Cross-validation and building of classification tree models for protein interaction prediction.
Usage
X.tune.tree(
data = NULL,
labels.col = NA,
predictors = NA,
tree.type = "RF",
cross.folds = 5,
eval.metric = "prc",
CP = 0.01,
CF = 0.3,
mTry = 1,
winnowing = FALSE,
boost = 1,
noGlobalPruning = FALSE,
costs = NA,
nodesize = 10,
rf.split = 5,
downsample = 1,
ntree = 151
)
Arguments
- data
Data frame with pairwise differential values. Must contain any number of columns with predictors for modelling and a labels column with values 1 (for protein pairs that form a complex) and 0 (for protein pairs that do not form a complex).
- labels.col
Character string: A name of the data column that contains labels values (1 or 0). If not specified, last column will be considered as labels column.
- predictors
Vector of character strings: Names of the columns that contain predictors. If not specified, all columns except of labels column and columns with names starting with "protein" will be considered as predictors.
- tree.type
Character string: A type of tree model to build. Options are: CART, PART, J48, C5.0, C5.0Rules, RF.
- cross.folds
Numeric: To how many data pieces should the training data be split for cross-validation. Default is 5. The parameter is ignore when mode is set to 'm'.
- eval.metric
Character string: How should the model be evaluated in cross-validation? Defalt is "prc" for area under the precision-recall curve. Other options are "roc" for area under the receiver-operator curve and "kappa" for Cohen's kappa.
- CP
Numeric vector: Complexity parameters to be cross-validated.
- CF
Numeric vector: Confidence factors to be cross-validated.
- mTry
A numeric vector: mtry parameter for random forest models. If not specified, five evenly spread mtry values between the number of predictors and a fifth of that will be validated.
- winnowing
Vector of logicals: If winnowing should be validated, set to c(FALSE,TRUE). Default is FALSE.
- boost
Vector of integers: What boost iteratations should be used for cross-validation. Default is 1.
- noGlobalPruning
Vector of logicals: If global pruning should be validated, set to c(FALSE,TRUE). Default is FALSE.
- costs
Integer: A number specifying how much higher is the cost of falsely predicting non-interacting than interacting protein pairs and vice versa.
- nodesize
Vector of integers: Specifies minimums of terminal node sizes to be validated. Default is 10.
- rf.split
Integer: To how many steps is the random forest training split? This is to avoid running out of RAM. Default is 5.
- downsample
Integer: How many times less of the non-interacting proteins should be used for the training? Default is 1. Applicable for C5.0 models.