Skip to contents

Cross-validation and building of classification tree models for protein interaction prediction.

Usage

X.tune.tree(
  data = NULL,
  labels.col = NA,
  predictors = NA,
  tree.type = "RF",
  cross.folds = 5,
  eval.metric = "prc",
  CP = 0.01,
  CF = 0.3,
  mTry = 1,
  winnowing = FALSE,
  boost = 1,
  noGlobalPruning = FALSE,
  costs = NA,
  nodesize = 10,
  rf.split = 5,
  downsample = 1,
  ntree = 151
)

Arguments

data

Data frame with pairwise differential values. Must contain any number of columns with predictors for modelling and a labels column with values 1 (for protein pairs that form a complex) and 0 (for protein pairs that do not form a complex).

labels.col

Character string: A name of the data column that contains labels values (1 or 0). If not specified, last column will be considered as labels column.

predictors

Vector of character strings: Names of the columns that contain predictors. If not specified, all columns except of labels column and columns with names starting with "protein" will be considered as predictors.

tree.type

Character string: A type of tree model to build. Options are: CART, PART, J48, C5.0, C5.0Rules, RF.

cross.folds

Numeric: To how many data pieces should the training data be split for cross-validation. Default is 5. The parameter is ignore when mode is set to 'm'.

eval.metric

Character string: How should the model be evaluated in cross-validation? Defalt is "prc" for area under the precision-recall curve. Other options are "roc" for area under the receiver-operator curve and "kappa" for Cohen's kappa.

CP

Numeric vector: Complexity parameters to be cross-validated.

CF

Numeric vector: Confidence factors to be cross-validated.

mTry

A numeric vector: mtry parameter for random forest models. If not specified, five evenly spread mtry values between the number of predictors and a fifth of that will be validated.

winnowing

Vector of logicals: If winnowing should be validated, set to c(FALSE,TRUE). Default is FALSE.

boost

Vector of integers: What boost iteratations should be used for cross-validation. Default is 1.

noGlobalPruning

Vector of logicals: If global pruning should be validated, set to c(FALSE,TRUE). Default is FALSE.

costs

Integer: A number specifying how much higher is the cost of falsely predicting non-interacting than interacting protein pairs and vice versa.

nodesize

Vector of integers: Specifies minimums of terminal node sizes to be validated. Default is 10.

rf.split

Integer: To how many steps is the random forest training split? This is to avoid running out of RAM. Default is 5.

downsample

Integer: How many times less of the non-interacting proteins should be used for the training? Default is 1. Applicable for C5.0 models.

Value

A list with three elements. $data is a data frame with the tuning results, $eval.plot is the plotted data and $eval.plot.SDs are plotted standard deviations of the results.

Examples

CV <- X.tune.tree(GS_specific,labels.col="complex",tree.type="RF",mTry=c(3,5),nodesize=c(10,20), downsample=3)
#> Error in theme_bw(base_size = 12): could not find function "theme_bw"