R語言caret包的學習（四）--創建模型及驗證

時間 2019-11-07

標籤語言 caret 學習創建模型驗證简体版

原文原文鏈接

本文介紹caret包中的創建模型及驗證的過程。主要涉及的函數有train()，predict()，confusionMatrix()，以及pROC包中的畫roc圖的相關函數。html

創建模型

在進行建模時，需對模型的參數進行優化，在caret包中其主要函數命令是train。git

train(x, y, method = "rf", preProcess = NULL, ...,
  weights = NULL, metric = ifelse(is.factor(y), "Accuracy", "RMSE"),
  maximize = ifelse(metric %in% c("RMSE", "logLoss", "MAE"), FALSE, TRUE),
  trControl = trainControl(), tuneGrid = NULL,
  tuneLength = ifelse(trControl$method == "none", 1, 3))

x 行爲樣本，列爲特徵的矩陣或數據框。列必須有名字
y 每一個樣本的結果，數值或因子型
method 指定具體的模型形式，支持大量訓練模型，可在此查詢：點擊
preProcess 表明自變量預處理方法的字符向量。默認爲空，能夠是 "BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute", "bagImpute", "medianImpute", "pca", "ica" and "spatialSign".
weights 加權的數值向量。僅做用於容許加權的模型
metric 指定將使用什麼彙總度量來選擇最優模型。默認狀況下，"RMSE" and "Rsquared" for regression and "Accuracy" and "Kappa" for classification
maximize 邏輯值，metric是否最大化
trControl 定義函數運行參數的列表。具體見下
tuneGrid 可能的調整值的數據框，列名與調整參數一致
tuneLength 調整參數網格中的粒度數量,默認時每一個調整參數的level的數量

下面來具體介紹一下trainControl函數github

trainControl(method = "boot", number = ifelse(grepl("cv", method), 10, 25),
  repeats = ifelse(grepl("[d_]cv$", method), 1, NA), p = 0.75,
  search = "grid", initialWindow = NULL, horizon = 1,
  fixedWindow = TRUE, skip = 0, verboseIter = FALSE, returnData = TRUE,
  returnResamp = "final",.....)

method 重抽樣方法："boot", "boot632", "optimism_boot", "boot_all", "cv", "repeatedcv", "LOOCV", "LGOCV" (for repeated training/test splits), "none" (only fits one model to the entire training set), "oob" (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models), timeslice, "adaptive_cv", "adaptive_boot" or "adaptive_LGOCV"
number folds的數量或重抽樣的迭代次數
repeats 僅做用於k折交叉驗證：表明要計算的完整摺疊集的數量
p 僅做用於分組交叉驗證：表明訓練集的百分比
search Either "grid" or "random"，表示如何肯定調整參數網格

用kernlab包中的spam數據來進行實驗app