Resampling methods are an indispensable tool in modern statistics.
In this chapter, we discuss two of the most commonly used resampling methods, cross-validation and the bootstrap.
For example, cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility.
The bootstrap is used in several contexts, most commonly to provide a measure of accuracy of a parameter estimate or of a given statistical learning method.
5.1 Cross-Validation 測試
In this section, we instead consider a class of methods that estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations.
5.1.1The Validation Set Approach this
Suppose that we would like to estimate the test error associated with fitting a particular statistical learning method on a set of observations. The validation set approach, displayed in Figure 5.1, is a very simple strategy for this task.
這種方法首先隨機地把可得到的觀測集分爲兩部分:一個訓練集(training set)和一個驗證集( validation set,或者說保留集( hold-out set) 。模型在訓練集上擬合,而後用擬合的模型來預測驗證集中觀測的響應變量。最後獲得的驗證集錯誤率——一般用均方偏差做爲定量響應變量的偏差度量——提供了對於測試錯誤率的一個估計
FIGURE 5.1. A schematic display of the validation set approach. A set of n observations are randomly
split into a training set (shown in blue, containing observations 7, 22, and 13, among others) and a
validation set (shown in beige, and containing observation 91, among others). The statistical learning
method is fit on the training set, and its performance is evaluated on the validation set
5.1.2 Leave-One-Out Cross-Validation(留一交叉驗證法LOOCV)
LOOCV也將觀測集分爲兩類,但只留下一個單獨的觀測值(x1, y1)做爲驗證集,剩下的觀測{(x2, y2), . . . , (xn, yn)}做爲訓練集
The LOOCV estimate for the test MSE is the average of these n test error estimates:
FIGURE 5.3. A schematic display of LOOCV
5.1.3 k-Fold Cross-Validation(k折交叉驗證)
This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1,MSE2, . . . ,MSEk. The k-fold CV estimate is computed by averaging these values,
5.1.4 Bias-Variance Trade-Off for k-Fold Cross-Validation
當k<n 時, k 折CV 方法相對於LOOCV 方法有計算上的優點。
但LOOCV 方法的方差要比k 折CV 方法的方差大,由於在使用LOOCV方法時,其實是在平均n個擬合模型的結果,每個模型都是在幾乎相同的觀測集上訓練的;所以,這些結果相互之間籠高度{正)相關的。相反,在使用k<n 的k 折CV 方法時,因爲每一個模型的訓練集之間的重疊部分相對較小,所以是在平均k個相關性較小的擬合模型的結果。因爲許多高度相關的量的均值要比不相關性相對較小的量的均值具備更高的波動性,所以用LOOCV 方法所產生的測試偏差估計的方差要比k 折CV 方法所產生的測試偏差估計的方差大。
一般來講,考慮到上述因素,使用k 折交叉時通常令k=5 或k =10。由於從經驗上來講,這些值使得測試錯誤率的估計不會有過大的誤差或方差。
5.1.5 Cross-Validation on Classification Problems
where Erri = I(yi != ˆyi). The k-fold CV error rate and validation set error rates are defined analogously.
5.2 The Bootstrap
The bootstrap is a widely applicable and extremely powerful statistical tool bootstrap that can be used to quantify the uncertainty associated with a given estimator
or statistical learning method.
Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and Y , respectively, where X and Y are random quantities.
We will invest a fraction α of our money in X, and will invest the remaining 1 − α in Y . Since there is variability associated with the returns on these two assets,
we wish to choose α to minimize the total risk, or variance, of our investment. In other words, we want to minimize Var(αX +(1 −α)Y ). One can show that the
value that minimizes the risk is given by
where σ2X= Var(X), σ2Y= Var(Y ), and σXY = Cov(X, Y ).
However, the bootstrap approach allows us to use a computer to emulate the process of obtaining new sample sets,
so that we can estimate the variability of ˆα without generating additional samples. Rather than repeatedly obtaining
independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set.