Week 37: Statistical interpretations and Resampling Methods

Loading [MathJax]/extensions/TeX/boldsymbol.js

Various steps in cross-validation

When the repetitive splitting of the data set is done randomly, samples may accidently end up in a fast majority of the splits in either training or test set. Such samples may have an unbalanced influence on either model building or prediction evaluation. To avoid this $k$ -fold cross-validation structures the data splitting. The samples are divided into $k$ more or less equally sized exhaustive and mutually exclusive subsets. In turn (at each split) one of these subsets plays the role of the test set while the union of the remaining subsets constitutes the training set. Such a splitting warrants a balanced representation of each sample in both training and test set over the splits. Still the division into the $k$ subsets involves a degree of randomness. This may be fully excluded when choosing $k=n$ . This particular case is referred to as leave-one-out cross-validation (LOOCV).