Skip to content

Train/test split¶

Python notebook: https://github.com/daviskregers/data-science-recap/blob/main/12-train-test-split.ipynb

Train / test in practice¶

Need to ensure both sets are large enough to contain representatives of all the variations and outliers in the data you care about
The data sets must be selected randomly
Train/test is a great way to guard against overfitting

Train/test is not infallible¶

Maybe your sample sizes are too small
Or due to random chance your train and test sets look remarkably similar
Overfitting can still happen

K-fold Cross validation¶

One way to further protect against overfitting is K-fold cross validation
Sounds complicated, but it's not:
- Split your data into K randomly assigned segments
- Reserve one segment as your test data
- Train on each of the remaining K-1 segments and measure their performance against the test set.
- Take the average of the K-1 r-squared scores