Skip to content

Train/test split

Python notebook: https://github.com/daviskregers/data-science-recap/blob/main/12-train-test-split.ipynb

Train / test in practice

  • Need to ensure both sets are large enough to contain representatives of all the variations and outliers in the data you care about
  • The data sets must be selected randomly
  • Train/test is a great way to guard against overfitting

Train/test is not infallible

  • Maybe your sample sizes are too small
  • Or due to random chance your train and test sets look remarkably similar
  • Overfitting can still happen

K-fold Cross validation

  • One way to further protect against overfitting is K-fold cross validation
  • Sounds complicated, but it's not:
    • Split your data into K randomly assigned segments
    • Reserve one segment as your test data
    • Train on each of the remaining K-1 segments and measure their performance against the test set.
    • Take the average of the K-1 r-squared scores