Skip to content

Tuning Neural Networks: Learning Rate and Batch Size hyperparameters

Learing Rate

  • Neural networks are trained by gradient descent (or similar means)
  • We start at some random point and sample different solutions (weights) seeking to minimize some cost function, over many epochs
  • How far apart these samples are is the learning rate.

Effect of learning rate

  • Too high a learning rate means you might overshoot the optimal solution
  • Too small learning rate will take too long to find the optimal solution
  • Learning rate is an example of a hyperparameter

Batch Size

  • How many training samples are used within each epoch
  • Somewhat counter intuitively:
    • Smaller batch sizes can work their way our of "local minima" more easily
    • Batch sizes that are too large can end up getting stuck in the wrong solution
    • Random shuffling at each epoch can make this look like very inconsistent results from run to run

To Recap

  • Small batch sizes tend to not get stuck in local minima
  • Large batch sizes can converge on the wrong solution at random
  • Large learning rates can overshoot the correct solution
  • Small learning rates increase training time.