Skip to content

Binning, Transforming, Encoding, Scaling and Shuffling

Binning

  • Bucket observations together based on ranges of values
  • Example: estimated ages of people
    • Put all 20-somethings in one classification, 30-seomethings in other etc
  • Quantile binning categorizes data by their place in the data distribution
    • Ensures even sizes of bins
  • Transforms numeric data to ordinal data
  • Especially useful when there is uncertainty in the measurements

Transforming

  • Feature data with an exponential trend may benefit from a logarithmic transform
  • Applying some function to feature to make it better suited for training
  • Example: YouTube recommendations
    • A numeric feature x is also represented by x^2 and sqrt(x)
    • This allows learning of super and sub-linear functions

Encoding

  • Transforming data into some new representation required by the model
  • One-hot encoding
    • Create buckets for every category
    • The bucket for your category has a 1, all other have 0
    • Very common in deep learning, where categories are represented by individual output neurons.

one-hot

Scaling / normalization

  • Some models prefer feature data to be normally distributed around 0 (most neural nets)
  • Most models require feature data to at least be scaled to comparable values
    • Otherwise features with larger magnitudes will have more weight than they should
    • Example: modeling age and income as features - incomes will be much higher than ages
  • Scikit-learn has a preprocessor module that helps (MinMaxScaler, etc)
  • Remember to scale your results back up.

Shuffling

  • Many algorithms benefit from shuffling their training data
  • Otherwiuse they may learn from residual signals in the training data resulting from the order which they were collected