Imputation techniques for missing data
Mean replacement
- Replace missing values with the mean value from the rest of the column (columns, not rows. A column represents a single feature.)
- Fast & easy, won't affect mean or sample size of overall data set.
- Median may be a better choice than mean when outliers are present
- But it's generally pretty terrible
- Only works on column level, misses correlations between features
- Can't use on categorical features (imputing the most frequent value can work in this case, thoough)
- Not very accurate
Dropping
- If not many rows contain missing data
- And dropping those rows doesn't bias your data
- And you don't have a lot of time
- But it's never going to be the right answer for the best approach.
- Almost anything is better. Can you substitute another similar fields perhaps? (i.e. review summary vs full text)
Machine learning
- KNN: Find K nearest (most similar) rows and average their values
- Assumes numerical data, not categorical
- There are ways to handle categorical data (hamming distance), but categorical data is probably better served by
- Deep Learning
- Build a machine learning model to impute data for your machine learning model.
- Works well for categorical data. Really well, but it's complicated.
- Regression
- Find linear or non-linear relationships between the missing feature and other features
- Most advanced technique: MICE (Multiple Imputation by Chained Equations)
Just get more data
- What's better than imputing data? Getting more real data!
- Sometimes you just have to try harder or collect more data.