Skip to content

Decision trees¶

Python notebook: https://github.com/daviskregers/data-science-recap/blob/main/16-decision-trees.ipynb

You can construct a flowchart to help you decide a classification for something with machine learning
This is called a decision tree
Another form of supervised learing
- Give if some sample data and the resulting classifications

Example¶

You want to build a system to filter out resumes based on historical hiring data
You have a database of some important attributes of job candidates and you know which ones were hired and which ones weren't
You can train a decision tree on this data, and arrive at a system for predicting whether a candidate will get hired based on it!

How Decision Trees work¶

At each step, find the attribute we can use to partition the data set to minimize the entropy of the data at the next step
Fancy term for this simple algorithm: ID3
It's a greedy algorithm - as it goes down the tree, it just picks the decision that reduce entropy the most at that stage.
- That might not actually result in an optimal tree
- It works

Random forests¶

Decision trees are very susceptible to overfitting
To fight this, we can construct several alternate decision trees and let them "vote" on the final classification
- Randomly re-sample the input data for each tree (fancy term for this: bootstrap aggregating or bagging)
- Randomize a subset of the attributes each step is allowed to choose from