K-means clustering¶

Python notebook: https://github.com/daviskregers/data-science-recap/blob/main/14-naive-bayes-spam-classifier.ipynb

Attempts to split data into K groups that are closest to K centroids
Unsupervised learning - uses only the positions of each data point.
Can uncover interesting groupings of people / things / behaviour
- Example: Where do millionaires live?
- What genres of music / movies / etc naturally fall out of data?
- Create your own stereotypes from demographic data

How it works¶

If you want to predict the cluster for new points, just find the centroid they're closest to.

Choosing K
- Try increasing K values until you stop getting large reductions in squared error (distances from each point to their centroids)
Avoiding local minima
- The random choice of initial centroids can yield different results
- Run it a few times just to make sure your initial results aren't wacky
Labeling the clusters
- K-means does not attempt to assign any meaning to the clusters you find
- It's up to you to dig into the data and try to determine that