Imagine you have a data set and you would like to know if there are clusters that could describe the relationship between the data points. If you plot them into a chart, most likely you will notice with your human eyes how they divided in groups. But how to make the computer notice that?
That’s what K-Means Clustering does: it is a set of processes or algorithms which allows you to define those groups without the need of knowing what the groups actually are. The machine will decide that for you.
Easy way to get started
An easy way to get started with K-Means is follow the process:
- Choose the amount of clusters you want to extract from the data;
- Place them as centroids, randomly;
- Calculate what points are closer to each centroid;
- Change the centroid position to center in the average distance of all points it held in its cluster;
- Repeat until you have a clear boundary dividing the data set.
The problem with this approach is that if you set the centroids randomly, you might get different clusters each time you try the process again. It’s a hill climbing problem: the result depends on the starting point. It’d be necessary to initialize the algorithm a couple times to get the most commons clusters found by the process.
Sklearn has a classifier that makes this process very easy. You just have to set the main values,
max_iter and you are ready to go.