K-Means Clustering: A very simple overview

Software Engineering & Travel Journal

This week I spent some time reviewing K-Means Clustering algorithm for Unsupervised Learning.

Imagine you have a data set and you would like to know if there are clusters that could describe the relationship between the data points. If you plot them into a chart, most likely you will notice with your human eyes how they divided in groups. But how to make the computer notice that?

That’s what K-Means Clustering does: it is a set of processes or algorithms which allows you to define those groups without the need of knowing what the groups actually are. The machine will decide that for you.

Easy way to get started

An easy way to get started with K-Means is follow the process:

  1. Choose the amount of clusters you want to extract from the data;
  2. Place them as centroids, randomly;
  3. Calculate what points are closer to each centroid;
  4. Change the centroid position to center in the average distance of all points it held in its cluster;
  5. Repeat until you have a clear boundary dividing the data set.

The problem with this approach is that if you set the centroids randomly, you might get different clusters each time you try the process again. It’s a hill climbing problem: the result depends on the starting point. It’d be necessary to initialize the algorithm a couple times to get the most commons clusters found by the process.

Sklearn has a classifier that makes this process very easy. You just have to set the main values, n_clusters, n_init and max_iter and you are ready to go.

Further reading:

About The Author

Thiago Ricieri

I'm software engineer at Pluto TV, working on iOS, Android and Roku platforms. Personally interested in machine learning, algorithms, data science and tech industry. I travel a lot and take good pictures.

Add a comment

*Please complete all fields correctly

Related Posts

How IoT, Blockchain, AI, Machine Learning and more trends will shape 2018
The year has begun, and many topics keep getting attention from the community. You probably have stumbled with some of the following trends Gartner, HubSpot and Forbes. Check it out!
How to write a decay function in Python
Although simple, it took me some time to visualize both functions are equal but written in different forms.
Understanding Expectation Maximization and Soft Clustering
This powerful algorithm will give you, for each data point you have, a vector of probabilities. Each probability will refer to each cluster you are trying to assign the data...