K-means Clustering Algorithm

K-means algorithm is the simplest and perhaps the most popular clustering algorithm in unsupervised machine learning, mostly used due to its performance and ease of implementation.

It aims to partition n observations in k partitions and does so based on entities attributes and their similarity between each other.

K-means algorithm procedure

  1. Define the initial cluster centres. Can be randomly generated or defined as copies of k entities (data points) attributes.
  2. Assign each entity to the cluster that has the closest centroid by calculating the distance between all data points and each cluster centre.
  3. Recalculate coordinates of cluster centres by taking the average of all entities' attributes assigned to the cluster.
  4. Repeat steps 2 and 3 iteratively until none of the entities get assigned to new group. When this happen, the algorithm converges.

Also, the excerpt below from the lecture explains really well how the algorithm works:

Pros of K-means

  • K-means is computationally fast in comparison to other clustering algorithms such as Hierarchical Clustering;
  • Results are easy to interpret.
  • Works really well if cluster sizes in the original data are similar, there are no outliers, entities within a cluster have similar variance, and features are independent of each other.

Cons of K-means

  • Hard to satisfy all the conditions listed in lastly listed advantage of K-means i.e. to have equally sized clusters with an equal variance between the features; to have no outliers; guarantee that feature are independent from one another.
  • Difficult to know what k value to use i.e. how many partitions should there be.
  • Sensitive to initial partitioning i.e. results may differ based on initial cluster centre attributes.
  • Does not work well if clusters in original data are of different size and different density.

Demo

In the demo you already have clusters and data points generated. So start by clicking 3. Execute iteration which goes over one cycle of cluster assignment and new centroid calculation. Keep on iterating until the algorithm converges, and reset all to start from the beginning. Also, number of data points and clusters can be altered with parameters at the top of the code in JS column.

The aim of the demo is to give a more intuitive understanding of how K-means clustering works, hope it helps :)

See the Pen K-means classification algorithm by Karolis (@karolis) on CodePen.