Clustering Metrics

Here are some metric available for validating clustering, explanation of each one is available on sklearn. [0]

If ground truth labels are available:

  • Adjusted Rand Index
  • Mutual Information Based scores
  • Homogeneity, completeness and V-measure
  • Fowlkes-Mallows scores

If not available :

  • Silhouette Coefficient
    • Range (-1,1)
    • 1 means it is similar to data-points in each cluster
  • Calinski-Harabaz Index
  • Davies-Bouldin Index
  • Contingency Matrix

 

Calculating SSE

It is a sum of distance between each point and its cluster center.

c1

Silhouette Score

It is calculated for each point and then we take an average of it.

c2

c4           c3

a(i) is average distance of a point to other points in same cluster.

b(i) is minimum of above average in for point in other cluster. It given the distance to nearest cluster.

s(i) close to 1 means data point is appropriately clustered. -1 means it is very bad clustered.

Setting s(i) to 0 when cluster size is one ensures that curve is not monotonically decreasing.

 

Elbow method and Silhouette Analysis

Notebook is available at https://github.com/arcarchit/datastories/blob/master/Silhouette.ipynb

sil1         sil2

Rand Index

  • When cluster labels are available we can use this matrix
  • It basically checks the similarity between two cluster assignments
    • Labels can also be seen as one type of cluster assignment
    • Score basically tells us how similar to cluster assignments are
  • This works by taking pair of points
    • Out of all pairs how many pairs are agreed in both clusters mechanism
    • Agree mean both
      • They are in same cluster in both mechanism
      • They are in different cluster in both mechanism
  • The Rand index has a value between 0 and 1, with 0 indicating that the two data clustering do not agree on any pair of points and 1 indicating that the data clustering are exactly the same

rand_index

  • One drawback of Rand index is that it can given non zero value for random assignment of clusters. To mitigate that there is matrix called Adjusted Rand Index. [2]
    • It specifically does not work when no of clusters are high

 

 

Reference

[0] : https://scikit-learn.org/stable/modules/clustering.html

[1] : https://github.com/anthonyng2/udemy-the-complete-machine-learning-course-with-python

[2] : https://davetang.org/muse/2017/09/21/adjusted-rand-index/