silhouette | Data Stories

Here are some metric available for validating clustering, explanation of each one is available on sklearn. [0]

If ground truth labels are available:

If not available :

Silhouette Coefficient
- Range (-1,1)
- 1 means it is similar to data-points in each cluster
Calinski-Harabaz Index
Davies-Bouldin Index
Contingency Matrix

It is a sum of distance between each point and its cluster center.

It is calculated for each point and then we take an average of it.

a(i) is average distance of a point to other points in same cluster.

b(i) is minimum of above average in for point in other cluster. It given the distance to nearest cluster.

s(i) close to 1 means data point is appropriately clustered. -1 means it is very bad clustered.

Setting s(i) to 0 when cluster size is one ensures that curve is not monotonically decreasing.

sil1 sil2

When cluster labels are available we can use this matrix
It basically checks the similarity between two cluster assignments
- Labels can also be seen as one type of cluster assignment
- Score basically tells us how similar to cluster assignments are
This works by taking pair of points
- Out of all pairs how many pairs are agreed in both clusters mechanism
- Agree mean both
  - They are in same cluster in both mechanism
  - They are in different cluster in both mechanism
The Rand index has a value between 0 and 1, with 0 indicating that the two data clustering do not agree on any pair of points and 1 indicating that the data clustering are exactly the same

rand_index

One drawback of Rand index is that it can given non zero value for random assignment of clusters. To mitigate that there is matrix called Adjusted Rand Index. [2]
- It specifically does not work when no of clusters are high

Data Stories