dendrogram | Data Stories

This post is a lecture summary of course by University of Washington : https://www.coursera.org/learn/ml-clustering-and-retrieval/home/welcome

Why one might want to use hierarchical clustering ?

A common problem is clustering is to predefined no of clusters (k)
- Hierarchical clustering does not need that
Deprogram provides effective visualization of clusters at various levels without re-running the algorithm
Any distance metric can work, while k-mean required euclidean distance [0][1]
We can construct more complex shape compared to k-means/EM

Divisive clustering

Example algorithm is recursive k-mean
We first cluster with k = 2 and then recursive apply k-means on these two clusters
Various choices we need to make :
- Which algorithm to use (k-means)
- How many split at each level (k=2, 3)
- When to stop? Here are some heuristics:
  - Number of points in a cluster falls below certain threshold (Minimum point in each cluster)
  - Distance to farthest point falls below certain threshold (Minimum cluster radius)
  - Until per-specified no of clusters are reached

Agglomerative clustering

Example would single linkage algorithm
We start with assigning each point to its own cluster
Next task is how to merge this cluster
- How do we define the distance between cluster?
  - Linkage function
    - It gives us two points from each cluster
    - Single linkage given points which has minimum distance across all possible pairs
    - Cluster A has points (a1, a2, a3) and B has (b1, b2, b3)
    - Linkage will give say (a2, b3) if it has the minimum distance across all combination
  - Distance Function
    - Can be euclidean or cosine anything
- We merge the cluster which has minimum distance
Dendrogram provides efficient visualization [4]
- Data-points of x-axis carefully ordered
- Height (y-axis) represents distance between two cluster as per specified linkage
- It gives us the complete visualization of which nodes were merged first followed by which one etc.
How to cut dendrogram ? [4]
- Suppose we are cutting the dendrogram at y=D*
- There are no pair of cluster which has distance < D* and which is not merged
- All the pairs of clusters having distance < D* are merged
- D* is the minimum distance between cluster at this level
- Distance between clusters identified > D*
Various linkage functions [2][3]
- Complete linkage takes two farthest points
- Average linkage considers all pairs and takes an average
- Ward linkage focus on increasing sum of squares
- These linkage however brings down the shaping flexibility
  - Single linage would allow for complex shapes in general
  - Single linkage however can result in chaining
Choice we need to make in agglomerative clustering
- Which linkage function to use
- Which distance metric to use
- Should we cut dendrogram flat or some weird cut