This post is a lecture summary of course by University of Washington : https://www.coursera.org/learn/ml-clustering-and-retrieval/home/welcome
Why one might want to use hierarchical clustering ?
- A common problem is clustering is to predefined no of clusters (k)
- Hierarchical clustering does not need that
- Deprogram provides effective visualization of clusters at various levels without re-running the algorithm
- Any distance metric can work, while k-mean required euclidean distance [0][1]
- We can construct more complex shape compared to k-means/EM
Divisive clustering
- Example algorithm is recursive k-mean
- We first cluster with k = 2 and then recursive apply k-means on these two clusters

- Various choices we need to make :
- Which algorithm to use (k-means)
- How many split at each level (k=2, 3)
- When to stop? Here are some heuristics:
- Number of points in a cluster falls below certain threshold (Minimum point in each cluster)
- Distance to farthest point falls below certain threshold (Minimum cluster radius)
- Until per-specified no of clusters are reached
Agglomerative clustering
- Example would single linkage algorithm
- We start with assigning each point to its own cluster
- Next task is how to merge this cluster
- How do we define the distance between cluster?
- Linkage function
- It gives us two points from each cluster
- Single linkage given points which has minimum distance across all possible pairs
- Cluster A has points (a1, a2, a3) and B has (b1, b2, b3)
- Linkage will give say (a2, b3) if it has the minimum distance across all combination
- Distance Function
- Can be euclidean or cosine anything
- Linkage function
- We merge the cluster which has minimum distance
- How do we define the distance between cluster?
- Dendrogram provides efficient visualization [4]
- Data-points of x-axis carefully ordered
- Height (y-axis) represents distance between two cluster as per specified linkage
- It gives us the complete visualization of which nodes were merged first followed by which one etc.
- How to cut dendrogram ? [4]
- Suppose we are cutting the dendrogram at y=D*
- There are no pair of cluster which has distance < D* and which is not merged
- All the pairs of clusters having distance < D* are merged
- D* is the minimum distance between cluster at this level
- Distance between clusters identified > D*
- Various linkage functions [2][3]
- Complete linkage takes two farthest points
- Average linkage considers all pairs and takes an average
- Ward linkage focus on increasing sum of squares
- These linkage however brings down the shaping flexibility
- Single linage would allow for complex shapes in general
- Single linkage however can result in chaining
- Choice we need to make in agglomerative clustering
- Which linkage function to use
- Which distance metric to use
- Should we cut dendrogram flat or some weird cut

- Complexity
- For brute-force it is O(N^2 log N)
- We can also prune search space by kd-trees or using triangular inequality
- Best know is O(N^2) [5]
Linkage Functions

References
[1]: https://pdfs.semanticscholar.org/a630/316f9c98839098747007753a9bb6d05f752e.pdf
[2] : https://www.saedsayad.com/clustering_hierarchical.htm
[4] : http://www.econ.upf.edu/~michael/stanford/maeb7.pdf
[5] : https://www.cs.ucsb.edu/~veronika/MAE/summary_SLINK_Sibson72.pdf
