Hierarchical Clustering

December 9, 2018May 8, 2020Archit Vora Leave a comment

This post is a lecture summary of course by University of Washington : https://www.coursera.org/learn/ml-clustering-and-retrieval/home/welcome

Why one might want to use hierarchical clustering ?

A common problem is clustering is to predefined no of clusters (k)
- Hierarchical clustering does not need that
Deprogram provides effective visualization of clusters at various levels without re-running the algorithm
Any distance metric can work, while k-mean required euclidean distance [0][1]
We can construct more complex shape compared to k-means/EM

Divisive clustering

Example algorithm is recursive k-mean
We first cluster with k = 2 and then recursive apply k-means on these two clusters
Various choices we need to make :
- Which algorithm to use (k-means)
- How many split at each level (k=2, 3)
- When to stop? Here are some heuristics:
  - Number of points in a cluster falls below certain threshold (Minimum point in each cluster)
  - Distance to farthest point falls below certain threshold (Minimum cluster radius)
  - Until per-specified no of clusters are reached

Agglomerative clustering

Example would single linkage algorithm
We start with assigning each point to its own cluster
Next task is how to merge this cluster
- How do we define the distance between cluster?
  - Linkage function
    - It gives us two points from each cluster
    - Single linkage given points which has minimum distance across all possible pairs
    - Cluster A has points (a1, a2, a3) and B has (b1, b2, b3)
    - Linkage will give say (a2, b3) if it has the minimum distance across all combination
  - Distance Function
    - Can be euclidean or cosine anything
- We merge the cluster which has minimum distance
Dendrogram provides efficient visualization [4]
- Data-points of x-axis carefully ordered
- Height (y-axis) represents distance between two cluster as per specified linkage
- It gives us the complete visualization of which nodes were merged first followed by which one etc.
How to cut dendrogram ? [4]
- Suppose we are cutting the dendrogram at y=D*
- There are no pair of cluster which has distance < D* and which is not merged
- All the pairs of clusters having distance < D* are merged
- D* is the minimum distance between cluster at this level
- Distance between clusters identified > D*
Various linkage functions [2][3]
- Complete linkage takes two farthest points
- Average linkage considers all pairs and takes an average
- Ward linkage focus on increasing sum of squares
- These linkage however brings down the shaping flexibility
  - Single linage would allow for complex shapes in general
  - Single linkage however can result in chaining
Choice we need to make in agglomerative clustering
- Which linkage function to use
- Which distance metric to use
- Should we cut dendrogram flat or some weird cut

dendrogram

Complexity
- For brute-force it is O(N^2 log N)
- We can also prune search space by kd-trees or using triangular inequality
- Best know is O(N^2) [5]

Linkage Functions

linkage

References

[0]: https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric

[1]: https://pdfs.semanticscholar.org/a630/316f9c98839098747007753a9bb6d05f752e.pdf

[2] : https://www.saedsayad.com/clustering_hierarchical.htm

[3] : https://stats.stackexchange.com/questions/195446/choosing-the-right-linkage-method-for-hierarchical-clustering

[4] : http://www.econ.upf.edu/~michael/stanford/maeb7.pdf

[5] : https://www.cs.ucsb.edu/~veronika/MAE/summary_SLINK_Sibson72.pdf

Taylor Series Expansion

December 8, 2018December 9, 2018Archit Vora Leave a comment

General

general

Single Variable

single

Multivariable

vector

where x and a are vectors. We had discussed gradient vector (D) and hessian matrix(H) in this post.

Reference

[0] : http://www.math.ucdenver.edu/~esulliva/Calculus3/Taylor.pdf

[1] : https://mathinsight.org/taylors_theorem_multivariable_introduction

Which ML algorithm to use when?

December 7, 2018December 7, 2018Archit Vora Leave a comment

We generally find this question while starting up new project or we want to compare some algorithms to discriminate them (discrimination helps understand things better sometimes).

Although there is no definitive answer to this, I am writing here summaries from some posts. [0]

machine-learning-cheet-sheet

Factors to consider

Accuracy
- Most of people focus on accuracy but it practically it is not the only
Training time
- Naive Bayes and logistic regression are much faster than boosting or neural nets
Linearity
- LR and SVM are suitable when classes are linearly seperatble
  - Of course SVM bypasses it via kernel trick but still not as much complex decision boundary as nueral nets
- Despite the risk of non linearity in data linear algorithms tends to work well in practice and are often used as starting point
Number of parameters
- Parameters does affect training time and accuracy
- More parameters helps learning complex function, however it requires more data to prevent over-fitting
No of features
- When data point are not enough for no of features (text, NLP) SVM works well

Notes

Try out linear/logistic regression, SVM first when you most dependent variables are numeric.
SVM
- SVM suites more when no of data points are less for given no of features.
- SVM is linear classifier only. It just uses kernel trick to project linearly inseparable data on high dimension.
- SVM is solved by mathematical optimization problem unlike nueral nets. Hence tends to be bit faster.
What is the difference between LR and SVM?
- LR has linear decision boundary while SVM can have non linear decision boundary.
Reinforcement learning
- Analyses and optimized behavior of agent, (via feedback from environment)
- They try to discover different actions to maximize reward
- Trial-error and delayed reward distinguishes reinforcement learning from other ML algorithms

References

[0] : https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/#prettyPhoto

[1] : https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice

Matrices Definitions

December 6, 2018May 8, 2020Archit Vora 1 Comment

Jacobian Matrix

jscobian

Gradient Vector

gradient

Hessian Matrix

hessian

Hessian Matrix is Jacobian of a gradient.

Symmetric Matrix

Matrix and its transpose are same

Hermitian Matrix

Matrix is hermitian if A_transpose (A^T) = A_bar (Conjugate of complex no)

Positive Semi-definite Matrices

One definition:
- Matrix M ∈ L(V) is positive definite iff
  - M is symmetric
  - v^T * M * v >= 0 for all v ∈ V
Now following are equal [0]
- v^T * M * v >= 0 for all v ∈ V
- All eigenvalues are non negative
- There exists a matrix B such that B^T * B = M
Application
- A twice differentiable function of n variable is convex if and only if Hessian of it is positive semi-definite (PSD) [2] [3]

References

[0] : https://www.cse.iitk.ac.in/users/rmittal/prev_course/s14/notes/lec11.pdf

[1] : http://www.princeton.edu/~amirali/Public/Teaching/ORF363_COS323/F15/ORF363_COS323_F15_Lec2.pdf

[2] : https://wiki.math.ntnu.no/_media/tma4180/2016v/note2.pdf

[3] : https://math.stackexchange.com/questions/946156/proving-convexity-of-a-function-whose-hessian-is-positive-semidefinite-over-a-co

Principal Component Analysis (PCA)

December 5, 2018May 5, 2020Archit Vora Leave a comment

PCA find its application in dimensionality reduction. It in turn helps in data visualization and inference. It is also useful in image compression.

PCA arises naturally as maximum likelihood estimation of a particular form of a linear gaussian latent model. However in this post we are focusing on standard non probabilistic view of PCA.

Applications

Dimentionality Reduction
Data Compression
- We had an example of images of digit, where in we can re-generate images with minimum loss by using first 4 eigen vectors
Factor Analysis
- Personality test answers are driven by underlying skill IQ and EQ
Data normalization
- Principle component Regression
Data Visualization

Intuition of Continuous Latent Variables

Bishop’s book[8] had example of images of digits
Latent variable can correspond to stretching or rotation
- Rotation can be say 30 degree, 31 degree which is continuous
- Contrast this with latent variable of hidden Markov models, where latent variables like rainy or sunny atmosphere is discrete
Wikipedia article[7] mentions classes of analysis when latent/observed variables are continuous/discrete
- Gaussian mixture model is example where latent variables are discrete and observed variables are continuos
- Factor analysis is example when both of them are continuous

Two Definitions

Direction that maximizes variance
- Variance is given by eigen values
Direction that minimizes error
- Error is given by eigen value

PCA vs Fisher’s discriminant

We have a note about Fisher’s discriminant analysis
Both PCA and fisher’s discriminant analysis projects data
- But PCA is unsupervised

Derivation

https://github.com/arcarchit/datastories/blob/master/notes/pca.pdf

Calculation

There are two ways to calculate principle component, 1) via covariance matrix and 2) SVD. It has been derived at [0] that both of them are actually the same.

Principle components represent direction with maximum variance. All the principle components are orthogonal.

We are computing major axis of variation. We can see this as projecting data on this axis and variance will be maximized here.

1) Via covariance Matrix

Steps involved [1]

Get data (X_raw)
Normalize( X = data after subtracting mean and divide by standard deviation )
Calculate covariance matrix ( C = X*X^T )
Calculate eigenvalues and eigenvectors of covariance matrix ( C = V*L*V^T )
1. Cx = λx
2. ( C – Iλ )x = 0
3. C – Iλ = 0
4. Determinant of | C – Iλ | = 0 given eigenvalues lambda
5. For each λ we can find x using (2), they will be eigenvectors [5]
Choosing components and forming a feature vector (F = Put eigenvector in columns )
Deriving new data-set (X_new = X * F, you may need to transpose)
How to get old data back ( X = X_new * F^(-1))

2) Via SVD

X = U*S*V^T

where

X – is n * p matrix

U^T * U = I | U is n * n | U are left singular vectors

V^T * V = I and V is p * p | V are right singular vectors

S is diagonal and n * p

X_new = Ur * Sr where Ur and Sr are reduced matrices based on variance we want to keep (say 90 %).

SVD is represents expansion of original data in a coordinate system where covariance matrix is orthogonal. [4]. Values of S are square root of eigenvalues.

U and V can are calculated by finding eigen-vectors of A*A^T and A^T*A. [4][6]

On eigenvalues and eigenvectors

Consider input matrix as transformation matrix [3]
There are some v for given matrix, which when transformed by the matrix just changes their scale but directions remain same
Such a vectors are called eigenvectors of the transformation matrix
Scale by which they change is represented by eigenvalues.

References

[0] : https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca

[1] : http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

[2] : http://people.ciirc.cvut.cz/~hlavac/TeachPresEn/11ImageProc/15PCA.pdf

[3] : https://www.mathplanet.com/education/geometry/transformations/transformation-using-matrices

[4] : http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

[5] : https://www.scss.tcd.ie/~dahyotr/CS1BA1/SolutionEigen.pdf

[6] : http://mysite.science.uottawa.ca/phofstra/MAT2342/SVDproblems.pdf

[7] : https://en.wikipedia.org/wiki/Latent_variable_model

[8] : Book : Pattern Recognition and Machine Learning by Christopher Bishop https://www.springer.com/gp/book/9780387310732

Naive Bayes Classifier

December 4, 2018May 17, 2023Archit Vora 2 Comments

There are two things[1]
- Probability model
- Classification model

Probability Model

A probability model is an extension of Bayes’ rule. It makes two assumptions:

Independence of Features: This assumption assumes that all features are independent of each other. However, it does not hold true in many cases. For example, having higher temperature does not necessarily imply higher humidity.
Equal Weight of Features: This assumption assumes that all features have equal importance or weight in the model.

Classification Model

The classification model involves the following steps:

Probability of Each Class: P(y) represents the probability of each class based on the training set.
Probability Estimation of Feature Values: The goal is to estimate the probability distribution of each feature value given a specific class, denoted as P(x_i|y). For discrete features, this can be achieved through simple probability calculations, such as multinomial Naive Bayes. For continuous features, Gaussian distributions can be used. In the case of count data, multinomial distributions are suitable.
Parameter Estimation: Parameter estimation is performed for each combination of class and feature.
Scikit-learn and Distribution Types: Scikit-learn library provides implementations of Gaussian Naive Bayes, Bernoulli Naive Bayes, and multinomial Naive Bayes classifiers. These classifiers refer to the distribution of features. It is important to note that different features can follow different distributions. Therefore, customization of the distribution based on the application may be necessary.

Advantages

Fast and Easy Implementation: Naive Bayes classifiers are known for their simplicity and efficiency in implementation.
Acceptable Classification Performance: While Naive Bayes classifiers may not always accurately predict probabilities, their classification performance is generally satisfactory.

Disadvantage

Independence Assumption: The assumption of feature independence does not hold true in all scenarios, which can affect the model’s accuracy.

Reference

[0] https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c

[1] https://en.wikipedia.org/wiki/Naive_Bayes_classifier

ANOVA Introduction

November 15, 2018May 18, 2023Archit Vora Leave a comment

ANOVA (Analysis of Variance) is a statistical technique used to determine if there are significant differences between the means of two or more groups. It is an extension of the t-test, which is used for comparing means between two groups.

The key idea behind ANOVA is to assess whether the data in different groups come from the same distribution or not. It is based on the assumption that data within each group should have low variance, while the variances between the groups should be relatively high.

The null hypothesis in ANOVA states that the means of all the groups are equal. If the groups are well separated and one group appears to be significantly different from the others, the null hypothesis is rejected.

There are different types of ANOVA designs, including one-way ANOVA and two-way ANOVA.

One-way ANOVA involves comparing the means of a single dependent variable across three or more independent groups. For example, you can use one-way ANOVA to analyze the stress levels of employees before the layoff announcement, after the announcement, and during the layoff period.

Two-way ANOVA, on the other hand, involves examining the interaction effects between two independent variables on a dependent variable. For instance, you can use two-way ANOVA to explore the stress levels of men and women before the layoff announcement, after the announcement, and during the layoff period. In two-way ANOVA, there are multiple null hypotheses to test, including whether the stress levels are the same for men and women, whether the stress levels are the same across different time points, and whether there is an interaction effect between gender and time.

The F distribution is utilized in ANOVA, similar to how the t distribution is used in t-tests. The F distribution has different shapes depending on the degrees of freedom for the numerator and denominator. As the degrees of freedom increase, the F distribution becomes more concentrated. It ranges from 0 to infinity, denoted as [0, infinity).

I have coded a notebook for calculating one way anova manually in python [1]

ANCOVA, MANOVA, MANCOVA

ANCOVA (Analysis of Covariance), MANOVA (Multivariate Analysis of Variance), and MANCOVA (Multivariate Analysis of Covariance) are related techniques.

ANCOVA (Analysis of Covariance) is an extension of ANOVA that incorporates a continuous independent variable, referred to as a covariate. The purpose of ANCOVA is to examine whether the relationship between the dependent variable and the independent variable(s) remains significant after controlling for the effect of the covariate. By including the covariate in the analysis, ANCOVA allows for a more accurate assessment of the impact of the independent variables on the dependent variable.

MANOVA (Multivariate Analysis of Variance) is used when there are two or more dependent variables. It examines whether there are significant differences between groups across the multiple dependent variables. MANOVA allows for the analysis of complex relationships among variables and provides a comprehensive assessment of group differences. It is particularly useful when the dependent variables are related or correlated.

MANCOVA (Multivariate Analysis of Covariance) is an extension of MANOVA that incorporates continuous independent variables along with the categorical independent variables. MANCOVA allows for the examination of the effects of both categorical and continuous independent variables on multiple dependent variables while controlling for the influence of covariates. It helps to determine whether the relationships between the independent variables and the dependent variables remain significant after accounting for the effects of covariates.

In summary:

ANOVA is used to compare means between three or more groups.
ANCOVA extends ANOVA by incorporating a continuous covariate to control for its effect.
MANOVA analyzes differences between groups across multiple dependent variables.
MANCOVA extends MANOVA by including both categorical and continuous independent variables along with covariates.

These techniques are widely used in research and can provide valuable insights into group differences and relationships among variables.

References:

[0] : https://www.technologynetworks.com/informatics/articles/one-way-vs-two-way-anova-definition-differences-assumptions-and-hypotheses-306553

[1] : https://github.com/arcarchit/datastories/blob/master/ANOVA.ipynb

[2] : http://www.statsmakemecry.com/smmctheblog/stats-soup-anova-ancova-manova-mancova

Linear Regression Derivations

November 12, 2018May 8, 2020Archit Vora Leave a comment

Suppose you have just two parameters a and b and have n observation
- y = a + b*x
- a and b can be found by simple optimization problem [1]
  - By setting partial derivative to 0

Simpler R2 formula is available at https://datastoriesweb.wordpress.com/2017/01/15/interpreting-statistical-values/

Matrix closed form Solution

gradient closed_form

Gradient Descent

We assume some initial values for parameters
We calculate gradient for each parameter and move in the negative direction of it by step size

I have coded gradient descent here at [2]

References :

[1] http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf

[2] https://github.com/arcarchit/datastories/blob/master/gradient_descent.ipynb

Bayesian Learning – Quick Summary

October 19, 2018May 8, 2023Archit Vora 2 Comments

We had used Bayesian learning for house price prediction project, notebook is available at [0]. Purpose of this blog is to have quick summary of concept involved.

Bayesian learning allows us to have distribution for parameters rather than point estimate.
We keep sampling values of parameters say 10000 times. Histogram of last 6000 sample represent approximate posterior of parameter.
Every parameters, latent variables, output have distribution associated with them.
Top parameters in the hierarchy will have prior associated with them.
- We sample for these top parameters
- Calculate corresponding value for all latent variable coming down the tree
- Finally at output calculate likelihood against observed data
Different MCMC algorithms differs in two steps:
- How to jump to next to next sample
- How to decide whether next sample is acceptable or not
Metropolis algorithm:
- Jumps considering normal distribution at previous parameter value and some fixed standard deviation
- Acceptance test : random(0,1) < ratio of (Pnew/Pold)
  - When Pnew is higher we will definitely accept
  - Else probability of acceptance depends on how large Pold is
  - If Pold = 2 * Pnew it is 50 %
- Pnew = liklihood_new * prior_new
- Pold = liklihood_old * prior_old
Example of other MCMC algorithms:
- Metropolis-Hastings
- The Gibbs Sampler
- Hamiltonian MCMC
- The No-U-Turn Sampler (NUTS)

Reference

[0] : https://github.com/arcarchit/House-Prices-Advanced-Regression-Techniques/blob/master/notebooks/Bayesian%2BMaster.ipynb

[1] : http://twiecki.github.io/blog/2015/11/10/mcmc-sampling/

[2] : https://www.quantstart.com/articles/Bayesian-Inference-of-a-Binomial-Proportion-The-Analytical-Approach

Probabilistic Clustering

October 11, 2018May 21, 2023Archit Vora 1 Comment

This post serves as a lecture summary of a course offered by the University of Washington, which can be accessed at the following link: University of Washington – ML: Clustering and Retrieval.

Sample code related to the course can be found on GitHub at: Sample Code for EM Algorithm.

In this lecture, we will cover several important topics:

Why we need probabilistic clustering: We will explore the motivations behind using probabilistic clustering methods, which allow for soft assignments to clusters rather than rigid assignments.
Mixture models: We will dive into the concept of mixture models, which are a probabilistic approach to clustering. Mixture models allow data points to belong to multiple clusters with different probabilities.
Expectation Maximization (EM): We will study the Expectation Maximization algorithm, a powerful iterative algorithm used to estimate the parameters of mixture models. EM alternates between the expectation step, where cluster assignments are updated based on current parameter estimates, and the maximization step, where the parameters are updated based on current cluster assignments.

Why we need probabilistic clustering ?

To understand the need for probabilistic clustering, let’s consider the problem of determining user preferences based on their feedback. Suppose users provide feedback on whether they liked or disliked an article. We want to gather this feedback and group it into clusters to gain insights into their preferences.

For instance, imagine a user who likes an article about the first human landing on Mars. This article could be relevant to both world news and science news categories. In such cases, assigning articles to clusters with a hard assignment (where an article belongs to only one cluster) may not capture the complete picture. Soft assignment to clusters, where an article can belong to multiple clusters with different probabilities, is more suitable.

Now, let’s explore some scenarios where the traditional K-means algorithm fails:

Clusters with Different Spreads: When clusters have varying spreads, simply deciding based on a single sample may not provide an accurate representation. Instead, it is important to consider the number of data points within each cluster. Assigning an ambiguous point to a cluster that has more data points can be a more meaningful approach.
Overlapping Clusters: In cases where clusters overlap, soft assignment becomes crucial. It allows for assigning data points to multiple clusters based on the degree of membership.
Clusters with Different Shapes or Orientations: When clusters have distinct shapes or orientations, measuring the distance to the cluster center alone is insufficient. Considering the shape of the clusters is necessary for accurate clustering.

One approach to address these challenges is using K-means with scaled Euclidean distance. This modification results in clusters represented by axis-aligned ellipses, accommodating different spreads in each dimension. However, determining appropriate weights for each dimension in the calculation remains a question that needs to be addressed. Shape is still axis-aligned ellipses, which is a limitation in itself.

Motivating application

Let’s consider the task of clustering images based on their color composition. One approach is to compute the average values of the red (R), green (G), and blue (B) channels for each image. This results in representing each image as a single 3-dimensional vector [R, G, B], capturing the average color values.

By applying clustering algorithms to these image vectors, we can group similar images together based on their dominant color characteristics. Although our data is unlabeled, we can observe potential clusters such as:

Sky Cluster (Blue Dominated): Images that predominantly contain blue colors, such as those depicting clear skies, can be grouped into this cluster.
Sunset Cluster (Red Dominated): Images showcasing vibrant sunsets with warm hues and red tones can be grouped into this cluster.
Forest Cluster (Green Dominated): Images featuring lush green forests, landscapes, or vegetation will likely be grouped into this cluster due to their dominant green color composition.

Mixture Models

Here is how distribution of blue might look for all images

We can model image clustering using a mixture of Gaussian distributions. The model consists of three Gaussian components, each with its own parameters (μ and σ). The final distribution is a convex combination of these components: π1g1 + π2g2 + π3*g3, where 0 ≤ π ≤ 1 and the sum of all π values is equal to 1.

Each mixture component represents a unique cluster, characterized by its own set of parameters (πk, μk, σk). In the case of multidimensional data, such as image vectors, each individual distribution is a multivariate Gaussian distribution.

Without examining the image vector, we can determine the probability of it belonging to cluster k using the πk value. Given an image vector x, we can estimate the probability of it belonging to cluster k as P(x | params) = P(x | μk, Σk), where Σ represents the covariance matrix in the multivariate Gaussian.

Clustering with soft assignment involves calculating a responsibility vector for each data point. The length of the responsibility vector corresponds to the number of clusters, which in this case is three (sky, forest, sunset). Each element of the responsibility vector represents the probability of a data point belonging to a specific cluster.

In the case of document clustering, where the dimensionality is high (proportional to the number of words in a vocabulary), keeping track of parameters for each Gaussian becomes computationally expensive. To address this, we simplify the model by considering only diagonal values in the covariance matrix, effectively restricting the model to axis-aligned ellipses. This approach is similar to K-means with scaled Euclidean distance, which also allows for axis-aligned ellipses.

However, the advantage of the mixture model is twofold:

We do not need to specify weights for each dimension in advance; the model learns them from the data.
Different clusters can have varying weights for each dimension, allowing for more flexible modeling of the data.

EM

Responsibility vector calculation with known cluster parameters

The responsibility vector can be calculated if we know the cluster parameters: π_k, μ_k, and ∑_k.
The responsibility vector is different from the π values.
Given a data point, we can find its corresponding responsibility vector by estimating the probabilities in that cluster.
The responsibility vector needs to be normalized after evaluating the probabilities.

This process is a consequence of Bayes’ rule.

Estimating cluster parameters with known responsibility vectors

It is similar to parameter estimation of a multivariate Gaussian.
Each data point, x_i, is multiplied, and there is no need to worry about the denominator since it is also not N (the number of data points).
N_k_soft implies how many points belong to cluster 1
- Also it is a soft assignment

EM Algorithm

E-step: Estimation of the responsibility vector given parameters.
M-step: Maximum likelihood estimation of the parameters given the responsibility vector.
These two steps are repeated until convergence.
The EM algorithm can be derived as a coordinate ascent algorithm and converges to a local mode.
The biased coin example provides a good illustration of the EM algorithm. Check the image below

Challenges

The probability becomes infinite when the variance becomes zero.
This can occur when there is only one data point in a cluster, and the mean is itself with zero variance.
In high-dimensional spaces, this often happens, such as when a document does not have a specific word at all.
To solve this issue, a small value is added to the diagonal elements of the covariance matrix, similar to adding a prior in Bayesian approaches.

Relationship to K-means:

When a Gaussian has the same variance in all dimensions and shrinks to zero, the likelihood becomes either 0 or 1.
This behavior is similar to hard assignment, resembling the K-means algorithm.

Initialisation

Choose k observations at random and assign observations to nearest centroid
Above via k-means++
Solution of k-means
Grow mixture model by splitting cluster (sometime merging) until K clusters are formed

I have coded simple EM at [0]

References

[0] : https://github.com/arcarchit/datastories/blob/master/EM.ipynb