Expectation Maximization - An explanation of statistical inference using the example of Gaussian Mixture Models

Christoph Winkler
M. Sc. Business Information Systems, Data Scientist.
Generative Modeling is simply about modeling "How the world could be" and not necessarily "How the world actually is".

Clustering forms a group of unsupervised learning algorithms that are designed to find unknown patterns in data. It is one of the fundamental methods for many researches and practitioners working with data. The k-means clustering algorithm is one of the best known and simpliest clustering methods used today. The algorithm assigns each data point to exactly one cluster. This is called hard assignment. However, the lack of in-between assignment often leads to issues regarding overlapping clusters. Additionally, k-means ignores the variance of the clusters and therefore it can happen that the algorithm does not work well with different cluster sizes.

In this article the Expectation Maximization (EM) algorithm is explained and discussed in simple words as a fundamental principal of statistical inference. Afterwards an implementation of the concept is presented in Python using the example of univariate Gaussian Mixture Models (GMMs). The article is written for researchers and practitioners that have a basic understanding of statistics and machine learning.

Model

EM clustering is a method to adress the issue of hard assignment and data generated by different cluster sizes. It adds the statistical assumption that every data point x_i is randomly drawn from a statistical distribution k. In Gaussian Mixture Models the underlying distribution is a normal distribution. Therefore, every cluster k ∈ {1,…,K} equals a normal distribution with mean μ_k and standard deviation σ_k:

$z_{i} \sim Categorial(\frac{1}{K},...,\frac{1}{K})$

$x_{i}|z_{i}, \mu \sim N(z_i^T,\mu,1)$

K is a hyperparameter of the model and determines the number of clusters which is fixed. A hyperparameter is a constant that has to be estimated before inferring the model parameters. Usually a hyperparameter does not change during training. However, a model parameter is not known before. It has to be estimated during inference. In many cases model parameters are randomly initialized.
Another relevant variable is x. It represents the observed data which depends on the cluster assignment z, the mean μ and the standard deviation σ. Φ is a K dimensional vector of a categorial distribution. It encodes the prior probability assumption that a data point x_i was generated from a certain cluster z_i. However, we do not have an inital assumption. Therefore, we set Φ_k = 1/K for k ∈ {1,…,K}.

EM Clustering

The goal of the algorithm is to maximize the overall likelihood p(x|z,μ,σ) that the data x is observed given the final cluster assignments z_i and its parameters (μ and σ). A requirement of the EM algorithm is that the probability density function (pdf) of the a posteriori distribution is known and available in closed form. The probability density function of the posterior distribution in univariate GMMs is the probability density function of the univariate normal distribution:

$p(x)=\frac{1}{\sqrt{2\pi \sigma^{2}}}e^{-\frac{(x-\mu)^2}{2 \sigma}}$

The EM algorithm computes a point estimate of the parameters of the actual posterior distribution. However, the function that is optimized during inference is non-convex. The properties of a non-convex function let us conclude that a found optimum is not guaranteed to be the global optimum. Therefore, it only finds a local optimal solution for the latent variables (z, μ; and σ) by using the observed variables x.

E-Step

In the first step the probability for each data point x_i and every possible cluster assignment is computed.

$p(x_{i}| \mu_{k}, \sigma_{k}) = \frac{\Phi_{k}N(x_{i},\mu_{k},\sigma_{k})}{\sum_{k=1}^{K}\Phi_{k}N(x_{i},\mu_{k},\sigma_{k})}$

In the numerator the prior expectation of the cluster assignment is multiplied by the density of the currently selected cluster. The denominator is the normalization factor that simply computes the sum of densities over all possible cluster assignments k ∈ {1,…,K}. Therefore, the outcome of the normalized densities is a probability value between 0 and 1. It represents the probability that a data point x_i was generated by cluster k and its estimated parameters. These probabilites are computed for each data point and each possible cluster assignment which forces a runtime complexity of at least O(n * k).

M-Step

In the next step the model parameters are updated. First the prior expectation of the cluster assignment:

$\phi_k = \frac{\sum p(x_i|\mu_k,\sigma_k)}{N}$

Next the cluster parameters μ and σ are updated:

$\mu_{k} = \frac{\sum_{i=1}^{N}p(x_{i}|\mu_{k}, \sigma_{k})x_{i}}{\sum_{i=1}^{N}p(x_{i}|\mu_{k}, \sigma_{k})}$

$\sigma_{k} = \sqrt{\frac{\sum_{i=1}^{N}p(x_{i}|\mu_{k}, \sigma_{k})(x_{i}-\mu_{k})^2}{\sum_{i=1}^{N}p(x_{i}|\mu_{k}, \sigma_{k})}}$

The updated value for μ is the weighted average of all data points x_i that are assigned to cluster k. Similar the updated value for σ is also computed by using the probabilities as weights.

Example

Let’s assume we observe the data points x. We set K=2 with a prior cluster assignment Φ. The initial values are set as follows:

$x = [3, 4.5],\mu = [2,5], \sigma = [1,1], \phi = [0.5, 0.5]$

The density for data point 1 (assuming cluster 1 generated it):

$N(3,\mu_{1},\sigma_{1})=N(3,2,1)=0,24$

The density for data point 1 (assuming cluster 2 generated it):

$N(3,\mu_{2},\sigma_{2})=N(3,5,1)=0.05$

Next step is to normalize the densities to compute the probability values (E-Step):

$p(x_{1},\mu_{1},\sigma_{1})=\frac{0.5*0.24}{0.5*0.24 + 0.5*0.05}=0.83$

$p(x_{1},\mu_{2},\sigma_{2})=\frac{0.5*0.05}{0.5*0.24 + 0.5*0.05}=0.17$

The probability that data point 1 was generated by cluster 1 is 83 percent whereas the probability that it was generated by cluster 2 is 17 percent. We also compute the probability values for data point 2:

$p(x_{2},\mu_{1},\sigma_{1})=0.05$

$p(x_{2},\mu_{2},\sigma_{2})=0.95$

The probability that data point 2 was generated by cluster 1 is 5 percent whereas the probability that it was generated by cluster 2 is 95 percent.

Last step is to update the model parameters (M-Step). Here you can find the new model parameter estimates after the first iteration:

$\phi_1 = \frac{0.83 + 0.05}{2} = 0.44$

$\phi_2 = \frac{0.17 + 0.95}{2} = 0.56$

$\mu_{1} = \frac{0.83*3+0.05*4.5}{0.83+0.0.05}=3.09$

$\mu_{2} = \frac{0.17*3+0.95*4.5}{0.17+0.95}=4.27$

$\sigma_{1} = \sqrt{\frac{0.83*(3-2)^2+0.05*(4.5-2)^2}{0.83+0.05}}=1.14$

$\sigma_{2} = \sqrt{\frac{0.17*(3-5)^2+0.95*(4.5-5)^2}{0.18+0.95}}=0.82$

In practice both steps are repeated several times. It is guaranteed that the model parameters converge to a stationary point. Therefore, let’s evaluate the convergence of the model. We compute the overall log likelihood before and after the first iteration:

$log(p(x|\mu, \sigma)) = -2.46$

$log(p(x|\mu, \sigma)) = -1.81$

It turns out that the log likelihood is increasing which lets us conclude it is more likely that the estimated model after the first iteration has generated the observed data. Therefore, the algorithm works as expected and the model is getting better.

I hope you enjoyed reading this article and got a better understanding of the Expectation Maximization algorithm. As a next step, please go on to read Expectation Maximization - A Python implementation.