# Gaussian Mixture Models and Expectation-Maximization

Like K-Means, Gaussian Mixture Models (GMM) can be regarded as a type of unsupervised learning or clustering methods. They are among the most statistically mature methods for clustering. But unlike K-Means, GMMs are able to build soft clustering boundaries, i.e., points in space can belong to any class with a given probability.

The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing Install-Package Accord.MachineLearning in your IDE’s NuGet package manager.

## Introduction

In statistics, a mixture model is a probabilistic model which assumes the underlying data to belong to a mixture distribution. In a mixture distribution, its density function is just a convex combination (a linear combination in which all coefficients or weights sum to one) of other probability density functions:

$p(x) = w_1 p_1(x) + w_2 p_2(x) + ... + w_n p_n(x)$

The individual pi(x) density functions that are combined to make the mixture density p(x) are called the mixture components, and the weights w1, w2, …, wn associated with each component are called the mixture weights or mixture coefficients.

The most common mixture distribution is the Gaussian (Normal) density function, in which each of the mixture components are Gaussian distributions, each with their own mean and variance parameters.

$p(x) = w_1 N(x \mid \mu_1,\Sigma_1) + w_2 N(x \mid \mu_2,\Sigma_2) + ... + w_n N(x \mid \mu_n,\Sigma_n)$

To give a more solid and concrete understanding of the Gaussian mixture models, we will be jumping directly on how to represent those abstractions in source code through the use of class diagrams. Hopefully class diagrams may give a better and direct overview on the subject than a lengthy theoretical discussion.

## Source code

First we will discuss about the characteristics of any probability distribution. The IDistribution interface (depicted below) says any probability distribution may have a probability function and a distribution function. It also says that probability distributions can also be fitted to sets of observations through a parameter estimation method.

Since distribution functions can be either univariate or multivariate, the methods above accept any number of scalar values as input through the use of the params keyword. The Fit method for parameter estimation also accepts a general System.Array because we could be given either an array of univariate observations in the form of a double[] or an array of multivariate observations in the form of a jagged double[][] array. Since each observation may have a different weight in the estimation process, the weights parameter can be used to inform those weights to the fitting method.

Probability distributions may also have some associated measures, mainly the Expectation and the Variance of the modeled random variable. Depending if the distribution is univariate or multivariate, the expected values and variances can be either a single scalar or a vector of scalars. For multivariate distributions the variance can be also represented by a symmetric, positive-definite matrix known as the variance-covariance matrix.

Moreover, probability distributions can also be continuous or discrete. In continuous distributions the probability function is known as the Probability Density Function (pdf), and in discrete distributions the probability function is known as the Probability Mass Function (pmf). However, since our distribution of interest, the Gaussian, is a continuous distribution, be focusing only on continuous multivariate distributions:

The Normal (Gaussian) distribution and the mixture distributions fall under the multivariate continuous distributions category and are implemented as such.

### Normal distributions

The Normal (or Gaussian) multivariate distribution is a multivariate distribution whose parameters are the mean vector μ and a variance-covariance matrix Σ. Its probability density function is given by:

$p(x) = \frac{1}{ (2\pi)^{k/2}|\Sigma|^{1/2} } \exp\!\Big( {-\tfrac{1}{2}}(x-\mu)' \Sigma^{-1} (x-\mu) \Big)$

The more detailed class diagram for the normal distribution shows the multivariate characteristics of most of its methods, in particular of the Fit method for estimating Gaussian parameters from a multivariate set of data. Each sample, given in the form of a double[], may also have different associated weights. In this case, the weight associated with each of the samples can be passed through the weights parameter. The source code for the Gaussian distribution can be found in the NormalDistribution.cs class.

To estimate the parameters of a Gaussian distribution all we have to do is compute the mean and the variance-covariance matrix out of the given data. This is a very straightforward procedure and can be done by using the standard methods of the Tools class in the Accord.Statistics namespace. The Fit method updates the distribution parameters, overwriting the current distribution.

### Mixture distributions

Mixture distributions are just convex combinations of other probability distributions. Thus said, mixture distributions have an array of component distributions and a coefficient array which contains the weights of the each of the component probability distributions.

In the generic implementation above, T is the type of the distributions used in the mixture. The most common mixture distribution, the Gaussian Mixture Distribution, could then be created by instantiating a Mixture object passing the initial Normal distributions as constructor parameters.

Alternatively, the parameters of the mixture distributions could be estimated from a set of observations by calling the Fit method. To estimate the parameters of a mixture distribution we will be using a common technique known as the Expectation-Maximization algorithm.

### Expectation Maximization

Expectation-Maximization (EM) is a well established maximum likelihood algorithm for fitting a mixture model to a set of training data. It should be noted that EM requires an a priori selection of model order, namely, the number of M components to be incorporated into the model. In the Accord.NET Framework, it can be called implicitly by using the Mixture.Fit method or explicitly using the ExpectationMaximization class. The source code for Expectation Maximization can be found in the ExpectationMaximization.cs file.

The general E-M algorithm is comprised of the following simple steps:

1. #### Initialization

Initialize the distribution parameters, such as the means, covariances and mixing coefficients and evaluate the initial value of the log-likelihood (the goodness of fit of the current distribution against the observation dataset)’;

2. #### Expectation

Evaluate the responsibilities (i.e. weight factors of each sample) using the current parameter values;

3. #### Maximization

Re-estimate the parameters using the responsibilities found in the previous step;

4. #### Repeat

Re-evaluate the log-likelihood and check if it has changed; if it has changed less than a given threshold, the algorithm has converged.

Below is the actual realization of the algorithm as implemented in the Fit method of the Mixture<T> class.

### Gaussian Mixture Models

Gaussian mixture models are among the most commonly used examples of mixture distributions. The GaussianMixtureModel class encompasses a Mixture<NormalDistribution> object and provides methods to learn from data and to perform actual classification through a simplified interface.

Moreover, a common problem which rises in mixture model fitting through E-M is the proper initialization of the mixture parameters. One popular approach is to initialize the mixture parameters using the centroids detected by the K-Means algorithm. The code below shows how the mixture parameters are computed by first creating a KMeans object and then passing the clusters to the mixture model for further processing.

## Using the code

Code usage is rather simple. First, instantiate a new GaussianMixtureModel object passing the desired number of components in the Gaussian mixture. Then, call the Compute(double[][], double) method passing the learning samples and a desired convergence threshold.

After training has been completed, a new sample can be classified by calling the Classify(double[]) method.

## Sample application

The accompanying sample application demonstrates how Gaussian Mixture Models can be used to cluster normally-distributed samples of data. By clicking the “Create random data” button, a given number of Normal distributions will be created in the two-dimensional space using random mean and covariance matrices.

After the data has been created, you can use the “Fit a Gaussian Mixture Model” button to fit a mixture of Gaussians to the data. Each of the Gaussians will receive a random color and the samples which have the greatest probability of belonging to any of the Gaussians will be colored accordingly.

Left: a random dataset containing three Gaussian clusters. Right: the clusters as identified by the Gaussian Mixture Model.

Left: a random dataset containing 10 Gaussian clusters. Right: the clusters as identified by the Gaussian Mixture Model.

## Conclusion

In this article we found how Gaussian Mixture Models can be successfully used to create soft clustering boundaries around data. Those soft boundaries are possible because in a mixture model each sample is said to belong to a cluster only within certain probability.

A main drawback of GMMs is that the number of Gaussian mixture components, as in K-Means, is assumed known as prior, so it cannot be considered as totally unsupervised clustering method. To aid in this situation, one could use additional algorithms such as the Minimum Message Length (MML) criteria.

Another problem arises with the correct initialization of the mixture components. A poor choice of initial parameters will invariably lead to poor results, as the E-M algorithm converges only to a local optimization point. A commonly used solution is initialization by randomly sampling in the mixture data. Other common solution, covered by the implementation presented here, is to use K-Means to perform initialization. However, K-Means itself may also converge to a poor solution and impact negatively in the mixture model fitting.

## References

• Raja, Y., Shaogang, G. Gaussian Mixture Models. Department of Computer Science, Queen Mary and Westfield College, England.
• Bishop, C. M. (2007), Pattern Recognition and Machine Learning (Information Science and Statistics), Springer.
• Wikipedia contributors, “Normal distribution,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Normal_distribution (accessed October 18, 2010).
• Wikipedia contributors, “Probability density function,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Probability_density_function (accessed October 18, 2010).
• Wikipedia contributors, “Mixture density,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Mixture_density (accessed October 18, 2010).

1. Excellent work and very well presented article. I didn’t even imagine that Gaussian Mixtures could be used as a classifier. BTW, the result with 10 clusters is amazing especially if compared with the mere K-Means. I look forward for the HMM with Gaussian mixtures 🙂

2. Thanks for this great post : )
well i have a question
how l is initialized in this constructor ?:D

public CholeskyDecomposition(double[,] value)
{
if (value == null)
{
throw new ArgumentNullException(“value”, “Matrix cannot be null.”);
}

if (value.GetLength(0) != value.GetLength(1))
{
throw new ArgumentException(“Matrix is not square.”, “value”);
}

int dimension = value.GetLength(0);
L = new double[dimension, dimension];

double[,] a = value;
double[,] l = L;

this.positiveDefinite = true;
this.symmetric = true;

for (int j = 0; j < dimension; j++)
{
//double[] Lrowj = l.GetRow(j);
double d = 0.0;
for (int k = 0; k < j; k++)
{
//double[] Lrowk = l.GetRow(k);
double s = 0.0;
for (int i = 0; i < k; i++)
{
s += l[k, i] * l[j, i];
}
l[j, k] = s = (a[j, k] – s) / l[k, k];
d = d + s * s;

this.symmetric = this.symmetric & (a[k, j] == a[j, k]);
}

d = a[j, j] – d;

this.positiveDefinite = this.positiveDefinite & (d > 0.0);
l[j, j] = System.Math.Sqrt(System.Math.Max(d, 0.0));
for (int k = j + 1; k < dimension; k++)
{
l[j, k] = 0.0;
}
}
}

3. Hi Rehab,

I think I didn’t understand your question. Could you clarify?

By the way, for the most current version of this code you can download the Accord.NET Framework, which has updated versions of Gaussian Mixture Model fitting.

Best regards,
César

4. Thanks Cesar alot for ur reply
i have downloaded the last version : ))and there is no problem now
well i have a quetsion if u want to test ur own implementation of GMM what will u do ?

5. i don’t mean accord.net, i mean any implementation of GMM, how could i say that this implementation works well?

6. Hi Rehab,

Well, I would create a bunch of gaussian mixtures and try to fit them 🙂

If you find any better method, please let me know, as I would also find it useful.

Best regards,
César

7. Thanks Cesar for ur reply
well i have another question
why the covariance of the data i used is not positive definite all the time ?

8. and what is the solution if the determinant of the decomposition =0 ?

9. Hi Rehab,

Not positive definite matrices most probably occurs when your data is rank deficient or when the data could be sufficiently fit in a smaller number of dimensions.

This is the case when one of the variables is a constant, for example. In this case, either a reformulation of the problem or a variable reduction step can sidestep this problem.

Another reason is that past Gaussians may have become degenerate, leading to all sorts of numerical problems, such as non-invertible covariance matrices. Again, this could be happening by the same problems listed above.

To sidestep those two problems, one could also include a small regularization constant to be added in each covariance matrix’s diagonal. This value is added until the covariance eventually turns out to be positive definite.

If would like to take a look, the latest version of the Accord.NET Framework has included support for specifying regularization constants in Gaussian mixture fitting. This should avoid most errors but may result in loss of accuracy by the final model.

Best regards,
Cesar

10. Thanks Cesar for this useful information :))
yeah i know this part of code as i debugged it line by line 😀 Really great Work (Y)(Y)
but what about the determinant the matrix may be positive definite but its determinant is zero ?

11. yes but i am getting a singular exception matrix while the matrix is a positive definite

this.chol = new CholeskyDecomposition(cov, false, true);

if (!this.chol.PositiveDefinite)
throw new NonPositiveDefiniteMatrixException(“Matrix is not positive definite.”);

double det = chol.Determinant;

if (det == 0) throw new SingularMatrixException(“Matrix is singular.”);

12. Anonymous says:

Hi cesar. First, thank you for this impressive library.

I have an image that consist of white background and black characters. Is possible to use GMM to segment the characters???

If yes could you please post a simple example on how to do this.

13. Hi Anonymous,

Well, this potentially could, but I haven’t tried using it for image processing yet. However, since you have white background and black characters, wouldn’t a simple thresholding work better in your case?

Best regards,
César

14. Anonymous says:

Hi cesar thank you for your response.
My problem is that characters are overlapping each others.

15. Hi Cesar i think there is a bug in checking whether the matrix is positive definite or not i can send u a test case if u’d like : )

16. Hi Rehab!

Sure, please send it. The more bugs reported and fixed, better the code. If you wish you can send me by email.

Best regards,
César

17. There used to be a button below my picture in the Google Profile page at the Contact tab. I am unsure if the page has changed since then. Anyways, my email is cesarsouza at gmail dot com. Thanks!

Regards,
César

Hi Cesar
This is really a great work. Please I have a question. My observations are complex numbers, Can I deduce the pdf of these data using the GMM? If yes how can I do that?

Thanks;

One more question Cesar. Now how can I determine the optimal number of Gaussian components? Also, do the number of components and the initial values of parameters affect strongly the final pdf?

Thanks;

20. Reinhold says:

Thank you for this excellent tutorial on GMMs and the code samples!
A few comments to the code and the way you handle weighted samples: If I’m not mistaken, in the E-Step in method ‘Fit’, there is some bug here:

double w = weights[n];

for (int k = 0; k < K; k++)
den += cache[k] = pi[k] * pdf[k].ProbabilityDensityFunction(x) * w;

for (int k = 0; k < K; k++)
gamma[k][n] = cache[k] / den;

If you multiply the weighted PDF with your per-observation weight ‘w’, It will be cancelled out when dividing through the variable ‘den’ (since ‘den’ is scaled by w too). It’s also in the current version of your repo. What you want to achieve is performing the expectation step on weighted samples. This problem was discussed in this forum, and the suggestion the last poster came up with (who is right IMHO) is – applied to your code – this:

for (int k = 0; k < K; k++)
den += cache[k] = pi[k] * pdf[k].ProbabilityDensityFunction(x);

for (int k = 0; k < K; k++)
gamma[k][n] = cache[k] / den * w;

So you just scale the gamma of the n-th sample by the n_th weight. But now you also have to correctly normalize the new weights pi[k] in the maximization step:

pi[k]= Nk / weights.Sum();

(or leave the division away if you assume the weights to sum up to one).

I also didn’t quite understand why you multiply the weights array with N in the beginning of the method ‘Fit’? In the case of equally-weighted samples, this just cancels out the scaling by ‘w’ in the ‘logLikelihood’ method.

Actually, I think one should weight the log of the pdf, instead of calculating the log of the weighted pdf, i.e.:

for (int k = 0; k < K; k++)
sum += pi[k] * pdf[k].ProbabilityFunction(x);
likelihood += System.Math.Log(sum) * w;

for (int k = 0; k < K; k++)
sum += pi[k] * pdf[k].ProbabilityFunction(x) * w;
likelihood += System.Math.Log(sum);

This would also normalize the result of your LogLikelihood method, i.e. divide the result by N (see [CG10], end of Section 6.2). Currently, this is not the case, so you should get a value that scales with the number of samples, which would render the threshold as stop-criterion senseless. In the current version in the repo, I saw you switched to some kind of percentage threshold by multiplying the threshold with the absolute value of your current log likelihood? I think this does the trick, but dividing the return value of your logLikelihood function by N (or proper weighting) would be the way to go if the threshold denotes an absolute delta.

Please correct me if I’m wrong!

Thanks a lot!
Reinhold

[CG10] Chen and Gupta, 2010. EM Demystified. Technical Report. [https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0002.pdf].

21. Hi Reinhold!

This is an excellent beginning for a new year! Many thanks for pointing that out. There was already another issue with the Mixture class which I was going to address in soon-to-be released version update, so your fix came in a good time. The weighting were indeed being canceled in the summations, and this is now fixed.

Many thanks, and a happy new year!

Regards,
Cesar

22. Hi César!

You’re welcome! I’m currently implementing a GMM-Fit module in C++, which was mainly inspired by your design you posted here.

I thought you might also be interested in a performance boost I recognized (don’t know how it behaves in your C# implementation).
See here.

Cheers!

23. Hi Reinhold!

The speedup is very interesting. I had implemented this method for dealing with very low probabilities or near- non-positive-definite covariance matrices, but never measured the speedup. To avoid numerical instabilities, It is also necessary compute the log-determinant directly instead of using Log(det) as it seems you are doing in the second code picture shown in your Google+ post. This technique is available in the full Accord.NET Framework (unlike the code of the post, which is a bit outdated… sorry about that).

Here is the most up-to-date multivariate Normal distribution code:

And here is the code for the direct Log-Determinant for the Cholesky decomposition:

I hope you also find this useful!

Best regards,
César

24. Anonymous says:

Hi César,

Is implementation of HMM with Gaussian mixtures available to use in current version of Accord.NET?
Also is it possible to use K mean for initialization of continuous HMM of Accord.Net in case of uni-variate data?

Regards,
coder

25. Hi Coder,

Yes, it is. You can create a generic Hidden Markov Model specifying a Mixture as the emission density. For example, you can create a HiddenMarkovModel<Mixture<NormalDistribution>> and then train it using the appropriate learning algorithm.

You could either initialize those distributions directly using K-Means, or you could also use the Segmental K-Means, available in the framework as the Viterbi learning algorithm.

Best regards,
Cesar

1. Anonymous says:

Cesar hi,

I’m guessing that if there is lack of expert knowledge to label training data, K-mean clustering can be used to label each data point in training sequence. Hence, appropriate learning algorithms (i.e,. continuous HMM , mixture of Gaussian HMM) will learn from the labeled training data from K-mean clustering. Am i understood it properly?

Another idea which i’m guessing is that no only continuous HMM, but also discrete HMM can be use K-mean to label data by given number of classes. Then we can use appropriate techniques data discretization technique to convert the distributions into discrete sequences before learning training.

I would appreciate it if you comment on those.
Regards,
coder

2. Hmm… I am not very sure about it. When I answered that mixture of Gaussians and K-Means could be used with HMMs, I actually meant they could be used to initialize the emission densities for those HMMs. What you seems to be asking is a way to perform sequence clustering using the HMMs, and this is something I hadn’t thought about yet.

Perhaps you could use each HMM as a probability distribution, then create a mixture of HMMs and use the Expectation-Maximization algorithm to perform the clustering. Perhaps it could work.

Best regards,
Cesar

26. Anonymous says:

Hi Cesar,

I am new to GMM and C#. To understand GMM I tried running your sample application – clicked “Create random data” but nothing happens. Am I missing anything ?

Thank you

1. Hi there,

Thanks for reporting the issue. The application has been fixed.

Best regards,
Cesar

27. Cesar, thanks for this tutorial, I’m learning a lot out of it, I want to ask for your help with the following questions:

1. Why the density function is used instead of the cumulative function of the multivariate normal?

2. Is there a normalization step?

Thanks again

1. Hi Raul,

About the first question: The density function is the probability of the distribution assuming a given value, which in this case is the value of your sample. If you are using a GMM for say, classification, then it would make more sense to consider the likelihood of the given sample rather than the likelihood of all points below your sample. Besides, there is no easy way to compute the cumulative function for a multivariate Gaussian.

And now about the second question, do you mean normalizing your samples to have zero-mean and unit variance? This could be done before starting the estimation procedure. However, since we are using Gaussian clusters, the clusters would likely be able to fit the form of your distribution.

Best regards,
Cesar

28. Cesar
Thanks for this. How would i use mixture model with data which have weights, i want to have mixture or density estimate which fits data which have weights. Can you explain how this can be done through a snippet or method.
Thanks a ton!

29. Hi Uday,

Yes, this can be done. However, I would like to ask you if you can download the latest version of the Accord.NET Framework for doing this. Once you have downloaded and installed the framework, you may fit a weighted Normal distribution directly using:

// Randomly initialize some mixture components
NormalDistribution[] components = new NormalDistribution[2];
components[0] = new NormalDistribution(2, 1);
components[1] = new NormalDistribution(5, 1);

// Create an initial mixture
Mixture mixture = new Mixture(components);

// Now, suppose we have a weighted data
// set. Those will be the input points:

double[] points = { 0, 3, 1, 7, 3, 5, 1, 2, -1, 2, 7, 6, 8, 6 }; // (14 points)

// And those are their respective unnormalized weights:
double[] weights = { 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 3, 1, 1 }; // (14 weights)

// Let’s normalize the weights so they sum up to one:
weights = weights.Divide(weights.Sum());

// Now we can fit our model to the data:
mixture.Fit(points, weights); // done!

// Our model will be:
double mean1 = mixture.Components[0].Mean; // 1.41126
double mean2 = mixture.Components[1].Mean; // 6.53301

I hope it helps. If you need further help, please leave a message at the Accord.NET Forums so we can continue the conversation there!

Best regards,
Cesar

30. Hi Cesar
can u tell me why i m getting the below RunTime ERROR

gmm.Compute(mixture, 0.0001);
at that line NonPositiveDefiniteMatrixException was unhandled

31. dear Mr.
I am a student, I have a question how to retrieve the data and how you draw because I want to draw my data from datagrid because I want to implement in my web Application
thx

32. Hi Cesar
How could i get the final centroid of each cluster?
thenaks

33. Hi Cesar!

I would like to use your tools to infer position of an object in an indoor environment.

For example, I receive data from 8 access points each time instant. So, each time I get a vector with 8 measurements, which correspond exactly one position of my object (each position is one state).

The question is how to use your tools to infer the state of that object.

For exemple:

t1 = (0.5, 1.0, 2.0, 1.5, 3.0, -0.5, 2.0, 1.0) => state 1
t1 = (1.5, 0.1, 0. 2, 0.5, 2.5, 1.0, 0.4, 1.1) => state 2

etc