Like KMeans, Gaussian Mixture Models (GMM) can be regarded as a type of unsupervised learning or clustering methods. They are among the most statistically mature methods for clustering. But unlike KMeans, GMMs are able to build soft clustering boundaries, i.e., points in space can belong to any class with a given probability.
 Download sample application (with source code).
 Browse the sample application source code online
 Install the Accord.NET machine learning framework.
The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing InstallPackage Accord.MachineLearning in your IDE’s NuGet package manager.
Contents
Introduction
In statistics, a mixture model is a probabilistic model which assumes the underlying data to belong to a mixture distribution. In a mixture distribution, its density function is just a convex combination (a linear combination in which all coefficients or weights sum to one) of other probability density functions:
The individual p_{i}(x) density functions that are combined to make the mixture density p(x) are called the mixture components, and the weights w_{1}, w_{2}, …, w_{n} associated with each component are called the mixture weights or mixture coefficients.
The most common mixture distribution is the Gaussian (Normal) density function, in which each of the mixture components are Gaussian distributions, each with their own mean and variance parameters.
To give a more solid and concrete understanding of the Gaussian mixture models, we will be jumping directly on how to represent those abstractions in source code through the use of class diagrams. Hopefully class diagrams may give a better and direct overview on the subject than a lengthy theoretical discussion.
Source code
First we will discuss about the characteristics of any probability distribution. The IDistribution interface (depicted below) says any probability distribution may have a probability function and a distribution function. It also says that probability distributions can also be fitted to sets of observations through a parameter estimation method.
Since distribution functions can be either univariate or multivariate, the methods above accept any number of scalar values as input through the use of the params keyword. The Fit method for parameter estimation also accepts a general System.Array because we could be given either an array of univariate observations in the form of a double[] or an array of multivariate observations in the form of a jagged double[][] array. Since each observation may have a different weight in the estimation process, the weights parameter can be used to inform those weights to the fitting method.
Probability distributions may also have some associated measures, mainly the Expectation and the Variance of the modeled random variable. Depending if the distribution is univariate or multivariate, the expected values and variances can be either a single scalar or a vector of scalars. For multivariate distributions the variance can be also represented by a symmetric, positivedefinite matrix known as the variancecovariance matrix.
Moreover, probability distributions can also be continuous or discrete. In continuous distributions the probability function is known as the Probability Density Function (pdf), and in discrete distributions the probability function is known as the Probability Mass Function (pmf). However, since our distribution of interest, the Gaussian, is a continuous distribution, be focusing only on continuous multivariate distributions:
The Normal (Gaussian) distribution and the mixture distributions fall under the multivariate continuous distributions category and are implemented as such.
Normal distributions
The Normal (or Gaussian) multivariate distribution is a multivariate distribution whose parameters are the mean vector μ and a variancecovariance matrix Σ. Its probability density function is given by:
The more detailed class diagram for the normal distribution shows the multivariate characteristics of most of its methods, in particular of the Fit method for estimating Gaussian parameters from a multivariate set of data. Each sample, given in the form of a double[], may also have different associated weights. In this case, the weight associated with each of the samples can be passed through the weights parameter. The source code for the Gaussian distribution can be found in the NormalDistribution.cs class.
To estimate the parameters of a Gaussian distribution all we have to do is compute the mean and the variancecovariance matrix out of the given data. This is a very straightforward procedure and can be done by using the standard methods of the Tools class in the Accord.Statistics namespace. The Fit method updates the distribution parameters, overwriting the current distribution.
Mixture distributions
Mixture distributions are just convex combinations of other probability distributions. Thus said, mixture distributions have an array of component distributions and a coefficient array which contains the weights of the each of the component probability distributions.
In the generic implementation above, T is the type of the distributions used in the mixture. The most common mixture distribution, the Gaussian Mixture Distribution, could then be created by instantiating a Mixture object passing the initial Normal distributions as constructor parameters.
Alternatively, the parameters of the mixture distributions could be estimated from a set of observations by calling the Fit method. To estimate the parameters of a mixture distribution we will be using a common technique known as the ExpectationMaximization algorithm.
Expectation Maximization
ExpectationMaximization (EM) is a well established maximum likelihood algorithm for fitting a mixture model to a set of training data. It should be noted that EM requires an a priori selection of model order, namely, the number of M components to be incorporated into the model. In the Accord.NET Framework, it can be called implicitly by using the Mixture.Fit method or explicitly using the ExpectationMaximization class. The source code for Expectation Maximization can be found in the ExpectationMaximization.cs file.
The general EM algorithm is comprised of the following simple steps:

Initialization
Initialize the distribution parameters, such as the means, covariances and mixing coefficients and evaluate the initial value of the loglikelihood (the goodness of fit of the current distribution against the observation dataset)’;

Expectation
Evaluate the responsibilities (i.e. weight factors of each sample) using the current parameter values;

Maximization
Reestimate the parameters using the responsibilities found in the previous step;

Repeat
Reevaluate the loglikelihood and check if it has changed; if it has changed less than a given threshold, the algorithm has converged.
Below is the actual realization of the algorithm as implemented in the Fit method of the Mixture<T> class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 
/// <summary> /// Fits the underlying distribution to a given set of observations. /// </summary> /// /// <param name="observations">The array of observations to fit the model against. The array /// elements can be either of type double (for univariate data) or /// type double[] (for multivariate data).</param> /// <param name="weights">The weight vector containing the weight for each of the samples.</param> /// <param name="options">Optional arguments which may be used during fitting, such /// as regularization constants and additional parameters.</param> /// /// <remarks> /// Although both double[] and double[][] arrays are supported, /// providing a double[] for a multivariate distribution or a /// double[][] for a univariate distribution may have a negative /// impact in performance. /// </remarks> /// public override void Fit(double[] observations, double[] weights, IFittingOptions options) { // Estimation parameters double threshold = 1e3; IFittingOptions innerOptions = null; if (options != null) { // Process optional arguments MixtureOptions o = (MixtureOptions)options; threshold = o.Threshold; innerOptions = o.InnerOptions; } // 1. Initialize means, covariances and mixing coefficients // and evaluate the initial value of the loglikelihood int N = observations.Length; int K = components.Length; double weightSum = weights.Sum(); // Initialize responsabilities double[] norms = new double[N]; double[][] gamma = new double[K][]; for (int k = 0; k < gamma.Length; k++) gamma[k] = new double[N]; // Clone the current distribution values double[] pi = (double[])coefficients.Clone(); T[] pdf = new T[components.Length]; for (int i = 0; i < components.Length; i++) pdf[i] = (T)components[i].Clone(); // Prepare the iteration double likelihood = logLikelihood(pi, pdf, observations, weights); bool converged = false; // Start while (!converged) { // 2. Expectation: Evaluate the component distributions // responsibilities using the current parameter values. Array.Clear(norms, 0, norms.Length); for (int k = 0; k < gamma.Length; k++) for (int i = 0; i < observations.Length; i++) norms[i] += gamma[k][i] = pi[k] * pdf[k].ProbabilityFunction(observations[i]); for (int k = 0; k < gamma.Length; k++) for (int i = 0; i < weights.Length; i++) if (norms[i] != 0) gamma[k][i] *= weights[i] / norms[i]; // 3. Maximization: Reestimate the distribution parameters // using the previously computed responsibilities for (int k = 0; k < gamma.Length; k++) { double sum = gamma[k].Sum(); for (int i = 0; i < gamma[k].Length; i++) gamma[k][i] /= sum; pi[k] = sum / weightSum; pdf[k].Fit(observations, gamma[k], innerOptions); } // 4. Evaluate the loglikelihood and check for convergence double newLikelihood = logLikelihood(pi, pdf, observations, weights); if (Double.IsNaN(newLikelihood)  Double.IsInfinity(newLikelihood)) throw new ConvergenceException("Fitting did not converge."); if (Math.Abs(likelihood  newLikelihood) < threshold * Math.Abs(likelihood)) converged = true; likelihood = newLikelihood; } // Become the newly fitted distribution. this.initialize(pi, pdf); } 
Gaussian Mixture Models
Gaussian mixture models are among the most commonly used examples of mixture distributions. The GaussianMixtureModel class encompasses a Mixture<NormalDistribution> object and provides methods to learn from data and to perform actual classification through a simplified interface.
Moreover, a common problem which rises in mixture model fitting through EM is the proper initialization of the mixture parameters. One popular approach is to initialize the mixture parameters using the centroids detected by the KMeans algorithm. The code below shows how the mixture parameters are computed by first creating a KMeans object and then passing the clusters to the mixture model for further processing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
/// <summary> /// Divides the input data into K clusters modeling each /// cluster as a multivariate Gaussian distribution. /// </summary> public double Compute(double[][] data, double threshold) { int components = this.gaussians.Count; // Create a new KMeans algorithm KMeans kmeans = new KMeans(components); // Compute the KMeans kmeans.Compute(data, threshold); // Initialize the Mixture Model with data from KMeans NormalDistribution[] distributions = new NormalDistribution[components]; double[] proportions = kmeans.Clusters.Proportions; for (int i = 0; i < components; i++) { double[] mean = kmeans.Clusters.Centroids[i]; double[,] covariance = kmeans.Clusters.Covariances[i]; distributions[i] = new NormalDistribution(mean, covariance); } // Fit a multivariate Gaussian distribution model = Mixture<NormalDistribution>.Estimate(data, threshold, proportions, distributions); // Return the loglikelihood as a measure of goodnessoffit return model.LogLikelihood(data); } 
Using the code
Code usage is rather simple. First, instantiate a new GaussianMixtureModel object passing the desired number of components in the Gaussian mixture. Then, call the Compute(double[][], double) method passing the learning samples and a desired convergence threshold.
1 2 3 4 5 6 7 8 
// Create a new Gaussian Mixture Model with 2 components GaussianMixtureModel gmm = new GaussianMixtureModel(2); // Compute the model (estimate) gmm.Compute(samples, 0.0001); // Classify a single sample int c = gmm.Gaussians.Nearest(sample); 
After training has been completed, a new sample can be classified by calling the Classify(double[]) method.
Sample application
The accompanying sample application demonstrates how Gaussian Mixture Models can be used to cluster normallydistributed samples of data. By clicking the “Create random data” button, a given number of Normal distributions will be created in the twodimensional space using random mean and covariance matrices.
After the data has been created, you can use the “Fit a Gaussian Mixture Model” button to fit a mixture of Gaussians to the data. Each of the Gaussians will receive a random color and the samples which have the greatest probability of belonging to any of the Gaussians will be colored accordingly.
Left: a random dataset containing three Gaussian clusters. Right: the clusters as identified by the Gaussian Mixture Model.
Left: a random dataset containing 10 Gaussian clusters. Right: the clusters as identified by the Gaussian Mixture Model.
Conclusion
In this article we found how Gaussian Mixture Models can be successfully used to create soft clustering boundaries around data. Those soft boundaries are possible because in a mixture model each sample is said to belong to a cluster only within certain probability.
A main drawback of GMMs is that the number of Gaussian mixture components, as in KMeans, is assumed known as prior, so it cannot be considered as totally unsupervised clustering method. To aid in this situation, one could use additional algorithms such as the Minimum Message Length (MML) criteria.
Another problem arises with the correct initialization of the mixture components. A poor choice of initial parameters will invariably lead to poor results, as the EM algorithm converges only to a local optimization point. A commonly used solution is initialization by randomly sampling in the mixture data. Other common solution, covered by the implementation presented here, is to use KMeans to perform initialization. However, KMeans itself may also converge to a poor solution and impact negatively in the mixture model fitting.
References
 Raja, Y., Shaogang, G. Gaussian Mixture Models. Department of Computer Science, Queen Mary and Westfield College, England.
 Bishop, C. M. (2007), Pattern Recognition and Machine Learning (Information Science and Statistics), Springer.
 Wikipedia contributors, “Normal distribution,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Normal_distribution (accessed October 18, 2010).
 Wikipedia contributors, “Probability density function,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Probability_density_function (accessed October 18, 2010).
 Wikipedia contributors, “Mixture density,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Mixture_density (accessed October 18, 2010).
See also
 KMeans Clustering Algorithm
 Principal Component Analysis (PCA)
 Kernel Principal Component Analysis (KPCA)
 Linear Discriminant Analysis (LDA)
 NonLinear Discriminant Analysis with Kernels (KDA)
 Kernel Support Vector Machines (SVM)
 Handwriting Recognition Revisited: Kernel Support Vector Machines
 Logistic Regression Analysis in C#