KMeans is one of the most popular, classic and simple approaches to clustering. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics [3].
 Download source code and sample application.
The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a C# framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. It is based on the AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.
Contents
 Introduction
 Source code
 Using the code
 Sample application
 Conclusion
 Acknowledgements
 References
 See also
Introduction
In statistics and machine learning, kmeans clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean [4]. In its most common form, the algorithm is an iterative greedy algorithm which converges to a local optimum after a certain number of iterations.
Illustration of the KMeans algorithm.
The algorithm works by first selecting k locations at random to be the initial centroids for the clusters. Each observation is then assigned to the cluster which has the nearest centroid, and the centroids are recalculated using the mean value of assigned values. The algorithm then repeats this process until the cluster centroids do not change anymore, or until the change less than a given threshold.
There are other refinements and extensions of the algorithm. The version depicted above is its most common form, also referred as Lloyd’s algorithm.
Lloyd’s KMeans algorithm
 Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
 Assign each object to the group that has the closest centroid.
 When all objects have been assigned, recalculate the positions of the K centroids.
 Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.
Source code
The source code has been implemented using Accord.NET and is now part of the framework. In the current version (2.1.2), the following classes related to KMeans are contained inside the Accord.MachineLearning namespace. The source code available in this page contains only the parts of the framework actually needed to support the algorithm.
Class diagram for the KMeans algorithm.
The KMeans class is the main class representing the KMeans algorithm. The algorithm itself is implemented in the Compute(double[][] data, double threshold) method, which accepts a set of observations and a convergence threshold to determine when the method should stop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 
<span>/// <summary></span> <span>/// Divides the input data into K clusters. </span> <span>/// </summary> </span> <span>public</span> <span>int</span>[] Compute(<span>double</span>[][] data, <span>double</span> threshold) { <span>int</span> k = <span>this</span>.K; <span>int</span> rows = data.Length; <span>int</span> cols = data[0].Length; <span>// pick K unique random indexes in the range 0..n1</span> <span>int</span>[] idx = Accord.Statistics.Tools.Random(rows, k); <span>// assign centroids from data set</span> <span>this</span>.centroids = data.Submatrix(idx); <span>// initial variables</span> <span>int</span>[] count = <span>new</span> <span>int</span>[k]; <span>int</span>[] labels = <span>new</span> <span>int</span>[rows]; <span>double</span>[][] newCentroids; <span>do</span> <span>// main loop</span> { <span>// Reset the centroids and the</span> <span>// cluster member counters'</span> newCentroids = <span>new</span> <span>double</span>[k][]; <span>for</span> (<span>int</span> i = 0; i < k; i++) { newCentroids[i] = <span>new</span> <span>double</span>[cols]; count[i] = 0; } <span>// First we will accumulate the data points</span> <span>// into their nearest clusters, storing this</span> <span>// information into the newClusters variable.</span> <span>// For each point in the data set,</span> <span>for</span> (<span>int</span> i = 0; i < data.Length; i++) { <span>// Get the point</span> <span>double</span>[] point = data[i]; <span>// Compute the nearest cluster centroid</span> <span>int</span> c = labels[i] = Nearest(data[i]); <span>// Increase the cluster's sample counter</span> count[c]++; <span>// Accumulate in the corresponding centroid</span> <span>double</span>[] centroid = newCentroids[c]; <span>for</span> (<span>int</span> j = 0; j < centroid.Length; j++) centroid[j] += point[j]; } <span>// Next we will compute each cluster's new centroid</span> <span>// by dividing the accumulated sums by the number of</span> <span>// samples in each cluster, thus averaging its members.</span> <span>for</span> (<span>int</span> i = 0; i < k; i++) { <span>double</span>[] mean = newCentroids[i]; <span>double</span> clusterCount = count[i]; <span>for</span> (<span>int</span> j = 0; j < cols; j++) mean[j] /= clusterCount; } <span>// The algorithm stops when there is no further change in</span> <span>// the centroids (difference is less than the threshold).</span> <span>if</span> (centroids.IsEqual(newCentroids, threshold)) <span>break</span>; <span>// go to next generation</span> centroids = newCentroids; } <span>while</span> (<span>true</span>); <span>// Compute cluster information (optional)</span> <span>for</span> (<span>int</span> i = 0; i < k; i++) { <span>// Extract the data for the current cluster</span> <span>double</span>[][] sub = data.Submatrix(labels.Find(x => x == i)); <span>// Compute the current cluster variance</span> covariances[i] = Statistics.Tools.Covariance(sub, centroids[i]); <span>// Compute the proportion of samples in the cluster</span> proportions[i] = (<span>double</span>)sub.Length / data.Length; } <span>// Return the classification result</span> <span>return</span> labels; } 
The implementation is quite straightforward and does not use additional techniques to avoid convergence problems. More refined techniques may be added to the implementation in the future, so please make sure to download the latest version of Accord.NET Framework for the most uptodate revision of the code.
Using the code
To use the code, create a new instance of the KMeans class passing the desired number of clusters to its constructor. Additionally, you may also pass a distance function to be used as a distance metric during clustering. The default is to use the square Euclidean distance.
1 
<span>// Declare some observations</span><br /> <span>double</span>[][] observations = <br /> {<br /> <span>new</span> <span>double</span>[] { 5, 2, 1 },<br /> <span>new</span> <span>double</span>[] { 5, 5, 6 },<br /> <span>new</span> <span>double</span>[] { 2, 1, 1 },<br /> <span>new</span> <span>double</span>[] { 1, 1, 2 },<br /> <span>new</span> <span>double</span>[] { 1, 2, 2 },<br /> <span>new</span> <span>double</span>[] { 3, 1, 2 },<br /> <span>new</span> <span>double</span>[] { 11, 5, 4 },<br /> <span>new</span> <span>double</span>[] { 15, 5, 6 },<br /> <span>new</span> <span>double</span>[] { 10, 5, 6 },<br /> };<br /><br /> <span>// Create a new KMeans algorithm with 3 clusters </span><br /> KMeans kmeans = <span>new</span> KMeans(3);<br /><br /> <span>// Compute the algorithm, retrieving an integer array</span><br /> <span>// containing the labels for each of the observations</span><br /> <span>int</span>[] labels = kmeans.Compute(observations);<br /><br /> <span>// As a result, the first two observations should belong to the</span><br /> <span>// same cluster (thus having the same label). The same should</span><br /> <span>// happen to the next four observations and to the last three.</span> 
Sample application
The kmeans clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are often used to aid border detection and object recognition. The sample application performs image segmentation using the standard squared Euclidean distance over RGB pixel color space. There are, however, better distance metrics to be used for image segmentation, such as weighted distances and other color spaces, which will not be addressed in this example.
Original image (from Ossi Petruska Flickr page*).
To perform image segmentation, we will first translate our image into an array of pixel values. The single image will be read, pixel by pixel, into a jagged array where each element is a double array of length 3. Each element of those double array will contain one of the three RGB values scaled to the interval [–1,1].
After we perform clustering on this array of pixel values, each pixel will have an associated cluster label. Each of these values will then be swapped by its corresponding cluster centroid. The source code below is called when one clicks the Run button in the application.
1 
<span>private</span> <span>void</span> btnRun_Click(<span>object</span> sender, EventArgs e)<br />{<br /> <span>// Retrieve the number of clusters</span><br /> <span>int</span> k = (<span>int</span>)numClusters.Value;<br /><br /> <span>// Load original image</span><br /> Bitmap image = Properties.Resources.leaf;<br /><br /> <span>// Transform the image into an array of pixel values</span><br /> <span>double</span>[][] pixels = image.ToDoubleArray();<br /><br /><br /> <span>// Create a KMeans algorithm using given k and a</span><br /> <span>// square euclidean distance as distance metric.</span><br /> KMeans kmeans = <span>new</span> KMeans(k, Distance.SquareEuclidean);<br /><br /> <span>// Compute the KMeans algorithm until the difference in</span><br /> <span>// cluster centroids between two iterations is below 0.05</span><br /> <span>int</span>[] idx = kmeans.Compute(pixels, 0.05);<br /><br /><br /> <span>// Replace every pixel with its corresponding centroid</span><br /> pixels.ApplyInPlace((x, i) => kmeans.Clusters.Centroids[idx[i]]);<br /><br /> <span>// Show resulting image in the picture box</span><br /> pictureBox.Image = pixels.ToBitmap(image.Width, image.Height);<br />} 
After segmentation, the following resulting images can be obtained:
Same image after KMeans clustering with k = 5.
Image after KMeans clustering with k = 10.
* The sample image used above has been licensed by Ossi Petruska in a Creative Commons AttributionNonCommercialShareAlike 2.0 Generic license.
Conclusion
KMeans is a very simple and popular approach to clustering. The implementation presented here is the same implementation used in Accord.NET. As it can be seem, the method can be easily extended with custom distance functions through delegates or lambda expressions, and can be used in different contexts, such as image segmentation, without further modifications. As a suggestion for improvement, the method can be further speeded up by using the triangle inequality as suggested on the paper "Using the triangle inequality to accelerate kmeans", by Charles Elkan.
In the next article, we will see how we can use KMeans to initialize the ExpectationMaximization algorithm for estimating Gaussian Mixture densities in Gaussian Mixture Models. Those articles will form the basis for Continuous density Hidden Markov Models.
Acknowledgements
To Antonino Porcino, who provided the first version of the code and for the valuable information about many other methods and algorithms.
References

[1] Matteo Matteucci. “Tutorial on Clustering Algorithms,” Politecnico di Milano, http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html (acessed October 4, 2010).

[2] Teknomo, Kardi. “KMeans Clustering Tutorials,” http://people.revoledu.com/kardi/ tutorial/kMean/ (acessed October 6, 2010).

[3] Wikipedia contributors, "Cluster analysis," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Cluster_analysis (accessed October 4, 2010).

[4] Wikipedia contributors, "Kmeans clustering," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Kmeans_clustering (accessed October 4, 2010).
See also
 Principal Component Analysis (PCA)
 Kernel Principal Component Analysis (KPCA)
 Linear Discriminant Analysis (LDA)
 NonLinear Discriminant Analysis with Kernels (KDA)
 Kernel Support Vector Machines (SVM)
 Handwriting Recognition Revisited: Kernel Support Vector Machines
 Logistic Regression Analysis in C#