K-Means is one of the most popular, classic and simple approaches to clustering. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics [3].
- Download source code and sample application.
The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a C# framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. It is based on the AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.
Contents
- Introduction
- Source code
- Using the code
- Sample application
- Conclusion
- Acknowledgements
- References
- See also
Introduction
In statistics and machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean [4]. In its most common form, the algorithm is an iterative greedy algorithm which converges to a local optimum after a certain number of iterations.
Illustration of the K-Means algorithm.
The algorithm works by first selecting k locations at random to be the initial centroids for the clusters. Each observation is then assigned to the cluster which has the nearest centroid, and the centroids are recalculated using the mean value of assigned values. The algorithm then repeats this process until the cluster centroids do not change anymore, or until the change less than a given threshold.
There are other refinements and extensions of the algorithm. The version depicted above is its most common form, also referred as Lloyd’s algorithm.
Lloyd’s K-Means algorithm
- Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
- Assign each object to the group that has the closest centroid.
- When all objects have been assigned, recalculate the positions of the K centroids.
- Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.
Source code
The source code has been implemented using Accord.NET and is now part of the framework. In the current version (2.1.2), the following classes related to K-Means are contained inside the Accord.MachineLearning namespace. The source code available in this page contains only the parts of the framework actually needed to support the algorithm.
Class diagram for the K-Means algorithm.
The KMeans class is the main class representing the K-Means algorithm. The algorithm itself is implemented in the Compute(double[][] data, double threshold) method, which accepts a set of observations and a convergence threshold to determine when the method should stop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
<span>/// <summary></span> <span>/// Divides the input data into K clusters. </span> <span>/// </summary> </span> <span>public</span> <span>int</span>[] Compute(<span>double</span>[][] data, <span>double</span> threshold) { <span>int</span> k = <span>this</span>.K; <span>int</span> rows = data.Length; <span>int</span> cols = data[0].Length; <span>// pick K unique random indexes in the range 0..n-1</span> <span>int</span>[] idx = Accord.Statistics.Tools.Random(rows, k); <span>// assign centroids from data set</span> <span>this</span>.centroids = data.Submatrix(idx); <span>// initial variables</span> <span>int</span>[] count = <span>new</span> <span>int</span>[k]; <span>int</span>[] labels = <span>new</span> <span>int</span>[rows]; <span>double</span>[][] newCentroids; <span>do</span> <span>// main loop</span> { <span>// Reset the centroids and the</span> <span>// cluster member counters'</span> newCentroids = <span>new</span> <span>double</span>[k][]; <span>for</span> (<span>int</span> i = 0; i < k; i++) { newCentroids[i] = <span>new</span> <span>double</span>[cols]; count[i] = 0; } <span>// First we will accumulate the data points</span> <span>// into their nearest clusters, storing this</span> <span>// information into the newClusters variable.</span> <span>// For each point in the data set,</span> <span>for</span> (<span>int</span> i = 0; i < data.Length; i++) { <span>// Get the point</span> <span>double</span>[] point = data[i]; <span>// Compute the nearest cluster centroid</span> <span>int</span> c = labels[i] = Nearest(data[i]); <span>// Increase the cluster's sample counter</span> count[c]++; <span>// Accumulate in the corresponding centroid</span> <span>double</span>[] centroid = newCentroids[c]; <span>for</span> (<span>int</span> j = 0; j < centroid.Length; j++) centroid[j] += point[j]; } <span>// Next we will compute each cluster's new centroid</span> <span>// by dividing the accumulated sums by the number of</span> <span>// samples in each cluster, thus averaging its members.</span> <span>for</span> (<span>int</span> i = 0; i < k; i++) { <span>double</span>[] mean = newCentroids[i]; <span>double</span> clusterCount = count[i]; <span>for</span> (<span>int</span> j = 0; j < cols; j++) mean[j] /= clusterCount; } <span>// The algorithm stops when there is no further change in</span> <span>// the centroids (difference is less than the threshold).</span> <span>if</span> (centroids.IsEqual(newCentroids, threshold)) <span>break</span>; <span>// go to next generation</span> centroids = newCentroids; } <span>while</span> (<span>true</span>); <span>// Compute cluster information (optional)</span> <span>for</span> (<span>int</span> i = 0; i < k; i++) { <span>// Extract the data for the current cluster</span> <span>double</span>[][] sub = data.Submatrix(labels.Find(x => x == i)); <span>// Compute the current cluster variance</span> covariances[i] = Statistics.Tools.Covariance(sub, centroids[i]); <span>// Compute the proportion of samples in the cluster</span> proportions[i] = (<span>double</span>)sub.Length / data.Length; } <span>// Return the classification result</span> <span>return</span> labels; } |
The implementation is quite straightforward and does not use additional techniques to avoid convergence problems. More refined techniques may be added to the implementation in the future, so please make sure to download the latest version of Accord.NET Framework for the most up-to-date revision of the code.
Using the code
To use the code, create a new instance of the KMeans class passing the desired number of clusters to its constructor. Additionally, you may also pass a distance function to be used as a distance metric during clustering. The default is to use the square Euclidean distance.
1 |
<span>// Declare some observations</span><br /> <span>double</span>[][] observations = <br /> {<br /> <span>new</span> <span>double</span>[] { -5, -2, -1 },<br /> <span>new</span> <span>double</span>[] { -5, -5, -6 },<br /> <span>new</span> <span>double</span>[] { 2, 1, 1 },<br /> <span>new</span> <span>double</span>[] { 1, 1, 2 },<br /> <span>new</span> <span>double</span>[] { 1, 2, 2 },<br /> <span>new</span> <span>double</span>[] { 3, 1, 2 },<br /> <span>new</span> <span>double</span>[] { 11, 5, 4 },<br /> <span>new</span> <span>double</span>[] { 15, 5, 6 },<br /> <span>new</span> <span>double</span>[] { 10, 5, 6 },<br /> };<br /><br /> <span>// Create a new K-Means algorithm with 3 clusters </span><br /> KMeans kmeans = <span>new</span> KMeans(3);<br /><br /> <span>// Compute the algorithm, retrieving an integer array</span><br /> <span>// containing the labels for each of the observations</span><br /> <span>int</span>[] labels = kmeans.Compute(observations);<br /><br /> <span>// As a result, the first two observations should belong to the</span><br /> <span>// same cluster (thus having the same label). The same should</span><br /> <span>// happen to the next four observations and to the last three.</span> |
Sample application
The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are often used to aid border detection and object recognition. The sample application performs image segmentation using the standard squared Euclidean distance over RGB pixel color space. There are, however, better distance metrics to be used for image segmentation, such as weighted distances and other color spaces, which will not be addressed in this example.
Original image (from Ossi Petruska Flickr page*).
To perform image segmentation, we will first translate our image into an array of pixel values. The single image will be read, pixel by pixel, into a jagged array where each element is a double array of length 3. Each element of those double array will contain one of the three RGB values scaled to the interval [–1,1].
After we perform clustering on this array of pixel values, each pixel will have an associated cluster label. Each of these values will then be swapped by its corresponding cluster centroid. The source code below is called when one clicks the Run button in the application.
1 |
<span>private</span> <span>void</span> btnRun_Click(<span>object</span> sender, EventArgs e)<br />{<br /> <span>// Retrieve the number of clusters</span><br /> <span>int</span> k = (<span>int</span>)numClusters.Value;<br /><br /> <span>// Load original image</span><br /> Bitmap image = Properties.Resources.leaf;<br /><br /> <span>// Transform the image into an array of pixel values</span><br /> <span>double</span>[][] pixels = image.ToDoubleArray();<br /><br /><br /> <span>// Create a K-Means algorithm using given k and a</span><br /> <span>// square euclidean distance as distance metric.</span><br /> KMeans kmeans = <span>new</span> KMeans(k, Distance.SquareEuclidean);<br /><br /> <span>// Compute the K-Means algorithm until the difference in</span><br /> <span>// cluster centroids between two iterations is below 0.05</span><br /> <span>int</span>[] idx = kmeans.Compute(pixels, 0.05);<br /><br /><br /> <span>// Replace every pixel with its corresponding centroid</span><br /> pixels.ApplyInPlace((x, i) => kmeans.Clusters.Centroids[idx[i]]);<br /><br /> <span>// Show resulting image in the picture box</span><br /> pictureBox.Image = pixels.ToBitmap(image.Width, image.Height);<br />} |
After segmentation, the following resulting images can be obtained:
Same image after K-Means clustering with k = 5.
Image after K-Means clustering with k = 10.
* The sample image used above has been licensed by Ossi Petruska in a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license.
Conclusion
K-Means is a very simple and popular approach to clustering. The implementation presented here is the same implementation used in Accord.NET. As it can be seem, the method can be easily extended with custom distance functions through delegates or lambda expressions, and can be used in different contexts, such as image segmentation, without further modifications. As a suggestion for improvement, the method can be further speeded up by using the triangle inequality as suggested on the paper "Using the triangle inequality to accelerate k-means", by Charles Elkan.
In the next article, we will see how we can use K-Means to initialize the Expectation-Maximization algorithm for estimating Gaussian Mixture densities in Gaussian Mixture Models. Those articles will form the basis for Continuous density Hidden Markov Models.
Acknowledgements
To Antonino Porcino, who provided the first version of the code and for the valuable information about many other methods and algorithms.
References
-
[1] Matteo Matteucci. “Tutorial on Clustering Algorithms,” Politecnico di Milano, http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html (acessed October 4, 2010).
-
[2] Teknomo, Kardi. “K-Means Clustering Tutorials,” http://people.revoledu.com/kardi/ tutorial/kMean/ (acessed October 6, 2010).
-
[3] Wikipedia contributors, "Cluster analysis," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Cluster_analysis (accessed October 4, 2010).
-
[4] Wikipedia contributors, "K-means clustering," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/K-means_clustering (accessed October 4, 2010).
See also
- Principal Component Analysis (PCA)
- Kernel Principal Component Analysis (KPCA)
- Linear Discriminant Analysis (LDA)
- Non-Linear Discriminant Analysis with Kernels (KDA)
- Kernel Support Vector Machines (SVM)
- Handwriting Recognition Revisited: Kernel Support Vector Machines
- Logistic Regression Analysis in C#
Hey great site! Your tutorial helped me so much in using Kmeans. Thank you very much!
I have a question though, everytime I run kmeans algorithm with same cluster number, I get different results. I guess it is due to random initialization. I tried to change Kmeans compute function with an inheritence method, but couldn’t. I was wondering if you have any suggestions… Thank you so much in advance!
Hi!
First of all, thanks for the positive feedback. It is really appreciated. And yes, you are right about the random initializations.
The random initialization is part of the K-Means algorithm, but if you really wish to disable it, there is a quick way to accomplish it.
All you may have to do is to call:
Accord.Math.Tools.SetRandomSeed(0);
right before any calls to the kmeans.Compute() method.
This will effectively disable the framework’s internal random number generator by resetting it to a common fixed seed before any initialization. Thus always the same values will be used. But please note this method only has effect on Debug mode. If you wish to add it for Release mode as well, please remove the #define safeguards and the conditional flag from this method in the Accord.Math/Tools.cs file. Those will be removed in the next version of the framework.
I hope it helps!
Best regards,
César
I can not thank you enough for your reply! I did what you said and works perfectly right now, thank you soo much man, you are the best!
God bless you:)
Thanks again -also thank you so much for the libraries, it was really helpful
Best,
Arda
Hi César,
It’s a really good post, however I am encountering the same problem and although I access Accord.Math.Tools.SetRandomSeed function, it seems that it does not set random seeds ,as the random numbers generated after this command are always random- not identical in each run.
To be clear, I am not talking about Kmeans clustering. I am talking about generating same random numbers, when I get that Kmeans will work perfectly.
Am I forgetting something?
I am using VS 2010 to develop my c# application. It uses .NET Framework 3.5
I would appreciate any help…
Thanks
Ahmet
Hi Gerard,
You could try using the latest version of the Accord.NET Framework available for download here. I’ve made SetRandomSeed available from any kind of build (not debug builds only) since version 2.1.5 of the framework.
By the way, please remember that, to achieve the desired effect, SetRandomSeed(0) should be called immediately before any calls to the kmeans.Compute() method. This will ensure the same set of numbers are generated during the execution of Compute().
Since this seems to be a desired feature, I guess I will add custom initializations in K-Means as an enhancement request in the issue tracker for Accord.NET.
If you need further help, please don’t hesitate to ask again.
Best regards,
Cesar
Hi again Cesar, thank you for your help but it seems not working..
Accord.Math.Tools.SetRandomSeed(0);
Console.WriteLine(Accord.Math.Tools.Random.NextDouble());
Shouldn’t the code above return the same numbers every time I run the code? I am using the latest version but it seems not working…I guess the problem is caused by setrandomseed method..
I guess I have no other option to code it myself..
Thank you very much
Gerard
Hi Gerard,
Are you sure you are using the latest version of the framework? I forgot to tell you that I renamed this method before turning it officially public in the official Accord.NET Framework build. The method has been renamed to SetupGenerator(int seed).
If you have tried to call SetRandomSeed and it did compile, perhaps you are not referencing the correct version of the assembly.
Best regards,
César
It works!!! Thank you soo much César for all your help, I guess I was not using correct version- thank you very much again. Great software and it’s a dream that it’s free!
Cheers,
Gerard
Hi, César.
I’ve been messing around with your k-means class and I tried to use it to cluster taking into account both the spatial distance and color.
I made an image with white background and four black polygons randomly disposed in it. Then I instantiated the class with the following parameters:
KMeans kmeans = new KMeans(4, Distance.SquareEuclidean);
But I couldn’t make it generate 4 clusters. No matter what I do, it only generates 2 clusters, one for the white pixels and other for the black ones).
What do I have to do to make your algorithm to create one cluster for each polygon, instead of taking into account only the color of the pixel?
Thank you in advance.
Hi Hugo,
In this case, what you need is not color clustering, but spatial clustering. If you will always have a black and white image, a very naive way to apply K-Means to this problem is to retrieve all 2D coordinates of the black pixels in your image and use this as your input data.
In pseudo-code, try something like this:
– Create a new List<double[]> l to accommodate the two dimensional coordinates (x,y).
– Iterate your image and add the current point if its color is black; otherwise just ignore it.
– Convert this list into a double[][] array by calling l.ToArray()
– Use this as your input data and attempt clustering.
PS: You can also try to use AForge.NET’s CollectActivePixels method of the UnamangedImage class to replace the steps above.
Now, to see your results:
– Define a suitable color for each of your clusters;
– Create an empty image with the same size as your original;
– Iterate over each coordinate of the empty image;
– Classify each (x,y) coordinate using k-means. Depending on the classification, paint the pixel with any color you have chosen for the cluster.
Now please keep in mind that the approach I suggested is very naive and will only be able to detect circles around the clusters’ centroids since it uses only 2D coordinate information. Alternatively, you could try a similar approach using other metrics or more elaborate features as input.
Best regards,
César
Oi César, eu escrevi a última mensagem em inglês pq todo seu blog está em inglês, mas como eu sou brasileiro e vc tb, acho q não vai ter problema se eu escrever em português mesmo.
Primeiramente, obrigado por sua resposta rápida e detalhada. Você me ajudou bastante.
Eu segui suas sugestões e saiu quase tudo como o esperado. Porém os agrupamentos não ficaram relacionados aos polígonos, como você pode ver abaixo.
Imagem original:
http://i52.tinypic.com/jgpcvo.png
Teste1:
http://i52.tinypic.com/fda64k.png
Teste2:
http://i52.tinypic.com/29uq3rq.png
Teste3:
http://i52.tinypic.com/fda64k.png
Teste4:
http://i53.tinypic.com/20pbqs7.png
Teste5 (nesse ele acertou):
http://i52.tinypic.com/2pr8apy.png
Então ficou bem aleatório, parece q as posições dos clusters iniciais (q são mesmo aleatórios) estão influenciando o resultado final diretamente.
Você acha q esse é o resultado correto e é assim mesmo q funciona o k-means nesse caso ou eu estou fazendo alguma coisa errada?
Eu apliquei o k-means da seguinte maneira:
KMeans kmeans = new KMeans(4, Distance.Manhattan);
// Também tentei outras medidas de distância
int[] idx = kmeans.Compute(BlackPixels, 15);
// Também tentei com outros valores de limiar
BlackPixels.ApplyInPlace((x, y) => kmeans.Clusters.Centroids[idx[y]]);
E para pintar na nova imagem eu iterei pelos pixels da imagem e se o pixel fosse preto eu o pintava de acordo com:
kmeans.Nearest(pixel)
Se fosse 0, 1, 2 ou 3 eu pintaria de vermelho, verde, azul e amarelo, respectivamente.
Você tem alguma idéia do q eu possa fazer ou do q eu estou fazendo errado?
De qualquer maneira muito obrigado, você fez um trabalho incrível nesse framework. Parabéns mesmo.
Hugo
Olá Hugo!
Então, neste caso, tente especificar um threshold beeem pequeno no método Compute. Algo como 0.0001.
Agora, uma observação. Se o que você busca é separar formas simples, talvez existam métodos mais adequados do que o K-Means. Por exemplo, você pode tentar usar os recursos de template e shape matching do AForge.NET.
Espero que ajude!
César
Fala César.
Obrigado pela ajuda, cara. Eu segui seu conselho, passei um threshold bem baixo pra Compute() e realmente os resultados foram melhores.
Infelizmente parece que k-means não é a melhor técnica pro q eu preciso, vou dar uma olhada nesse link q vc me passou pra ver.
Muito obrigado pela atenção e parabéns pelo trabalho.
Até mais.
Hugo
Hello,
I am learning a lot thanks for the sharing these. Is it possible to modify KMeans to cluster strings?
Hello César,
In each time when i use k-mean with same multivariate data to achieve discrete observation sequence, it is giving different observation sequences. I mean label of cluster is changing and sometimes some data points belongs to different cluster when i run the code again with same data.
Do you know how i can make it fixed?
regards, coder
Hello!
The easiest way to achieve fixed cluster labels would be to fix the random number generator with a seed, using
Accord.Math.Tools.SetupGenerator(0);
However, a much more proper way to do it would be to order the clusters by size! After you run the clustering algorithm, sort the clusters by their Proportion and store their positions in a new vector to act as a dictionary. Then you can use this new ordering to achieve more reproducible results.
Best regards,
Cesar