K-Means clustering in C#

kmeans_thumb2

K-Means is one of the most popular, classic and simple approaches to clustering. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics [3].

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a C# framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. It is based on the AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Contents

  1. Introduction
    1. Lloyd’s K-Mean algorithm
  2. Source code
  3. Using the code
  4. Sample application
  5. Conclusion
  6. Acknowledgements
  7. References
  8. See also

Introduction

In statistics and machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean [4]. In its most common form, the algorithm is an iterative greedy algorithm which converges to a local optimum after a certain number of iterations.

kmeans

Illustration of the K-Means algorithm.

The algorithm works by first selecting k locations at random to be the initial centroids for the clusters. Each observation is then assigned to the cluster which has the nearest centroid, and the centroids are recalculated using the mean value of assigned values. The algorithm then repeats this process until the cluster centroids do not change anymore, or until the change less than a given threshold.

There are other refinements and extensions of the algorithm. The version depicted above is its most common form, also referred as Lloyd’s algorithm.

Lloyd’s K-Means algorithm

  1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
  2. Assign each object to the group that has the closest centroid.
  3. When all objects have been assigned, recalculate the positions of the K centroids.
  4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

 

Source code

The source code has been implemented using Accord.NET and is now part of the framework. In the current version (2.1.2), the following classes related to K-Means are contained inside the Accord.MachineLearning namespace. The source code available in this page contains only the parts of the framework actually needed to support the algorithm.

Class diagram for K-MeansClass diagram for the K-Means algorithm.

The KMeans class is the main class representing the K-Means algorithm. The algorithm itself is implemented in the Compute(double[][] data, double threshold) method, which accepts a set of observations and a convergence threshold to determine when the method should stop.

 

The implementation is quite straightforward and does not use additional techniques to avoid convergence problems. More refined techniques may be added to the implementation in the future, so please make sure to download the latest version of Accord.NET Framework for the most up-to-date revision of the code.

Using the code

To use the code, create a new instance of the KMeans class passing the desired number of clusters to its constructor. Additionally, you may also pass a distance function to be used as a distance metric during clustering. The default is to use the square Euclidean distance.

 

Sample application

The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are often used to aid border detection and object recognition. The sample application performs image segmentation using the standard squared Euclidean distance over RGB pixel color space. There are, however, better distance metrics to be used for image segmentation, such as weighted distances and other color spaces, which will not be addressed in this example.

kmeans1

Original image (from Ossi Petruska Flickr page*).

To perform image segmentation, we will first translate our image into an array of pixel values. The single image will be read, pixel by pixel, into a jagged array where each element is a double array of length 3. Each element of those double array will contain one of the three RGB values scaled to the interval [–1,1].

After we perform clustering on this array of pixel values, each pixel will have an associated cluster label. Each of these values will then be swapped by its corresponding cluster centroid. The source code below is called when one clicks the Run button in the application.

After segmentation, the following resulting images can be obtained:

kmeans5

Same image after K-Means clustering with k = 5.

kmeans10

Image after K-Means clustering with k = 10.

 

* The sample image used above has been licensed by Ossi Petruska in a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license.

Conclusion

K-Means is a very simple and popular approach to clustering. The implementation presented here is the same implementation used in Accord.NET. As it can be seem, the method can be easily extended with custom distance functions through delegates or lambda expressions, and can be used in different contexts, such as image segmentation, without further modifications. As a suggestion for improvement, the method can be further speeded up by using the triangle inequality as suggested on the paper "Using the triangle inequality to accelerate k-means", by Charles Elkan.

In the next article, we will see how we can use K-Means to initialize the Expectation-Maximization algorithm for estimating Gaussian Mixture densities in Gaussian Mixture Models. Those articles will form the basis for Continuous density Hidden Markov Models.

Acknowledgements

To Antonino Porcino, who provided the first version of the code and for the valuable information about many other methods and algorithms.

References

See also

17 Comments

  1. Hey great site! Your tutorial helped me so much in using Kmeans. Thank you very much!
    I have a question though, everytime I run kmeans algorithm with same cluster number, I get different results. I guess it is due to random initialization. I tried to change Kmeans compute function with an inheritence method, but couldn’t. I was wondering if you have any suggestions… Thank you so much in advance!

  2. Hi!

    First of all, thanks for the positive feedback. It is really appreciated. And yes, you are right about the random initializations.

    The random initialization is part of the K-Means algorithm, but if you really wish to disable it, there is a quick way to accomplish it.

    All you may have to do is to call:

    Accord.Math.Tools.SetRandomSeed(0);

    right before any calls to the kmeans.Compute() method.

    This will effectively disable the framework’s internal random number generator by resetting it to a common fixed seed before any initialization. Thus always the same values will be used. But please note this method only has effect on Debug mode. If you wish to add it for Release mode as well, please remove the #define safeguards and the conditional flag from this method in the Accord.Math/Tools.cs file. Those will be removed in the next version of the framework.

    I hope it helps!

    Best regards,
    César

  3. I can not thank you enough for your reply! I did what you said and works perfectly right now, thank you soo much man, you are the best!
    God bless you:)
    Thanks again -also thank you so much for the libraries, it was really helpful

    Best,

    Arda

  4. Hi César,

    It’s a really good post, however I am encountering the same problem and although I access Accord.Math.Tools.SetRandomSeed function, it seems that it does not set random seeds ,as the random numbers generated after this command are always random- not identical in each run.

    To be clear, I am not talking about Kmeans clustering. I am talking about generating same random numbers, when I get that Kmeans will work perfectly.

    Am I forgetting something?
    I am using VS 2010 to develop my c# application. It uses .NET Framework 3.5

    I would appreciate any help…

    Thanks

    Ahmet

  5. Hi Gerard,

    You could try using the latest version of the Accord.NET Framework available for download here. I’ve made SetRandomSeed available from any kind of build (not debug builds only) since version 2.1.5 of the framework.

    By the way, please remember that, to achieve the desired effect, SetRandomSeed(0) should be called immediately before any calls to the kmeans.Compute() method. This will ensure the same set of numbers are generated during the execution of Compute().

    Since this seems to be a desired feature, I guess I will add custom initializations in K-Means as an enhancement request in the issue tracker for Accord.NET.

    If you need further help, please don’t hesitate to ask again.

    Best regards,
    Cesar

  6. Hi again Cesar, thank you for your help but it seems not working..

    Accord.Math.Tools.SetRandomSeed(0);
    Console.WriteLine(Accord.Math.Tools.Random.NextDouble());

    Shouldn’t the code above return the same numbers every time I run the code? I am using the latest version but it seems not working…I guess the problem is caused by setrandomseed method..

    I guess I have no other option to code it myself..

    Thank you very much

    Gerard

  7. Hi Gerard,

    Are you sure you are using the latest version of the framework? I forgot to tell you that I renamed this method before turning it officially public in the official Accord.NET Framework build. The method has been renamed to SetupGenerator(int seed).

    If you have tried to call SetRandomSeed and it did compile, perhaps you are not referencing the correct version of the assembly.

    Best regards,
    César

  8. It works!!! Thank you soo much César for all your help, I guess I was not using correct version- thank you very much again. Great software and it’s a dream that it’s free!

    Cheers,

    Gerard

  9. Hi, César.

    I’ve been messing around with your k-means class and I tried to use it to cluster taking into account both the spatial distance and color.

    I made an image with white background and four black polygons randomly disposed in it. Then I instantiated the class with the following parameters:

    KMeans kmeans = new KMeans(4, Distance.SquareEuclidean);

    But I couldn’t make it generate 4 clusters. No matter what I do, it only generates 2 clusters, one for the white pixels and other for the black ones).

    What do I have to do to make your algorithm to create one cluster for each polygon, instead of taking into account only the color of the pixel?

    Thank you in advance.

  10. Hi Hugo,

    In this case, what you need is not color clustering, but spatial clustering. If you will always have a black and white image, a very naive way to apply K-Means to this problem is to retrieve all 2D coordinates of the black pixels in your image and use this as your input data.

    In pseudo-code, try something like this:

    – Create a new List<double[]> l to accommodate the two dimensional coordinates (x,y).
    – Iterate your image and add the current point if its color is black; otherwise just ignore it.
    – Convert this list into a double[][] array by calling l.ToArray()
    – Use this as your input data and attempt clustering.

    PS: You can also try to use AForge.NET’s CollectActivePixels method of the UnamangedImage class to replace the steps above.

    Now, to see your results:

    – Define a suitable color for each of your clusters;
    – Create an empty image with the same size as your original;
    – Iterate over each coordinate of the empty image;
    – Classify each (x,y) coordinate using k-means. Depending on the classification, paint the pixel with any color you have chosen for the cluster.

    Now please keep in mind that the approach I suggested is very naive and will only be able to detect circles around the clusters’ centroids since it uses only 2D coordinate information. Alternatively, you could try a similar approach using other metrics or more elaborate features as input.

    Best regards,
    César

  11. Oi César, eu escrevi a última mensagem em inglês pq todo seu blog está em inglês, mas como eu sou brasileiro e vc tb, acho q não vai ter problema se eu escrever em português mesmo.

    Primeiramente, obrigado por sua resposta rápida e detalhada. Você me ajudou bastante.

    Eu segui suas sugestões e saiu quase tudo como o esperado. Porém os agrupamentos não ficaram relacionados aos polígonos, como você pode ver abaixo.

    Imagem original:
    http://i52.tinypic.com/jgpcvo.png

    Teste1:
    http://i52.tinypic.com/fda64k.png

    Teste2:
    http://i52.tinypic.com/29uq3rq.png

    Teste3:
    http://i52.tinypic.com/fda64k.png

    Teste4:
    http://i53.tinypic.com/20pbqs7.png

    Teste5 (nesse ele acertou):
    http://i52.tinypic.com/2pr8apy.png

    Então ficou bem aleatório, parece q as posições dos clusters iniciais (q são mesmo aleatórios) estão influenciando o resultado final diretamente.

    Você acha q esse é o resultado correto e é assim mesmo q funciona o k-means nesse caso ou eu estou fazendo alguma coisa errada?

    Eu apliquei o k-means da seguinte maneira:

    KMeans kmeans = new KMeans(4, Distance.Manhattan);
    // Também tentei outras medidas de distância

    int[] idx = kmeans.Compute(BlackPixels, 15);
    // Também tentei com outros valores de limiar

    BlackPixels.ApplyInPlace((x, y) => kmeans.Clusters.Centroids[idx[y]]);

    E para pintar na nova imagem eu iterei pelos pixels da imagem e se o pixel fosse preto eu o pintava de acordo com:

    kmeans.Nearest(pixel)

    Se fosse 0, 1, 2 ou 3 eu pintaria de vermelho, verde, azul e amarelo, respectivamente.

    Você tem alguma idéia do q eu possa fazer ou do q eu estou fazendo errado?

    De qualquer maneira muito obrigado, você fez um trabalho incrível nesse framework. Parabéns mesmo.

    Hugo

  12. Fala César.

    Obrigado pela ajuda, cara. Eu segui seu conselho, passei um threshold bem baixo pra Compute() e realmente os resultados foram melhores.

    Infelizmente parece que k-means não é a melhor técnica pro q eu preciso, vou dar uma olhada nesse link q vc me passou pra ver.

    Muito obrigado pela atenção e parabéns pelo trabalho.

    Até mais.

    Hugo

  13. Hello,
    I am learning a lot thanks for the sharing these. Is it possible to modify KMeans to cluster strings?

  14. Hello César,

    In each time when i use k-mean with same multivariate data to achieve discrete observation sequence, it is giving different observation sequences. I mean label of cluster is changing and sometimes some data points belongs to different cluster when i run the code again with same data.

    Do you know how i can make it fixed?

    regards, coder

  15. Hello!

    The easiest way to achieve fixed cluster labels would be to fix the random number generator with a seed, using

    Accord.Math.Tools.SetupGenerator(0);

    However, a much more proper way to do it would be to order the clusters by size! After you run the clustering algorithm, sort the clusters by their Proportion and store their positions in a new vector to act as a dictionary. Then you can use this new ordering to achieve more reproducible results.

    Best regards,
    Cesar

Leave a Reply

Your email address will not be published. Required fields are marked *