Principal Component Analysis in C#

Principal Component Analysis (PCA) is an exploratory tool designed by Karl Pearson in 1901 to identify unknown trends in a multidimensional data set. It involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components.

Foreword

Before you read this article, please keep in mind that it was written before the Accord.NET Framework was created and became popular. As such, if you would like to do Principal Component Analysis in your projects, download the accord-net framework from NuGet and either follow the starting guide or download the PCA sample application from the sample gallery in order to get up and running quickly with the framework.

Introduction

PCA essentially rotates the set of points around their mean in order to align with the first few principal components. This moves as much of the variance as possible (using a linear transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be highly correlated and may be dropped with minimal loss of information. Please note that the signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA.

For a more complete explanation for PCA, please visit Lindsay Smith excellent Tutorial On Principal Component Analysis (2002).

Accord.NET Framework

This new library, which I called Accord.NET, was initially intended to extend the AForge.NET Framework through the addition of new features such as Principal Component Analysis, numerical decompositions, and a few other mathematical transformations and tools. However, the library I created grew larger than the original framework I was trying to extend. In a few months, both libraries will merge under Accord.NET. (Update April 2015)

Design decisions

As people who want to use PCA in their projects usually already have their own Matrix classes definitions, I decided to avoid using custom Matrix and Vector classes in order to make the code more flexible. I also tried to avoid dependencies on other methods whenever possible, to make the code very independent. I think this also made the code simpler to understand.

The code is divided into two projects:

  • Accord.Math, which provides mathematical tools, decompositions and transformations, and
  • Accord.Statistics, which provides the statistical analysis, statistical tools and visualizations.

Both of them depends on the AForge.NET core. Also, their internal structure and organization tries to mimic AForge’s wherever possible.

The given source code doesn’t include the full source of the Accord Framework, which remains as a test bed for new features I’d like to see in AForge.NET. Rather, it includes only limited portions of the code to support PCA. It also contains code for Kernel Principal Component Analysis, as both share the same framework. Please be sure to look for the correct project when testing.

Code overview

Below is the main code behind PCA.

Using the code

To perform a simple analysis, you can simple instantiate a new PrincipalComponentAnalysis object passing your data and call its Compute method to compute the model. Then you can simply call the Transform method to project the data into the principal component space.

A sample sample code demonstrating its usage is presented below.

Example application

To demonstrate the use of PCA, I created a simple Windows Forms Application which performs simple statistical analysis and PCA transformations.

input_thumb-5B2-5D

The application can open Excel workbooks. Here we are loading some random Gaussian data, some random Poisson data, and a linear multiplication of the first variable (thus also being Gaussian).
 
univariate_thumb-5B4-5D
Simple descriptive analysis of the source data, with a histogram plot of the first variable. We can see it fits a Gaussian distribution with 1.0 mean and 1.0 standard deviation.
 
principal_thumb-5B9-5D
Here we perform PCA by using the Correlation method. Actually, the transformation uses SVD on the standardized data rather than on the correlation matrix, the effect being the same. As the third variable is a linear multiplication of the first, the analysis detected it as irrelevant, thus having a zero importance level.
 
projection_thumb-5B4-5D
Now we can make a projection of the source data using only the first two components.
 

 

Note: The principal components are not unique because the Singular Value Decomposition is not unique. Also the signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA.

Together with the demo application comes an Excel spreadsheet containing several data examples. The first example is the same used by Lindsay on his Tutorial on Principal Component Analysis. The others include Gaussian data, uncorrelated data and linear combinations of Gaussian data to further exemplify the analysis.

I hope this code and example can be useful! If you have any comments about the code or the article, please let me know.

See also

A Tutorial On Principal Component Analysis with the Accord.NET Framework

This is a tutorial for those who are interested in learning how PCA works and how each step of Lindsay’s tutorial can be computed in the Accord.NET Framework, in C#.

Kernel Principal Component Analysis in C#

This is the non-linear extension of Principal Component Analysis. While linear PCA is restricted to rotating or scaling the data, kernel PCA can do arbitrary transformations (such as folding and twisting the data and the space that contains the data).

49 Comments

  1. your project is really very good but i have problems,when i open the program, it has difficulties in identifing the “statistics” and “samples” characterizing them as unavailable. Thus some properties in solution cannot be read.

    Thank you in advance

  2. I suspect this is because the PCA is being computed using the SVD.

    In SVD, singular values are equal to the square root of the eigenvalues. So the singular values are being squared (thus becoming large) to give the eigenvalues. Those eigenvalues, however, may not be the actual eigenvalues that would be obtained using a Eigendecomposition because the SVD implementation used automatically normalizes its singular values.

    As the eigenvalues are used only to compute the amount of variance explained by each component, the important thing to note is that their ratio is preserved.

    Cesar

  3. thanks alot for ur efforts, it’s really great
    i have a question, if i wanna get the principal components for an image
    i mean my target is to classify faces and non faces, so i need to get the pca to the face images
    how can i pass my data which is image in my case to the pca object?
    thanks in advance

  4. Hi,

    You can always transform your image into a single vector and then pass it to PCA as you would pass any other input vector.

    For example, if you have a 320×240 image, you can create a vector of 320*240=76800 positions and then copy the image pixel by pixel to this vector, in any order you wish, as long as you are consistent using the same ordering with all your images.

    In the page http://www.face-rec.org/algorithms/ you can find more information about face recognition algorithms, including ones using PCA.

    Best regards,
    César

  5. Really thanks
    great work, and it works with me well

    just another small question
    are there any restrictions about type or size of the images entered to pca.compute()?

    Thanks alot and sorry for too many questions.

  6. Well, I have not tested the memory limits of this implementation, but in theory it should work as long as you have enough memory to accommodate those matrices during computation.

    About the type of the image, it all depends on how you are going to create your input vector. But typically only gray-scale images are used.

    Best regards,
    César

  7. Dear César,
    Thank you very much for making this valuable code available to the public.
    I have a problem using your code:
    My data is too large: about 500,000 rows and 118 columns. When I give this matrix to the DescriptiveAnalysis(sourceMatrix, sourceColumns) method, the exception “System.OutOfMemoryException” is thrown.
    Is there any way that instead giving a matrix, I can pass a comma-seperated file containing the info to the methods?

  8. Hi Mohammed,

    Have you tried running the method on a 64 bit system? Perhaps it should work. Besides, the error happens in the DescriptiveAnalysis class, perhaps it may work if you comment this portion of the code and use only the PrincipalComponentAnalysis classes.

    Best regards,
    César

  9. Hello! This code is really helpful but my problem is that I dont know how to use this. My project is about face recognition and base on my researches, PCA is paired with neural network on most of the face recognition systems. How will I connect your code with my neural network? What is really the output of PCA that will be the input of the neural network? Are eigenvectors values? I have read that the output of PCA are eigenvectors? I really cant understand PCA. I wish you could help me. Thank you in advance!

  10. Hi Ligemm,

    PCA can be seen as a linear transformation. Being a transformation, what it does is project your data into another space. The PCA output you are looking for is the projection of your original data into this space, which in the case of PCA, will be a space where your variables are (hopefully) uncorrelated.

    The eigenvectors found by the analysis will form the basis for this new space, and the eigenvalues can be used to measure the importance of each of the vectors. If you discard the less important eigenvectors before performing the projection, then you can also perform dimensionality reduction in the process.

    By the way, I have used PCA as a preprocessing step for ANNs too. If you wish, please take a look on the images on this poster, they might help to understand how PCA can be used in this scenario.

    Regards,
    César

  11. Dear César,

    I reduced the number of samples to 50, but still the number of features (118) seems to be too much for your code. The maximum number of features the code can handle seems to be 46 for my data, otherwise it takes forever to finish pca.Compute().
    Do you have any suggestions?

    Thanks.

  12. Hi Mohammad,

    Well, can I have a look in your data? If Compute is taking forever (and is not throwing any exceptions) then this may be a bug. If you could provide an excerpt of your data (perhaps the 50 samples with 118 columns you mentioned) it would be great!

    Best regards,
    César

  13. Hello, I think there is s small bug in PCA implementation. When you use matrix where column number is higher then row number, then it does not work correctly (transform method returns all zeros).
    I think solution is in method PCA.Compute where turning all params to “true” value helps.

    SingularValueDecomposition svd = new SingularValueDecomposition(matrix, true, true, true);

  14. Hi Anonymous,

    Thanks, you are correct about that. The latest version of PCA in the development branch of Accord.NET does indeed uses those parameters when creating the SingularValueDecomposition, but I forgot to update the code available here.

    Thanks again!
    César

  15. Yes, Cesar and Anonymous, I have noticed this for my data, as well (I have a matrix where the column nr is higher than the row nr and I get zeros from the transform method).

    I am using the latest Accord.Net framework (version 2.1.4) and get this problem… How do I update this myself?

    BTW – THANK YOU VERY MUCH for providing us with this great framework, Cesar!

    Greetings from Monika

  16. Hi Monika,

    I am going to release an updated version of the framework soon. If you wish, you can change the aforementioned line in the PrincipalComponentAnalysis.cs class to:

    SingularValueDecomposition svd = new SingularValueDecomposition(matrix, true, true, true);

    If you have installed the framework using the executable installer, the source code will be available in the installation folder. However, if you can wait a little, I will try to release a new version of the framework this week.

    Best regards,
    César

  17. Dear César,

    As I encountered the same problem with column count > row count, I wanted to ask, if the updated version of the framework is already online and maybe within the version 2.1.4?

    Thanks again for your efforts and regards,
    MSA

  18. Dear César,
    thanks you very much This code is really helpful
    i use PCA for my examination but how put jacobian iterasion for this code to get eigen vektor And eigen value
    thanks…

  19. Dear César,

    I’m all of a sudden stuck with a problem concerning the dimensions of input data and pca / kda and I’d like to ask, if you have encountered similar behaviour.

    Is it only possible to set the principal components count maximum to rows count of input data? I always receive an exception “Index was outside the bounds of the array” during the first call of “.Transform(matrix,pcacomponents)” after “.Compute()” with Analysis method “Center”.

    Hope you can help me.

    Thanks in advance.
    MSA2000

  20. Hi MSA2000,

    In PCA this shouldn’t be a problem. I believe I had corrected this in recent versions of the Accord.NET Framework. However, for KPCA, the limit is indeed the number of rows in your data. KPCA works by performing PCA over the Kernel matrix. The kernel matrix have the same dimensions as the number of rows in your data, so I guess it is not possible to generate more components than rows.

    Best regards,
    César

  21. the SVD computation never ends with my data. For example if i pass data 51×400 it works, if i pass 52×400 stops working 😡 i did some debugging, and found out that the “p” variable in SingularValueDecomposition class is not decreasing its value in switch (kase) statement (only when kase==4), what is never reached 😡

  22. Hi there,

    Well, sort of. When we apply PCA to categorical data, we could indeed obtain a lower-dimensional representation of the data. Please see page 339 from the book “Principal Component Analysis”, by I.T. Jolliffe. This particular page is available in Google Books. As it can be seen, the author states that “For data in which all variables are binary, Gower (1966) points out that using PCA does provide a plausible low-dimensional representation.”

    However, a better approach would be to use Multiple Correspondence Analysis instead.

    Best regards,
    Cesar

  23. Hi César,

    I discovered a bug in the PCA adjust function. If any of the standard deviations are zero this will return NaN and propagate through SVD causing the process to fail:

    matrix[i, j] = (m[i, j] – columnMeans[j]) / columnStdDev[j];

    I’ve updated it to do an addition check now for 0 standard deviations and divide by epsilon if found.

    matrix[i, j] = (m[i, j] – columnMeans[j]) / Double.Epsilon;

    I’m also a little confused by the number of eigenvalue generated when using data with more dimensions than samples.

    I have a dataset consisting of 5 samples (rows) with 480 dimensions (columns). The SVD algorithm returns 480 eigenvectors, however, it only returns 6 eigenvalues. I was under the impression that there should be one eigenvalue for every eigenvector. I figured the number of Principal Components would correspond to the dimensions such that it would be possible to analyze the dimensions in their entirety. I’m still new to PCA and was wondering if you could explain.

    Thanks,

    Eric

  24. Hi Eric,

    Thanks for reporting the issue. This has already been fixed a while ago in the main Accord.NET sources. If a variable has zero standard deviation then it should be removed from the the data set, since it will have no impact in the analysis.

    About your second question, I have written a tutorial about using PCA through SVD. However, it is still somewhat unfinished, so if you wish I can send you a partial version by email.

    Best regards,
    Cesar

  25. Thanks César,

    I’ll update to the latest version of Accord.NET. That would be great if you could email me your tutorial. I’ll send you an email so you’ll have my address.

    Cheers,

    Eric

  26. hi,

    this is ananth. i downloaded the sample from your website and could not able to run the code in visual studio 2010. its saying that the accord.statistics.dll and accord.statistics.controls.dll.?? what could be the solution

  27. When you do “using the code”, why is the sourceMatrix a double[,] in :

    PrincipalComponentAnalysis pca = new PrincipalComponentAnalysis(sourceMatrix,
    PrincipalComponentAnalysis.AnalysisMethod.Correlation);

    and here is double [][]:
    double[,] components = pca.Transform(sourceMatrix, 0.8f, true);

    ?

  28. Hi Miguel,

    This had to do on how the matrices were handled by the classifiers in the AForge.NET Framework (which I wanted to keep compatible). The internal matrix processing routines in Accord.NET (such as the matrix decompositions) can work faster on multidimensional matrices than on jagged ones. This is possible because they make heavy use of unsafe pointer operations.

    However, as the classifiers often expect data as double[][], I included .Transform methods for both matrix types. The latest version of Accord.NET should process both types of matrices.

    Hope it helps!

    Best regards,
    Cesar

  29. Hi Cesar, I’m working on image denoising Using PCA-LPG approach, LPG=> Local Pixel Grouping. The Problem is how to group the pixels that are output from your PCA. Or better still if u can provide an insight into how I can apply ur code to denoising. ur will really be appreciated thanks.

  30. Hi Mustapha,

    If you wish to perform denoising, one way is to

    1) Compute the analysis;
    2) Project the image into principal component space using some principal components (but not all of them);
    3) Revert the projection using the .Revert method.

    The reversion hopefully acts as a denoising step. A similar and more powerful result can also be obtained using Kernel Principal Component Analysis. The PCA/KPCA sample application includes a demo of the reversion procedure in the “Project Reversion” tab.

    Hope it helps!

    Best regards,
    Cesar

  31. Hi Cesar, yesterday i downloaded the framework and tried to implemt sample you have mentioned in the section “Using the code” above.

    double[,] components = pca.Transform(sourceMatrix, 0.8f, true);

    But there is no function named “Transform” which is taking 3 arguments.
    Is it changed in new version. If so how can we specify “percentage of information(0.8f)”.

    Thanks in advance.

  32. Hi there,

    Thanks for letting me know about the issue! Perhaps it got lost away in some refactoring. If you wish to determine how many components do you need in order to achieve a given percentage of information, you can use the GetNumberOfComponents method of the PrincipalComponentAnalysis class. Then you can use this number as the second argument of the Transform function.

    Hope it helps!

    Best regards,
    Cesar

  33. Hi Cesar,
    I downloaded the source code, but it shows me warnings about missing some files. Did I made a mistake or what else could happen? Thanks Sona

  34. For those missing files, in downloaded projects, please use Nuget. For example, search for “nuget zedgraph” will provide you with the latest ZedGraph package…

  35. Hello Cesar,

    I run the PCA and it works fine. However, I cannot find a way to save the principal component space in order to reuse it later.
    In fact I need to put my soft in production and I do not want to provide the complete set of data. I just want to project the new sample into the principal component space.
    Is there a way to do that ?

    Best regards

Leave a Reply

Your email address will not be published. Required fields are marked *