# Kernel Principal Component Analysis in C# Kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are done in a reproducing kernel Hilbert space with a non-linear mapping.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

## Analysis Overview

KPCA is an extension of PCA to non-linear distributions. Instead of directly doing a PCA, the original data points are mapped into a higher-dimensional (possibly infinite-dimensional) feature space defined by a (usually nonlinear) function φ through a mathematical process called the “kernel trick”.

### The Kernel trick

The kernel trick is a very interesting technique, able to transform any algorithm that solely depends on the dot product between two vectors. To apply the trick, we replace all dot products in an algorithm with a kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. This non-linear algorithm is equivalent to the linear algorithm operating in the range space of φ. However, because kernels are used, the φ function is never explicitly computed. This is desirable , as the high-dimensional space can have an arbitrary large dimensionality, or can be even infinite-dimensional (such as in the case the chosen kernel function is a Gaussian) and thus infeasible to compute.

Interesting video by Udi Aharoni demonstrating how points that are not linearly separable in a 2D space can almost always be linearly separated in a higher dimensional (3D) space.

### Principal Components in Kernel Space

Like in PCA, the overall idea is to perform a transformation that will maximize the variance of the captured variables while minimizing the overall covariance between those variables. Using the kernel trick, the covariance matrix is substituted by the Kernel matrix and the analysis is carried analogously in feature space. An Eigen value decomposition is performed and the eigenvectors are sorted in ascending order of Eigen values, so those vectors may form a basis in feature space that explain most of the variance in the data on its first dimensions.

However, because the principal components are in feature space, we will not be directly performing dimensionality reduction. Suppose that the number of observations m exceeds the input dimensionality n. In linear PCA, we can find at most n nonzero Eigen values. On the other hand, using Kernel PCA we can find up to m nonzero Eigen values because we will be operating on a m x m kernel matrix.

A more detailed discussion can be seen in the works of Mika et al. (Kernel PCA and De-Noising in Feature Spaces) and Scholkopf et al. (Nonlinear Component Analysis as a Kernel Eigenvalue Problem). A more accessible explanation of the topic is also available in the work from Ian Fasel.

### The pre-image problem

In standard PCA it was possible to revert the transformation, retrieving the original data points and enabling the use of PCA for compression* and denoising. However, as the results provided by kernel PCA live in some high dimensional feature space,  they need not have pre-images in input space. In fact, even if it exists, it also need be not unique.

Thus, in KPCA, it is only possible to obtain an approximate pre-image for a given set of patterns projected on the feature space. This problem can (and has) been tackled by many approaches, some of them of iterative nature, but some closed-form solutions also exists. Typically those solutions make use of the fact that distances in feature space are related to distances in input space. Thus, those solutions try to achieve an optimal transformation that can embed those feature points in input space respecting those distances relationships.

One of the first solutions of this form were proposed by Kwok and Tsang in their paper “The Pre-Image Problem in Kernel Methods". A better approach is also described by Bakir on his thesis “Extensions to Kernel Dependency Estimation”, alongside with a nice comprehensive description on various other methods for handling the pre-image problem.

Note: actually the use of KPCA for compression is a bit limited, since one needs to store the original data set to compute transformations and its approximate reversion.

## Code Overview

Below is the main code behind KPCA. This is a direct, naive implementation of the algorithm that may not scale very well for very large datasets. It is implemented on top of the previous standard PCA code and thus on top of the AForge.NET Framework.

The main code behind the reversion of the projection. It is a direct, naive implementation of the closed-form solution algorithm as proposed by Kwok and Tsang. Currently it is completely not optimized, but it gives an better overview of the algorithm on its essential form. The source code available to download may be better optimized in the future.

## Using the Code

As with PCA, code usage is simple. First, choose a kernel implementing the IKernel interface then pass it to the KernelPrincipalComponentAnalysis constructor together with the data points.

Then, simply call Compute() to perform the analysis. Once done, data about the analysis will be available through the object properties and transformations on additional data can be computed by calling Transform().

## Sample Application

To demonstrate the use of KPCA, I created a simple Windows Forms Application which performs simple statistical analysis and KPCA transformations. It can also perform projection reversion by solving the approximate pre-image problem.

On left: Scholkopf Toy Example. Middle: Kernel principal components detected by the KPCA. Right: The two first dimensions obtained by the two first Kernel principal components.

Left: Wikipedia example. Middle: Projection using a non-centered Gaussian kernel. Right: Reconstruction of the original data using all Kernel principal components, showing some denoising ability.

#### Linear PCA as a special case

Additionally we may check that linear PCA is indeed a special case of kernel PCA by using a linear kernel on the example by Lindsay Smith on his A Tutorial On Principal Component Analysis.

## References

1. Antonino Porcino says:

I’m playing around with the code… amazing stuff! I’m impressed. BTW, do you have some C# recipe for CCA (curvilinear component analysis)? I’m trying to code it directly from the “paper” but, alas, I’m stuck. Also I wonder how all they do compare, PCA, KCA, CCA (and also CDA).

2. César Souza says:

Hi Antonino,

Thanks for the feedback! Unfortunately, I don’t have any implementations of Curvilinear Component Analysis. Is it the same as Principal Curves?

This class of methods is surely in my todo list, but I can’t say for sure when I’ll have time to study and implement them. Anyway, I think they would be a nice addition for the Accord.NET Framework.

Best regards,
César

3. Antonino Porcino says:

don’t know much about Principal Curves (to tell the truth I don’t know much of almost anything in this field as I’m self taught).

CCA is a very simple technique for dimensional reduction; it does that by trying to maintain the euclidean distances from the original space so that “topology” is preserved to a certain extent.

CDA is a more complex variant that uses non euclidean distances but that is able to map highly non-linear manifold structures (spirals and so on).

Today I managed to make my CCA code work (as I’ve said it’s a very simple technique) but it’s not perfect yet. I’m going to compare it versus KCA.

If you want, I can send you the code and the paper stuff I’ve collected, so to have a look.

P.S. I’m going to check your work on SVMs… very interesting stuff!

4. César Souza says:

Hi Antonino,

If you could send your material, I would be very grateful! Please send me a message through the “contact” link on the top of this page, so we can talk by email.

Thanks!
César

5. Anonymous says:

Hi César!

I am unable to downolad your demo neither the source. Is there something wrong with the links?

thx,
Bart

6. Antonino Porcino says:

César can you please tell me if I have understood correctly…

In PCA, after the Compute() step:

– pca.Results[,] are the extracted principal components, arranged in decreasing order of importance

– pca.Components[i].EigenVector[] are the weights that express the i-th component as a linear combination of the source data[,] vectors

– pca.ComponentMatrix[,] are the weights that express the original data[,] vectors as a linear combination of principal components vectors

– if I want to “denoise” a specific input data series, all I have to do is to make a linear combination of pca.Results[,] * pca.ComponentMatrix[,], setting weights to zero if proportion is low (or something like that).

Is that correct?

7. César Souza says:

Hi Antonino,

– Actually, the pca.Results[,] are the projections of the source data into principal component space. It is already a “denoised” version of the source data, obtained through a projection using the entire base of eigenvectors. It is the “result” of this full projection.

– The pca.Components[i].EigenVector[] are indeed weights that express the formation of the i-th component as a linear combination of the source data[,] vectors. It is one of the vectors defining the basis for the orthogonal projection.

– Actually, this is the entire projection basis. It is the combination of all Components[i].EigenVectors. The only difference is that the Components[i].EigenVectors are easier to analyse in a per-component fashion.

– If you would like to apply the transformation to another data set, all you have to do is to call the Transform(double[,]) method passing the data and the number of components you wish to use in the projection. You can select the number of components by observing the amount of variance explained by each of them. If you compute the Transform() of the original data, you will obtain the same projection as the one stored in pca.Result[,]. The pca.Result[,] is only a shortcut, as the Result[,] becomes readily available during the Compute() evaluation and thus doesn’t need to be computed again.

Well, I hope this made things more clear!

Best regards,
César

8. Antonino Porcino says:

Ok, thanks much for the explanations, I have a better understanding now.

Anyway for my problem pca.Transform() isn’t suitable I think… let me explain:

Suppose I have 3 somewhat-correlated data series, d1[], d2[] and d3[]. I put them into columns of data[,] matrix and do pca.Compute().

Now suppose I am interested only in d3 and want to “denoise” it, that is, rewrite d3[] using only the most meaningful principal components. The problem is that PCA.Transform() needs a matrix[,] so I cant give d3[] as a parameter.

What I did was to rebuild back d3[] as weighted sum of pca.Result[t,i]*weights[i], where weights[i] is the corresponding row in pca.ComponentMatrix. I also had to sum pca.Means[i] because, I guess, means is removed during the computation of PCA.

It seems to work because the more components I add, the more d3[] gets similar to the original. And it is also working in the sense of “filtering”, by visual inspection.

So from my way of approaching the problem, d1 and d2 are seen as “hints” and d3 is the working data.

All this have sense or am I totally off-road?

9. César Souza says:

Hi Antonino,

I see. What you are describing is very similar to the projection reversion. Yesterday I completely forgot the fact that denoise is tipically achieved using the Reverse transform. Sorry for that. What you can try to do is to:

– Compute the principal components of your entire data (d1;d2;d3);

– Project all your data to principal component space by calling Transform([,]) and specifying a single principal component;

– Revert your data back by calling Revert([,]) and discarding d2 and d3.

However, I am not sure if you are talking about PCA or KPCA. In PCA this should work OK, but in KPCA, the reverse transform is not that straightforward and may not even exist. There is also a lot of room for improvement in the code of the reverse KPCA transform. Currently, it is a direct implementation of one of the possible algorithms for the reverse transform without optimizations.

Another option is to use Independent Component Analysis instead. It is specially suited to separate multiple data channels which have been mixed. Please take a look on the sample application I linked, perhaps you will find it interesting.

Best regards,
César

10. Antonino Porcino says:

Wow, didn’t even know there was an ICA implementation in the framework… that’s wonderful!

I’ve experimented with both PCA and ICA yesterday, just to see what is best suited for my problem (PCA seems to be more stable over time when I add new datapoints, while ICA is more “erratic”).

BTW, a small note: there is a PCA.Means[] but not a ICA.Means[] (it’s a private array) so it’s not possible to demix data and get back the exact original data.

11. César Souza says:

Thanks, Antonino. The Means[] property will be available in the next release. However, if you would like to get the exact original data back, you could try using the ica.Combine[,] method as well.

Best regards,
César

12. Guanta says:

Hi César,

Thank you for your implementation, it helped me a lot as I am trying right now to implement it, too,

I still have a few question:

1. I read the original article by Schoelkopf. He assumes that the observation data is centerd and only if not you could center the Kernel in Feature-space. So I guess, one of both is neccessary, not both (as you do), is that correct? Or perhaps I only need to center the training data and not the testing data?

2. In your Transform()-method your Kernel has the dimensions (rows X N).
You then center the Kernel. Both in the centerKernel()-function and later in the Transform function you assume a symmetric Kernel-matrix (thus N == rows), otherwise the index in K[i, k] would be out of Bounce.
But normally you use less training-data then test-data, don’t you?
Thus you would have to center the Testing-Kernel differently and change the ‘rows’ in the inner loop “for (int k = 0; k < rows; k++)” to the number of samples, so in your case ‘N’.
Are my considerations true?

Thank you again & best regards,
Guanta

13. César Souza says:

Hi Guanta,

Your considerations are all correct. The Transform function should be performing a different centralization. I had started writing a preliminary correction but didn’t finish it yet. The correct centering should take the previous kernel matrix in account. I forgot where I did find the references for the correct centering, but the code I had written some time ago, and which I didn’t yet finish testing, was:

private static double[,] centerKernel(double[,] newK, double[,] K)
{
int testSamples = newK.GetLength(0);
int N = K.GetLength(0);

for (int t = 0; t < testSamples; t++)
{
double[] k = newK.GetRow(t);

double[] onet = Matrix.Vector(N, 1.0 / N);
double[,] oneN = Matrix.Create(N, N, 1.0 / N);

double[] onetK = onet.Multiply(K);
double[] kone = k.Multiply(oneN);
double[] oneKone = onet.Multiply(K).Multiply(oneN);

for (int i = 0; i < k.Length; i++)
newK[t, i] = k[i] – onetK[i] – kone[i] + oneKone[i];
}

return newK;
}

I hope it helps.

Best regards,
César

14. Anonymous says:

Hi César,

I also got to work it by using a similar approach. I just wrote it in terms of matrix-multiplication which I found more readable (btw. you could also write the last 3 foor-loops in the transform()-function as a matrix-multiplication, i.e.: Result=CenteredKernel*ComponentMatrix).
I still wonder if I need to center the source and/or trainingsdata or if I don’t need it due to the centering of the Kernel. Anyway, the results look good. Thank you again for your fast reply!
Guanta

15. Guanta says:

Hi César,

Thank you for your response! I solved it similarly but in terms of matrix-multiplication. Btw you could also express your three for-loops in the transform()-function as a matrix multiplication (i.e. result = centerd_test_kernel * ComponentMatrix.transpose() ).

But I really wonder whether I need to center the source if I have centered the kernel already (or vice versa)? Do you know that perhaps? Or tested it?
In normal PCA you center the data with the mean and when you project the data you add the mean again. Do I need that here, too?

16. Guanta says:

Sry, with my last assumption I am wrong, you don’t add the mean again in normal pca. You just subtract the mean.
I guess, one of both centering with the mean or centering in feature-space is sufficiant.

17. kausar says:

Thank you so much for such a nice work i have used teh sample application for both PCA and KPCA. the PCA is working but KPCA is giving me error when i do the example “”scholkopf””. Please help me becuase am also trying my dataset and getting the same error.

Error: “column names must correspond to array columns. Parameter name: colnames.

18. César Souza says:

Hi Kausar,

Please try to download the latest version of the Accord.NET Framework. The framework comes with the latest enhancements and bug-fixes, so if this error is indeed a bug, it is very likely it has already been fixed on the most current release.

Best regards,
César

19. kausar says:

Thank you for the advice. i installed the latest version and then tried but now its giving an error

Error: The number of colors must be between 1 and 32.
Parameter name: number.

Please help me.i have find the PCA successfully of my dataset using your software. i have 38 features(column). i have to do feature reduction in order to give it to SVM classifier. Please do tell me about KPCA or either any other tool similar in your resource to do feature reduction n get better accuracy.

Thank again..

20. César Souza says:

Hi Kausar,

This is simply a limitation of the scatterplot chart. Are you sure you have more than 32 classes in your problem? It is very hard to produce more than 32 distinguishable colors in a chart, so I would suggest trying another approach for displaying it.

Regards,
César

21. Ali says:

Thanks for such a nice implementation. I have used it to compare my result with your implementation. I appreciate it. I need one more suggestions “How can I use this for a large dimensional data set?”. As I am doing my reserach on remote sensing and working on a image size of 145x145x220. Can you help me in this regards.

I have tried this impelmentation with a 40x40x220 data set but it fail to execute that. Please consider and suggest me on a quick basis.

Thanks

22. Rafi says:

hy cesar..
This project really nice and very helpful 😀

but, i have one question. when i do my researh with large image size, i got message this:
“An unhandled exception of type ‘System.OutOfMemoryException’ occurred in Accord.Math.dll”
what can i do for this? i have tried with data matrix 100×100 but it fail execute that.