Kernel Discriminant Analysis in C#

kdaf16

Kernel (Fisher) discriminant analysis (kernel FDA) is a non-linear generalization of linear discriminant analysis (LDA) using techniques of kernel methods. Using a kernel, the originally linear operations of LDA are done in a reproducing kernel Hilbert space with a non-linear mapping.

This code has also been incorporated in Accord.NET Framework, which includes the latest version of this code plus many other statistics and machine learning tools.

Analysis Overview

KDA is an extension of LDA to non-linear distributions, just as KPCA is to PCA. For information about LDA, please check the previous post, Linear Discriminant Analysis in C#. The algorithm presented here is a multi-class generalization of the original algorithm by Mika et al. in Fisher discriminant analysis with kernels (1999).

The objective of KDA is to find a transformation maximizing the between-class variance and minimizing the within-class variance. It can be shown that, with Kernels, the original objective function can be expressed as:

 J(alpha) = frac{alpha^T S_B alpha}{alpha^T S_W alpha}

with:

S_B =  sum_{c=1}^C (mu_c - bar{x}) (mu_c - bar{x})^T S_W =  sum_{c=1}^C K_c (I - 1_{l_c}) K_c^T

where Kc is the Kernel matrix for class c, uc is the column means vector for Kc, I is the identity matrix, lc is the number of samples in class c and 1lc is a lc x lc matrix with all entries 1/lc.

The Kernel Trick

The Kernel trick transforms any algorithm that solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced with the kernel function.

K(x,y) = <φ(x),φ(y)>

Thus, a linear algorithm can easily be transformed into a non-linear algorithm. So the algorithm can be carried in a higher-dimension space without explicitly mapping the input points into this space. For more information and a cool video showing how the kernel trick works, please see the introductory text in the previous post, Kernel Principal Component Analysis in C#.

Adding regularization

As the problem proposed above is ill-posed (we are estimating l dimensional covariance structures from l samples), the Sw matrix may become singular, and computing the objective function may become problematic. To avoid singularities, we can add a multiple of the identity matrix to the Sw matrix:

S_W :=!, S_W + lambda I

The λ adds numerical stabilities as for large λ the matrix Sw will become positive definite. The λ term can also be seen as a regularization term, favoring solutions with small expansion coefficients [Mika et al, 1999].

 

Source Code

The code below implements a generalization of the original Kernel Discriminant Analysis algorithm by Mika et al. in Fisher discriminant analysis with kernels (1999).

After completing the analysis, we may wish to project new data into discriminant space. The following code projects a data matrix into this space using the basis found during the analysis.

Sample Application

The sample application shows the how Kernel Discriminant Analysis works. The application can load Excel worksheets containing tables in the same format as the included sample worksheet.

Kernel Discriminant Analysis (KDA) Kernel Discriminant Analysis (KDA) Kernel Discriminant Analysis (KDA) Kernel Discriminant Analysis (KDA)

The first image shows the Wikipedia example for Kernel Principal Component Analysis. The second image shows the Analysis carried out with a Gaussian kernel with sigma = 3.6. The third image shows the Kernel between class and within class equivalent matrices. The last image shows the subset spawned by each class and its Kernel scatter matrix.

Kernel Discriminant Analysis (KDA) Kernel Discriminant Analysis (KDA)

The picture on the left shows the projection of the original data set in the first two discriminant dimensions. Note that just one dimension would already be sufficient to linear separate the classes. On the right, the picture shows how a point in input space (given by the mouse cursor) gets mapped into feature space.

Linear Discriminant Analysis equivalent as a special case

Linear Discriminant Analysis as Kernel Discriminant Analysis spacial case Linear Discriminant Analysis as Kernel Discriminant Analysis spacial case

We can also check that a projection equivalent to Linear Discriminant Analysis can be obtained by using a Linear kernel in the Kernel Discriminant Analysis.

See also

References

On Google Account Security, Trustfulness and Dependence

Today I received a very strange email in my personal gmail address from gmail itself:

O seu endereço secundário, “my-personal-address”@gmail.com, está associado a:

“xxxxxx”@gmail.com
“yyyyy”@gmail.com
“zzzzzzz”@gmail.com

Para efetuar login, clique no link abaixo.

http://www.google.com/accounts/
Se você clicar no link acima e ele não funcionar, copie e cole o URL em uma nova janela do navegador.Obrigado por usar o Google.

Em caso de dúvidas ou preocupações em relação à sua conta, visite as Perguntas freqüentes (FAQs) do Google no endereço
http://www.google.com/support/accounts/

Esse correio é apenas para envio de mensagens. As respostas para essa mensagem não serão monitoradas nem respondidas.

Which roughly translates to:

Your secondary email, “my-personal-address”@gmail.com, is associated with:

“xxxxxx”@gmail.com
“yyyyy”@gmail.com
“zzzzzzz”@gmail.com

To login, click on the link below.
http://www.google.com/accounts/

However, only the first address is known to me. I remember creating it to test a web project for a university class last year. But for the others, I have to ask: How did those people add MY personal email address as their secondary email?

 

I’m pretty sure my personal account is safe and has not been compromised. I switch (strong) passwords often and always take care when and where I insert my credentials. Could this be an error within GMail itself? I always thought GMail required secondary email addresses to be confirmed before they could be linked with an account.

 

I’m getting worried because I just realized GMail accounts often aren’t just email accounts. They are much more. They form the online identity for a lot of people. Through Google services, they hold financial, health, social and content information. They are often the primary identity in sites such as Twitter, Facebook, LinkedIn and many others. Gosh, I know business which run with GMail addresses. For some people, losing a Google ID would mean starting life over.

I think Google has acquired a lot of responsibility those days. And I hope they can handle it well and wisely, as I (and almost everyone else I know) trust Google to handle a lot of their personal data.

 

On a final note, I strongly suggest Google to start researching ways to identify people other than just asking passwords. I have to use 2048-bit cryptographic keys with unlocking pass-phrases to access my remote hosts; why would I continue using passwords – even if they are strong passwords travelling within SSL by default – to access my – much more important – personal data?

Linear Discriminant Analysis in C#

cigars_thumb

Linear discriminant analysis (LDA) is a method used in statistics and machine learning to find a linear combination of features which best characterize or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Motivations

The goals of LDA are somewhat similar to those of PCA. But different from LDA, PCA is an unsupervised technique and as such does not include label information of the data, effectively ignoring this often useful information. For instance, consider two cigar-like clusters in 2 dimensions, one cigar having y = c and the other y = –c (with c being a arbitrary constant), as the image below suggests:

cigars

This example was adapted from the
note on Fisher Linear Discriminant Analysis by Max Welling.

 

The cigars are positioned in parallel and very closely together, such that the variance in the total dataset, ignoring the labels, is in the direction of the cigars. For classification, this would be a terrible projection, because all labels will get mixed and we will destroy any useful information:

cigars-pca-2 cigars-pca-1

 

A much more useful projection is orthogonal to the cigars, which would perfectly separate the data-cases:

cigars-lda-2 cigars-lda-1

 

The first row of images was obtained by performing PCA on the original data. The left image is the PCA projection using the first two components. However, as the analysis says the first component accounts for 96% of the information, one can infer all other components could be safely discarded. The result is the image on the right, clearly a not very useful projection.

The second row of images was obtained by performing LDA on the original data. The left image is the LDA projection using the first two dimensions. However, as the analysis says the first dimension accounts for 100% of the information, all other dimensions could be safely discarded. Well, this is actually true, as we can see on the result in the right image.

 

Analysis Overview

Linear Discriminant Analysis is closely related to principal component analysis (PCA) and factor analysis in that both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Linear Discriminant Analysis considers maximizing the following objective:

 J(w) = frac{w^T S_B w}{w^T S_W w}

where

S_B =  sum_{c=1}^C (mu_c - bar{x}) (mu_c - bar{x})^T S_W =  sum_{c=1}^C sum_{i in c} (x_i - mu_c) (x_i - mu_c)^T

are the Between-Class Scatter Matrix and Within-Class Scatter Matrix, respectively. The optimal solution can be found by computing the Eigenvalues of SBSW-1 and taking the Eigenvectors corresponding to the largest Eigen values to form a new basis for the data. Those can be easily obtained by computing the generalized eigenvalue decomposition of SB and SW.

 

Source Code

The source code below shows the main algorithm for Linear Discriminant Analysis.

 

 

Using The Code

Code usage is simple, as it follows the same object model in the previous PCA example.

 

Considerations

Besides being useful in many situations, in many practical cases linear discriminants are not suitable. A nice insight is that LDA can be extended for use in non-linear classification via the kernel trick. Using kernels, the original observations can be effectively mapped into a higher dimensional non-linear space. Linear classification in this non-linear space is then equivalent to non-linear classification in the original space.

While the code available here already contains code for Kernel Discriminant Analysis, this is something I’ll address in the next post. If you have any suggestions or questions, please leave me a comment.

Finally, as usual, I hope someone finds this information useful.

 

References:

A word on Neural Network learning

micro_tanks

Some time ago, while doing random research, I came across this funny but important note about neural networks, and thought it was worth sharing.

The choice of the dimensionality and domain of the input set is crucial to the success of any connectionist model. A common example of a poor choice of input set and test data is the Pentagon’s foray into the field of object recognition. This story is probably apocryphal and many different versions exist on-line, but the story describes a true difficulty with neural nets.

As the story goes, a network was set up with the input being the pixels in a picture, and the output was a single bit, yes or no, for the existence of an enemy tank hidden somewhere in the picture. When the training was complete, the network performed beautifully, but when applied to new data, it failed miserably. The problem was that in the test data, all of the pictures that had tanks in them were taken on cloudy days, and all of the pictures without tanks were taken on sunny days. The neural net was identifying the existence or non-existence of sunshine, not tanks.

– David Gerhard; Pitch Extraction and Fundamental Frequency: History and Current Techniques, Technical Report TR-CS 2003-06, 2003.

 

🙂

Kernel Principal Component Analysis in C#

kpca-1_thumb-5B1-5D

Kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are done in a reproducing kernel Hilbert space with a non-linear mapping.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Analysis Overview

KPCA is an extension of PCA to non-linear distributions. Instead of directly doing a PCA, the original data points are mapped into a higher-dimensional (possibly infinite-dimensional) feature space defined by a (usually nonlinear) function φ through a mathematical process called the “kernel trick”.

The Kernel trick

The kernel trick is a very interesting technique, able to transform any algorithm that solely depends on the dot product between two vectors. To apply the trick, we replace all dot products in an algorithm with a kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. This non-linear algorithm is equivalent to the linear algorithm operating in the range space of φ. However, because kernels are used, the φ function is never explicitly computed. This is desirable [8], as the high-dimensional space can have an arbitrary large dimensionality, or can be even infinite-dimensional (such as in the case the chosen kernel function is a Gaussian) and thus infeasible to compute.

Interesting video by Udi Aharoni demonstrating how points that are not linearly separable in a 2D space can almost always be linearly separated in a higher dimensional (3D) space.

Principal Components in Kernel Space

Like in PCA, the overall idea is to perform a transformation that will maximize the variance of the captured variables while minimizing the overall covariance between those variables. Using the kernel trick, the covariance matrix is substituted by the Kernel matrix and the analysis is carried analogously in feature space. An Eigen value decomposition is performed and the eigenvectors are sorted in ascending order of Eigen values, so those vectors may form a basis in feature space that explain most of the variance in the data on its first dimensions.

However, because the principal components are in feature space, we will not be directly performing dimensionality reduction. Suppose that the number of observations m exceeds the input dimensionality n. In linear PCA, we can find at most n nonzero Eigen values. On the other hand, using Kernel PCA we can find up to m nonzero Eigen values because we will be operating on a m x m kernel matrix[2].

A more detailed discussion can be seen in the works of Mika et al. (Kernel PCA and De-Noising in Feature Spaces) and Scholkopf et al. (Nonlinear Component Analysis as a Kernel Eigenvalue Problem). A more accessible explanation of the topic is also available in the work from Ian Fasel.

The pre-image problem

In standard PCA it was possible to revert the transformation, retrieving the original data points and enabling the use of PCA for compression* and denoising. However, as the results provided by kernel PCA live in some high dimensional feature space,  they need not have pre-images in input space. In fact, even if it exists, it also need be not unique.

Thus, in KPCA, it is only possible to obtain an approximate pre-image for a given set of patterns projected on the feature space. This problem can (and has) been tackled by many approaches, some of them of iterative nature, but some closed-form solutions also exists. Typically those solutions make use of the fact that distances in feature space are related to distances in input space. Thus, those solutions try to achieve an optimal transformation that can embed those feature points in input space respecting those distances relationships.

One of the first solutions of this form were proposed by Kwok and Tsang in their paper “The Pre-Image Problem in Kernel Methods". A better approach is also described by Bakir on his thesis “Extensions to Kernel Dependency Estimation”, alongside with a nice comprehensive description on various other methods for handling the pre-image problem.

Note: actually the use of KPCA for compression is a bit limited, since one needs to store the original data set to compute transformations and its approximate reversion.

 

Code Overview

Below is the main code behind KPCA. This is a direct, naive implementation of the algorithm that may not scale very well for very large datasets. It is implemented on top of the previous standard PCA code and thus on top of the AForge.NET Framework.

 

The main code behind the reversion of the projection. It is a direct, naive implementation of the closed-form solution algorithm as proposed by Kwok and Tsang. Currently it is completely not optimized, but it gives an better overview of the algorithm on its essential form. The source code available to download may be better optimized in the future.

 

 

Using the Code

As with PCA, code usage is simple. First, choose a kernel implementing the IKernel interface then pass it to the KernelPrincipalComponentAnalysis constructor together with the data points.

Then, simply call Compute() to perform the analysis. Once done, data about the analysis will be available through the object properties and transformations on additional data can be computed by calling Transform().

 

Sample Application

To demonstrate the use of KPCA, I created a simple Windows Forms Application which performs simple statistical analysis and KPCA transformations. It can also perform projection reversion by solving the approximate pre-image problem.

Scholkopf's KPCA Toy Example Kernel principal components detected by the KPCA Two first components detected by the KPCA

On left: Scholkopf Toy Example. Middle: Kernel principal components detected by the KPCA. Right: The two first dimensions obtained by the two first Kernel principal components.

 

Wikipedia's Example on KPCA Projection using a non-centered Gaussian kernel Reconstruction of the original data

Left: Wikipedia example. Middle: Projection using a non-centered Gaussian kernel. Right: Reconstruction of the original data using all Kernel principal components, showing some denoising ability.

Linear PCA as a special case

Lindsay Smith's PCA example data Linear Principal Component Analysis

Additionally we may check that linear PCA is indeed a special case of kernel PCA by using a linear kernel on the example by Lindsay Smith on his A Tutorial On Principal Component Analysis.

 

See also

 

References

Drive óptico (DVD) não reconhecido no Windows Vista (“não foi possível carregar o driver do dispositivo”)

Há algum tempo meu irmão comprou um netbook Acer. Como netbooks geralmente não vem com drives ópticos, meu irmão resolveu comprar um drive externo gravador de DVDs da LG, conexão USB.

Só que, por alguma razão, a unidade não era listada no Windows Explorer e aparecia com um ponto de exclamação no gerenciador de dispositivos. Por alguma razão, o Windows (Vista) não conseguia carregar o driver correto para a unidade e exibia a mensagem “não foi possível carregar o driver do dispositivo”.

Bom, se você está com este mesmo problema, para resolver basta apagar uma entrada do registro do Windows. O caminho da entrada é:

HKEY_LOCAL_MACHINESystemCurrentControlSetControlClass{4D36E965-E325-11CE-BFC1-08002BE10318}UpperFilters

 

Os passos para a solução são:

  • Clique Iniciar, e no campo procurar digite “regedit” e dê enter.
  • No regedit, navegue até
    HKEY_LOCAL_MACHINESystemCurrentControlSetControlClass{4D36E965-E325-11CE-BFC1-08002BE10318}
  • Selecione a entrada “UpperFilters”, clique com o botão direito e selecione Excluir.
  • Selecione a entrada “LowerFilters”, clique com o botão direito e selecione Excluir.
  • Feche o regedit e reinicie o computador. Pronto!

 

Espero que isto seja útil a mais alguém 🙂

Obs: Acho que isto tem alguma coisa a ver com a instalação do iTunes ou de outras porcarias que estavam presentes na ocasião. De qualquer modo, problema resolvido!

Obs 2: Mais detalhes no site da Microsoft.

Discriminatory Power Analysis by Receiver-Operating Characteristic Curves (Part 2 of 2: C# Source Code)

roc-1_thumb-5B10-5D_thumb-5B1-5D

Part 2 of 2: C# Source Code. Go to Part 1 of 2: Theory.

Happy new year everyone! As promised, here is the second part of the article Performing Discriminatory Power Analysis by Receiver-Operating Characteristic Curves. This second part shows how to create ROC curves in C#. The following C# code implements the creation, visualization and analysis of ROC curves. The sample application accompanying the source code demonstrates how to use them inside your own applications.

 

Source code

Download source code and sample applications.

Confusion matrix

The ConfusionMatrix class represents a 2×2 contingency table. All derived measures mentioned above, such as sensitivity and specificity, are available through its properties and methods.

 

Receiver Operating Characteristic

The ReceiverOperatingCharacteristic class represents a ROC curve. The collection of points that define the curve are available through the Points property. Because points are ConfusionMatrix-inherited objects, the collection can be directly bound to a DataGridView, enabling quick curve visual inspection.

Additional information, such as the total Area Under the Curve and its associated deviation error are available through the Area and Error properties, respectively.

 

Using the code

Code usage is very simple. To use the aforementioned classes and create a ROC curve, simply create a new ReceiverOperatingCharacteristic object passing the actual data, as measured by the experiment, and the test data, as given by the prediction model.

The actual data must be a dichotomous variable, with only two valid values. The “false”  value is assumed to be the lowest value of the dichotomy, while the “true” value is assumed to be the highest. It is recommended to use 0 and 1 values for simplicity, although it isn’t mandatory.

The test data, as given by the prediction model, must have continuous values between the lowest and highest values for the actual data. For example, if the two valid values are 0 and 1, then its values must be inside the [0,1] range.

To compute the Curve using different cutoff values, call the Compute method passing the desired number of points or the desired increment in the cutoff value between points.

 

Sample applications

Together with the source code, there is an accompanying sample application demonstrating the use of the ROC analysis. To open the example table, click File –> Load then select the excel spreadsheet located inside the executable folder. To plot the curve, select the number of points or the threshold increment and click Plot.

roc-1_thumb[10]A “good” classifier.

 

roc-2_thumb[2] An approximately “random” classifier.

 

roc-3_thumb[3] A perfect, ideal classifier.

 

roc-4_thumb[3] Curve points for the “good” classifier.

 

Further Reading

  • Receiver Operating Curves: An Introduction
    Excellent page about ROC curves and its applications. Includes excellent applets for experimentation with the curves, allowing for better understanding of its workings and meaning.

  • BBC NEWS MAGAZINE, A scanner to detect terrorists; Very interesting paper about how statistics are usually wrongly interpreted when published by the media. “To find one terrorist in 3000 people, using a screen that works 90% of the time, you’ll end up detaining 300 people, one of whom might be your target”. Written by Michael Blastland.

 

References