Discriminatory Power Analysis by Receiver-Operating Characteristic Curves (Part 1 of 2: Theory)

contingency_thumb-5B3-5D

Part 1 of 2: Theory. Go to Part 2 of 2: C# Source code.

When dealing with systems, methods or tests involving the detection, diagnostics or prediction of results, it is very important to validate obtained results in order to quantify and qualify its discriminative power as good or not for a given analysis. However, we must consider that the simple quantization of hits and misses on a test group does not necessarily reflects how good a system is, because this quantization will fundamentally depend on the quality and distribution of this test group data.

 

To exemplify what has just been said, lets consider the example of a diabetes detection system, which outputs are 1 or 0, indicating whether a patient has the condition or not. Now, lets suppose we applied this system on a test group of 100 patients which we (but not the system) already know who has diabetes or not, and the system correctly identified 90% of the conditions. Seems a rather good performance, doesn’t?

Not exactly. What has not been said is how the condition is distributed along the test group. Lets reveal now that 90% actually had diabetes. The system, thus, could have said "1" to every and any input given, and even still obtain a 90% hit rate, as the 10% patients left were healthy. By now, we can not be sure if the system was really good or just acted by chance, saying everyone was ill without any prior knowledge or calculations.

 

contingency
Contingency Table (confusion matrix) for a binary classifier

For situations like this, other measures have been created in order to consider such unbalance in the test groups. Before going further into those measures, lets discuss about the so called contingency table (or confusion matrix), which will act as a base for the next measures shown. Its mechanics are rather simple: we consider positive values that the system has predicted as positive as true positives (hit), positive values that the system predicted as negative as false negatives (miss), negative values the system said were negative as true negatives (hit), and negative values the system said were positive as false positive (miss).

 

Now I’ll present some measures derived from this rather simple table.

Accuracy

    The proportion of correct predictions, without considering what is positive and what is negative. This measure is highly dependant on the data set distribution and can easily lead to wrong conclusions about the system performance.

    ACC = TOTAL HITS / NUMBER OF ENTRIES IN THE SET

            = (TP + TN) / (P + N)

Sensitivity

    The proportion of true positives: the ability of the system on correctly predicting the condition in cases it is really present.

    SENS = POSITIVE HITS / TOTAL POSITIVES

             = TP / (TP + FN)

Specificity

    The proportion of true negatives: the ability of the system in correctly predicting the absence of the condition in cases it is not present.

    SPEC = NEGATIVE HITS / TOTAL NEGATIVES

             = TN / (TN + FP)

Efficiency

    The arithmetic mean of Sensibility and Specificity. In pratical situations, sensibility and specificity vary in reverse directions. Generally, when a method is too responsive to positives, it tends to produce many false positives, and vice versa. Therefore, a perfect decision method (with 100% specificity and 100% specificity) rarely is conceived, and a balance between both must be obtained.

    EFF = (SENS + SPEC) / 2

Positive Predictive Value

    The proportion of true positives in contrast with all positive predictions. This measure is highly susceptible to the prevalence in the data set, but gives an estimate on how good the system is when making a positive affirmation. It also can easily lead to wrong conclusions about system performance.

    PPV = POSITIVE HITS / TOTAL POSITIVE PREDICTIONS

           = TP / (TP + FP)

Negative Predictive Value

    The proportion of true negatives in contrast with all negative predictions. This measure is highly susceptible to the prevalence in the data set but gives an estimate on how good the system is when making a negative affirmation. It can easily lead to wrong conclusions about system performance.

    NPV = NEGATIVE HITS / TOTAL NEGATIVE PREDICTIONS

           = TN / (TN + FN)

Matthews Correlation Coefficient – or Phi (φ) Coefficient

    The Matthews correlation coefficient is a measure of the quality of two binary classifications that can be used even if both classes have very different sizes. It returns a value between −1 and +1, in which a +1 coefficient represents a perfect prediction, 0 a random prediction, and –1 an inverse prediction. This statistic is an equivalent to the phi coefficient, and attempts, like the efficiency measure, summarize the quality of the contingency table in a single value which can be compared.

    MCC = φ = (TP*TN – FP*FN) / sqrt((TP + FP)*(TP + FN)*(TN + FP)*(TN + FN))

    Note that, if any of the sums in the denominator equals zero, the denominator can be considered 1, resulting in a MCC of 0, which is also the correct limit for this situation.

 

The Receiver Operating Characteristic (ROC) Curve

The ROC curve was developed by electrical and radar system engineers during World War II to detect enemy objects in enemy fields. The ROC analysis has been used in medicine, radiology, psychology and other areas for many decades. And, more recently, has been introduced to areas such as machine learning and data mining.

 

classifiers
ROC curves for different classifiers

Because the output from classification systems are generally continuous, it is necessary to define a cutoff value, or discriminatory threshold, to classify and count the number of positive and negative predictions (such as positive or negative diagnostics in the case of pathology occurrence). As this threshold can be arbitrarily determined, the best practice to compare the performance of different systems is to study the effect of selecting diverse cutoff values over the output data.

Considering many cutoff values, it is possible to calculate a set of pairs (sensitivity, 1-specificity) which can then be plotted in a curve. This curve will be the ROC curve for the system, having sensitivity values as its ordenades (y-axis) and the complement of specificity (1-specificity) as its abscissas (x-axis).

 

 

A standard measure for system comparison is the area under the ROC curve (AUC), which can be obtained by numerical integration, such as, for example, the trapezoidal rule. Theoretically, higher the AUC, better the system.

Calculating the area of a ROC curve in Microsoft Excel®

    Put the sensitivity and (1-specificity) pairs in the columns A and B, respectively. If you have 10 points  (from A1 to B10), you can use the following formula to calculate its ROC area:

    =SUMPRODUCT((A6:A15+A5:A14)*(B5:B14-B6:B15))*0,5

Determining the standard error when calculating the area

The standard error of a ROC curve is based on the standard deviation assumed when applying our system to a population sample rather than to the entire population. This measure comes from the fact that, depending on which samples of a population we take to perform the ROC analysis, the area under the curve would vary according the particular sample distribution.

The error calculation is, to a certain point, simple, as it comes from only three known values: the area A under the ROC curve, the number Na of samples which have the investigated condition (i.e. have diabetes) and the number Nn of samples which does not have the condition (i.e. does not have diabetes).

ERROR = sqrt((A*(1 – A) + (Na– 1)*(Q1 – A²)+(Nn– 1)(Q2 – A²))/(Na * Nn))

where:

    Q1 = A/(2 – A)
    Q2 = 2*A²/(1 + A)

 

Source code

For a C# code implementing ROC curve creation and analysis, please follow to the next part of this article, Discriminatory Power Analysis using Receiver-Operating Characteristic Curves (Part 2 of 2: C# Source Code).

 

Further Reading

  • Receiver Operating Curves: An Introduction
    Excellent page about ROC curves and its applications. Includes excellent applets for experimentation with the curves, allowing for better understanding of its workings and meaning.

  • BBC NEWS MAGAZINE, A scanner to detect terrorists; Very interesting paper about how statistics are usually wrongly interpreted when published by the media. “To find one terrorist in 3000 people, using a screen that works 90% of the time, you’ll end up detaining 300 people, one of whom might be your target”. Written by Michael Blastland.

 

References

Neural Network Learning by the Levenberg-Marquardt Algorithm with Bayesian Regularization (part 1)

jacobianneuralnet26

A complete explanation for the totally lost, part 1 of 2.

Downloads

The code has been incorporated in Accord.NET Framework, which includes the latest version of this code plus many other statistics and machine learning tools.

Contents

  1. Overview
    1. Neural Networks
    2. AForge Framework
    3. Network training as a function optimization problem
  2. Levenberg-Marquardt
    1. Computing the Jacobian
    2. Approximating the Hessian
    3. Solving the Levenberg-Marquardt equation
    4. General Algorithm
    5. Limitations
  3. Bayesian Regularization
    1. Expanded Levenberg-Marquardt Algorithm
  4. Source Code
    1. Using the Code
    2. Sample Applications
  5. Additional Notes
  6. References

Overview

The problem of neural network learning can be seen as a function optimization problem, where we are trying to determine the best network parameters (weights and biases) in order to minimize network error. This said, several function optimization techniques from numerical linear algebra can be directly applied to network learning, one of these techniques being the Levenberg-Marquardt algorithm.
The Levenberg–Marquardt algorithm provides a numerical solution to the problem of minimizing a (generally nonlinear) function, over a space of parameters for the function. It is a popular alternative to the Gauss-Newton method of finding the minimum of a function.

Neural Networks

Neural networks are a relatively new artificial intelligence technique. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. The learning procedure tries is to find a set of connections w that gives a mapping that fits the training set well.
Furthermore, neural networks can be viewed as highly nonlinear functions with the basic the form:

(1)

Where x is the input vector presented to the network, w are the weights of the network, and y is the corresponding output vector approximated or predicted by the network. The weight vector w is commonly ordered first by layer, then by neurons, and finally by the weights of each neuron plus its bias.
This view of network as an parameterized function will be the basis for applying standard function optimization methods to solve the problem of neural network training.

AForge Framework

AForge.NET Framework is a C# framework designed for developers and researchers in the fields of Computer Vision and Artificial Intelligence. Here, the Levenberg-Marquardt learning algorithm is implemented as a class implementing the ISupervisedLearning interface from the AForge framework.

Network training as a function optimization problem

As mentioned previously, neural networks can be viewed as highly non-linear functions. From this perspective, the training problem can be considered as a general function optimization problem, with the adjustable parameters being the weights and biases of the network, and the Levenberg-Marquardt can be straightforward applied in this case.

Levenberg-Marquardt Algorithm

The Levenberg-Marquardt algorithm is a very simple, but robust, method for approximating a function. Basically, it consists in solving the equation:

(2)

Where J is the Jacobian matrix for the system, λ is the Levenberg’s damping factor, δ is the weight update vector that we want to find and E is the error vector containing the output errors for each input vector used on training the network. The δ tell us by how much we should change our network weights to achieve a (possibly) better solution. The JtJ matrix can also be known as the approximated Hessian.
The λ damping factor is adjusted at each iteration, and guides the optimization process. If reduction of E is rapid, a smaller value can be used, bringing the algorithm closer to the Gauss–Newton algorithm, whereas if an iteration gives insufficient reduction in the residual, λ can be increased, giving a step closer to the gradient descent direction.

Computing the Jacobian

The Jacobian is a matrix of all first-order partial derivatives of a vector-valued function. In the neural network case, it is a N-by-W matrix, where N is the number of entries in our training set and W is the total number of parameters (weights + biases) of our network. It can be created by taking the partial derivatives of each output in respect to each weight, and has the form:
Jacobian matrix for neural networks
Where F(xi, w) is the network function evaluated for the i-th input vector of the training set using the weight vector w and wj is the j-th element of the weight vector w of the network.
In traditional Levenberg-Marquardt implementations, the Jacobian is approximated by using finite differences. However, for neural networks, it can be computed very efficiently by using the chain rule of calculus and the first derivatives of the activation functions.

Approximating the Hessian

For the least-squares problem, the Hessian generally doesn’t needs to be calculated. As stated earlier, it can be approximated by using the Jacobian matrix with the formula:

(3)

Which is is a very good approximation of the Hessian if the residual errors at the solution are “small”. If the residuals are not sufficiently small at the solution, this approach may result in slow convergence. The Hessian can also be used to apply regularization to the learning process, which will be discussed later.

Solving the Levenberg-Marquardt equation

Levenberg’s main contribution to the method was the introduction of the damping factor λ. This value is summed to every member of the approximate Hessian diagonal before the system is solved for the gradient. Tipically, λ would start as a small value such as 0.1.
Then, the Levenberg-Marquardt equation is solved, commonly by using a LU decomposition. However, the system can only be solved if the approximated Hessian has not become singular (not having an inverse). If this is the case, the equation can still be solved by using a SVD decomposition.
After the equation is solved, the weights w are updated using δ and network errors for each entry in the training set are recalculated. If the new sum of squared errors has decreased, λ is decreased and the iteration ends. If it has not, then the new weights are discarded and the method is repeated with a higher value for λ.
This adjustment for λ is done by using an adjustment factor v, usually defined as 10. If  λ needs to increase, it is multiplied by v. If it needs to decrease, then it is divided by v. The process is repeated until the error decreases. When this happens, the current iteration ends.

General Levenberg-Marquardt Algorithm

As stated earlier, the Levenberg-Marquardt consists basically in solving (2) with different λ values until the sum of squared error decreases. So, each learning iteration (epoch) will consist of the following basic steps:

  1. Compute the Jacobian (by using finite differences or the chain rule)
  2. Compute the error gradient
    • g = JtE
  3. Approximate the Hessian using the cross product Jacobian (eq. 3)
    1. H = JtJ
  4. Solve (H + λI)δ = g to find δ
  5. Update the network weights w using δ
  6. Recalculate the sum of squared errors using the updated weights
  7. If the sum of squared errors has not decreased,
    1. Discard the new weights, increase λ using v and go to step 4.
  8. Else decrease λ using v and stop.

Variations of the algorithm may include different values for v, one for decreasing λ and other for increasing it. Others may solve (H + λdiag(H))δ = g instead of (H + λI)δ = g (2), while others may select the initial λ according to the size of the elements on H, by setting λ0 = t max(diag(H)), where t is a value chosen by the user. I’ve chosen the identity matrix equation because, apparently, it is the same method implemented internally by the Neural Network Toolbox in MATLAB.
We can see we will have a problem if the error does not decrease after a some iterations. In this case, the algorithm also stops if λ becomes too large.

Limitations

The Levenberg-Marquardt is very sensitive to the initial network weighs. Also, it does not consider outliers in the data, what may lead to overfitting noise. To avoid those situations, we can use a technique known as regularization.

In the next part of this article (part 2), we’ll discuss more about Bayesian regularization. We will also present and demonstrate the usage of the article’s accompanying source code. Please click here to go to Neural Network Learning by the Levenberg-Marquardt Algorithm with Bayesian Regularization (part 2).

Neural Network Learning by the Levenberg-Marquardt Algorithm with Bayesian Regularization (part 2)

lm-apprx-1_thumb-5B2-5D

A complete explanation for the totally lost, part 2 of 2. Click here to return to part 1.

Downloads

The code has been incorporated in Accord.NET Framework, which includes the latest version of this code plus many other statistics and machine learning tools.

Contents

  1. Overview
    1. Neural Networks
    2. AForge Framework
    3. Network training as a function optimization problem
  2. Levenberg-Marquardt
    1. Computing the Jacobian
    2. Approximating the Hessian
    3. Solving the Levenberg-Marquardt equation
    4. General Algorithm
    5. Limitations
  3. Bayesian Regularization
    1. Expanded Levenberg-Marquardt Algorithm
  4. Source Code
    1. Using the Code
    2. Sample Applications
  5. Additional Notes
  6. References

Bayesian Regularization

To overcome the problem in interpolating noisy data, MacKay (1992) has proposed a Bayesian framework which can be directly applied to the neural network learning problem. It also allows to estimate the effective number of parameters actually used by the model – in this case, the number of network weights actually needed to solve a particular problem.
Bayesian regularization expands the cost function to search not only for the minimal error, but for the minimal error using the minimal weights. It works by introducing two Bayesian hyperparameters, alpha and beta, to tell which direction (minimal error or minimal weights) the learning process must seek.
The cost function will then become:

C(k) = β*Ed+α*Ew, where:

  1. Ed is the sum of squared errors, and
  2. Ew is the sum of squared weights

By using Bayesian regularization, one can avoid costly cross validation. It is particularly useful to problems that can not, or would suffer, if a portion of the available data were reserved to a validation set. Regularization also reduces (or eliminates) the need for testing different number of hidden neurons for a problem. A third variable, gamma, indicates the number of effective weights being used by the network, thus giving an indication on how complex the network should be.
Many practical realizations of Bayesian Regularization perform an update of the hyperparameters alpha and beta after each training cycle. However, according to Poland (2001) the most popular update algorithm fails to produce robust iterates if there is not much training data. The following algorithm shows how to update the hyperparameters according to both rules. Both methods base their calculations in the inverse Hessian matrix.

Expanded Levenberg-Marquardt Algorithm

Adding bayesian regularization to the Levenberg-Marquardt adds little overhead to the process because we already have an Hessian approximation. So, the algorithm for a learning epoch then becomes:

  1. Compute the Jacobian (by finite differences or using the chain rule)
  2. Compute the error gradient
    • g = JtE
  3. Approximate the Hessian using the cross product Jacobian
    1. H = JtJ
  4. Calculate the cost function
    1. C = β*Ed+α*Ew, where:
      1. Ed is the sum of squared errors, and
      2. Ew is the sum of squared weights
  5. Solve (H + λI)δ = g to find δ
  6. Update the network weights w using δ
  7. Recalculate the cost function using the updated weights
  8. If the cost has not decreased,
    1. Discard the new weights, increase λ using v and go to step 5.
  9. Else decrease λ using v
  10. Update the Bayesian hyperparameters using MacKay’s or Poland’s formulae:
    1. gamma = W – (alpha * tr(H-1))
    2. beta = (N – gamma) / 2.0 * Ed
    3. alpha = W / (2.0 * Ew + tr(H-1)) [modified Poland’s update], or
    4. alpha = gamma / (2.0 * Ew) [original MacKay’s update], where:
      1. W is the number of network parameters (number of weights and biases)
      2. N is the number of entries in the training set
      3. tr(H-1) is the trace of the inverse Hessian matrix

Source Code

Here is the code for running a learning epoch of the Levenberg-Marquardt algorithm.

Using the Code

Code usage is very simple. It follows the common AForge.NET learning algorithm usage, instantiating a LevenbergMarquardtLearning class instead of the more traditional BackpropagationLearning. Optionally, we may opt for using the Jacobian approximated by finite differences instead of computing it directly (although it would not be very practical unless for debugging purposes). We may also enable bayesian regularization by passing a boolean parameter to the constructor.

Sample applications

The original sample applications from AForge.NET have been modified to use Levenberg-Marquardt learning. However, it is quite tricky to get them to converge adequately.

lm-apprx-1 lm-apprx-2
Function approximation using regular Levenberg-Marquardt

lm-apprx-3-br lm-apprx-3-nw lm-apprx-4-br lm-apprx-5-br lm-apprx-6-br
Function approximation using the Levenberg-Marquardt with Bayesian regularization. Note that some of them does not converge to a best fit solution; instead, a possibly more general solution is preferred. The top-right image uses the Nguyen-Widrow method for initializing weights.

lm-xor-1
Solving the XOR problem using Levenberg-Marquardt

Additional Notes

Some versions of the Levenberg-Marquardt algorithm solve the equation (JtJ + λ diag(JtJ) I)δ = JtE instead of  (JtJ + λI)δ = JtE, effectively replacing the identity matrix with the diagonal of the approximated Hessian for the weight update rule. According to Wikipedia, this was suggested by Marquardt to incorporate some local curvature estimation. However, for some reason, more recent implementations that use the Levenberg-Marquardt tag do not include Marquardt’s suggestion. Other implementations suggest using an initial λ related to the size of the elements of the approximated Hessian by making λ0 = t max(diag(JtJ)), where t is a value chosen by the user.
It is also important to note that the Levenberg-Marquardt is extremely dependent on the initial guess for the network parameters. Depending on the initial weights of the network, the algorithm may converge to a local minima or not converge at all.
Besides that, it is an extremely fast method for neural network learning when compared to the standard backpropagation algorithm.

Finally, if you have any comments about the article or about the code, please let me know it! All suggestions are welcome.

References

Universal Approximation Theorem

7984c7a1591d89aa4911224a5c486d4b

The Universal Approximation Theorem states that:

Let φ(·) be a nonconstant, bounded, and monotonically-increasing continuous function. Let Im0 denote the m0-dimensional unit hypercube [0,1]m0. The space of continuous functions on Im0 is denoted by C(Im0). Then, given any function f Э C(Im0) and є > 0, there exist an integer m1 and sets of real constants αi, bi and wij, where i = 1, …, m1 and j = 1, …, m0 such that we may define:

 
  F( x_1 , dots, x_{m_0} ) =
  sum_{i=1}^{m_1} alpha_i varphi left( sum_{j=1}^{m_0} w_{i,j} x_j + b_iright)

as an approximate realization of the function f; that is,

 
  | F( x_1 , dots, x_{m_0} ) - f ( x_1 , dots, x_{m_0} ) | < varepsilon

for all x1, x2, …, xm0 that lie in the input space.

[ From Wikipedia ]

Beautiful, isn’t? 😛

Análise de Componente Principal em C#

input_thumb-5B2-5D

For the english version, Principal Component Analysis in C#, click this link.

A Análise de Componente Principal (PCA) é uma ferramenta exploratória criada por Karl Pearson em 1901 para identificar tendencias desconhecidas em um conjunto de dados multidimensional. A análise envolve um procedimento matemático que transforma um conjunto de variaveis possivelmente correlatas em um grupo menor de variaveis incorrelatas de denominadas componentes principais. Os primeiros componentes principais correspondem a maior variabilidade possivel contida nos dados, enquanto os componentes posteriores correspondem ao restante da informacao, em ordem decrescente de variabilidade.

Visão Geral

A técnica PCA essencialmente rotaciona o conjunto de pontos ao redor de sua média objetivando seu alinhamento com os primeiros componentes principais. Isto move a maior parte da variancia possivel (usando uma transformação linear) para as primeiras dimensões. Os valores nas dimensoes restantes, portanto, tendem a ser altamente correlacionados e podem ser ignorados com perda mínima de informação.

Para uma explicação mais detalhada sobre PCA, leia o excelente texto Tutorial On Principal Component Analysis, por Lindsay Smith (2002).

 

AForge.NET Framework

Minha idéia inicial era implementar a PCA internamente no AForge.NET Framework. O AForge.NET é um excelente framework de Inteligência Artificial e Visão Computacional de código aberto escrito para .NET e desenvolvido principalmente por Andrew Kirillov.

No entanto, como isto envolveria muitas mudanças e adições ao framework, acredito que seria muito difícil rever todas mudanças requeridas para adicionar esta funcionalidade diretamente em seu código fonte.

Devido a isso, este novo, e mais limpo, código, foi implementado usando somente os binários do AForge como ponto de partida. Espero que este código ajude outros precisando realizar a Análise de Componente Principal em seus próprios projetos em C#.

 

Esta nova biblioteca, a que chamei de Accord.NET, extende o AForge.NET adicionando novas features como a Principal Component Analysis, decomposições numéricas e algumas outras poucas transformações matemáticas e ferramentas. Na verdade, esta extensão é muito mais como um campo de teste para novas features que eu gostaria de ver em versões futuras do AForge. Mas para maior simplicidade, omiti as partes não relacionadas com PCA e criei esta versão compacta apenas para suportar a análise.

 

Decisões de Implementação

Como pessoas que querem utilizar PCA em seus projetos geralmente já tem suas próprias classes de matrizes, decidi evitar usar implementações específicas para tornar o código mais flexível. Também tentei evitar dependencias em outros métodos sempre que possível, para tornar o código bastante independente. Acredito que isto também tenha tornado o código mais simples de se entender.

O código está dividido entre estes dois projetos:

  • Accord.Math, que fornece as ferramentas matemáticas, como decomposições e transfomações; e
  • Accord.Statistics, que fornece análises e ferramentas estatísticas.

Ambos dependem do núcleo do AForge.NET. Adicionalmente, sua estrutura e organização interna também tenta similar ao máximo a empregada no AForge sempre que possível.

 

O código fonte fornecido não inclui o código fonte na íntegra do Accord Framework, que permanece como um framework de testes do que seria interessante ver no AForge, mas inclui somente porções limitadas do código necessárias para suportar PCA.

 

Visão Geral do Código

Below is the main code behind the PCA.

 

Usando o Código

Para realizar uma análise simples, você pode simplesmente instanciar um novo objeto PrincipalComponentAnalysis passando seu conjunto de dados e chamar seu método Compute para computar o modelo. Então você pode simplesmente chamar o método Transform para projetar os dados no espaço dos componentes principais.

Abaixo está um exemplo de código demonstrando seu uso.

 

Demonstração

Para demonstrar o uso da Análise de Componente Principal, criei uma simples aplicação Windows Forms que efetua uma análise de dados simples e as transformações PCA em planilhas Excel.

input
The application can open Excel workbooks. Here we are loading some random Gaussian data, some random Poisson data, and a linear multiplication of the first variable (thus also being Gaussian).
 
univariate
Simple descriptive analysis of the source data, with a histogram plot of the first variable. We can see it fits a Gaussian distribution with 1.0 mean and 1.0 standard deviation.
 
principal
Here we perform PCA by using the Correlation method (actually the transformation uses SVD on the standardized data, not the correlation matrix, but the effect is the same). Because the third variable is a linear multiplication of the first, the analysis detected it was totally irrelevant, thus having zero importance level.
 
projection
Now we can make a projection of the source data using only the first two components.
 

 

Nota: Os componentes principais não são únicos porque a decomposição em valor singular não é única. Além do mais, os sinais na matriz de rotação são arbitrários, e portanto podem difererir entre diferentes programas para PCA.

junto com a aplicação de demonstração vem uma planilha Excel contendo vários exemplos de dados. O primeiro exemplo é o mesmo utilizado por Lindsay em seu Tutorial on Principal Component Analysis. Outros exemplos incluem dados Gaussianos, dados não correlacionados e combinações lineares de dados Gaussianos para exemplificar melhor a análise.

 

Bem, novamente, espero que alguém ache este código útil! 😛

Principal Component Analysis in C#

Principal Component Analysis (PCA) is an exploratory tool designed by Karl Pearson in 1901 to identify unknown trends in a multidimensional data set. It involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components.

Foreword

Before you read this article, please keep in mind that it was written before the Accord.NET Framework was created and became popular. As such, if you would like to do Principal Component Analysis in your projects, download the accord-net framework from NuGet and either follow the starting guide or download the PCA sample application from the sample gallery in order to get up and running quickly with the framework.

Introduction

PCA essentially rotates the set of points around their mean in order to align with the first few principal components. This moves as much of the variance as possible (using a linear transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be highly correlated and may be dropped with minimal loss of information. Please note that the signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA.

For a more complete explanation for PCA, please visit Lindsay Smith excellent Tutorial On Principal Component Analysis (2002).

Accord.NET Framework

This new library, which I called Accord.NET, was initially intended to extend the AForge.NET Framework through the addition of new features such as Principal Component Analysis, numerical decompositions, and a few other mathematical transformations and tools. However, the library I created grew larger than the original framework I was trying to extend. In a few months, both libraries will merge under Accord.NET. (Update April 2015)

Design decisions

As people who want to use PCA in their projects usually already have their own Matrix classes definitions, I decided to avoid using custom Matrix and Vector classes in order to make the code more flexible. I also tried to avoid dependencies on other methods whenever possible, to make the code very independent. I think this also made the code simpler to understand.

The code is divided into two projects:

  • Accord.Math, which provides mathematical tools, decompositions and transformations, and
  • Accord.Statistics, which provides the statistical analysis, statistical tools and visualizations.

Both of them depends on the AForge.NET core. Also, their internal structure and organization tries to mimic AForge’s wherever possible.

The given source code doesn’t include the full source of the Accord Framework, which remains as a test bed for new features I’d like to see in AForge.NET. Rather, it includes only limited portions of the code to support PCA. It also contains code for Kernel Principal Component Analysis, as both share the same framework. Please be sure to look for the correct project when testing.

Code overview

Below is the main code behind PCA.

Using the code

To perform a simple analysis, you can simple instantiate a new PrincipalComponentAnalysis object passing your data and call its Compute method to compute the model. Then you can simply call the Transform method to project the data into the principal component space.

A sample sample code demonstrating its usage is presented below.

Example application

To demonstrate the use of PCA, I created a simple Windows Forms Application which performs simple statistical analysis and PCA transformations.

input_thumb-5B2-5D

The application can open Excel workbooks. Here we are loading some random Gaussian data, some random Poisson data, and a linear multiplication of the first variable (thus also being Gaussian).
 
univariate_thumb-5B4-5D
Simple descriptive analysis of the source data, with a histogram plot of the first variable. We can see it fits a Gaussian distribution with 1.0 mean and 1.0 standard deviation.
 
principal_thumb-5B9-5D
Here we perform PCA by using the Correlation method. Actually, the transformation uses SVD on the standardized data rather than on the correlation matrix, the effect being the same. As the third variable is a linear multiplication of the first, the analysis detected it as irrelevant, thus having a zero importance level.
 
projection_thumb-5B4-5D
Now we can make a projection of the source data using only the first two components.
 

 

Note: The principal components are not unique because the Singular Value Decomposition is not unique. Also the signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA.

Together with the demo application comes an Excel spreadsheet containing several data examples. The first example is the same used by Lindsay on his Tutorial on Principal Component Analysis. The others include Gaussian data, uncorrelated data and linear combinations of Gaussian data to further exemplify the analysis.

I hope this code and example can be useful! If you have any comments about the code or the article, please let me know.

See also

A Tutorial On Principal Component Analysis with the Accord.NET Framework

This is a tutorial for those who are interested in learning how PCA works and how each step of Lindsay’s tutorial can be computed in the Accord.NET Framework, in C#.

Kernel Principal Component Analysis in C#

This is the non-linear extension of Principal Component Analysis. While linear PCA is restricted to rotating or scaling the data, kernel PCA can do arbitrary transformations (such as folding and twisting the data and the space that contains the data).