Sequence Classifiers in C#: Hidden Markov Models

sample-5

Few days ago I published a new article on CodeProject, about classifiers based on banks of hidden Markov models to accomplish sequence classification.

While I have written about this subject in the past on this very blog, this time the article deals with Gaussian, continuous-density hidden Markov models rather than discrete ones. Plus, at this time the Accord.NET Framework has evolved much since that 2010 post, and the new article reflects most of the improvements and additions in those last two years.

In the meantime, this article is also serving as a hook to a future article, an article about Hidden Conditional Random Fields (HCRFs). The HCRF models can serve the same purpose as the HMMs but can be generalized to arbitrary graph structures and be trained discriminatively, which could be an advantage on classification tasks.

As always, I hope readers can find it a good read 🙂

Deep Learning Artificial Neural Networks: Speech Recognition and Universal Translators

Happy new year everyone!

With the beginning of this year, I would like to share a video I wish I had found earlier. It is about the recent breakthrough given by Deep Neural Networks in the field of speech recognition – which, despite I had known was a breakthrough, I didn’t know it was already leading to such surprising great results.

 

Deep neural networks are also available in the Accord.NET Framework. However, they’ve been a very recent addition – if you find any issues, bugs, or just wish to collaborate on development, please let me know!

Machine Learning Books and Lectures

Jordan, one of the readers of the blog, wrote to point out some cool references for machine learning, mathematics and artificial intelligence.

Thanks again for all your contributions. I’m a .NET programmer coming from a background of studying politics and Latin American studies so when the moment came that I realized I was interested in A.I., I didn’t have many resources to turn to.

I have a collection of books and links that I found very very helpful when I was trying to learn about these concepts. I was thinking they could be useful to some of your readers who were in my shoes two years ago. So without further adieu:

Handbook of Normal Frames and Coordinates

"Handbook of Normal Frames and Coordinates," Iliev, Bozhidar Z. (2006)

“This can be a heavy text, but if you understand the concept, this can take you to the next level – I would categorize this as Topology but has applications in quantum physics, theoretical mathematics, theoretical physics etc…- in context of your blogs – granted this is a dedicated read – I think readers will be able to appreciate and truly understand what a Hilbert Space is after reading this book.”

Linear Algebra

"Linear Algebra," Jim Heffron (2008)

“I liked this one because it was free and is still highly rated – I have also heard great reviews about David Poole’s book on linear algebra, but I couldn’t get myself to shell out the money when this was free.”

Complex To Real

http://www.complextoreal.com/tutorial.htm

“There are a ton of great articles here – I have personally read the ones on fourier transforms and wavelet transforms – great stuff”

Stanford Lectures – Fourier Analysis series

(free through Youtube or iTunesU)

“Email the Professor for the companion book. At this point it may have been published – but well worth shelling out the dough in any case.”

Bio-Inspired Artificial Intelligence

"Bio-Inspired Artificial Intelligence," Floreano and Mattiussi (2008)

“Excellent reference – fairly in depth for the number of topics it covers, lots of illustrations (illustrations are always a good thing 🙂 and I’ve also found it to be a useful source for inspiration. I bought this book while I was looking into fuzzy swarm intelligence – it doesn’t have all the answers, but I am simply in love with this book.”

Video lectures on Machine Learning

http://videolectures.net/pascal/

“A collection of video lectures featuring very brilliant people – let’s face it… if you’re doing or are interested in mathematics this complex… you probably don’t know too many people who you can talk to / learn from on the subject unless you’re in a University studying this kind of thing – these are some great lectures on machine learning – I just recently found this site but wish I had found it sooner – it’s great if you’re just exploring machine learning or are very well versed in it – however, I fall somewhere in the middle of that distribution so take it for what it’s worth!”

Fearless Symmetry

"Fearless Symmetry," Ash and Gross (2006)

Another accessible book to those without heavy training in math – great intro to Galois Theory, the Riemann Hypothesis and several other concepts.

Zero: The Biography of a Dangerous Idea

    "Zero: The Biography of a Dangerous Idea," Charles Seife (2000)

This one is more historical and conceptual than technical but it’s a captivating read and will help get you through those hard times when you want to put down that book on K-Dimensional Manifolds, but still need to satisfy your mathematical mind (some say it’s never at rest once you catch that "learning bug").

Khan Academy

http://www.khanacademy.org/

Finally,  when you get lost, go here! The Khan Academy is a not-for-profit 501(c)(3) with the mission of providing a world-class education to anyone, anywhere.

 

Thanks Jordan! I hope readers can enjoy those resources as much as I did. Cheers!

K-Means clustering in C#

kmeans_thumb2

K-Means is one of the most popular, classic and simple approaches to clustering. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics [3].

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a C# framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. It is based on the AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Contents

  1. Introduction
    1. Lloyd’s K-Mean algorithm
  2. Source code
  3. Using the code
  4. Sample application
  5. Conclusion
  6. Acknowledgements
  7. References
  8. See also

Introduction

In statistics and machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean [4]. In its most common form, the algorithm is an iterative greedy algorithm which converges to a local optimum after a certain number of iterations.

kmeans

Illustration of the K-Means algorithm.

The algorithm works by first selecting k locations at random to be the initial centroids for the clusters. Each observation is then assigned to the cluster which has the nearest centroid, and the centroids are recalculated using the mean value of assigned values. The algorithm then repeats this process until the cluster centroids do not change anymore, or until the change less than a given threshold.

There are other refinements and extensions of the algorithm. The version depicted above is its most common form, also referred as Lloyd’s algorithm.

Lloyd’s K-Means algorithm

  1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
  2. Assign each object to the group that has the closest centroid.
  3. When all objects have been assigned, recalculate the positions of the K centroids.
  4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

 

Source code

The source code has been implemented using Accord.NET and is now part of the framework. In the current version (2.1.2), the following classes related to K-Means are contained inside the Accord.MachineLearning namespace. The source code available in this page contains only the parts of the framework actually needed to support the algorithm.

Class diagram for K-MeansClass diagram for the K-Means algorithm.

The KMeans class is the main class representing the K-Means algorithm. The algorithm itself is implemented in the Compute(double[][] data, double threshold) method, which accepts a set of observations and a convergence threshold to determine when the method should stop.

 

The implementation is quite straightforward and does not use additional techniques to avoid convergence problems. More refined techniques may be added to the implementation in the future, so please make sure to download the latest version of Accord.NET Framework for the most up-to-date revision of the code.

Using the code

To use the code, create a new instance of the KMeans class passing the desired number of clusters to its constructor. Additionally, you may also pass a distance function to be used as a distance metric during clustering. The default is to use the square Euclidean distance.

 

Sample application

The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are often used to aid border detection and object recognition. The sample application performs image segmentation using the standard squared Euclidean distance over RGB pixel color space. There are, however, better distance metrics to be used for image segmentation, such as weighted distances and other color spaces, which will not be addressed in this example.

kmeans1

Original image (from Ossi Petruska Flickr page*).

To perform image segmentation, we will first translate our image into an array of pixel values. The single image will be read, pixel by pixel, into a jagged array where each element is a double array of length 3. Each element of those double array will contain one of the three RGB values scaled to the interval [–1,1].

After we perform clustering on this array of pixel values, each pixel will have an associated cluster label. Each of these values will then be swapped by its corresponding cluster centroid. The source code below is called when one clicks the Run button in the application.

After segmentation, the following resulting images can be obtained:

kmeans5

Same image after K-Means clustering with k = 5.

kmeans10

Image after K-Means clustering with k = 10.

 

* The sample image used above has been licensed by Ossi Petruska in a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license.

Conclusion

K-Means is a very simple and popular approach to clustering. The implementation presented here is the same implementation used in Accord.NET. As it can be seem, the method can be easily extended with custom distance functions through delegates or lambda expressions, and can be used in different contexts, such as image segmentation, without further modifications. As a suggestion for improvement, the method can be further speeded up by using the triangle inequality as suggested on the paper "Using the triangle inequality to accelerate k-means", by Charles Elkan.

In the next article, we will see how we can use K-Means to initialize the Expectation-Maximization algorithm for estimating Gaussian Mixture densities in Gaussian Mixture Models. Those articles will form the basis for Continuous density Hidden Markov Models.

Acknowledgements

To Antonino Porcino, who provided the first version of the code and for the valuable information about many other methods and algorithms.

References

See also

Logistic Regression in C#

Logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logistic curve. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription.

The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing Install-Package Accord.MachineLearning in your IDE’s NuGet package manager.

Overview

The Logistic regression is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical.

Logistic Sigmoid Function

The logistic sigmoid function is given by

g(z) = 1 / (1 + Exp(-z))

where in the context of logistical regression z is called the logit.

Logistic Regression Model

The logistic regression model is a generalized linear model. This means that it is just a linear regression model taken as input for a non-linear link function. The linear model has the form

z = c1x1 + c2x2 + … cnxn + i = ct x + i

where c is the coefficient vector, i is the intercept value and x is the observation vector for n variables and in the context of logistic regression is called the logit. The logit is then applied as input for the nonlinear logistic sigmoid function g(z), giving as result a probability.

So in a binomial problem where we are trying to determine whether a observation belongs to class C1 or class C2, the logistic model tell us that:

p(C1|x) = g(ct x + i)

p(C2|x) = 1 – p(C1|x)

where p(C1|x) denotes the probability of C1 being true when x is true.
In other words, denotes the probability of x belonging to class C1.

Coefficients

The coefficients for the logistic regression are the values which multiply each observation variable from a sample input vector. The intercept is analogous to the independent term in a polynomial linear regression model or the threshold or bias value for a neuron in a neural network model.

Odds Ratio

After computation of the logistic regression model, every coefficient will have an associated measure called the odds ratio. The odds ratio is a measure of effect size, describing the strength of association or non-independence between two binary data values and can be approximated by raising the coefficient value to the euler’s number.

Odds ratioc = ec

Standard Error

The standard error for the coefficients can be retrieved from the inverse Hessian matrix calculated during the model fitting phase and can be used to give confidence intervals for the odds ratio. The standard error for the i-th coefficient of the regression can be obtained as:

SEi = sqrt(diag(H-1)i)

Confidence Intervals

The confidence interval around the logistic regression coefficient is plus or minus 1.96*SEi, where SEi is the standard error for coefficient i. We can then define:

95% C.I.i = <lower, upper> = <coefficienti – 1.96 * SEi,  coefficienti + 1.96 * SEi>

Wald Statistic and Wald’s Test

The Wald statistic is the ratio of the logistic coefficient to its standard error. A Wald test is used to test the statistical significance of each coefficient in the model. A Wald test calculates a Z statistic, which is:

z = coefficient / standard error

This z value can be squared, yielding a Wald statistic with a chi-square distribution, or, alternatively, it can be taken as is and compared directly with a Normal distribution.

The Wald test outputs a p-value indicating the significance of individual independent variables. If the value is below a chosen significance threshold (typically 0.05), then the variable plays some role in determining the outcome variable that most likely is not result of chance alone. However, there are some problems with the use of the Wald statistic. The Likelihood-ratio test is a better alternative for the Wald test.

Likelihood-Ratio and Chi-Square Test

The likelihood-ratio is the ratio of the likelihood of the full model over the likelihood of a simpler nested model. When compared to the null model the likelihood-ratio test gives an overall model performance measure. When compared to nested models, each with one variable omitted, it tests the statistical significance of each coefficient in the model. These significance tests are considered to be more reliable than the Wald significance test.

The likelihood-ratio is a chi-square statistic given by:

D & = -2lnleft( frac{text{likelihood for first model}}{text{likelihood for second model}} right).

The model with more parameters will always fit at least as well (have a greater log-likelihood). Whether it fits significantly better and should thus be preferred can be determined by deriving the probability or p-value of the obtained difference D. In many cases, the probability distribution of the test statistic can be approximated by a chi-square distribution with (df1 − df2) degrees of freedom, where df1 and df2 are the degrees of freedom of models 1 and 2 respectively.

Regression to a Logistic Model

If we consider the mapping

φ(<x1, x2, …, xn>) = <x1, x2, … xn, 1>

The logistic regression model can also be rewritten as

p(C1|φ) = g(wt φ) = f(φ, w)

so that w contains all coefficients and the intercept value in a single weight vector. Doing so will allow the logistic regression model to be expressed as a common optimization model in the form f(φ, w) allowing many standard non-linear optimization algorithms to be directly applied in the search for the best parameters w that best fits the probability of a class C1 given a input vector φ.

Likelihood function

The likelihood function for the logistic model is given by:

p(t|w) = prod_{n=1}^N y_n^{t_n} { 1 - y_n }^{1-t_n}

but, as the log of products equals the sum of logs, taking the log of the likelihood function results in the Log-likelihood function in the form:

ln p(t|w) = sum_{n=1}^N { t_n ln y_n + (1-t_n) ln (1 - y_n) }

Furthermore, if we take the negative of the log-likelihood, we will have a error function called the cross-entropy error function:

E(w) = -ln p(t|w) = - sum_{n=1}^N { t_n ln y_n + (1-t_n) ln (1 - y_n) }

which gives both better numerical accuracy and enable us to write the error gradient in the same form as the gradient of the sum-of-squares error function for linear regression models (Bishop, 2006).

Another important detail is that the likelihood surface is convex, so it has no local maxima. Any local maxima will also be a global maxima, so one does not need to worry about getting trapped in a valley when walking down the error function.

Iterative Reweighted Least-Squares

The method of Iterative Reweighted Least-Squares is commonly used to find the maximum likelihood estimates of a generalized linear model. In most common applications, an iterative Newton-Raphson algorithm is used to calculate those maximum likelihood values for the parameters. This algorithm uses second order information, represented in the form of a Hessian matrix, to guide incremental coefficient changes. This is also the algorithm used in this implementation.

In the Accord.NET machine learning framework. the Iterative Reweighted Least-Squares is implemented in the IterativeReweightedLeastSquares class (source).

Source Code

Below is the main code segment for the logistic regression, performing one pass of the Iterative Reweighted Least-Squares algorithm.

Furthermore, as the likelihood function is convex, the Logistic Regression Analysis can perform regression without having to experiment different starting points. Below is the code to compute a logistic regression analysis. The algorithm iterates until the largest absolute parameter change between two iterations becomes less than a given limit, or the maximum number of iterations is reached.

Using the code

Code usage is very simple. To perform a Logistic Regression Analysis, simply create an object of the type LogisticRegressionAnalysis and pass the input and output data as constructor parameters. Optionally you can pass the variables names as well.

Then just call Compute() to compute the analysis.

After that, you can display information about the regression coefficients by binding the CoefficientCollection to a DataGridView.

Sample application

The sample application creates Logistic Regression Analyses by reading Excel workbooks. It can also perform standard Linear Regressions, although there aren’t many options available to specify linear models.

lr-1 lr-6

Example data set adapted from http://www.statsdirect.co.uk/help/regression_and_correlation/logi.htm.

 

lr-2 lr-3

The image on the left shows information about the analysis performed, such as coefficient values, standard errors, p-value for the Wald statistic, Odds ratio and 95% confidence intervals. It also reports the log-likelihood, deviance and likelihood-ratio chi-square test for the final model. The newest available version can also compute likelihood-ratio tests for each coefficient, although not shown in the image.

Final Considerations

The logistic regression model can be seen as the exact same as a one layer MLP with only one hidden neuron using a sigmoid activation function trained by a Newton’s method learning algorithm. Thus we can say the Logistic Regression is just a special case of Neural Networks. However, one possible (and maybe unique) advantage of logistic regression is that it allows simpler interpretation of its results in the form of odds ratios and statistical hypothesis testing. Statistical analysis using neural networks is also possible, but some may argue it is not as straightforward as using ordinary logistic regression.

Multilayer perceptron neural networks with sigmoid activation functions can be created using the Accord.Neuro package and namespace.

References

A word on Neural Network learning

micro_tanks

Some time ago, while doing random research, I came across this funny but important note about neural networks, and thought it was worth sharing.

The choice of the dimensionality and domain of the input set is crucial to the success of any connectionist model. A common example of a poor choice of input set and test data is the Pentagon’s foray into the field of object recognition. This story is probably apocryphal and many different versions exist on-line, but the story describes a true difficulty with neural nets.

As the story goes, a network was set up with the input being the pixels in a picture, and the output was a single bit, yes or no, for the existence of an enemy tank hidden somewhere in the picture. When the training was complete, the network performed beautifully, but when applied to new data, it failed miserably. The problem was that in the test data, all of the pictures that had tanks in them were taken on cloudy days, and all of the pictures without tanks were taken on sunny days. The neural net was identifying the existence or non-existence of sunshine, not tanks.

– David Gerhard; Pitch Extraction and Fundamental Frequency: History and Current Techniques, Technical Report TR-CS 2003-06, 2003.

 

🙂

Kernel Principal Component Analysis in C#

kpca-1_thumb-5B1-5D

Kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are done in a reproducing kernel Hilbert space with a non-linear mapping.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Analysis Overview

KPCA is an extension of PCA to non-linear distributions. Instead of directly doing a PCA, the original data points are mapped into a higher-dimensional (possibly infinite-dimensional) feature space defined by a (usually nonlinear) function φ through a mathematical process called the “kernel trick”.

The Kernel trick

The kernel trick is a very interesting technique, able to transform any algorithm that solely depends on the dot product between two vectors. To apply the trick, we replace all dot products in an algorithm with a kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. This non-linear algorithm is equivalent to the linear algorithm operating in the range space of φ. However, because kernels are used, the φ function is never explicitly computed. This is desirable [8], as the high-dimensional space can have an arbitrary large dimensionality, or can be even infinite-dimensional (such as in the case the chosen kernel function is a Gaussian) and thus infeasible to compute.

Interesting video by Udi Aharoni demonstrating how points that are not linearly separable in a 2D space can almost always be linearly separated in a higher dimensional (3D) space.

Principal Components in Kernel Space

Like in PCA, the overall idea is to perform a transformation that will maximize the variance of the captured variables while minimizing the overall covariance between those variables. Using the kernel trick, the covariance matrix is substituted by the Kernel matrix and the analysis is carried analogously in feature space. An Eigen value decomposition is performed and the eigenvectors are sorted in ascending order of Eigen values, so those vectors may form a basis in feature space that explain most of the variance in the data on its first dimensions.

However, because the principal components are in feature space, we will not be directly performing dimensionality reduction. Suppose that the number of observations m exceeds the input dimensionality n. In linear PCA, we can find at most n nonzero Eigen values. On the other hand, using Kernel PCA we can find up to m nonzero Eigen values because we will be operating on a m x m kernel matrix[2].

A more detailed discussion can be seen in the works of Mika et al. (Kernel PCA and De-Noising in Feature Spaces) and Scholkopf et al. (Nonlinear Component Analysis as a Kernel Eigenvalue Problem). A more accessible explanation of the topic is also available in the work from Ian Fasel.

The pre-image problem

In standard PCA it was possible to revert the transformation, retrieving the original data points and enabling the use of PCA for compression* and denoising. However, as the results provided by kernel PCA live in some high dimensional feature space,  they need not have pre-images in input space. In fact, even if it exists, it also need be not unique.

Thus, in KPCA, it is only possible to obtain an approximate pre-image for a given set of patterns projected on the feature space. This problem can (and has) been tackled by many approaches, some of them of iterative nature, but some closed-form solutions also exists. Typically those solutions make use of the fact that distances in feature space are related to distances in input space. Thus, those solutions try to achieve an optimal transformation that can embed those feature points in input space respecting those distances relationships.

One of the first solutions of this form were proposed by Kwok and Tsang in their paper “The Pre-Image Problem in Kernel Methods". A better approach is also described by Bakir on his thesis “Extensions to Kernel Dependency Estimation”, alongside with a nice comprehensive description on various other methods for handling the pre-image problem.

Note: actually the use of KPCA for compression is a bit limited, since one needs to store the original data set to compute transformations and its approximate reversion.

 

Code Overview

Below is the main code behind KPCA. This is a direct, naive implementation of the algorithm that may not scale very well for very large datasets. It is implemented on top of the previous standard PCA code and thus on top of the AForge.NET Framework.

 

The main code behind the reversion of the projection. It is a direct, naive implementation of the closed-form solution algorithm as proposed by Kwok and Tsang. Currently it is completely not optimized, but it gives an better overview of the algorithm on its essential form. The source code available to download may be better optimized in the future.

 

 

Using the Code

As with PCA, code usage is simple. First, choose a kernel implementing the IKernel interface then pass it to the KernelPrincipalComponentAnalysis constructor together with the data points.

Then, simply call Compute() to perform the analysis. Once done, data about the analysis will be available through the object properties and transformations on additional data can be computed by calling Transform().

 

Sample Application

To demonstrate the use of KPCA, I created a simple Windows Forms Application which performs simple statistical analysis and KPCA transformations. It can also perform projection reversion by solving the approximate pre-image problem.

Scholkopf's KPCA Toy Example Kernel principal components detected by the KPCA Two first components detected by the KPCA

On left: Scholkopf Toy Example. Middle: Kernel principal components detected by the KPCA. Right: The two first dimensions obtained by the two first Kernel principal components.

 

Wikipedia's Example on KPCA Projection using a non-centered Gaussian kernel Reconstruction of the original data

Left: Wikipedia example. Middle: Projection using a non-centered Gaussian kernel. Right: Reconstruction of the original data using all Kernel principal components, showing some denoising ability.

Linear PCA as a special case

Lindsay Smith's PCA example data Linear Principal Component Analysis

Additionally we may check that linear PCA is indeed a special case of kernel PCA by using a linear kernel on the example by Lindsay Smith on his A Tutorial On Principal Component Analysis.

 

See also

 

References

Discriminatory Power Analysis by Receiver-Operating Characteristic Curves (Part 2 of 2: C# Source Code)

roc-1_thumb-5B10-5D_thumb-5B1-5D

Part 2 of 2: C# Source Code. Go to Part 1 of 2: Theory.

Happy new year everyone! As promised, here is the second part of the article Performing Discriminatory Power Analysis by Receiver-Operating Characteristic Curves. This second part shows how to create ROC curves in C#. The following C# code implements the creation, visualization and analysis of ROC curves. The sample application accompanying the source code demonstrates how to use them inside your own applications.

 

Source code

Download source code and sample applications.

Confusion matrix

The ConfusionMatrix class represents a 2×2 contingency table. All derived measures mentioned above, such as sensitivity and specificity, are available through its properties and methods.

 

Receiver Operating Characteristic

The ReceiverOperatingCharacteristic class represents a ROC curve. The collection of points that define the curve are available through the Points property. Because points are ConfusionMatrix-inherited objects, the collection can be directly bound to a DataGridView, enabling quick curve visual inspection.

Additional information, such as the total Area Under the Curve and its associated deviation error are available through the Area and Error properties, respectively.

 

Using the code

Code usage is very simple. To use the aforementioned classes and create a ROC curve, simply create a new ReceiverOperatingCharacteristic object passing the actual data, as measured by the experiment, and the test data, as given by the prediction model.

The actual data must be a dichotomous variable, with only two valid values. The “false”  value is assumed to be the lowest value of the dichotomy, while the “true” value is assumed to be the highest. It is recommended to use 0 and 1 values for simplicity, although it isn’t mandatory.

The test data, as given by the prediction model, must have continuous values between the lowest and highest values for the actual data. For example, if the two valid values are 0 and 1, then its values must be inside the [0,1] range.

To compute the Curve using different cutoff values, call the Compute method passing the desired number of points or the desired increment in the cutoff value between points.

 

Sample applications

Together with the source code, there is an accompanying sample application demonstrating the use of the ROC analysis. To open the example table, click File –> Load then select the excel spreadsheet located inside the executable folder. To plot the curve, select the number of points or the threshold increment and click Plot.

roc-1_thumb[10]A “good” classifier.

 

roc-2_thumb[2] An approximately “random” classifier.

 

roc-3_thumb[3] A perfect, ideal classifier.

 

roc-4_thumb[3] Curve points for the “good” classifier.

 

Further Reading

  • Receiver Operating Curves: An Introduction
    Excellent page about ROC curves and its applications. Includes excellent applets for experimentation with the curves, allowing for better understanding of its workings and meaning.

  • BBC NEWS MAGAZINE, A scanner to detect terrorists; Very interesting paper about how statistics are usually wrongly interpreted when published by the media. “To find one terrorist in 3000 people, using a screen that works 90% of the time, you’ll end up detaining 300 people, one of whom might be your target”. Written by Michael Blastland.

 

References

Discriminatory Power Analysis by Receiver-Operating Characteristic Curves (Part 1 of 2: Theory)

contingency_thumb-5B3-5D

Part 1 of 2: Theory. Go to Part 2 of 2: C# Source code.

When dealing with systems, methods or tests involving the detection, diagnostics or prediction of results, it is very important to validate obtained results in order to quantify and qualify its discriminative power as good or not for a given analysis. However, we must consider that the simple quantization of hits and misses on a test group does not necessarily reflects how good a system is, because this quantization will fundamentally depend on the quality and distribution of this test group data.

 

To exemplify what has just been said, lets consider the example of a diabetes detection system, which outputs are 1 or 0, indicating whether a patient has the condition or not. Now, lets suppose we applied this system on a test group of 100 patients which we (but not the system) already know who has diabetes or not, and the system correctly identified 90% of the conditions. Seems a rather good performance, doesn’t?

Not exactly. What has not been said is how the condition is distributed along the test group. Lets reveal now that 90% actually had diabetes. The system, thus, could have said "1" to every and any input given, and even still obtain a 90% hit rate, as the 10% patients left were healthy. By now, we can not be sure if the system was really good or just acted by chance, saying everyone was ill without any prior knowledge or calculations.

 

contingency
Contingency Table (confusion matrix) for a binary classifier

For situations like this, other measures have been created in order to consider such unbalance in the test groups. Before going further into those measures, lets discuss about the so called contingency table (or confusion matrix), which will act as a base for the next measures shown. Its mechanics are rather simple: we consider positive values that the system has predicted as positive as true positives (hit), positive values that the system predicted as negative as false negatives (miss), negative values the system said were negative as true negatives (hit), and negative values the system said were positive as false positive (miss).

 

Now I’ll present some measures derived from this rather simple table.

Accuracy

    The proportion of correct predictions, without considering what is positive and what is negative. This measure is highly dependant on the data set distribution and can easily lead to wrong conclusions about the system performance.

    ACC = TOTAL HITS / NUMBER OF ENTRIES IN THE SET

            = (TP + TN) / (P + N)

Sensitivity

    The proportion of true positives: the ability of the system on correctly predicting the condition in cases it is really present.

    SENS = POSITIVE HITS / TOTAL POSITIVES

             = TP / (TP + FN)

Specificity

    The proportion of true negatives: the ability of the system in correctly predicting the absence of the condition in cases it is not present.

    SPEC = NEGATIVE HITS / TOTAL NEGATIVES

             = TN / (TN + FP)

Efficiency

    The arithmetic mean of Sensibility and Specificity. In pratical situations, sensibility and specificity vary in reverse directions. Generally, when a method is too responsive to positives, it tends to produce many false positives, and vice versa. Therefore, a perfect decision method (with 100% specificity and 100% specificity) rarely is conceived, and a balance between both must be obtained.

    EFF = (SENS + SPEC) / 2

Positive Predictive Value

    The proportion of true positives in contrast with all positive predictions. This measure is highly susceptible to the prevalence in the data set, but gives an estimate on how good the system is when making a positive affirmation. It also can easily lead to wrong conclusions about system performance.

    PPV = POSITIVE HITS / TOTAL POSITIVE PREDICTIONS

           = TP / (TP + FP)

Negative Predictive Value

    The proportion of true negatives in contrast with all negative predictions. This measure is highly susceptible to the prevalence in the data set but gives an estimate on how good the system is when making a negative affirmation. It can easily lead to wrong conclusions about system performance.

    NPV = NEGATIVE HITS / TOTAL NEGATIVE PREDICTIONS

           = TN / (TN + FN)

Matthews Correlation Coefficient – or Phi (φ) Coefficient

    The Matthews correlation coefficient is a measure of the quality of two binary classifications that can be used even if both classes have very different sizes. It returns a value between −1 and +1, in which a +1 coefficient represents a perfect prediction, 0 a random prediction, and –1 an inverse prediction. This statistic is an equivalent to the phi coefficient, and attempts, like the efficiency measure, summarize the quality of the contingency table in a single value which can be compared.

    MCC = φ = (TP*TN – FP*FN) / sqrt((TP + FP)*(TP + FN)*(TN + FP)*(TN + FN))

    Note that, if any of the sums in the denominator equals zero, the denominator can be considered 1, resulting in a MCC of 0, which is also the correct limit for this situation.

 

The Receiver Operating Characteristic (ROC) Curve

The ROC curve was developed by electrical and radar system engineers during World War II to detect enemy objects in enemy fields. The ROC analysis has been used in medicine, radiology, psychology and other areas for many decades. And, more recently, has been introduced to areas such as machine learning and data mining.

 

classifiers
ROC curves for different classifiers

Because the output from classification systems are generally continuous, it is necessary to define a cutoff value, or discriminatory threshold, to classify and count the number of positive and negative predictions (such as positive or negative diagnostics in the case of pathology occurrence). As this threshold can be arbitrarily determined, the best practice to compare the performance of different systems is to study the effect of selecting diverse cutoff values over the output data.

Considering many cutoff values, it is possible to calculate a set of pairs (sensitivity, 1-specificity) which can then be plotted in a curve. This curve will be the ROC curve for the system, having sensitivity values as its ordenades (y-axis) and the complement of specificity (1-specificity) as its abscissas (x-axis).

 

 

A standard measure for system comparison is the area under the ROC curve (AUC), which can be obtained by numerical integration, such as, for example, the trapezoidal rule. Theoretically, higher the AUC, better the system.

Calculating the area of a ROC curve in Microsoft Excel®

    Put the sensitivity and (1-specificity) pairs in the columns A and B, respectively. If you have 10 points  (from A1 to B10), you can use the following formula to calculate its ROC area:

    =SUMPRODUCT((A6:A15+A5:A14)*(B5:B14-B6:B15))*0,5

Determining the standard error when calculating the area

The standard error of a ROC curve is based on the standard deviation assumed when applying our system to a population sample rather than to the entire population. This measure comes from the fact that, depending on which samples of a population we take to perform the ROC analysis, the area under the curve would vary according the particular sample distribution.

The error calculation is, to a certain point, simple, as it comes from only three known values: the area A under the ROC curve, the number Na of samples which have the investigated condition (i.e. have diabetes) and the number Nn of samples which does not have the condition (i.e. does not have diabetes).

ERROR = sqrt((A*(1 – A) + (Na– 1)*(Q1 – A²)+(Nn– 1)(Q2 – A²))/(Na * Nn))

where:

    Q1 = A/(2 – A)
    Q2 = 2*A²/(1 + A)

 

Source code

For a C# code implementing ROC curve creation and analysis, please follow to the next part of this article, Discriminatory Power Analysis using Receiver-Operating Characteristic Curves (Part 2 of 2: C# Source Code).

 

Further Reading

  • Receiver Operating Curves: An Introduction
    Excellent page about ROC curves and its applications. Includes excellent applets for experimentation with the curves, allowing for better understanding of its workings and meaning.

  • BBC NEWS MAGAZINE, A scanner to detect terrorists; Very interesting paper about how statistics are usually wrongly interpreted when published by the media. “To find one terrorist in 3000 people, using a screen that works 90% of the time, you’ll end up detaining 300 people, one of whom might be your target”. Written by Michael Blastland.

 

References

Principal Component Analysis in C#

Principal Component Analysis (PCA) is an exploratory tool designed by Karl Pearson in 1901 to identify unknown trends in a multidimensional data set. It involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components.

Foreword

Before you read this article, please keep in mind that it was written before the Accord.NET Framework was created and became popular. As such, if you would like to do Principal Component Analysis in your projects, download the accord-net framework from NuGet and either follow the starting guide or download the PCA sample application from the sample gallery in order to get up and running quickly with the framework.

Introduction

PCA essentially rotates the set of points around their mean in order to align with the first few principal components. This moves as much of the variance as possible (using a linear transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be highly correlated and may be dropped with minimal loss of information. Please note that the signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA.

For a more complete explanation for PCA, please visit Lindsay Smith excellent Tutorial On Principal Component Analysis (2002).

Accord.NET Framework

This new library, which I called Accord.NET, was initially intended to extend the AForge.NET Framework through the addition of new features such as Principal Component Analysis, numerical decompositions, and a few other mathematical transformations and tools. However, the library I created grew larger than the original framework I was trying to extend. In a few months, both libraries will merge under Accord.NET. (Update April 2015)

Design decisions

As people who want to use PCA in their projects usually already have their own Matrix classes definitions, I decided to avoid using custom Matrix and Vector classes in order to make the code more flexible. I also tried to avoid dependencies on other methods whenever possible, to make the code very independent. I think this also made the code simpler to understand.

The code is divided into two projects:

  • Accord.Math, which provides mathematical tools, decompositions and transformations, and
  • Accord.Statistics, which provides the statistical analysis, statistical tools and visualizations.

Both of them depends on the AForge.NET core. Also, their internal structure and organization tries to mimic AForge’s wherever possible.

The given source code doesn’t include the full source of the Accord Framework, which remains as a test bed for new features I’d like to see in AForge.NET. Rather, it includes only limited portions of the code to support PCA. It also contains code for Kernel Principal Component Analysis, as both share the same framework. Please be sure to look for the correct project when testing.

Code overview

Below is the main code behind PCA.

Using the code

To perform a simple analysis, you can simple instantiate a new PrincipalComponentAnalysis object passing your data and call its Compute method to compute the model. Then you can simply call the Transform method to project the data into the principal component space.

A sample sample code demonstrating its usage is presented below.

Example application

To demonstrate the use of PCA, I created a simple Windows Forms Application which performs simple statistical analysis and PCA transformations.

input_thumb-5B2-5D

The application can open Excel workbooks. Here we are loading some random Gaussian data, some random Poisson data, and a linear multiplication of the first variable (thus also being Gaussian).
 
univariate_thumb-5B4-5D
Simple descriptive analysis of the source data, with a histogram plot of the first variable. We can see it fits a Gaussian distribution with 1.0 mean and 1.0 standard deviation.
 
principal_thumb-5B9-5D
Here we perform PCA by using the Correlation method. Actually, the transformation uses SVD on the standardized data rather than on the correlation matrix, the effect being the same. As the third variable is a linear multiplication of the first, the analysis detected it as irrelevant, thus having a zero importance level.
 
projection_thumb-5B4-5D
Now we can make a projection of the source data using only the first two components.
 

 

Note: The principal components are not unique because the Singular Value Decomposition is not unique. Also the signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA.

Together with the demo application comes an Excel spreadsheet containing several data examples. The first example is the same used by Lindsay on his Tutorial on Principal Component Analysis. The others include Gaussian data, uncorrelated data and linear combinations of Gaussian data to further exemplify the analysis.

I hope this code and example can be useful! If you have any comments about the code or the article, please let me know.

See also

A Tutorial On Principal Component Analysis with the Accord.NET Framework

This is a tutorial for those who are interested in learning how PCA works and how each step of Lindsay’s tutorial can be computed in the Accord.NET Framework, in C#.

Kernel Principal Component Analysis in C#

This is the non-linear extension of Principal Component Analysis. While linear PCA is restricted to rotating or scaling the data, kernel PCA can do arbitrary transformations (such as folding and twisting the data and the space that contains the data).