# Recognition of Handwritten Digits using Non-linear Kernel Discriminant Analysis

I’ve just submitted a new article to CodeProject, entitled "Handwriting Recognition using Kernel Discriminant Analysis". The article is a demonstration of handwritten digit recognition using Non-linear Discriminant Analysis with Kernels and using the Optical Recognition of Handwritten Digits data set from the University of California’s Machine Learning Repository.

Recognition of Handwritten digits using Kernel Discriminant Analysis

The Code Project is a free repository of source code, tutorials and development resources. It is also home of a large and ever-growing community of software developers, architects and other technology professionals. It has fairly active forums, and is a reasonably good resource for resolving difficult software development issues.

# Hidden Markov Model -Based Sequence Classifiers in C#

Distinct Hidden Markov Models can be trained individually on data obtained from each class of a classification problem. If we create a set of those models and specialize each model to recognize each of the separate classes, we can then use the HMMs ability to calculate the likelihood that a given sequence belongs to the model to determine the most likely class of an unknown sequence.

 Download sample and source code Download the Accord.NET Framework This post is a continuation for the previousarticle, Hidden Markov Models in C#.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

## Introduction

### Hidden Markov Models

An introduction about Hidden Markov Models (HMM) was presented in a previous article, entitled Hidden Markov Models in C#. HMMs are models are stochastic methods to model temporal and sequence data.

They allow, among other things, (1) to infer the most likely sequence of states that produced a given output sequence, to (2) infer which will be the most likely next state (and thus predicting the next output) and (3) calculate the probability that a given sequence of outputs originated from the system (allowing the use of hidden Markov models for sequence classification). In the context of this article, we will be more interested in results from ability (3).

Hidden Markov Models can be seem as finite state machines where for each sequence unit observation there is a state transition and, for each state, there is a output symbol emission. The picture on the left summarizes the overall definition of a HMM given in the previous article.

### Hidden Markov Model -Based Sequence Classifier

Hidden Markov Models can be trained individually on data obtained from each class of a classification problem. If we create a set of models and specialize each model to recognize each of the separated classes, then we will be able to explore the HMMs ability to calculate the likelihood that a given sequence belongs to itself to perform classification of discrete sequences.

After all models have been trained, the probability of the unknown-class sequence can be computed for each model. As each model specialized in a given class, the one which outputs the highest probability can be used to determine the most likely class for the new sequence, as shown in the picture below.

Schematic representation of the sequence classification procedure.

## Source Code

The implementation is very straightforward, as shown in the picture below. The Tag property of the HiddenMarkovModel class can be used to store additional information about which class the model will be representing.

Class diagram for the HMM-based sequence classifier.

### Computing the most likely class for a given sequence

Computing the most likely class for a given sequence is as simple as iterating through all models in the set and selecting the one which has produced the highest likelihood value.

### Training each model to recognize each of the output classes

To train each model to recognize each of the output classes we have to split the training data into subsets whose outputs corresponds to each of the classes from the classification problem.

## Using the Code

Code usage is very simple. Once one has the input data available in the form of a sequence array and output data in form of a integer array, create a new HiddenMarkovClassifier object with the appropriate parameters.

Then, just call Learn() passing the input and output data and the convergence limit for the learning algorithm. After that, call Compute() passing a new integer sequence to be classified by the system.

## Sample Application

The accompanying sample application can read Excel spreadsheets containing integer sequences and labels for those sequences. An example spreadsheet contained inside the Resources folder in the compressed archive file demonstrates the expected format for this data.

Left: Input sequences data. Right: Hidden Markov Models contained in the Sequence Classifier with initial configurations.

Left: Trained models. Right: Performance test for the Hidden Markov Model Sequence Classifier.

# Hidden Markov Models in C#

Hidden Markov Models (HMM) are stochastic methods to model temporal and sequence data. They are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

## Introduction

Hidden Markov Models were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recognition, starting in the mid-1970s. Indeed, one of the most comprehensive explanations on the topic was published in “A Tutorial On Hidden Markov Models And Selected Applications in Speech Recognition”, by Lawrence R. Rabiner in 1989. In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA. Since then, they have become ubiquitous in the field of bioinformatics.

Dynamical systems of discrete nature assumed to be governed by a Markov chain emits a sequence of observable outputs. Under the Markov assumption, it is also assumed that the latest output depends only on the current state of the system. Such states are often not known from the observer when only the output values are observable.

Hidden Markov Models attempt to model such systems and allow, among other things, (1) to infer the most likely sequence of states that produced a given output sequence, to (2) infer which will be the most likely next state (and thus predicting the next output) and (3) calculate the probability that a given sequence of outputs originated from the system (allowing the use of hidden Markov models for sequence classification).

The “hidden” in Hidden Markov Models comes from the fact that the observer does not know in which state the system may be in, but has only a probabilistic insight on where it should be.

## Definition

Hidden Markov Models can be seem as finite state machines where for each sequence unit observation there is a state transition and, for each state, there is a output symbol emission.

### Notation

Traditionally, HMMs have been defined by the following quintuple:

$\lambda = (N, M, A, B, \pi)$

where

• N is the number of states for the model
• M is the number of distinct observations symbols per state, i.e. the discrete alphabet size.
• A is the NxN state transition probability distribution given in the form of a matrix A = {aij}
• B is the NxM observation symbol probability distribution given in the form of a matrix B = {bj(k)}
• π is the initial state distribution vector π = {πi}

Note that, if we opt out the structure parameters M and N we have the more often used compact notation

$\lambda = (A, B, \pi)$

### Canonical problems

There are three canonical problems associated with hidden Markov models, which I’ll quote from Wikipedia:

1. Given the parameters of the model, compute the probability of a particular output sequence. This requires summation over all possible state sequences, but can be done efficiently using the Forward algorithm, which is a form of dynamic programming.
2. Given the parameters of the model and a particular output sequence, find the state sequence that is most likely to have generated that output sequence. This requires finding a maximum over all possible state sequences, but can similarly be solved efficiently by the Viterbi algorithm.
3. Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, derive the maximum likelihood estimate of the parameters of the HMM given a dataset of output sequences. No tractable algorithm is known for solving this problem exactly, but a local maximum likelihood can be derived efficiently using the Baum-Welch algorithm or the Baldi-Chauvin algorithm. The Baum-Welch algorithm is an example of a forward-backward algorithm, and is a special case of the Expectation-maximization algorithm.

The solution for those problems are exactly what makes Hidden Markov Models useful. The ability to learn from the data (using the solution of problem 3) and then become able to make predictions (solution to problem 2) and able to classify sequences (solution of problem 2) is nothing but applied machine learning. From this perspective, HMMs can just be seem as supervisioned sequence classifiers and sequence predictors with some other useful interesting properties.

### Choosing the structure

Choosing the structure for a hidden Markov model is not always obvious. The number of states depend on the application and to what interpretation one is willing to give to the hidden states. Some domain knowledge is required to build a suitable model and also to choose the initial parameters that an HMM can take. There is also some trial and error involved, and there are sometimes complex tradeoffs that have to be made between model complexity and difficulty of learning, just as is the case with most machine learning techniques.

Additional information can be found on http://www.cse.unsw.edu.au/~waleed/phd/tr9806/node12.html.

## Algorithms

The solution to the three canonical problems are the algorithms that makes HMMs useful. Each of the three problems are described in the three subsections below.

### Evaluation

The first canonical problem is the evaluation of the probability of a particular output sequence. It can be efficiently computed using either the Viterbi-forward or the Forward algorithms, both of which are forms of dynamic programming.

The Viterbi algorithm originally computes the most likely sequence of states which has originated a sequence of observations. In doing so, it is also able to return the probability of traversing this particular sequence of states. So to obtain Viterbi probabilities, please refer to the Decoding problem referred below.

The Forward algorithm, unlike the Viterbi algorithm, does not find a particular sequence of states; instead it computes the probability that any sequence of states has produced the sequence of observations. In both algorithms, a matrix is used to store computations about the possible state sequence paths that the model can assume. The forward algorithm also plays a key role in the Learning problem, and is thus implemented as a separate method.

### Decoding

The second canonical problem is the discovery of the most likely sequence of states that generated a given output sequence. This can be computed efficiently using the Viterbi algorithm. A trackback is used to detect the maximum probability path travelled by the algorithm. The probability of travelling such sequence is also computed in the process.

### Learning

The third and last problem is the problem of learning the most likely parameters that best models a system given a set of sequences originated from this system. Most implementations I’ve seem did not consider the problem of learning from a set of sequences, but only from a single sequence at a time. The algorithm below, however, is fully suitable to learn from a set of sequences and also uses scaling, which is another thing I have not seem in other implementations.

The source code follows the original algorithm by Rabiner (1989). There are, however, some known issues with the algorithms detailed in Rabiner’s paper. More information about those issues is available in a next section of this article entitled “Remarks”.

## Using the code

Lets suppose we have gathered some sequences from a system we wish to model. The sequences are expressed as a integer array such as:

For us, it can be obvious to see that the system is outputting sequences that always start with a zero and have one or more ones at the end. But lets try to fit a Hidden Markov Model to predict those sequences.

Once the model is trained, lets test to see if it recognizes some sequences:

Of course the model performs well as this a rather simple example. A more useful test case would consist of allowing for some errors in the input sequences in the hope that the model will become more tolerant to measurement errors.

We can see that, despite having a very low probability, the likelihood values for the sequences containing a simulated measurement error are greater than the likelihoods for the sequences which do not follow the sequence structure at all.

In a subsequent article, we will see that those low values for the likelihoods will not be a problem because HMMs are often used in sets to form sequence classifiers. When used in such configurations, what really matters is which HMM returns the highest probability among others in the set.

## Remarks

A practical issue in the use of Hidden Markov Models to model long sequences is the numerical scaling of conditional probabilities. The probability of observing a long sequence given most models is extremely small, and the use of these extremely small numbers in computations often leads to numerical instability, making application of HMMs to genome length sequences quite challenging.

There are two common approaches to dealing with small conditional probabilities. One approach is to rescale the conditional probabilities using carefully designed scaling factors, and the other approach is to work with the logarithms of the conditional probabilities. For more information on using logarithms please see the work entitled “Numerically Stable Hidden Markov Model Implementation”, by Tobias P. Mann.

### Known issues

The code on this article is based on the Tutorial by Rabiner. There are, however, some problems with the scaling and other algorithms. An errata depicting all issues is available in the website “An Erratum for ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition’” and is maintained by Ali Rahimi. I have not yet verified if the implementation presented here also suffers from the same mistakes explained there. This code has worked well under many situations, but I cannot guarantee its perfectness. Please use at your own risk.

## Acknowledgements

Thanks to Guilherme C. Pedroso, for the help with the Baum-Welch generalization for multiple input sequences. He has also co-written a very interesting article using hidden Markov models for gesture recognition, entitled “Automatic Recognition of Finger Spelling for LIBRAS based on a Two-Layer Architecture” published in the 25th Symposium On Applied Computing (ACM SAC 2010).

# List of Windows Messages

Here is a list of all Windows Messages available in the form of a C# enumeration, just in case someone finds it useful.

# C# is now a better language than Java…

After some chatting and hearing wonders about this language I decided to try something new. I decided to try Scala.

Here is a list of cool references, not only for Scala, but about programming in general, kindly given by an experienced programmer I met some weeks ago.

## Books

### Programming in Scala: A comprehensive step-by-step guide

~ Martin Odersky, Lex Spoon, and Bill Venners

This book is the authoritative tutorial on the Scala programming language, co-written by the language’s designer, Martin Odersky.

### Clean Code: A Handbook of Agile Software Craftsmanship

Even bad code can function. But if code isn’t clean, it can bring a development organization to its knees. Every year, countless hours and significant resources are lost because of poorly written code. But it doesn’t have to be that way.

### Refactoring: Improving the Design of Existing Code

~ Martin Fowler, Kent Beck, John Brant, William Opdyke, Don Roberts

Refactoring is a controlled technique for improving the design of an existing code base. Its essence is applying a series of small behavior-preserving transformations, each of which “too small to be worth doing”. However the cumulative effect of each of these transformations is quite significant. By doing them in small steps you reduce the risk of introducing errors. You also avoid having the system broken while you are carrying out the restructuring – which allows you to gradually refactor a system over an extended period of time.

### Implementation Patterns

Great code doesn’t just function: it clearly and consistently communicates your intentions, allowing other programmers to understand your code, rely on it, and modify it with confidence. But great code doesn’t just happen. It is the outcome of hundreds of small but critical decisions programmers make every single day. Now, legendary software innovator Kent Beck–known worldwide for creating Extreme Programming and pioneering software patterns and test-driven development–focuses on these critical decisions, unearthing powerful “implementation patterns” for writing programs that are simpler, clearer, better organized, and more cost effective.

### Code Complete: A Practical Handbook of Software Construction

~ Steve McConnell

For more than a decade, Steve McConnell, one of the premier authors and voices in the software community, has helped change the way developers write code–and produce better software. Now his classic book, CODE COMPLETE, has been fully updated and revised with best practices in the art and science of constructing software. Whether you’re a new developer seeking a sound introduction to the practice of software development or a veteran exploring strategic new approaches to problem solving, you’ll find a wealth of practical suggestions and methods for strengthening your skills. Topics include design, applying good techniques to construction, eliminating errors, planning, managing construction activities, and relating personal character to superior software. This new edition features fully updated information on programming techniques, including the emergence of Web-style programming, and integrated coverage of object-oriented design. You’ll also find new code examples–both good and bad–in C++, Microsoft(r) Visual Basic(r), C#, and Java, though the focus is squarely on techniques and practices.

Amazon Editorial Review

### The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition (2nd Edition)

~ Frederick P. Brooks

The classic book on the human elements of software engineering. Software tools and development environments may have changed in the 21 years since the first edition of this book, but the peculiarly nonlinear economies of scale in collaborative work and the nature of individuals and groups has not changed an epsilon. If you write code or depend upon those who do, get this book as soon as possible — from Amazon.com Books, your library, or anyone else. You (and/or your colleagues) will be forever grateful. Very Highest Recommendation.

## Blogs & Online Resources

### Joel on Software

A weblog by Joel Spolsky, a programmer working in New York City, about software and software companies.

### The Artima Developer Community

Artima.com is a collection of resources about Java, Jini, the JVM, and object oriented design.

### Mark’s Blog

Mark Russinovich’s technical blog covering topics such as Windows troubleshooting, technologies and security. Among other feats, Russinovich was the man behind the discovery of the Sony rootkit in Sony DRM products in 2005.

### Object Mentor’s Blog

A team of consultants who mentor their clients in C++, Java, OOP, Patterns, UML, Agile Methodologies, and Extreme Programming.

## People to follow

### @unclebobmartin

Known colloquially as “Uncle Bob”, Robert Cecil Martin has been a software professional since 1970 and an international software consultant since 1990. In 2001, he led the group that created Agile software development from Extreme programming techniques. He is also a leading member of the Software Craftsmanship movement.

He is founder and president of Object Mentor Inc., a team of consultants who mentor their clients in C++, Java, OOP, Patterns, UML, Agile Methodologies, and Extreme Programming.

### @KentBeck

Kent Beck is an American software engineer and the creator of Extreme Programming and Test Driven Development. Beck was one of the 17 original signatories of the Agile Manifesto in 2001.

# Logistic Regression in C#

Logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logistic curve. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription.

The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing Install-Package Accord.MachineLearning in your IDE’s NuGet package manager.

## Overview

The Logistic regression is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical.

### Logistic Sigmoid Function

The logistic sigmoid function is given by

g(z) = 1 / (1 + Exp(-z))

where in the context of logistical regression z is called the logit.

### Logistic Regression Model

The logistic regression model is a generalized linear model. This means that it is just a linear regression model taken as input for a non-linear link function. The linear model has the form

z = c1x1 + c2x2 + … cnxn + i = ct x + i

where c is the coefficient vector, i is the intercept value and x is the observation vector for n variables and in the context of logistic regression is called the logit. The logit is then applied as input for the nonlinear logistic sigmoid function g(z), giving as result a probability.

So in a binomial problem where we are trying to determine whether a observation belongs to class C1 or class C2, the logistic model tell us that:

p(C1|x) = g(ct x + i)

p(C2|x) = 1 – p(C1|x)

where p(C1|x) denotes the probability of C1 being true when x is true.
In other words, denotes the probability of x belonging to class C1.

### Coefficients

The coefficients for the logistic regression are the values which multiply each observation variable from a sample input vector. The intercept is analogous to the independent term in a polynomial linear regression model or the threshold or bias value for a neuron in a neural network model.

### Odds Ratio

After computation of the logistic regression model, every coefficient will have an associated measure called the odds ratio. The odds ratio is a measure of effect size, describing the strength of association or non-independence between two binary data values and can be approximated by raising the coefficient value to the euler’s number.

Odds ratioc = ec

### Standard Error

The standard error for the coefficients can be retrieved from the inverse Hessian matrix calculated during the model fitting phase and can be used to give confidence intervals for the odds ratio. The standard error for the i-th coefficient of the regression can be obtained as:

SEi = sqrt(diag(H-1)i)

### Confidence Intervals

The confidence interval around the logistic regression coefficient is plus or minus 1.96*SEi, where SEi is the standard error for coefficient i. We can then define:

95% C.I.i = <lower, upper> = <coefficienti – 1.96 * SEi,  coefficienti + 1.96 * SEi>

### Wald Statistic and Wald’s Test

The Wald statistic is the ratio of the logistic coefficient to its standard error. A Wald test is used to test the statistical significance of each coefficient in the model. A Wald test calculates a Z statistic, which is:

z = coefficient / standard error

This z value can be squared, yielding a Wald statistic with a chi-square distribution, or, alternatively, it can be taken as is and compared directly with a Normal distribution.

The Wald test outputs a p-value indicating the significance of individual independent variables. If the value is below a chosen significance threshold (typically 0.05), then the variable plays some role in determining the outcome variable that most likely is not result of chance alone. However, there are some problems with the use of the Wald statistic. The Likelihood-ratio test is a better alternative for the Wald test.

### Likelihood-Ratio and Chi-Square Test

The likelihood-ratio is the ratio of the likelihood of the full model over the likelihood of a simpler nested model. When compared to the null model the likelihood-ratio test gives an overall model performance measure. When compared to nested models, each with one variable omitted, it tests the statistical significance of each coefficient in the model. These significance tests are considered to be more reliable than the Wald significance test.

The likelihood-ratio is a chi-square statistic given by:

The model with more parameters will always fit at least as well (have a greater log-likelihood). Whether it fits significantly better and should thus be preferred can be determined by deriving the probability or p-value of the obtained difference D. In many cases, the probability distribution of the test statistic can be approximated by a chi-square distribution with (df1 − df2) degrees of freedom, where df1 and df2 are the degrees of freedom of models 1 and 2 respectively.

### Regression to a Logistic Model

If we consider the mapping

φ(<x1, x2, …, xn>) = <x1, x2, … xn, 1>

The logistic regression model can also be rewritten as

p(C1|φ) = g(wt φ) = f(φ, w)

so that w contains all coefficients and the intercept value in a single weight vector. Doing so will allow the logistic regression model to be expressed as a common optimization model in the form f(φ, w) allowing many standard non-linear optimization algorithms to be directly applied in the search for the best parameters w that best fits the probability of a class C1 given a input vector φ.

### Likelihood function

The likelihood function for the logistic model is given by:

but, as the log of products equals the sum of logs, taking the log of the likelihood function results in the Log-likelihood function in the form:

Furthermore, if we take the negative of the log-likelihood, we will have a error function called the cross-entropy error function:

which gives both better numerical accuracy and enable us to write the error gradient in the same form as the gradient of the sum-of-squares error function for linear regression models (Bishop, 2006).

Another important detail is that the likelihood surface is convex, so it has no local maxima. Any local maxima will also be a global maxima, so one does not need to worry about getting trapped in a valley when walking down the error function.

### Iterative Reweighted Least-Squares

The method of Iterative Reweighted Least-Squares is commonly used to find the maximum likelihood estimates of a generalized linear model. In most common applications, an iterative Newton-Raphson algorithm is used to calculate those maximum likelihood values for the parameters. This algorithm uses second order information, represented in the form of a Hessian matrix, to guide incremental coefficient changes. This is also the algorithm used in this implementation.

In the Accord.NET machine learning framework. the Iterative Reweighted Least-Squares is implemented in the IterativeReweightedLeastSquares class (source).

## Source Code

Below is the main code segment for the logistic regression, performing one pass of the Iterative Reweighted Least-Squares algorithm.

Furthermore, as the likelihood function is convex, the Logistic Regression Analysis can perform regression without having to experiment different starting points. Below is the code to compute a logistic regression analysis. The algorithm iterates until the largest absolute parameter change between two iterations becomes less than a given limit, or the maximum number of iterations is reached.

## Using the code

Code usage is very simple. To perform a Logistic Regression Analysis, simply create an object of the type LogisticRegressionAnalysis and pass the input and output data as constructor parameters. Optionally you can pass the variables names as well.

Then just call Compute() to compute the analysis.

After that, you can display information about the regression coefficients by binding the CoefficientCollection to a DataGridView.

## Sample application

The sample application creates Logistic Regression Analyses by reading Excel workbooks. It can also perform standard Linear Regressions, although there aren’t many options available to specify linear models.

Example data set adapted from http://www.statsdirect.co.uk/help/regression_and_correlation/logi.htm.

The image on the left shows information about the analysis performed, such as coefficient values, standard errors, p-value for the Wald statistic, Odds ratio and 95% confidence intervals. It also reports the log-likelihood, deviance and likelihood-ratio chi-square test for the final model. The newest available version can also compute likelihood-ratio tests for each coefficient, although not shown in the image.

## Final Considerations

The logistic regression model can be seen as the exact same as a one layer MLP with only one hidden neuron using a sigmoid activation function trained by a Newton’s method learning algorithm. Thus we can say the Logistic Regression is just a special case of Neural Networks. However, one possible (and maybe unique) advantage of logistic regression is that it allows simpler interpretation of its results in the form of odds ratios and statistical hypothesis testing. Statistical analysis using neural networks is also possible, but some may argue it is not as straightforward as using ordinary logistic regression.

Multilayer perceptron neural networks with sigmoid activation functions can be created using the Accord.Neuro package and namespace.

# Kernel Discriminant Analysis in C#

Kernel (Fisher) discriminant analysis (kernel FDA) is a non-linear generalization of linear discriminant analysis (LDA) using techniques of kernel methods. Using a kernel, the originally linear operations of LDA are done in a