# RANdom Sample Consensus (RANSAC) in C#

RANSAC is an iterative method to build robust estimates for parameters of a mathematical model from a set of observed data which is known to contain outliers. The RANSAC algorithm is often used in computer vision, e.g., to simultaneously solve the correspondence problem and estimate the fundamental matrix related to a pair of stereo cameras.

The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing Install-Package Accord.MachineLearning in your IDE’s NuGet package manager.

## Introduction

RANSAC is an abbreviation for “RANdom SAmple Consensus“. It is an iterative method to estimate parameters of a mathematical model from a set of observed data which may contains outliers. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. The algorithm was first published by Fischler and Bolles in 1981.

The basic assumption is that the data consists of “inliers”, i.e., data whose distribution can be explained by some mathematical model, and “outliers” which are data that do not fit the model. Outliers could be considered points which come from noise, erroneous measurements or simply incorrect data. RANSAC also assumes that, given a set of inliers, there exists a procedure which can estimate the parameters of a model that optimally explains or fits this data.

### Example: Fitting a simple linear regression

We can use RANSAC to robustly fit a linear regression model using noisy data. Consider the example below, in which we have a cloud of points that seems to belong to a line. These are the inliers of the data. The other points, which can be seem as measurement errors or extreme noise values, are points expected to be considered outliers.

Linear structure contained in noisy data.

RANSAC is able to automatically distinguish the inliers from the outliers through the evaluation of the linear regression model. To do so, it randomly selects subsets from the data and attempts to fit linear regression models using them. The model which best explains most of the data will then be returned as the most probably correct model fit.

The image below shows the result of fitting a linear regression directly (as shown by the red line) and using RANSAC (as shown by the blue line). We can see that the red line represents poorly the data structure because it considers all points in order to fit the regression model. The blue line seems to be a much better representation of the linear relationship hidden inside the overall noisy data.

Hidden linear structure inside noisy data. The red line shows the fitting of a linear regression model directly considering all data points. The blue line shows the same result using RANSAC.

## Source code

The code below implements RANSAC using a generic approach. Models are considered to be of the reference type TModel and the type of data manipulated by this model is considered to be of the type TData. This approach allows for the creation of a general purpose RANSAC algorithm which can be used in very different contexts, be it the fitting of linear regression models or the estimation of homography matrices from pair of points in different images.

This code is available in the class RANSAC of the Accord.NET Framework (source).

Besides the generic parameters, the class utilizes three delegated functions during execution.

• The Fitting function, which should accept a subset of the data and use it to fit a model of the chosen type, which should be returned by the function;
• The Degenerate function, which should check if a subset of the training data is already known to result in a poor model, to avoid unnecessary computations; and
• The Distance function, which should accept a model and a subset of the training data to compute the distance between the model prediction and the expected value for a given point. It should return the indices of the points only whose predicted and expected values are within a given threshold of tolerance apart.

## Using the code

In the following example, we will fit a simple linear regression of the form x→y using RANSAC. The first step is to create a RANSAC algorithm passing the generic type parameters of the model to be build, i.e. SimpleLinearRegression and of the data to be fitted, i.e. a double array.

In this case we will be using a double array because the first position will hold the values for the input variable x. The second position will be holding the values for the output variables y. If you are already using .NET 4 it is possible to use the Tuple type instead.

After the creation of the RANSAC algorithm, we should set the delegate functions which will tell RANSAC how to fit a model, how to tell if a set of samples is degenerate and how to check for inliers in data.

Finally, all we have to do is call the Compute method passing the data. The best model found will be returned by the function, while the given set of inliers indices for this model will be returned as an out parameter.

## Sample application

The accompanying source application demonstrates the fitting of the simple linear regression model with and without using RANSAC. The application accepts Excel worksheets containing the independent values in the first column and the dependent variables in the second column.

# Accord.NET Framework – An extension to AForge.NET

Today I have just released the first version of Accord.NET. The Accord.NET Framework is a C# framework I have developed over the years while working on several areas of artificial intelligence, machine learning and statistics.

The Accord.NET Framework extends the excellent AForge.NET Framework with new tools and libraries. In order to use Accord.NET, you must have AForge.NET already installed. The first version number of Accord.NET will be used to indicate the compatibility status with AForge.NET versions, thus the first version will be starting at 2.0.0.

The framework is comprised by the set of libraries and sample applications, which demonstrate their features:

• Accord.Statistics – library with statistical analysis and other tools;
• Accord.Imaging – extension to the AForge.NET Imaging library with new filters and routines;
• Accord.Neuro – extension to the AForge.NET Neuro library with other learning algorithms;
• Accord.MachineLearning – extension to AForge’s machine learning library with Support Vector Machines;
• Accord.Audio – experimental library with filters and audio processing routines.

The Framework has just been released, so be ready to expect bugs and unpolished/unfinished areas. The code is released under a LGPL license. For additional help, issues, and discussions, please refer to the recently created forums.

# Harris Corners Detector in C#

Corner detection, or the more general terminology interest point detection, is an approach used within computer vision systems to extract certain kinds of features from an image. Corner detection is frequently used in motion detection, image matching, tracking, image mosaicing, panorama stitching, 3D modelling and object recognition.

This code has also been incorporated in Accord.NET Framework, which includes the latest version of this code plus many other statistics and machine learning tools. In order to install the Accord.NET Framework in your projects, use NuGet. Type in the package manager: Install-Package Accord.Imaging

## Introduction

One of the firsts operators for interest point detection was developed by Hans P. Moravec in 1977 for his research involving the automatic navigation of a robot through a clustered environment. It was also Moravec who defined the concept of “points of interest” in a image and concluded these interest points could be used to find matching regions in different images.

The Moravec operator is considered to be a corner detector because it defines interest points as points where there are large intensity variations in all directions. This often is the case at corners. It is interesting to note, however, that Moravec was not specifically interested in finding corners, just distinct regions in an image that could be used to register consecutive image frames.

### The Harris Operator

This operator was developed by Chris Harris and Mike Stephens in 1988 as a processing step to build interpretations of a robot’s environment based on image sequences. Like Moravec, they needed a method to match corresponding points in consecutive image frames, but were interested in tracking both corners and edges between frames.

Harris and Stephens improved upon Moravec’s corner detector by considering the differential of the corner score with respect to direction directly. The Harris corner detector computes the locally averaged moment matrix computed from the image gradients, and then combines the Eigenvalues of the moment matrix to compute a corner measure, from which maximum values indicate corners positions.

## Source Code

Below is the source code for the Harris Corners Detector algorithm. This is mostly the same code implemented in the Accord.NET Framework.

## Using the code

Code usage is very simple. A Corners Marker filter from the AForge Framework can be used to directly draw the detected interest points in the original image as it would be usual within the framework.

## Sample application

The accompanying sample application is pretty much self-explanatory. It performs the corners detection in the famous Lena Söderberg‘s picture, but can be adapted to work with other pictures as well. Just add another image to the project settings’ resources and change the line of code which loads the bitmap.

It is also very straightforward to adapt the application to load arbitrary images from the hard disk. I have opted to leave it this way for simplicity.

# Recognition of Handwritten Digits using Non-linear Kernel Discriminant Analysis

I’ve just submitted a new article to CodeProject, entitled "Handwriting Recognition using Kernel Discriminant Analysis". The article is a demonstration of handwritten digit recognition using Non-linear Discriminant Analysis with Kernels and using the Optical Recognition of Handwritten Digits data set from the University of California’s Machine Learning Repository.

Recognition of Handwritten digits using Kernel Discriminant Analysis

The Code Project is a free repository of source code, tutorials and development resources. It is also home of a large and ever-growing community of software developers, architects and other technology professionals. It has fairly active forums, and is a reasonably good resource for resolving difficult software development issues.

# Kernel Support Vector Machines for Classification and Regression in C#

Kernel methods in general have gained increased attention in recent years, partly due to the grown of popularity of the Support Vector Machines. Support Vector Machines are linear classifiers and regressors that, through the Kernel trick, operate in reproducing Kernel Hilbert spaces and are thus able to perform non-linear classification and regression in their input space.

## Foreword

If you would like to use SVMs in your .NET applications, download the Accord.NET Framework through NuGet. Afterwards, creating support vector machines for binary and multi-class problems with a variety of kernels becomes very easy. The Accord.NET Framework is a LGPL framework which can be used freely in commercial, closed-source, open-source or free applications. This article explains a bit how the SVM algorithms and the overall SVM module was designed before being added as part of the framework.

## Introduction

### Support Vector Machines

Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. In simple words, given a set of training examples, each marked as belonging to one of two categories, a SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

A linear support vector machine is composed of a set of given support vectors z and a set of weights w. The computation for the output of a given SVM with N support vectors z1, z2, … , zN and weights w1, w2, … , wN is then given by:

$F(x) = \sum_{i=1}^N w_i \, \left \langle z_i,x \right \rangle + b$

### Kernel Support Vector Machines

The original optimal hyperplane algorithm proposed by Vladimir Vapnik in 1963 was a linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick (originally proposed by Aizerman et al.) to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be non-linear and the transformed space high dimensional; thus though the classifier is a hyperplane in the high-dimensional feature space, it may be non-linear in the original input space.

Using kernels, the original formulation for the SVM given SVM with support vectors z1, z2, … , zN and weights w1, w2, … , wN is now given by:

$F(x) = \sum_{i=1}^N w_i \, k(z_i,x) + b$

It is also very straightforward to see that, using a linear kernel of the form K(z,x) = <z,x> = zTx, we recover the original formulation for the linear SVM.

#### The Kernel trick

The Kernel trick is a very interesting and powerful tool. It is powerful because it provides a bridge from linearity to non-linearity to any algorithm that solely depends on the dot product between two vectors. It comes from the fact that, if we first map our input data into a higher-dimensional space, a linear algorithm operating in this space will behave non-linearly in the original input space.

Now, the Kernel trick is really interesting because that mapping does not need to be ever computed. If our algorithm can be expressed only in terms of a inner product between two vectors, all we need is replace this inner product with the inner product from some other suitable space. That is where resides the “trick”: wherever a dot product is used, it is replaced with a Kernel function. The kernel function denotes an inner product in feature space and is usually denoted as:

K(x,y) = <φ(x),φ(y)>

Using the Kernel function, the algorithm can then be carried into a higher-dimension space without explicitly mapping the input points into this space. This is highly desirable, as sometimes our higher-dimensional feature space could even be infinite-dimensional and thus infeasible to compute.

#### Standard Kernels

Some common Kernel functions include the linear kernel, the polynomial kernel and the Gaussian kernel. Below is a simple list with their most interesting characteristics.

 Linear Kernel The Linear kernel is the simplest kernel function. It is given by the common inner product plus an optional constant c. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts, i.e. KPCA with linear kernel is equivalent to standard PCA. $k(x, y) = x^T y + c$ Polynomial Kernel The Polynomial kernel is a non-stationary kernel. It is well suited for problems where all data is normalized. $k(x, y) = (\alpha x^T y + c)^d$ Gaussian Kernel The Gaussian kernel is by far one of the most versatile Kernels. It is a radial basis function kernel, and is the preferred Kernel when we don’t know much about the data we are trying to model. $k(x, y) = \exp\left(-\frac{ \lVert x-y \rVert ^2}{2\sigma^2}\right)$

For more Kernel functions, check Kernel functions for Machine Learning Applications. The accompanying source code includes definitions for over 20 distinct Kernel functions, many of them detailed in the aforementioned post.

## Learning Algorithms

### Sequential Minimal Optimization

Previous SVM learning algorithms involved the use of quadratic programming solvers. Some of them used chunking to split the problem in smaller parts which could be solved more efficiently. Platt’s Sequential Minimal Optimization (SMO) algorithm puts chunking to the extreme by breaking the problem down into 2-dimensional sub-problems that can be solved analytically, eliminating the need for a numerical optimization algorithm.

The algorithm makes use of Lagrange multipliers to compute the optimization problem. Platt’s algorithm is composed of three main procedures or parts:

• run, which iterates over all points until convergence to a tolerance threshold;
• examineExample, which finds two points to jointly optimize;
• takeStep, which solves the 2-dimensional optimization problem analytically.

The algorithm is also governed by three extra parameters besides the Kernel function and the data points.

• The parameter Ccontrols the trade off between allowing some training errors and forcing rigid margins. Increasing the value of C increases the cost of misclassifications but may result in models that do not generalize well to points outside the training set.
• The parameter ε controls the width of the ε-insensitive zone, used to fit the training data. The value of ε can affect the number of support vectors used to construct the regression function. The bigger ε, the fewer support vectors are selected and the solution becomes more sparse. On the other hand, increasing the ε-value by too much will result in less accurate models.
• The parameter T is the convergence tolerance. It is the criterion for completing the training process.

After the algorithm ends, a new Support Vector Machine can be created using only the points whose Lagrange multipliers are higher than zero. The expected outputs yi can be individually multiplied by their corresponding Lagrange multipliers ai to form a single weight vector w.

$F(x) = \sum_{i=0}^N \{ \alpha_i y \, k(z_i,x) \} + b = \sum_{i=0}^N \{ w_i \, k(z_i,x) \} + b$

### Sequential Minimal Optimization for Regression

A version of SVM for regression was proposed in 1996 by Vladimir Vapnik, Harris Drucker, Chris Burges, Linda Kaufman and Alex Smola. The method was called support vector regression and, as is the case with the original Support Vector Machine formulation, depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction that is within a tolerance threshold ε.

Platt’s algorithm has also been modified for regression. Albeit still maintaining much of its original structure, the difference lies in the fact that the modified algorithm uses two Lagrange multipliers âi and ai for each input point i. After the algorithm ends, a new Support Vector Machine can be created using only points whose both Lagrange multipliers are higher than zero. The multipliers âi and ai are then subtracted to form a single weight vector w.

$F(x) = \sum_{i=0}^N \{ (\hat{\alpha_i} - \alpha_i) \, k(z_i,x) \} + b = \sum_{i=0}^N \{ w_i \, k(z_i,x) \} + b$

The algorithm is also governed by the same three parameters presented above. The parameter ε, however, receives a special meaning. It governs the size of the ε-insensitive tube over the regression line. The algorithm has been further developed and adapted by Alex J. Smola, Bernhard Schoelkopf and further optimizations were introduced by Shevade et al and Flake et al.

## Source Code

Here is the class diagram for the Support Vector Machine module. We can see it is very simple in terms of standard class organization.

Class diagram for the (Kernel) Support Vector Machines module.

### Support Vector Machine

Below is the class definition for the Linear Support Vector Machine. It is pretty much self explanatory.

### Kernel Support Vector Machine

Here is the class definition for the Kernel Support Vector Machine. It inherits from Support Vector Machine and extends it with a Kernel property. The Compute method is also overridden to include the chosen Kernel in the model computation.

### Sequential Minimal Optimization

Here is the code for the Sequential Minimal Optimization (SMO) algorithm.

## Using the code

In the following example, we will be training a Polynomial Kernel Support Vector Machine to recognize the XOR classification problem. The XOR function is classic example of a pattern classification problem that is not linearly separable.

Here, remember that the SVM is a margin classifier that classifies instances as either 1 or –1. So the training and expected output for the classification task should also be in this range. There are no such requirements for the inputs, though.

To create the Kernel Support Vector Machine with a Polynomial Kernel, do:

After the machine has been created, create a new Learning algorithm. As we are going to do classification, we will be using the standard SequentialMinimalOptimization algorithm.

After the model has been trained, we can compute its outputs for the given inputs.

The machine should be able to correctly identify all of the input instances.

## Sample application

The sample application is able to perform both Classification and Regression using Support Vector Machines. It can read Excel spreadsheets and determines the task to be performed depending on the number of the columns in the sheet. If the input table contains two columns (e.g. X and Y) it will be interpreted as a regression problem X –> Y. If the input table contains three columns (e.g. x1, x2 and Y) it will be interpreted as a classification problem <x1,x2> belongs to class Y, Y being either 1 or -1.

### Classification

To perform classification, load a classification task data such as the Yin Yang classification problem.

Yin Yang classification problem. The goal is to create a model which best determines whether a given point belongs to class blue or green. It is a clear example of a non-linearly separable problem.

Creation of a Gaussian Kernel Support Vector Machine with σ = 1.2236, C = 1.0, ε = 0.001 and T = 0.001.

Classification using the created Support Vector Machine. Notice it achieves an accuracy of 97%, with sensitivity and specifity rates of 98% and 96%, respectively.

### Regression

To perform regression, we can load the Gaussian noise sine wave example.

Noise sine wave regression problem.

Creation of a Gaussian Kernel Support Vector Machine with σ = 1.2236, C = 1.0, ε = 0.2 and T = 0.001.

After the model has been created, we can plot the model approximation for the sine wave data. The blue line shows the curve approximation for the original red training dots.

Regression using the created Kernel Support Vector Machine. Notice the coefficient of determination r² of 0.95. The closer to one, the better.

# Java over-engineered?

The following gem, that belongs into a Java programming best practices book, was found in the source of Sun’s JDK.

Well, I must confess this is almost better than the simple Hello World in Java [cached copy].

# Partial Least Squares Analysis and Regression in C#

Partial Least Squares Regression (PLS) is a technique that generalizes and combines features from principal component analysis and (multivariate) multiple regression. It has been widely adopted in the field of chemometrics and social sciences.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

## Introduction

Partial least squares regression (PLS-regression) is a statistical method that is related to principal components regression. The goal of this method is to find a linear regression model by projecting both the predicted variables and the observable variables to new, latent variable spaces. It was developed in the 1960s by Herman Wold to be used in econometrics. However, today it is most commonly used for regression in the field of chemometrics.

In statistics, latent variables (as opposed to observable variables), are variables
that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.

A PLS model will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS-regression is particularly suited when the matrix of predictors has more variables
than observations, and when there is multicollinearity among X values. Its interesting to note that standard linear regression would likely fail to produce meaningful interpretable models in those cases.

## Overview

### Multivariate Linear Regression in Latent Space

Multiple Linear Regression is a generalization of simple linear regression for multiple inputs. In turn, Multivariate Linear Regression is a generalization of Multiple Linear Regression for multiple outputs. The multivariate linear regression is a general linear regression model which can map an arbitrary dimension space into another arbitrary dimension space using only linear relationships. In the context of PLS, it is used to map the latent variable space for the inputs X into the latent variable space for the output variables Y. Those latent variable spaces are spawned by the loading matrices for X and Y, commonly denoted P and Q, respectively.

The goal of PLS algorithms are therefore to find those two matrices. There are mainly two algorithms to do this: NIPALS and SIMPLS.

### Algorithm 1: NIPALS

Here is an exposition of the NIPALS algorithm for finding the loading matrices required for PLS regression. There are, however, many variations of this algorithm which normalize or do not normalize certain vectors.

Algorithm:

• Let X be the mean-centered input matrix,
• Let Y be the mean-centered output matrix,
• Let P be the loadings matrix for X, and let pdenote the i-th column of P;
• Let Q be the loadings matrix for Y, and let qdenote the i-th column of Q;
• Let T be the score matrix for X, and tdenote the i-th column of T;
• Let U be the score matrix for Y, and udenote the i-th column of U;
• Let W be the PLS weight matrix, and wdenote the i-th column of W; and
• Let B be a diagonal matrix of diagonal coefficients bi

Then:

1. For each factor i to be calculated:
1. Initially choose ui as the largest column vector in
X (having the largest sum of squares)
2. While (ti has not converged to a desired precision)
1. wi ∝ X’u(estimate X weights)
2. ti ∝ Xw(estimate X factor scores)
3. qi ∝ Y’t(estimate Y weights)
4. ui = Yq(estimate Y scores)
3. bi = t’u (compute prediction coefficient b)
5. X = X – tp’ (deflate X)

In other statistical analysis such as PCA, it is often interesting to inspect how much of the variance can be explained by each of the principal component dimensions. The same can also be accomplished for PLS, both for the input (predictor) variables X and outputs (regressor) variables Y. For the input variables, the amount of variance explained by each factor can be computed as bi². For outputs, it can be computed as the sum of the squared elements of its column in the matrix P,  i.e. as Sum(pi²).

### Algorithm 2: SIMPLS

SIMPLS is an alternative algorithm for finding the PLS matrices P and Q that has been derived considering the true objective of maximizing the covariance between the latent factors and the output vectors. Both NIPALS and SIMPLS are equivalent when there is just one output variable Y to be regressed. However, their answers can differ when Y is multi-dimensional. However, because the construction of the weight vectors used by SIMPLS is based on the empirical variance–covariance matrix of the joint input and output variables, outliers present in the data can adversely impact its performance.

Algorithm:

• Let X be the mean-centered input matrix,
• Let Y be the mean-centered output matrix,
• Let P be the loadings matrix for X, and let pi denote the i-th column of P;
• Let C be the loadings matrix for Y, and let ci denote the i-th column of C;
• Let T be the score matrix for X, and ti denote the i-th column of T;
• Let U be the score matrix for Y, and ui denote the i-th column of U; and
• Let W be the PLS weight matrix, and wi denote the i-th column of W.

Then:

1. Create the covariance matrix C = X’Y
2. For each factor i to be calculated:
1. Perform SVD on the covariance matrix and store the first left singular vector in wi and the first right singular value times the singular values in ci.
2. ti ∝ X*wi            (estimate X factor scores)
4. ci = ci/norm(ti)     (estimate Y weights)
5. wi = wi/norm(ti)     (estimate X weights)
6. ui = Y*ci            (estimate Y scores)
7. vi = pi              (form the basis vector vi)
9. Make u orthogonal to the previous scores T
10. Deflate the covariance matrix C
1. C = C – vi*(vi‘*C)

## Source Code

This section contains the realization of the NIPALS and SIMPLS algorithms in C#. The models have been implemented considering an object-oriented structure that is particularly suitable to be data-bound to Windows.Forms (or WPF) controls.

## Class Diagram

Class diagram for the Partial Least Squares Analysis.

## Performing PLS using NIPALS

Here is the source code for computing PLS using the NIPALS algorithm:

## Performing PLS using SIMPLS

And here is the source code for computing PLS using the SIMPLS algorithm:

## Multivariate Linear Regression

Multivariate Linear Regression is computed in a similar manner to a Multiple Linear Regression. The only difference is that, instead of having a weight vector and a intercept, we have a weight matrix and a intercept vector.

The weight matrix and the intercept vector are computed in the PartialLeastSquaresAnalysis class by the CreateRegression method. In case the analyzed data already was mean centered before being fed to the analysis, the constructed intercept vector will consist only of zeros.

## Using the code

As an example, lets consider the example data from Hervé Abdi, where the goal is to predict the subjective evaluation of a set of 5 wines. The dependent variables that we want to predict for each wine are its likeability, and how well it goes with meat, or dessert (as rated by a panel of experts). The predictors are the price, the sugar, alcohol, and acidity content of each wine.

Next, we proceed to create the Partial Least Squares Analysis using the Covariance
method (data will only be mean centered but not normalized) and using the SIMPLS
algorithm.

After the analysis has been computed, we can proceed and create the regression model.

Now after the regression has been computed, we can find how well it has performed. The coefficient of determination r² for the variables Hedonic, Goes with Meat and Goes with Dessert can be computed by the CoefficientOfDetermination method of the MultivariateRegressionClass and will be, respectively, 0.9999, 0.9999 and 0.8750 – the closer to one, the better.

## Sample application

The accompanying sample application performs Partial Least Squares Analysis and Regression in Excel worksheets. The predictors and dependent variables can be selected once the data has been loaded in the application.

Left: Wine example from Hervé Abdi. Right: Variance explained by PLS using the SIMPLS algorithm

Left: Partial Least Squares Analysis results and regression coefficients for the
full regression model. Right: projection of the dependent and predictors variables
using the three first factors.

Results from the Multivariate Linear Regression performed in Latent Space using
three factors from PLS.

Left: Example data from Geladi and Kowalski. Right: Analysis results for the data using NIPALS. We can see that just two factors are enough to explain the whole variance of the set.

Left: Projections to the latent spaces. Right: Loadings and coefficients for the
two factor Multivariate Linear Regression model.

Results from the Partial Least Squares Regression showing a perfect fit using only the first two factors.

# Hidden Markov Model -Based Sequence Classifiers in C#

Distinct Hidden Markov Models can be trained individually on data obtained from each class of a classification problem. If we create a set of those models and specialize each model to recognize each of the separate classes, we can then use the HMMs ability to calculate the likelihood that a given sequence belongs to the model to determine the most likely class of an unknown sequence.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

## Introduction

### Hidden Markov Models

An introduction about Hidden Markov Models (HMM) was presented in a previous article, entitled Hidden Markov Models in C#. HMMs are models are stochastic methods to model temporal and sequence data.

They allow, among other things, (1) to infer the most likely sequence of states that produced a given output sequence, to (2) infer which will be the most likely next state (and thus predicting the next output) and (3) calculate the probability that a given sequence of outputs originated from the system (allowing the use of hidden Markov models for sequence classification). In the context of this article, we will be more interested in results from ability (3).

Hidden Markov Models can be seem as finite state machines where for each sequence unit observation there is a state transition and, for each state, there is a output symbol emission. The picture on the left summarizes the overall definition of a HMM given in the previous article.

### Hidden Markov Model -Based Sequence Classifier

Hidden Markov Models can be trained individually on data obtained from each class of a classification problem. If we create a set of models and specialize each model to recognize each of the separated classes, then we will be able to explore the HMMs ability to calculate the likelihood that a given sequence belongs to itself to perform classification of discrete sequences.

After all models have been trained, the probability of the unknown-class sequence can be computed for each model. As each model specialized in a given class, the one which outputs the highest probability can be used to determine the most likely class for the new sequence, as shown in the picture below.

Schematic representation of the sequence classification procedure.

## Source Code

The implementation is very straightforward, as shown in the picture below. The Tag property of the HiddenMarkovModel class can be used to store additional information about which class the model will be representing.

Class diagram for the HMM-based sequence classifier.

### Computing the most likely class for a given sequence

Computing the most likely class for a given sequence is as simple as iterating through all models in the set and selecting the one which has produced the highest likelihood value.

### Training each model to recognize each of the output classes

To train each model to recognize each of the output classes we have to split the training data into subsets whose outputs corresponds to each of the classes from the classification problem.

## Using the Code

Code usage is very simple. Once one has the input data available in the form of a sequence array and output data in form of a integer array, create a new HiddenMarkovClassifier object with the appropriate parameters.

Then, just call Learn() passing the input and output data and the convergence limit for the learning algorithm. After that, call Compute() passing a new integer sequence to be classified by the system.

## Sample Application

The accompanying sample application can read Excel spreadsheets containing integer sequences and labels for those sequences. An example spreadsheet contained inside the Resources folder in the compressed archive file demonstrates the expected format for this data.

Left: Input sequences data. Right: Hidden Markov Models contained in the Sequence Classifier with initial configurations.

Left: Trained models. Right: Performance test for the Hidden Markov Model Sequence Classifier.

# Hidden Markov Models in C#

Hidden Markov Models (HMM) are stochastic methods to model temporal and sequence data. They are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

## Introduction

Hidden Markov Models were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recognition, starting in the mid-1970s. Indeed, one of the most comprehensive explanations on the topic was published in “A Tutorial On Hidden Markov Models And Selected Applications in Speech Recognition”, by Lawrence R. Rabiner in 1989. In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA. Since then, they have become ubiquitous in the field of bioinformatics.

Dynamical systems of discrete nature assumed to be governed by a Markov chain emits a sequence of observable outputs. Under the Markov assumption, it is also assumed that the latest output depends only on the current state of the system. Such states are often not known from the observer when only the output values are observable.

Hidden Markov Models attempt to model such systems and allow, among other things, (1) to infer the most likely sequence of states that produced a given output sequence, to (2) infer which will be the most likely next state (and thus predicting the next output) and (3) calculate the probability that a given sequence of outputs originated from the system (allowing the use of hidden Markov models for sequence classification).

The “hidden” in Hidden Markov Models comes from the fact that the observer does not know in which state the system may be in, but has only a probabilistic insight on where it should be.

## Definition

Hidden Markov Models can be seem as finite state machines where for each sequence unit observation there is a state transition and, for each state, there is a output symbol emission.

### Notation

Traditionally, HMMs have been defined by the following quintuple:

$\lambda = (N, M, A, B, \pi)$

where

• N is the number of states for the model
• M is the number of distinct observations symbols per state, i.e. the discrete alphabet size.
• A is the NxN state transition probability distribution given in the form of a matrix A = {aij}
• B is the NxM observation symbol probability distribution given in the form of a matrix B = {bj(k)}
• π is the initial state distribution vector π = {πi}

Note that, if we opt out the structure parameters M and N we have the more often used compact notation

$\lambda = (A, B, \pi)$

### Canonical problems

There are three canonical problems associated with hidden Markov models, which I’ll quote from Wikipedia:

1. Given the parameters of the model, compute the probability of a particular output sequence. This requires summation over all possible state sequences, but can be done efficiently using the Forward algorithm, which is a form of dynamic programming.
2. Given the parameters of the model and a particular output sequence, find the state sequence that is most likely to have generated that output sequence. This requires finding a maximum over all possible state sequences, but can similarly be solved efficiently by the Viterbi algorithm.
3. Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, derive the maximum likelihood estimate of the parameters of the HMM given a dataset of output sequences. No tractable algorithm is known for solving this problem exactly, but a local maximum likelihood can be derived efficiently using the Baum-Welch algorithm or the Baldi-Chauvin algorithm. The Baum-Welch algorithm is an example of a forward-backward algorithm, and is a special case of the Expectation-maximization algorithm.

The solution for those problems are exactly what makes Hidden Markov Models useful. The ability to learn from the data (using the solution of problem 3) and then become able to make predictions (solution to problem 2) and able to classify sequences (solution of problem 2) is nothing but applied machine learning. From this perspective, HMMs can just be seem as supervisioned sequence classifiers and sequence predictors with some other useful interesting properties.

### Choosing the structure

Choosing the structure for a hidden Markov model is not always obvious. The number of states depend on the application and to what interpretation one is willing to give to the hidden states. Some domain knowledge is required to build a suitable model and also to choose the initial parameters that an HMM can take. There is also some trial and error involved, and there are sometimes complex tradeoffs that have to be made between model complexity and difficulty of learning, just as is the case with most machine learning techniques.

Additional information can be found on http://www.cse.unsw.edu.au/~waleed/phd/tr9806/node12.html.

## Algorithms

The solution to the three canonical problems are the algorithms that makes HMMs useful. Each of the three problems are described in the three subsections below.

### Evaluation

The first canonical problem is the evaluation of the probability of a particular output sequence. It can be efficiently computed using either the Viterbi-forward or the Forward algorithms, both of which are forms of dynamic programming.

The Viterbi algorithm originally computes the most likely sequence of states which has originated a sequence of observations. In doing so, it is also able to return the probability of traversing this particular sequence of states. So to obtain Viterbi probabilities, please refer to the Decoding problem referred below.

The Forward algorithm, unlike the Viterbi algorithm, does not find a particular sequence of states; instead it computes the probability that any sequence of states has produced the sequence of observations. In both algorithms, a matrix is used to store computations about the possible state sequence paths that the model can assume. The forward algorithm also plays a key role in the Learning problem, and is thus implemented as a separate method.

### Decoding

The second canonical problem is the discovery of the most likely sequence of states that generated a given output sequence. This can be computed efficiently using the Viterbi algorithm. A trackback is used to detect the maximum probability path travelled by the algorithm. The probability of travelling such sequence is also computed in the process.

### Learning

The third and last problem is the problem of learning the most likely parameters that best models a system given a set of sequences originated from this system. Most implementations I’ve seem did not consider the problem of learning from a set of sequences, but only from a single sequence at a time. The algorithm below, however, is fully suitable to learn from a set of sequences and also uses scaling, which is another thing I have not seem in other implementations.

The source code follows the original algorithm by Rabiner (1989). There are, however, some known issues with the algorithms detailed in Rabiner’s paper. More information about those issues is available in a next section of this article entitled “Remarks”.