Video classification with hybrid models

Deep learning reigns undisputed as the new de-facto method for image classification. However, at least until the beginning of 2016, it was not yet quite clear whether deep learning methods were undisputedly better than more traditional approaches based on handcrafted features for the specific task of video classification. While most research interest is certainly being directed towards deep learning, we thought it would be fun to play on the handcrafted features side for a while and we actually made some pretty interesting discoveries along the way.

Disclaimer: this post is directed towards beginners in computer vision and may contain very loose descriptions or simplifications of some topics in order to bring interest about the field to unexperienced readers – for a succinct and peer-reviewed discussion about the topic, the reader is strongly advised to refer to https://arxiv.org/abs/1608.07138 instead.

Three (non-exhaustive) approaches for classification

In the following paragraphs, we will explore a little three different ways of performing video classification: the task of determining to which of a finite set of labels a video must belong to.

Handcrafted features: designing image features yourself

Initial video classification systems were based on handcrafted features, or, in other words, based on techniques for extracting useful information from a video that have been discovered or invented based on the expertise of the researcher.

For example, a very basic approach would be to consider that whatever could be considered as a “corner” in an image (see link for an example) would be quite relevant for determining what was inside this image. We could then present this collection of points (which would be of much reduced size than the image itself) to a classifier  and hope it would be able to determine its contents from this simplified representation of the data.

In the case of video, the most successful example of those is certainly the Dense Trajectories approach of Wang et al., that captures frame-level descriptors over pixel trajectories determined by optical flow.

Deep learning: letting networks figure out features for you

Well, some could actually think that using corners or other features to try to guess what is in an image or video is not straightforward at all – why not simply use the image or video itself as a whole and present this to the classifier to check out what it would come up with? Well the truth is that this had been tried, but until a few years ago, it simply wouldn’t work except for simple problems. For some time we didn’t know how to make classifiers that could handle extremely large amounts of data, and yet extract anything useful from it.

This started to change in 2006, when the interest on training large neural networks started to raise. In 2012, a deep convolutional network with 8 layers managed to win the ImageNet challenge, far outperforming other approaches based on Fisher Vector encodings of handcrafted features. Since then, many works have shown how deep nets could perform way better than other methods in image classification and other domains.

However, while deep nets have certainly been shown to be the undisputed winners in action classification, the same could not yet be said about video classification, at least not in the beginning of 2016.  Moreover, training deep neural networks for video can also be extremely costly, both in terms of computational power needed as well as the number and sheer size of examples needed to train those networks.

Hybrid models: getting the best out of the both worlds 😉

So, could we design classification architectures that could be both powerful and easy to train? In the sense that it wouldn’t be necessary to rely on huge amounts of labelled data in order to learn even the most basic, pixel-level aspects of a video, but instead in a way that we could leverage this knowledge already from techniques that are already known to work fairly well?

Well, indeed, the answer seems to be yes – as long as you pay attention to some to some details that can actually make a huge difference.

This is shown in the paper “Sympathy for Details: Hybrid Classification Architectures for Action Recognition.” This paper is mostly centered about two things: a) showing that it is possible to make standard methods perform as good as deep nets for video; and b) showing that by combining Fisher Vectors of traditional handcrafted video features with a multi-layer architecture can actually perform better than most deep nets for video, achieving state-of-the-art results at the date of submission.

The paper, co-written with colleagues and advisors from Xerox Research Centre and the Computer Vision Center of the Autonomous University of Barcelona, has been accepted for publication in the 14th European Conference on Computer Vision (ECCV’16), to be held in Amsterdam, The Netherlands, this October, and can be found here.

Liblinear algorithms in C#

The Accord.NET Framework is not only an image processing and computer vision framework, but also a machine learning framework for .NET. One of its features is to encompass the exact same algorithms that can be found in other libraries, such as LIBLINEAR, but offer them in .NET withing a common interface ready to be incorporated in your application.

What is LIBLINEAR?

As its authors put, LIBLINEAR is a library for large linear classification. It is intended to be used to tackle classification and regression problems with millions of instances and features, although it can only produce linear classifiers, i.e. linear support vector machines.

The framework now offers almost all liblinear algorithms in C#, except for one. Those include:

… but also always make sure to interpret results correctly! This post presents a quick intro on how to perform statistical hypothesis testing and power analysis with the Accord.NET Framework in C#.

Hypothesis testing

What does hypothesis testing means in statistics (and should also mean everywhere, for that matter)? You may recall from Karl Popper’s theory of falsiability that good theories can rarely be accurately proven, but you may gain a considerable confidence on them by constantly challenging and failing to refute them.

This comes from the fact that it is often easier to falsify something than to prove it. Consider for instance the white-swan/black-swan example: Let’s say a theory states that all swans are white. This is a very strong claim; it does not applies to one or a few particular observations of a swan, but all of them. It would be rather difficult to verify if all of the swans in Earth are indeed white. It is thus almost impossible to prove this theory directly.

However, the catch is that it will take only a single contrary example to refute it. If
we find a single swan that is black, the entire theory should be rejected, so alternate theories could be raised. It should be fairly easy to attempt to prove a theory wrong. If a theory continuously survives those attempts to be proven wrong, it becomes stronger. This does not necessarily means it is correct, only that it is very unlikely to be wrong.

This is pretty much how the science method works; it also provides a solution to the demarcation problem originally proposed by Kant: the problem of separating the sciences from the pseudo-sciences (i.e. astronomy from astrology). A “good” theory should be easy to attack so we can try to refute it; and by constantly challenging it and failing to prove it wrong, we gain further confidence that this theory may,
indeed, be right. In short:

Often the most interesting theories can’t be proven right, they can only be proven wrong. By continuously refuting alternatives, a theory becomes stronger (but most likely never reaching the ‘truth’).

Answering the question in the first phrase of this section, hypothesis testing means verifying if a theory holds even when confronted with alternative theories. In statistical hypothesis testing, this often means checking if a hypothesis holds even when confronted with the fact that it may have just happened to be true by pure chance or plain luck.

Statistical hypothesis testing

Fisher (1925) also noted that we can’t always prove a theory but we can attempt to refute it. Therefore, statistical hypothesis testing includes stating a hypothesis, which is the hypothesis we are trying to invalidade; and check if we can confidently reject it by confronting it with data from a test or experiment. This hypothesis to be pickpocketed is often called the null hypothesis (commonly denoted H0). It receives this name as it is usually the hypothesis of no change: there is no difference, nothing changed after the experiment, there is no effect.

The hypotheses verified by statistical hypothesis tests are often theories about whether or not a random sample from a population comes from a given probability distribution. This seems weird, but several problems can be cast in this way. Suppose, for example, we would like to determine if students from a classroom have significantly different grades than students from another room. Any difference could
possibly be attributed to chance, as some students may just perform better on a exam because of luck.

An exam was applied to both classrooms. The exam results (by each student) are written below:

 Classroom A Classroom B 8.12, 8.34, 7.54, 8.98, 8.24, 7.15, 6.60, 7.84, 8.68, 9.44, 8.83, 8.21, 8.83, 10.0, 7.94, 9.58, 9.44, 8.36, 8.48, 8.47, 8.02, 8.20, 10.0, 8.66, 8.48, 9.17, 6.54, 7.50 7.50, 6.70, 8.55, 7.84, 9.23, 6.10, 8.45, 8.27, 7.01, 7.18, 9.05, 8.18, 7.70, 7.93, 8.20, 8.19, 7.65, 9.25, 8.71, 8.34, 7.47, 7.47, 8.24, 7.10, 7.87, 10.0, 8.26, 6.82, 7.53 Students: 28 Mean: 8.416 Students: 29 Mean: 7.958

We have two hypothesis:

• Results for classroom A are not significantly different from the results from classroom B. Any difference in means could have been explained due to chance alone.
• Results are indeed different. The apparent differences are very unlikely to have occurred by chance.

Since we have less than 30 samples, we will be using a Two-Sample T-Test to test the hypothesis that the population’s mean values of the two samples are not equal. Besides, we will not be assuming equal variances. So let’s we create our test object:

And now we can query it:

Which reveals the test is indeed significant. And now we have at least two problems to address…

Problems

Problem 1: Statistical significance does not imply practical significance

So the test was significant. But would this mean the difference itself is significant?
Would this mean there any serious problem with the school teaching method?

No – but it doesn’t mean the contrary either. It is impossible to tell just by looking at the p-level.

The test only said there was a difference, but it can not tell the importance of this difference. Besides the two classes having performed so differently they could trigger statistical significance, we don’t know if this difference has any practical significance. A statistical test being significant is not a proof; it is just an evidence to be balanced together with other pieces of information in order to drawn a conclusion.

Perhaps one of best examples illustrating this problem is given by Martha K. Smith:

Suppose a large clinical trial is carried out to compare a new medical treatment with a standard one. The statistical analysis shows a statistically significant difference in lifespan when using the new treatment compared to the old one. But the increase in lifespan is at most three days, with average increase less than 24 hours, and with poor quality of life during the period of extended life. Most people would not consider the improvement practically significant.

In our classroom example, the difference in means is about 0.46 points. If principals believe a difference of less than 0.50 in a scale from 0.00 to 10.00 is not that critical, there may be no need to force students from the room with lower grades to start taking extra lessons after school. In other words, statistical hypothesis testing does not lead to automatic decision making. A statistically significant test is just another evidence which should be balanced with other clues in order to take a decision or
draw a conclusion
.

Problem 2: Powerless tests can be misleading

The p-level reported by the significance test is the probability of the extreme data we found be occurring given the null hypothesis is correct. Often, this is not the case. We
must also know the probability that the test will reject the null hypothesis when the null hypothesis is false. To do so, we must compute the power of our test, and, better yet, we should have used this information to conclude how many samples we would need to achieve a more informative test before we conducted our experiment. The power is then the probability of detecting a difference if this difference indeed exists (Smith, 2011). So let’s see:

The test we performed had astonishingly small power; so if the null hypothesis is false (and there is actually a difference between the classrooms) we have only about 50% chance of correctly rejecting it. Therefore, this would also lead to a 50% chance of producing a false negative – incorrectly saying there is no difference when actually there is.  The table below exemplifies the different errors we can get by rejecting or not the null hypothesis.

 Fail to reject the null hypothesis Null hypothesis is true Null hypothesis is false Correct True negative Type II error (beta) False negative Type I error (alpha) False positive Correct outcome True positive

Tests with little statistical power are  often inconsistent in the literature. Suppose, for example, that the score from the first student from classroom B had earned a  7.52 instead of 7.50. Due to the low power of the test, this little change would already be sufficient to render the test nonsignificant, and we will not be able to reject the null hypothesis that the two population means aren’t different anymore. Due to the low power of the test, we can’t distinguish between a correct true negative and a type II error. This is why powerless tests can be misleading and should never be relied upon for decision making (Smith, 2011b).

The power of a test increases with the sample size. To obtain a power of at least 80%, let’s see how many samples should have been collected:

So, we would actually need 57 students in each classroom to confidently affirm whether there was an difference or not. Pretty disappointing, since in the real world we wouldn’t be able to enroll more students and wait years until we could perform another exam just to adjust the sample size. On those situations, the power for the test can be increased by increasing the significance threshold (Thomas, Juanes, 1996), although clearly sacrificing our ability to detect false positives (Type I errors) in the process.

My test turned insignificant. Should I accept the null hypothesis?

The short answer is ‘only if you have enough power‘. Otherwise, definitively no.

If you have reason to believe the test you performed had enough power to detect a difference within the given Type-II error rate, and it didn’t, then accepting the null would most likely be acceptable. The acceptance should also be accompanied of an analysis of confidence intervals or effect sizes. Consider, for example, that some actual scientific discoveries were  indeed made by accepting the null hypothesis rather than by contradicting it; one notable example being the discovery of the X-Ray (Yu, 2012).

Further criticism

Much of the criticism associated with statistical hypothesis testing is often related not to the use of statistical hypothesis testing per se, but on how the significance outcomes from such tests are interpreted. Often it boils down to incorrectly believing a p-value is the probability of a null hypothesis being true, when, in fact, it is only the probability of obtaining a test statistic as extreme as the one calculated from the data, always within the assumptions of the test under question.

Moreover, another problem may arise when we chose a null hypothesis which is obviously false. There is no point on hypothesis testing when the null hypothesis couldn’t possibly be true. For instance, it is very difficult to believe a parameter of a continuous distribution is exactly equal to an hypothesized value, such as zero. Given enough samples, it will always be possible to find a difference, as small as it gets, and the test will invariably turn significant. Under those specific circumstances, statistical testing can be useless as it has relationship to practical significance. That is why analyzing the effect size is important in order to determine the practical significance of an hypothesis test. Useful hypothesis would also need to be probable, plausible and falsifiable (Beaulieu-Prévost, 2005).

The following links also summarize much of the criticism in statistical hypothesis testing. The last one includes very interesting (if not enlightening) comments (in the comment section) on common criticisms of the hypothesis testing method.

The Accord.NET Framework

Now that we presented the statistical hypothesis testing framework, and now that the reader is aware of its drawbacks, we can start talking about performing those tests with the aid of a computer. The Accord.NET Framework is a framework for scientific computing with wide support for statistical and power analysis tests, without entering the merit if they are valid or no. In short, it provides some scissors;
feel free to run with them
.

Tests available in the framework (at the time of this writing)

As it may already have been noticed, the sample code included in the previous section was C# code using the framework. In the aforementioned example, we created a T-Test for comparing the population means of two samples drawn from Normal distributions. The framework, nevertheless, includes  many other tests, some with support for power analysis. Those include:

Tests marked with a * are available in versions for one and two samples. Tests on the second row can be used to test hypothesis about contingency tables. Just remembering, the framework is open source and all code is available on GitHub.

A class diagram for the hypothesis testing module is shown in the picture below. Click for a larger version.

Class diagram for the Accord.Statistics.Testing namespace.

Framework usage should be rather simple. In order to illustrate it, the next section brings some example problems and their solution using the hypothesis testing module.

Example problems and solutions

Problem 1. Clairvoyant card game.

This is the second example from Wikipedia’s page on hypothesis testing. In this example, a person is tested for clairvoyance (ability of gaining information about something through extra sensory perception; detecting something without using the known human senses.

Problem 2. Worried insurance company

This is a common example with variations given by many sources. Some of them can be found here and here.

Problem 3. Differences among multiple groups (ANOVA)

This example comes from Wikipedia’s page on the F-test. Suppose we would like to study the effect of three different levels of a factor ona response (such as, for example, three levels of a fertilizer on plant growth. We have made 6 observations for each of the three levels a1, a2 and a3, and have written the results as in the table below.

The last line in the example shows the ANOVA table using the framework’s DataGridBox object. The DataGridBox is a convenience class for displaying DataGridViews just as one would display a message using MessageBox.
The table is shown below:

Problem 4. Biased bees

This example comes from the stats page of the College of Saint Benedict and Saint John’s University (Kirkman, 1996). It is a very interesting example
as it shows a case in which a t-test fails to see a difference between the samples because of the non-normality of of the sample’s distributions. The Kolmogorov-Smirnov nonparametric test, on the other hand, succeeds.

The example deals with the preference of bees between two nearby blooming trees in an empty field. The experimenter has colelcted data measurinbg how much time does a bee spents near a particular tree. The time starts to be measured when a bee first touches the tree, and is stopped when the bee moves more than 1 meter far from it. The samples below represents the measured time, in seconds, of the observed
bees for each of the trees.

Problem 5. Comparing classifier performances

The last example comes from (E. Ientilucci, 2006) and deals with comparing the performance of two different raters (classifiers) to see if their performance are significantly different.

Suppose an experimenter has two classification systems, both trained to classify observations into one of 4 mutually exclusive categories. In order to measure the performance of each classifier, the experimenter confronted their classification labels with the ground truth for a testing dataset, writing the respective results in the form of contingency tables.

The hypothesis to be tested is that the performance of the two classifiers are the same.

In this case, the test didn’t show enough evidence to confidently reject the null hypothesis. Therefore, one should restrain from affirming anything about differences between the two systems, unless the power for the test is known.

Unfortunately I could not find a clear indication in the literature about the power of a two matrix Kappa test. However, since the test statistic is asymptotically normal, one would try checking the power for this test by analysis the power of the underlying Z-test. If there is enough power, one could possibly accept the null hypothesis that there are no large differences between the two systems.

Suggestions

As always, I expect the above discussion and examples could be useful for interested readers and users. However, if you believe you have found a flaw or would like to discuss any portion of this post, please feel free to do so by posting on the comments section.

PS: The classroom example uses a T-test to test for differences in populations means. The T-Test assumes a normal distribution. The data, however, it not exactly normal, since it is crippled between 0 and 10. Suggestions for a better example would also be appreciated!

References

R. A. Fisher, 1925. Statistical Methods for Research Workers. Available online from:
http://psychclassics.yorku.ca/Fisher/Methods/

M. K. Smith, 2011. Common mistakes in using statistics: Spotting and Avoiding Them – Power of a
Statistical Procedure. Available online from:
http://www.ma.utexas.edu/users/mks/statmistakes/power.html

M. K. Smith, 2011b. Common mistakes in using statistics: Spotting and Avoiding Them – Detrimental
Effects of Underpowered or Overpowered Studies. Available online from:
http://www.ma.utexas.edu/users/mks/statmistakes/UnderOverPower.html

L. Thomas, F. Juanes, 1996. The importance of statistical power analysis: an example from animal behaviour,
Animal Behaviour, Volume 52, Issue 4, October., Pages 856-859. Available online from: http://otg.downstate.edu/downloads/2007/Spring07/thomas.pdf

C. H. (Alex) Yu, 2012. Don’t believe in the null hypothesis? Available online from:
http://www.creative-wisdom.com/computer/sas/hypothesis.html

D. Beaulieu-Prévost, 2005. Statistical decision and falsification in science : going beyond the null
hypothesis. Séminaires CIC. Université de Montréal.

T.W. Kirkman, 1996. Statistics to Use. Acessed July 2012. Avalilable online from
http://www.physics.csbsju.edu/stats/

E. Ientilucci, 2006. “On Using and Computing the Kappa Statistic”. Available online from http://www.cis.rit.edu/~ejipci/Reports/On_Using_and_Computing_the_Kappa_Statistic.pdf

Machine Learning Books and Lectures

Jordan, one of the readers of the blog, wrote to point out some cool references for machine learning, mathematics and artificial intelligence.

Thanks again for all your contributions. I’m a .NET programmer coming from a background of studying politics and Latin American studies so when the moment came that I realized I was interested in A.I., I didn’t have many resources to turn to.

I have a collection of books and links that I found very very helpful when I was trying to learn about these concepts. I was thinking they could be useful to some of your readers who were in my shoes two years ago. So without further adieu:

Handbook of Normal Frames and Coordinates

"Handbook of Normal Frames and Coordinates," Iliev, Bozhidar Z. (2006)

“This can be a heavy text, but if you understand the concept, this can take you to the next level – I would categorize this as Topology but has applications in quantum physics, theoretical mathematics, theoretical physics etc…- in context of your blogs – granted this is a dedicated read – I think readers will be able to appreciate and truly understand what a Hilbert Space is after reading this book.”

Linear Algebra

"Linear Algebra," Jim Heffron (2008)

“I liked this one because it was free and is still highly rated – I have also heard great reviews about David Poole’s book on linear algebra, but I couldn’t get myself to shell out the money when this was free.”

Complex To Real

“There are a ton of great articles here – I have personally read the ones on fourier transforms and wavelet transforms – great stuff”

Stanford Lectures – Fourier Analysis series

“Email the Professor for the companion book. At this point it may have been published – but well worth shelling out the dough in any case.”

Bio-Inspired Artificial Intelligence

"Bio-Inspired Artificial Intelligence," Floreano and Mattiussi (2008)

“Excellent reference – fairly in depth for the number of topics it covers, lots of illustrations (illustrations are always a good thing 🙂 and I’ve also found it to be a useful source for inspiration. I bought this book while I was looking into fuzzy swarm intelligence – it doesn’t have all the answers, but I am simply in love with this book.”

Video lectures on Machine Learning

“A collection of video lectures featuring very brilliant people – let’s face it… if you’re doing or are interested in mathematics this complex… you probably don’t know too many people who you can talk to / learn from on the subject unless you’re in a University studying this kind of thing – these are some great lectures on machine learning – I just recently found this site but wish I had found it sooner – it’s great if you’re just exploring machine learning or are very well versed in it – however, I fall somewhere in the middle of that distribution so take it for what it’s worth!”

Fearless Symmetry

"Fearless Symmetry," Ash and Gross (2006)

Another accessible book to those without heavy training in math – great intro to Galois Theory, the Riemann Hypothesis and several other concepts.

Zero: The Biography of a Dangerous Idea

"Zero: The Biography of a Dangerous Idea," Charles Seife (2000)

This one is more historical and conceptual than technical but it’s a captivating read and will help get you through those hard times when you want to put down that book on K-Dimensional Manifolds, but still need to satisfy your mathematical mind (some say it’s never at rest once you catch that "learning bug").

Finally,  when you get lost, go here! The Khan Academy is a not-for-profit 501(c)(3) with the mission of providing a world-class education to anyone, anywhere.

Thanks Jordan! I hope readers can enjoy those resources as much as I did. Cheers!

Gaussian Mixture Models and Expectation-Maximization

Like K-Means, Gaussian Mixture Models (GMM) can be regarded as a type of unsupervised learning or clustering methods. They are among the most statistically mature methods for clustering. But unlike K-Means, GMMs are able to build soft clustering boundaries, i.e., points in space can belong to any class with a given probability.

The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing Install-Package Accord.MachineLearning in your IDE’s NuGet package manager.

Introduction

In statistics, a mixture model is a probabilistic model which assumes the underlying data to belong to a mixture distribution. In a mixture distribution, its density function is just a convex combination (a linear combination in which all coefficients or weights sum to one) of other probability density functions:

$p(x) = w_1 p_1(x) + w_2 p_2(x) + ... + w_n p_n(x)$

The individual pi(x) density functions that are combined to make the mixture density p(x) are called the mixture components, and the weights w1, w2, …, wn associated with each component are called the mixture weights or mixture coefficients.

The most common mixture distribution is the Gaussian (Normal) density function, in which each of the mixture components are Gaussian distributions, each with their own mean and variance parameters.

$p(x) = w_1 N(x \mid \mu_1,\Sigma_1) + w_2 N(x \mid \mu_2,\Sigma_2) + ... + w_n N(x \mid \mu_n,\Sigma_n)$

To give a more solid and concrete understanding of the Gaussian mixture models, we will be jumping directly on how to represent those abstractions in source code through the use of class diagrams. Hopefully class diagrams may give a better and direct overview on the subject than a lengthy theoretical discussion.

Source code

First we will discuss about the characteristics of any probability distribution. The IDistribution interface (depicted below) says any probability distribution may have a probability function and a distribution function. It also says that probability distributions can also be fitted to sets of observations through a parameter estimation method.

Since distribution functions can be either univariate or multivariate, the methods above accept any number of scalar values as input through the use of the params keyword. The Fit method for parameter estimation also accepts a general System.Array because we could be given either an array of univariate observations in the form of a double[] or an array of multivariate observations in the form of a jagged double[][] array. Since each observation may have a different weight in the estimation process, the weights parameter can be used to inform those weights to the fitting method.

Probability distributions may also have some associated measures, mainly the Expectation and the Variance of the modeled random variable. Depending if the distribution is univariate or multivariate, the expected values and variances can be either a single scalar or a vector of scalars. For multivariate distributions the variance can be also represented by a symmetric, positive-definite matrix known as the variance-covariance matrix.

Moreover, probability distributions can also be continuous or discrete. In continuous distributions the probability function is known as the Probability Density Function (pdf), and in discrete distributions the probability function is known as the Probability Mass Function (pmf). However, since our distribution of interest, the Gaussian, is a continuous distribution, be focusing only on continuous multivariate distributions:

The Normal (Gaussian) distribution and the mixture distributions fall under the multivariate continuous distributions category and are implemented as such.

Normal distributions

The Normal (or Gaussian) multivariate distribution is a multivariate distribution whose parameters are the mean vector μ and a variance-covariance matrix Σ. Its probability density function is given by:

$p(x) = \frac{1}{ (2\pi)^{k/2}|\Sigma|^{1/2} } \exp\!\Big( {-\tfrac{1}{2}}(x-\mu)' \Sigma^{-1} (x-\mu) \Big)$

The more detailed class diagram for the normal distribution shows the multivariate characteristics of most of its methods, in particular of the Fit method for estimating Gaussian parameters from a multivariate set of data. Each sample, given in the form of a double[], may also have different associated weights. In this case, the weight associated with each of the samples can be passed through the weights parameter. The source code for the Gaussian distribution can be found in the NormalDistribution.cs class.

To estimate the parameters of a Gaussian distribution all we have to do is compute the mean and the variance-covariance matrix out of the given data. This is a very straightforward procedure and can be done by using the standard methods of the Tools class in the Accord.Statistics namespace. The Fit method updates the distribution parameters, overwriting the current distribution.

Mixture distributions

Mixture distributions are just convex combinations of other probability distributions. Thus said, mixture distributions have an array of component distributions and a coefficient array which contains the weights of the each of the component probability distributions.

In the generic implementation above, T is the type of the distributions used in the mixture. The most common mixture distribution, the Gaussian Mixture Distribution, could then be created by instantiating a Mixture object passing the initial Normal distributions as constructor parameters.

Alternatively, the parameters of the mixture distributions could be estimated from a set of observations by calling the Fit method. To estimate the parameters of a mixture distribution we will be using a common technique known as the Expectation-Maximization algorithm.

Expectation Maximization

Expectation-Maximization (EM) is a well established maximum likelihood algorithm for fitting a mixture model to a set of training data. It should be noted that EM requires an a priori selection of model order, namely, the number of M components to be incorporated into the model. In the Accord.NET Framework, it can be called implicitly by using the Mixture.Fit method or explicitly using the ExpectationMaximization class. The source code for Expectation Maximization can be found in the ExpectationMaximization.cs file.

The general E-M algorithm is comprised of the following simple steps:

1. Initialization

Initialize the distribution parameters, such as the means, covariances and mixing coefficients and evaluate the initial value of the log-likelihood (the goodness of fit of the current distribution against the observation dataset)’;

2. Expectation

Evaluate the responsibilities (i.e. weight factors of each sample) using the current parameter values;

3. Maximization

Re-estimate the parameters using the responsibilities found in the previous step;

4. Repeat

Re-evaluate the log-likelihood and check if it has changed; if it has changed less than a given threshold, the algorithm has converged.

Below is the actual realization of the algorithm as implemented in the Fit method of the Mixture<T> class.

Gaussian Mixture Models

Gaussian mixture models are among the most commonly used examples of mixture distributions. The GaussianMixtureModel class encompasses a Mixture<NormalDistribution> object and provides methods to learn from data and to perform actual classification through a simplified interface.

Moreover, a common problem which rises in mixture model fitting through E-M is the proper initialization of the mixture parameters. One popular approach is to initialize the mixture parameters using the centroids detected by the K-Means algorithm. The code below shows how the mixture parameters are computed by first creating a KMeans object and then passing the clusters to the mixture model for further processing.

Using the code

Code usage is rather simple. First, instantiate a new GaussianMixtureModel object passing the desired number of components in the Gaussian mixture. Then, call the Compute(double[][], double) method passing the learning samples and a desired convergence threshold.

After training has been completed, a new sample can be classified by calling the Classify(double[]) method.

Sample application

The accompanying sample application demonstrates how Gaussian Mixture Models can be used to cluster normally-distributed samples of data. By clicking the “Create random data” button, a given number of Normal distributions will be created in the two-dimensional space using random mean and covariance matrices.

After the data has been created, you can use the “Fit a Gaussian Mixture Model” button to fit a mixture of Gaussians to the data. Each of the Gaussians will receive a random color and the samples which have the greatest probability of belonging to any of the Gaussians will be colored accordingly.

Left: a random dataset containing three Gaussian clusters. Right: the clusters as identified by the Gaussian Mixture Model.

Left: a random dataset containing 10 Gaussian clusters. Right: the clusters as identified by the Gaussian Mixture Model.

Conclusion

In this article we found how Gaussian Mixture Models can be successfully used to create soft clustering boundaries around data. Those soft boundaries are possible because in a mixture model each sample is said to belong to a cluster only within certain probability.

A main drawback of GMMs is that the number of Gaussian mixture components, as in K-Means, is assumed known as prior, so it cannot be considered as totally unsupervised clustering method. To aid in this situation, one could use additional algorithms such as the Minimum Message Length (MML) criteria.

Another problem arises with the correct initialization of the mixture components. A poor choice of initial parameters will invariably lead to poor results, as the E-M algorithm converges only to a local optimization point. A commonly used solution is initialization by randomly sampling in the mixture data. Other common solution, covered by the implementation presented here, is to use K-Means to perform initialization. However, K-Means itself may also converge to a poor solution and impact negatively in the mixture model fitting.

References

• Raja, Y., Shaogang, G. Gaussian Mixture Models. Department of Computer Science, Queen Mary and Westfield College, England.
• Bishop, C. M. (2007), Pattern Recognition and Machine Learning (Information Science and Statistics), Springer.
• Wikipedia contributors, “Normal distribution,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Normal_distribution (accessed October 18, 2010).
• Wikipedia contributors, “Probability density function,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Probability_density_function (accessed October 18, 2010).
• Wikipedia contributors, “Mixture density,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Mixture_density (accessed October 18, 2010).

K-Means clustering in C#

K-Means is one of the most popular, classic and simple approaches to clustering. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics [3].

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a C# framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. It is based on the AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Introduction

In statistics and machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean [4]. In its most common form, the algorithm is an iterative greedy algorithm which converges to a local optimum after a certain number of iterations.

Illustration of the K-Means algorithm.

The algorithm works by first selecting k locations at random to be the initial centroids for the clusters. Each observation is then assigned to the cluster which has the nearest centroid, and the centroids are recalculated using the mean value of assigned values. The algorithm then repeats this process until the cluster centroids do not change anymore, or until the change less than a given threshold.

There are other refinements and extensions of the algorithm. The version depicted above is its most common form, also referred as Lloyd’s algorithm.

Lloyd’s K-Means algorithm

1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

Source code

The source code has been implemented using Accord.NET and is now part of the framework. In the current version (2.1.2), the following classes related to K-Means are contained inside the Accord.MachineLearning namespace. The source code available in this page contains only the parts of the framework actually needed to support the algorithm.

Class diagram for the K-Means algorithm.

The KMeans class is the main class representing the K-Means algorithm. The algorithm itself is implemented in the Compute(double[][] data, double threshold) method, which accepts a set of observations and a convergence threshold to determine when the method should stop.

The implementation is quite straightforward and does not use additional techniques to avoid convergence problems. More refined techniques may be added to the implementation in the future, so please make sure to download the latest version of Accord.NET Framework for the most up-to-date revision of the code.

Using the code

To use the code, create a new instance of the KMeans class passing the desired number of clusters to its constructor. Additionally, you may also pass a distance function to be used as a distance metric during clustering. The default is to use the square Euclidean distance.

Sample application

The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are often used to aid border detection and object recognition. The sample application performs image segmentation using the standard squared Euclidean distance over RGB pixel color space. There are, however, better distance metrics to be used for image segmentation, such as weighted distances and other color spaces, which will not be addressed in this example.

Original image (from Ossi Petruska Flickr page*).

To perform image segmentation, we will first translate our image into an array of pixel values. The single image will be read, pixel by pixel, into a jagged array where each element is a double array of length 3. Each element of those double array will contain one of the three RGB values scaled to the interval [–1,1].

After we perform clustering on this array of pixel values, each pixel will have an associated cluster label. Each of these values will then be swapped by its corresponding cluster centroid. The source code below is called when one clicks the Run button in the application.

After segmentation, the following resulting images can be obtained:

Same image after K-Means clustering with k = 5.

Image after K-Means clustering with k = 10.

* The sample image used above has been licensed by Ossi Petruska in a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license.

Conclusion

K-Means is a very simple and popular approach to clustering. The implementation presented here is the same implementation used in Accord.NET. As it can be seem, the method can be easily extended with custom distance functions through delegates or lambda expressions, and can be used in different contexts, such as image segmentation, without further modifications. As a suggestion for improvement, the method can be further speeded up by using the triangle inequality as suggested on the paper "Using the triangle inequality to accelerate k-means", by Charles Elkan.

In the next article, we will see how we can use K-Means to initialize the Expectation-Maximization algorithm for estimating Gaussian Mixture densities in Gaussian Mixture Models. Those articles will form the basis for Continuous density Hidden Markov Models.

Acknowledgements

To Antonino Porcino, who provided the first version of the code and for the valuable information about many other methods and algorithms.

Handwritten Digits Recognition with C# SVMs in Accord.NET

I’ve posted a new article on CodeProject, entitled “Handwriting Recognition Revisited: Kernel Support Vector Machines”. It is a continuation of a previous article on handwritten digits recognition but this time using SVMs instead of KDA.

The code uses the SVM library in Accord.NET Framework. The framework, which runs on .NET and is written mostly in C#, supports standard or multiclass support vector machines for either classification or regression, having more than 20 kernel functions available to choose from.

In the article, the same set and the same amount of training and testing samples have been used as in the previous Kernel Discriminant Analysis article for comparison purposes. The experimental results shows that  SVMs can outperform KDA both in terms of efficiency and accuracy, requiring much less processing time and memory available while still producing robust results.

RANdom Sample Consensus (RANSAC) in C#

RANSAC is an iterative method to build robust estimates for parameters of a mathematical model from a set of observed data which is known to contain outliers. The RANSAC algorithm is often used in computer vision, e.g., to simultaneously solve the correspondence problem and estimate the fundamental matrix related to a pair of stereo cameras.

The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing Install-Package Accord.MachineLearning in your IDE’s NuGet package manager.

Introduction

RANSAC is an abbreviation for “RANdom SAmple Consensus“. It is an iterative method to estimate parameters of a mathematical model from a set of observed data which may contains outliers. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. The algorithm was first published by Fischler and Bolles in 1981.

The basic assumption is that the data consists of “inliers”, i.e., data whose distribution can be explained by some mathematical model, and “outliers” which are data that do not fit the model. Outliers could be considered points which come from noise, erroneous measurements or simply incorrect data. RANSAC also assumes that, given a set of inliers, there exists a procedure which can estimate the parameters of a model that optimally explains or fits this data.

Example: Fitting a simple linear regression

We can use RANSAC to robustly fit a linear regression model using noisy data. Consider the example below, in which we have a cloud of points that seems to belong to a line. These are the inliers of the data. The other points, which can be seem as measurement errors or extreme noise values, are points expected to be considered outliers.

Linear structure contained in noisy data.

RANSAC is able to automatically distinguish the inliers from the outliers through the evaluation of the linear regression model. To do so, it randomly selects subsets from the data and attempts to fit linear regression models using them. The model which best explains most of the data will then be returned as the most probably correct model fit.

The image below shows the result of fitting a linear regression directly (as shown by the red line) and using RANSAC (as shown by the blue line). We can see that the red line represents poorly the data structure because it considers all points in order to fit the regression model. The blue line seems to be a much better representation of the linear relationship hidden inside the overall noisy data.

Hidden linear structure inside noisy data. The red line shows the fitting of a linear regression model directly considering all data points. The blue line shows the same result using RANSAC.

Source code

The code below implements RANSAC using a generic approach. Models are considered to be of the reference type TModel and the type of data manipulated by this model is considered to be of the type TData. This approach allows for the creation of a general purpose RANSAC algorithm which can be used in very different contexts, be it the fitting of linear regression models or the estimation of homography matrices from pair of points in different images.

This code is available in the class RANSAC of the Accord.NET Framework (source).

Besides the generic parameters, the class utilizes three delegated functions during execution.

• The Fitting function, which should accept a subset of the data and use it to fit a model of the chosen type, which should be returned by the function;
• The Degenerate function, which should check if a subset of the training data is already known to result in a poor model, to avoid unnecessary computations; and
• The Distance function, which should accept a model and a subset of the training data to compute the distance between the model prediction and the expected value for a given point. It should return the indices of the points only whose predicted and expected values are within a given threshold of tolerance apart.

Using the code

In the following example, we will fit a simple linear regression of the form x→y using RANSAC. The first step is to create a RANSAC algorithm passing the generic type parameters of the model to be build, i.e. SimpleLinearRegression and of the data to be fitted, i.e. a double array.

In this case we will be using a double array because the first position will hold the values for the input variable x. The second position will be holding the values for the output variables y. If you are already using .NET 4 it is possible to use the Tuple type instead.

After the creation of the RANSAC algorithm, we should set the delegate functions which will tell RANSAC how to fit a model, how to tell if a set of samples is degenerate and how to check for inliers in data.

Finally, all we have to do is call the Compute method passing the data. The best model found will be returned by the function, while the given set of inliers indices for this model will be returned as an out parameter.

Sample application

The accompanying source application demonstrates the fitting of the simple linear regression model with and without using RANSAC. The application accepts Excel worksheets containing the independent values in the first column and the dependent variables in the second column.

Accord.NET Framework – An extension to AForge.NET

Today I have just released the first version of Accord.NET. The Accord.NET Framework is a C# framework I have developed over the years while working on several areas of artificial intelligence, machine learning and statistics.

The Accord.NET Framework extends the excellent AForge.NET Framework with new tools and libraries. In order to use Accord.NET, you must have AForge.NET already installed. The first version number of Accord.NET will be used to indicate the compatibility status with AForge.NET versions, thus the first version will be starting at 2.0.0.

The framework is comprised by the set of libraries and sample applications, which demonstrate their features:

• Accord.Statistics – library with statistical analysis and other tools;
• Accord.Imaging – extension to the AForge.NET Imaging library with new filters and routines;
• Accord.Neuro – extension to the AForge.NET Neuro library with other learning algorithms;
• Accord.MachineLearning – extension to AForge’s machine learning library with Support Vector Machines;
• Accord.Audio – experimental library with filters and audio processing routines.

The Framework has just been released, so be ready to expect bugs and unpolished/unfinished areas. The code is released under a LGPL license. For additional help, issues, and discussions, please refer to the recently created forums.

Recognition of Handwritten Digits using Non-linear Kernel Discriminant Analysis

I’ve just submitted a new article to CodeProject, entitled "Handwriting Recognition using Kernel Discriminant Analysis". The article is a demonstration of handwritten digit recognition using Non-linear Discriminant Analysis with Kernels and using the Optical Recognition of Handwritten Digits data set from the University of California’s Machine Learning Repository.

Recognition of Handwritten digits using Kernel Discriminant Analysis

The Code Project is a free repository of source code, tutorials and development resources. It is also home of a large and ever-growing community of software developers, architects and other technology professionals. It has fairly active forums, and is a reasonably good resource for resolving difficult software development issues.