A new article has been published in CodeProject! This article details the ViolaJones face detection algorithm available in the Accord.NET Framework. The article page also provides a standalone package for face detection which can be reused without instantiating the entire framework.
Always test your hypothesis…
… but also always make sure to interpret results correctly! This post presents a quick intro on how to perform statistical hypothesis testing and power analysis with the Accord.NET Framework in C#.
 Download the Accord.NET Framework (through NuGet)
 Fork the code repository on GitHub or browse it online
Contents 
Hypothesis testing
What does hypothesis testing means in statistics (and should also mean everywhere, for that matter)? You may recall from Karl Popper’s theory of falsiability that good theories can rarely be accurately proven, but you may gain a considerable confidence on them by constantly challenging and failing to refute them.
This comes from the fact that it is often easier to falsify something than to prove it. Consider for instance the whiteswan/blackswan example: Let’s say a theory states that all swans are white. This is a very strong claim; it does not applies to one or a few particular observations of a swan, but all of them. It would be rather difficult to verify if all of the swans in Earth are indeed white. It is thus almost impossible to prove this theory directly.
However, the catch is that it will take only a single contrary example to refute it. If
we find a single swan that is black, the entire theory should be rejected, so alternate theories could be raised. It should be fairly easy to attempt to prove a theory wrong. If a theory continuously survives those attempts to be proven wrong, it becomes stronger. This does not necessarily means it is correct, only that it is very unlikely to be wrong.
This is pretty much how the science method works; it also provides a solution to the demarcation problem originally proposed by Kant: the problem of separating the sciences from the pseudosciences (i.e. astronomy from astrology). A “good” theory should be easy to attack so we can try to refute it; and by constantly challenging it and failing to prove it wrong, we gain further confidence that this theory may,
indeed, be right. In short:
Often the most interesting theories can’t be proven right, they can only be proven wrong. By continuously refuting alternatives, a theory becomes stronger (but most likely never reaching the ‘truth’).
Answering the question in the first phrase of this section, hypothesis testing means verifying if a theory holds even when confronted with alternative theories. In statistical hypothesis testing, this often means checking if a hypothesis holds even when confronted with the fact that it may have just happened to be true by pure chance or plain luck.
Statistical hypothesis testing
Fisher (1925) also noted that we can’t always prove a theory but we can attempt to refute it. Therefore, statistical hypothesis testing includes stating a hypothesis, which is the hypothesis we are trying to invalidade; and check if we can confidently reject it by confronting it with data from a test or experiment. This hypothesis to be pickpocketed is often called the null hypothesis (commonly denoted H_{0}). It receives this name as it is usually the hypothesis of no change: there is no difference, nothing changed after the experiment, there is no effect.
The hypotheses verified by statistical hypothesis tests are often theories about whether or not a random sample from a population comes from a given probability distribution. This seems weird, but several problems can be cast in this way. Suppose, for example, we would like to determine if students from a classroom have significantly different grades than students from another room. Any difference could
possibly be attributed to chance, as some students may just perform better on a exam because of luck.
An exam was applied to both classrooms. The exam results (by each student) are written below:
Classroom A  Classroom B 
8.12, 8.34, 7.54, 8.98, 8.24, 7.15, 6.60, 7.84, 8.68, 9.44, 8.83, 8.21, 8.83, 10.0, 7.94, 9.58, 9.44, 8.36, 8.48, 8.47, 8.02, 8.20, 10.0, 8.66, 8.48, 9.17, 6.54, 7.50 
7.50, 6.70, 8.55, 7.84, 9.23, 6.10, 8.45, 8.27, 7.01, 7.18, 9.05, 8.18, 7.70, 7.93, 8.20, 8.19, 7.65, 9.25, 8.71, 8.34, 7.47, 7.47, 8.24, 7.10, 7.87, 10.0, 8.26, 6.82, 7.53 
Students: 28 Mean: 8.416 
Students: 29 Mean: 7.958 
We have two hypothesis:
 Results for classroom A are not significantly different from the results from classroom B. Any difference in means could have been explained due to chance alone.
 Results are indeed different. The apparent differences are very unlikely to have occurred by chance.
Since we have less than 30 samples, we will be using a TwoSample TTest to test the hypothesis that the population’s mean values of the two samples are not equal. Besides, we will not be assuming equal variances. So let’s we create our test object:
1 2 3 4 5 6 
using Accord.Statistics; using Accord.Statistics.Testing; using Accord.Statistics.Testing.Power; var test = new TwoSampleTTest(A, B, hypothesis: TwoSampleHypothesis.ValuesAreNotEqual); 
And now we can query it:
1 2 
// Check whether the test was significant (returns true) Console.WriteLine("Significant: " + test.Significant); 
Which reveals the test is indeed significant. And now we have at least two problems to address…
Problems
Problem 1: Statistical significance does not imply practical significance
So the test was significant. But would this mean the difference itself is significant?
Would this mean there any serious problem with the school teaching method?
No – but it doesn’t mean the contrary either. It is impossible to tell just by looking at the plevel.
The test only said there was a difference, but it can not tell the importance of this difference. Besides the two classes having performed so differently they could trigger statistical significance, we don’t know if this difference has any practical significance. A statistical test being significant is not a proof; it is just an evidence to be balanced together with other pieces of information in order to drawn a conclusion.
Perhaps one of best examples illustrating this problem is given by Martha K. Smith:
Suppose a large clinical trial is carried out to compare a new medical treatment with a standard one. The statistical analysis shows a statistically significant difference in lifespan when using the new treatment compared to the old one. But the increase in lifespan is at most three days, with average increase less than 24 hours, and with poor quality of life during the period of extended life. Most people would not consider the improvement practically significant.
In our classroom example, the difference in means is about 0.46 points. If principals believe a difference of less than 0.50 in a scale from 0.00 to 10.00 is not that critical, there may be no need to force students from the room with lower grades to start taking extra lessons after school. In other words, statistical hypothesis testing does not lead to automatic decision making. A statistically significant test is just another evidence which should be balanced with other clues in order to take a decision or
draw a conclusion.
Problem 2: Powerless tests can be misleading
The plevel reported by the significance test is the probability of the extreme data we found be occurring given the null hypothesis is correct. Often, this is not the case. We
must also know the probability that the test will reject the null hypothesis when the null hypothesis is false. To do so, we must compute the power of our test, and, better yet, we should have used this information to conclude how many samples we would need to achieve a more informative test before we conducted our experiment. The power is then the probability of detecting a difference if this difference indeed exists (Smith, 2011). So let’s see:
1 2 
// Check the actual power of the test... Console.WriteLine("Power: " + test.Analysis.Power); // gives about 0.50 
The test we performed had astonishingly small power; so if the null hypothesis is false (and there is actually a difference between the classrooms) we have only about 50% chance of correctly rejecting it. Therefore, this would also lead to a 50% chance of producing a false negative – incorrectly saying there is no difference when actually there is. The table below exemplifies the different errors we can get by rejecting or not the null hypothesis.
Null hypothesis is true  Null hypothesis is false  
Fail to reject the null hypothesis 
Correct True negative 
Type II error (beta) False negative 
Reject the null hypothesis 
Type I error (alpha) False positive 
Correct outcome True positive 
Tests with little statistical power are often inconsistent in the literature. Suppose, for example, that the score from the first student from classroom B had earned a 7.52 instead of 7.50. Due to the low power of the test, this little change would already be sufficient to render the test nonsignificant, and we will not be able to reject the null hypothesis that the two population means aren’t different anymore. Due to the low power of the test, we can’t distinguish between a correct true negative and a type II error. This is why powerless tests can be misleading and should never be relied upon for decision making (Smith, 2011b).
The power of a test increases with the sample size. To obtain a power of at least 80%, let’s see how many samples should have been collected:
1 2 3 4 5 6 7 8 9 10 
// Create a posthoc analysis to determine sample size var analysis = new TwoSampleTTestPowerAnalysis(test) { Power = 0.80 }; analysis.ComputeSamples(); Console.WriteLine("Samples in each group:" + Math.Ceiling(analysis.Samples1)); // gives 57 
So, we would actually need 57 students in each classroom to confidently affirm whether there was an difference or not. Pretty disappointing, since in the real world we wouldn’t be able to enroll more students and wait years until we could perform another exam just to adjust the sample size. On those situations, the power for the test can be increased by increasing the significance threshold (Thomas, Juanes, 1996), although clearly sacrificing our ability to detect false positives (Type I errors) in the process.
My test turned insignificant. Should I accept the null hypothesis?
The short answer is ‘only if you have enough power‘. Otherwise, definitively no.
If you have reason to believe the test you performed had enough power to detect a difference within the given TypeII error rate, and it didn’t, then accepting the null would most likely be acceptable. The acceptance should also be accompanied of an analysis of confidence intervals or effect sizes. Consider, for example, that some actual scientific discoveries were indeed made by accepting the null hypothesis rather than by contradicting it; one notable example being the discovery of the XRay (Yu, 2012).
Further criticism
Much of the criticism associated with statistical hypothesis testing is often related not to the use of statistical hypothesis testing per se, but on how the significance outcomes from such tests are interpreted. Often it boils down to incorrectly believing a pvalue is the probability of a null hypothesis being true, when, in fact, it is only the probability of obtaining a test statistic as extreme as the one calculated from the data, always within the assumptions of the test under question.
Moreover, another problem may arise when we chose a null hypothesis which is obviously false. There is no point on hypothesis testing when the null hypothesis couldn’t possibly be true. For instance, it is very difficult to believe a parameter of a continuous distribution is exactly equal to an hypothesized value, such as zero. Given enough samples, it will always be possible to find a difference, as small as it gets, and the test will invariably turn significant. Under those specific circumstances, statistical testing can be useless as it has relationship to practical significance. That is why analyzing the effect size is important in order to determine the practical significance of an hypothesis test. Useful hypothesis would also need to be probable, plausible and falsifiable (BeaulieuPrévost, 2005).
The following links also summarize much of the criticism in statistical hypothesis testing. The last one includes very interesting (if not enlightening) comments (in the comment section) on common criticisms of the hypothesis testing method.
 Rational Wiki, Problems with statistical significance
 Simply Statistics, Pvalues and hypothesis testing get a bad rap – but we sometimes find them useful
 John D. Cook, Five criticisms of significance testing
 John D. Cook, Significance testing and congress
The Accord.NET Framework
Now that we presented the statistical hypothesis testing framework, and now that the reader is aware of its drawbacks, we can start talking about performing those tests with the aid of a computer. The Accord.NET Framework is a framework for scientific computing with wide support for statistical and power analysis tests, without entering the merit if they are valid or no. In short, it provides some scissors;
feel free to run with them.
Tests available in the framework (at the time of this writing)
As it may already have been noticed, the sample code included in the previous section was C# code using the framework. In the aforementioned example, we created a TTest for comparing the population means of two samples drawn from Normal distributions. The framework, nevertheless, includes many other tests, some with support for power analysis. Those include:
Parametric tests  Nonparametric tests 

Tests marked with a * are available in versions for one and two samples. Tests on the second row can be used to test hypothesis about contingency tables. Just remembering, the framework is open source and all code is available on GitHub.
A class diagram for the hypothesis testing module is shown in the picture below. Click for a larger version.
Class diagram for the Accord.Statistics.Testing namespace.
Framework usage should be rather simple. In order to illustrate it, the next section brings some example problems and their solution using the hypothesis testing module.
Example problems and solutions
Problem 1. Clairvoyant card game.
This is the second example from Wikipedia’s page on hypothesis testing. In this example, a person is tested for clairvoyance (ability of gaining information about something through extra sensory perception; detecting something without using the known human senses.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
// A person is shown the reverse of a playing card 25 times and is // asked which of the four suits the card belongs to. Every time // the person correctly guesses the suit of the card, we count this // result as a correct answer. Let's suppose the person obtained 13 // correctly answers out of the 25 cards. // Since each suit appears 1/4 of the time in the card deck, we // would assume the probability of producing a correct answer by // chance alone would be of 1/4. // And finally, we must consider we are interested in which the // subject performs better than what would expected by chance. // In other words, that the person's probability of predicting // a card is higher than the chance hypothesized value of 1/4. BinomialTest test = new BinomialTest( successes: 13, trials: 25, hypothesizedProbability: 1.0 / 4.0, alternate: OneSampleHypothesis.ValueIsGreaterThanHypothesis); Console.WriteLine("Test pValue: " + test.PValue); // ~ 0.003 Console.WriteLine("Significant? " + test.Significant); // True. 
Problem 2. Worried insurance company
This is a common example with variations given by many sources. Some of them can be found here and here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 
// An insurance company is reviewing its current policy rates. When the // company initially set its rates, they believed the average claim amount // was about $1,800. Now that the company is already up and running, the // executive directors want to know whether the mean is greater than $1,800. double hypothesizedMean = 1800; // Now we have two hypothesis. The null hypothesis (H0) is that there is no // difference between the initial set value of $1,800 and the average claim // amount of the population. The alternate hypothesis is that the average // is greater than $1,800. // H0 : population mean ≤ $1,800 // H1 : population mean > $1,800 OneSampleHypothesis alternate = OneSampleHypothesis.ValueIsGreaterThanHypothesis; // To verify those hypothesis, we draw 40 random claims and obtain a // sample mean of $1,950. The standard deviation of claims is assumed // to be around $500. double sampleMean = 1950; double standardDev = 500; int sampleSize = 40; // Let's create our test and check the results ZTest test = new ZTest(sampleMean, standardDev, sampleSize, hypothesizedMean, alternate); Console.WriteLine("Test pValue: " + test.PValue); // ~0.03 Console.WriteLine("Significant? " + test.Significant); // True. // In case we would like more information about what was calculated: Console.WriteLine("z statistic: " + test.Statistic); // ~1.89736 Console.WriteLine("std. error: " + test.StandardError); // 79.05694 Console.WriteLine("test tail: " + test.Tail); // one Upper (right) Console.WriteLine("alpha level: " + test.Size); // 0.05 
Problem 3. Differences among multiple groups (ANOVA)
This example comes from Wikipedia’s page on the Ftest. Suppose we would like to study the effect of three different levels of a factor ona response (such as, for example, three levels of a fertilizer on plant growth. We have made 6 observations for each of the three levels a1, a2 and a3, and have written the results as in the table below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
double[][] outcomes = new double[,] { // a1 a2 a3 { 6, 8, 13 }, { 8, 12, 9 }, { 4, 9, 11 }, { 5, 11, 8 }, { 3, 6, 7 }, { 4, 8, 12 }, } .Transpose().ToArray(); // Now we can create an ANOVA for the outcomes OneWayAnova anova = new OneWayAnova(outcomes); // Retrieve the Ftest FTest test = anova.FTest; Console.WriteLine("Test pvalue: " + test.PValue); // ~0.002 Console.WriteLine("Significant? " + test.Significant); // true // Show the ANOVA table DataGridBox.Show(anova.Table); 
The last line in the example shows the ANOVA table using the framework’s DataGridBox object. The DataGridBox is a convenience class for displaying DataGridViews just as one would display a message using MessageBox.
The table is shown below:
Problem 4. Biased bees
This example comes from the stats page of the College of Saint Benedict and Saint John’s University (Kirkman, 1996). It is a very interesting example
as it shows a case in which a ttest fails to see a difference between the samples because of the nonnormality of of the sample’s distributions. The KolmogorovSmirnov nonparametric test, on the other hand, succeeds.
The example deals with the preference of bees between two nearby blooming trees in an empty field. The experimenter has colelcted data measurinbg how much time does a bee spents near a particular tree. The time starts to be measured when a bee first touches the tree, and is stopped when the bee moves more than 1 meter far from it. The samples below represents the measured time, in seconds, of the observed
bees for each of the trees.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 
double[] redwell = { 23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1, 19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5 }; double[] whitney = { 16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6, 39.1, 26.5, 22.7 }; // Create a ttest as a first attempt. var t = new TwoSampleTTest(redwell, whitney); Console.WriteLine("TTest"); Console.WriteLine("Test pvalue: " + t.PValue); // ~0.837 Console.WriteLine("Significant? " + t.Significant); // false // Create a nonparametric Kolmogovor Smirnov test var ks = new TwoSampleKolmogorovSmirnovTest(redwell, whitney); Console.WriteLine("KSTest"); Console.WriteLine("Test pvalue: " + ks.PValue); // ~0.038 Console.WriteLine("Significant? " + ks.Significant); // true 
Problem 5. Comparing classifier performances
The last example comes from (E. Ientilucci, 2006) and deals with comparing the performance of two different raters (classifiers) to see if their performance are significantly different.
Suppose an experimenter has two classification systems, both trained to classify observations into one of 4 mutually exclusive categories. In order to measure the performance of each classifier, the experimenter confronted their classification labels with the ground truth for a testing dataset, writing the respective results in the form of contingency tables.
The hypothesis to be tested is that the performance of the two classifiers are the same.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
// Create the confusion matrix for the first sytem. var a = new GeneralConfusionMatrix(new int[,]]] { { 317, 23, 0, 0 }, { 61, 120, 0, 0 }, { 2, 4, 60, 0 }, { 35, 29, 0, 8 }, }); // Create the confusion matrix for the second system. var b = new GeneralConfusionMatrix(new int[,] { { 377, 79, 0, 0 }, { 2, 72, 0, 0 }, { 33, 5, 60, 0 }, { 3, 20, 0, 8 }, }); var test = new TwoMatrixKappaTest(a, b); Console.WriteLine("Test pvalue: " + test.PValue); // ~0.628 Console.WriteLine("Significant? " + test.Significant); // false 
In this case, the test didn’t show enough evidence to confidently reject the null hypothesis. Therefore, one should restrain from affirming anything about differences between the two systems, unless the power for the test is known.
Unfortunately I could not find a clear indication in the literature about the power of a two matrix Kappa test. However, since the test statistic is asymptotically normal, one would try checking the power for this test by analysis the power of the underlying Ztest. If there is enough power, one could possibly accept the null hypothesis that there are no large differences between the two systems.
Suggestions
As always, I expect the above discussion and examples could be useful for interested readers and users. However, if you believe you have found a flaw or would like to discuss any portion of this post, please feel free to do so by posting on the comments section.
PS: The classroom example uses a Ttest to test for differences in populations means. The TTest assumes a normal distribution. The data, however, it not exactly normal, since it is crippled between 0 and 10. Suggestions for a better example would also be appreciated!
References
R. A. Fisher, 1925. Statistical Methods for Research Workers. Available online from:
http://psychclassics.yorku.ca/Fisher/Methods/
M. K. Smith, 2011. Common mistakes in using statistics: Spotting and Avoiding Them – Power of a
Statistical Procedure. Available online from:
http://www.ma.utexas.edu/users/mks/statmistakes/power.html
M. K. Smith, 2011b. Common mistakes in using statistics: Spotting and Avoiding Them – Detrimental
Effects of Underpowered or Overpowered Studies. Available online from:
http://www.ma.utexas.edu/users/mks/statmistakes/UnderOverPower.html
L. Thomas, F. Juanes, 1996. The importance of statistical power analysis: an example from animal behaviour,
Animal Behaviour, Volume 52, Issue 4, October., Pages 856859. Available online from: http://otg.downstate.edu/downloads/2007/Spring07/thomas.pdf
C. H. (Alex) Yu, 2012. Don’t believe in the null hypothesis? Available online from:
http://www.creativewisdom.com/computer/sas/hypothesis.html
D. BeaulieuPrévost, 2005. Statistical decision and falsification in science : going beyond the null
hypothesis. Séminaires CIC. Université de Montréal.
T.W. Kirkman, 1996. Statistics to Use. Acessed July 2012. Avalilable online from
http://www.physics.csbsju.edu/stats/
E. Ientilucci, 2006. “On Using and Computing the Kappa Statistic”. Available online from http://www.cis.rit.edu/~ejipci/Reports/On_Using_and_Computing_the_Kappa_Statistic.pdf
Accord.NET has a new home!
As some might have noticed, Origo is going away until the end of the month. Origo maintainers were kind enough to support previous project owners with a tarball containing the entire contents from the hosted projects upon request; this means all Accord.NET content previously hosted on Origo can be gradually restored at Google Code one the next months.
The Accord.NET Framework is now at http://code.google.com/p/accord/
With the hosting change there also comes a few updates. Version 2.7 is about to be released soon; this version now includes the highly requested Bag of Visual Words (BoW) based on the SURF feature point detector. This feature can be used to extract fixedlength feature vectors from images and train standard methods such as Support Vector Machines to perform image classification. Initial experiments with the χ² (chisquare) kernel have shown promising results.
The Accord.NET Framework project page, now hosted at Google Code. 
Other hopefully useful features include new learning algorithms for hidden Markov models, such as the Viterbi training and the Maximum Likelihood Estimation training; and the introduction of the ISampleable interface for specifying distributions which can be sampled efficiently (i.e. now it is possible to drawn random samples from some distributions – including hidden Markov models). Other nice addition is the inclusion of the ScatterplotBox class for displaying data points in a scatterplot in a similar way to MessageBox.Show. It comes at hand to quickly inspect some spacial data in the middle of other calculations or even in live demonstrations.
As others might also have noticed, this site’s design also has changed, hopefully for the better.
As usual, I hope someone will find the new release useful 🙂
Quadratic Programming in C#
I have manually translated and adapted the QuadProg solver for quadratic programming problems made by Berwin A. Turlach. His code was originally published under the GNU Library License, which has now been superseded by the GNU Lesser License. This adapted version honors the original work and is thus distributed under the same license.
 Download source code
 Download sample application
_{The code in this article is part of the Accord.NET Framework. Download the full framework through NuGet and use this code in your own applications.}
Introduction
Despite the name, the terms linear or quadratic programming have little resemblance to the set of activities most people now know as programming. Those terms usually usually refers to a specific set of function optimization methods, i.e. methods which can be used to determine the maximum or minimum points of special kinds of functions under a given number of solution constraints. For example, suppose we would like to determine the minimum value of the function:
f(x, y) = 2x + y + 4
Under the constraints that x and y must be nonnegative (i.e. either positive or zero). This may seem fairly simple and trivial, but remember that practical linear programming problems may have hundreds or even thousands of variables and possibly million constraints.
When the problem to be solved involves a quadratic function instead of a linear function, but still presents linear constraints, this problem can be cast as a quadratic programming problem. Quadratic functions are polynomial functions in each each term may have at most a total degree of 2. For example, consider the function
f(x, y, z) = 2x² + 5xy + y² – z² + x – 5.
Now let’s check the sum of the degrees for each variable on the polynomial terms. We start by writing the missing terms of the polynomial
f(x, y, z) = 2 x^{2}y^{0}z^{0 }+ 5 x^{1}y^{1}z^{0} + 2 x^{0}y^{2}z^{0} – x^{0}y^{0}z^{2} + x^{1}y^{0}z^{0} – 5 x^{0}y^{0}z^{0 }
and then proceed to check the sum of the degrees at each term. In the first term, 2+0+0 = 2. For the second, 1+1+0 = 2, and so on. Those functions have a nice property that they can be expressed in a matrix form
f(x) = 1/2 x^{T }Q x + c^{T}x.
Here, x and c are vectors. The matrix Q is a symmetric matrix specifying how the variables combine in the quadratic terms of the function. If this matrix is positive definite, then the function is convex, and the optimization has a single, unique optimum (we say it has a global optimum). The coefficients c specify the linear terms of our function.
Source code
The available source code is based on a translation of the Fortran code written by Berwin A. Turlach. However, some modifications have been made. Fortran uses columnmajor ordering for matrices, meaning that matrices are stored in memory in sequential order of column elements. Almost all other languages use rowmajor ordering, including C, C++, Java and C#. In order to improve data locality, I have modified the code to use the transpose of the original matrices D and A. I have also modified the QP formulation adopted in the Goldfarb and Idnani paper to reflect the form presented in the introduction.
In this article, we will be using the GoldfarbIdnani class from the Accord.NET Framework, as well as the QuadraticObjectiveFunction. Their source code are available at GoldfarbIdnani.cs and QuadraticObjectiveFunction.cs, respectively. Please note that the sample application version available in this blog post will most likely not be the most recently, updated, fixed and enhanced version of the code. For the latest version, be sure to download the latest version of the framework on the project site or through a NuGet package.
Using the code
The first step in solving a quadratic programming problem is, well, specifying the problem. To specify a quadratic programming problem, one would need two components: a matrix D describing the relationship between the quadratic terms, and a vector d describing the linear terms. Perhaps this would work better with an example.
Suppose we are trying to solve a minimization problem. Given a function, the goal in such problems is to find the correct set of function arguments which would result in the minimum possible value for the function. An example of a quadratic minimization problem is given below:
min f(x, y) = 2x² – xy + 4y² – 5x – 6y subject to the constraints: x – y = 5 
(generated with Wolfram Alpha) 
However, note that this problem involves a set of constraints. The required solution for this minimization problem is required to lie in the interval specified by the constraints. More specifically, any x and y pair candidate for being a minimal of the function must respect the relations x – y = 5 and x >= 10. Thus, instead of lying in the unconstrained minimum of the function surface shown above, the solution lies slightly off the center of the surface. This is an obvious easy problem to solve manually, but it will fit for this demonstration.
As it can be seen (and also live demonstrated by asking Wolfram Alpha) the solution lies on the point (10,5), and the constrained minimum of the function is given by 170. So, now that we know what a quadratic programming problem looks like, how can we actually solve it?
Specifying the objective function
The first step in solving a quadratic programming problem is to specify the objective function. Using this code, there are three ways to specify it. Each of them has their own advantages and disadvantages.
1. Manually specifying the QP matrix.
This is the most common approach for numerical software, and probably the most cumbersome for the user. The problem matrix has to be specified manually. This matrix is sometimes denoted Q, D or H as it actually denotes the Hessian matrix for the problem.
The matrix Q is used to describe the quadratic terms of our problem. It is a n x n matrix, in which n corresponds to the number of variables in our problem, covering all possible combinations of variables. Recall our example given on the start of this section. We have 2 variables, x and y. Thus, our matrix Q is 2 x 2. The possible combinations for x and y are expressed in the table below.
x  y  
x  x*x  x*y 
y  y*x  y*y 
To form our matrix Q, we can take all coefficients associated with each pair mentioned on the table above. The diagonal elements should also be multiplied by two (this is actually because the matrix is the Hessian matrix of the problem: it is the matrix of all secondorder derivatives for the function. Since we have only at most quadratic terms, the elementary power rule of derivation “drops” the ² from the x² and y² terms – I think a mathematician would hit me with a stick for explaining it like this, but it serves well for a quick, nontechnical explanation).
Remember our quadratic terms were 2x² – 1xy + 4y². Writing the terms on their proper position and differentiating, we have:
As it can be seen, the matrix is also symmetric (and often, but not always, positive definite). The next step, more trivial, is to write a vector d containing the linear terms. The linear terms are –5x –6y, and thus our vector d can be given by:
Therefore our C# code can be created like this:
1 2 3 4 5 6 7 8 9 10 11 
double[,] Q = { { +4, 1 }, { 1, +8 }, }; double[] d = { 5, 6 }; 
2. Using lambda expressions
This approach is a bit more intuitive and less error prone. However, it involves lambdas functions and some people find it hard to follow them. Another disadvantage is that we will lose the edit & continue debugging ability of visual studio. The advantage is that the compiler may catch some obvious syntax errors automatically. Below we are using the QuadraticObjectiveFunction from Accord.NET to specify our target objective function.
1 2 3 4 
double x = 0, y = 0; QuadraticObjectiveFunction f = // 2x²  xy + 4y²  5x  6y new QuadraticObjectiveFunction(() => 2*(x*x)  (x*y) + 4*(y*y)  5*x  6*y); 
Note that the x and y variables could have been initialized to any value. They are only used as symbols, and not used in any computations.
3. Using text strings
This approach is more intuitive but a bit more error prone. The function can be specified using strings, as in a standard mathematical formula. However, since all we have are strings, there is no way to enforce static, compile time checking.
1 2 
QuadraticObjectiveFunction f = new QuadraticObjectiveFunction("2x²  xy + 4y²  5x  6y"); 
Couldn’t be easier.
Specifying the constraints
The next step in specifying a quadratic programming problem is to specify the constraints. The constraints can be specified in almost the same way as the objective function.
1. Manually specifying the constraints matrix
The first option is to manually specify the constraints matrix A and vector b. The constraint matrix expresses the way the variables should be combined when compared to corresponding value on vector b. It is possible to specify either equality constraints or inequality constraints. The formulation used in this code is slightly different from the one used in Turlach’s original implementation. The constraints are supposed to be in the form:
A_{1} x = b_{1}
A_{2} x = b_{2}
This means that each line of matrix A expresses a possible combination of variables x which should be compared to the corresponding line of b. An integer variable m can be specified to denote how many of the first rows of matrix A should be treated as equalities rather than inequalities. Recall that in our example the constraints are given by 1x –1y = 5 and 1x = 10. Lets write this down in a tabular form:
#  x  y  ?  b 
q_{1}  1  1  =  5 
q_{2}  1  0  =  10 
Thus our matrix A and vector b can be specified as:
And not forgetting that m = 1, because the first constraint is actually an equality.
2. Using classes and objects
A more natural way to specify constraints is using the classes and objects of the Accord.NET Framework. The LinearConstraint class allows one to specify a single constraint using an objectoriented approach. It doesn’t have the most intuitive usage on earth, but has much more expressiveness. It can also be read aloud, it that adds anything! 🙂
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
List<LinearConstraint> list = new List<LinearConstraint>(); list.Add(new LinearConstraint(numberOfVariables: 1) { VariablesAtIndices = new[] { 0 }, // index 0 (x) ShouldBe = ConstraintType.GreaterThanOrEqualTo, Value = 10 }); list.Add(new LinearConstraint(numberOfVariables: 2) { VariablesAtIndices = new int[] { 0, 1 }, // index 0 (x) and index 1 (y) CombinedAs = new double[] { 1, 1 }, // when combined as 1x 1y ShouldBe = ConstraintType.EqualTo, Value = 5 }); 
The specification is centered around the notion that variables are numbered and have an associated index. For example, x is the zeroth variable of the problem. Thus x has an index of 0 and y has an index of 1. So for example, reading aloud the last constraint, it is possible to express how the variables at indices 0 and 1, when combined as 1x and –1y, should be equal to value 5.
2. Using lambda expressions
A more intuitive way to express constraints is again using lambda expressions. And again the problems are the same: some people find it hard to follow and we lose edit & continue.
1 2 3 
var constraints = new List<LinearConstraint>(); constraints.Add(new LinearConstraint(f, () => x  y == 5)); constraints.Add(new LinearConstraint(f, () => x >= 10)); 
3. Using text strings
Same as above, but with strings.
1 2 3 
var constraints = new List<LinearConstraint>(); constraints.Add(new LinearConstraint(f, "x  y = 5")); constraints.Add(new LinearConstraint(f, "x >= 10")); 
Finally, creating and solving the problem
Once we have specified what do we want, we can now ask the code for a solution. In case we have opted for manually specifying the objective function Q and d, and the constraint matrix A, vector b and integer m, we can use:
1 2 
// Create the optimization problem var solver = new GoldfarbIdnani(Q, d, A, b, m); 
In case we have opted for creating a QuadraticObjectiveFunction object and a list of constraints instead, we can use:
1 2 
// Create our optimization problem var solver = new GoldfarbIdnani(function, constraints: list); 
After the solver object has been created, we can call Minimize() to solve the problem. In case we have opted for manually specifying Q and d, we can use:
1 2 
// Attempt to solve the problem bool success = solver.Minimize(); 
The solution will be available in the Solution property of the solver object, and will be given by:
1 2 3 
double value = solver.Value; // f(x,y) = 170 double x = solver.Solution[0]; // x = 10 double y = solver.Solution[1]; // y = 5 
Sample application
The Accord.NET Framework now includes a sample application demonstrating the use of the GoldfarbIdnani Quadratic Programming Solver. It can be downloaded at the Accord.NET Framework site, and also comes together with recent versions of the framework (> 2.6).
Solver sample application included in the Accord.NET Framework.
Remarks
Because the code has been translated by hand (in contrast of using automatic translators such as f2c) there could be potential bugs in the code. I have tested the code behavior against R’s quadprog package and still didn’t find errors. But this does not mean the code is bugfree. As always, as is the case of everything else in this blog, this code is published in good faith, but I can not guarantee the correctness of everything. Please read the disclaimer for more information.
References
 B. A. Turlach. QuadProg: Quadratic Programming Routines (1998). Website, available at http://school.maths.uwa.edu.au/~berwin/software/quadprog.html.
 D. Goldfarb and A. Idnani. Dual and PrimalDual Methods for Solving Strictly Convex Quadratic Programs (1982). In J. P. Hennart (ed.), Numerical Analysis, SpringerVerlag, Berlin, pages 226–239.
 D. Goldfarb and A. Idnani. A numerically stable dual method for solving strictly convex quadratic programs (1983). Mathematical Programming, 27, 1–33.
 Wikipedia contributors, Quadratic programming. In Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/Quadratic_programming.
 Wikipedia contributors, Quadratic form. In Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/Quadratic_form.
 Wikipedia contributors, Linear programming. In Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/Linear_programming.
 Wikipedia contributors, Mathematical optimization. In Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/Mathematical_optimization.
 Wolfram Alpha LLC. WolframAlpha, 2012
Limitedmemory Broyden–Fletcher–Goldfarb–Shanno (LBFGS) for Unconstrained Optimization in C#
The Limitedmemory BroydenFletcherGoldfarbShanno method is an optimization method belonging to the family of quasiNewton methods for unconstrained nonlinear optimization. In short terms, it is an offtheshelf optimizer for seeking either minimum or maximum points of a any differentiable and possibly nonlinear function, requiring only an expression of the function and its gradient. The goal of this article is to provide and demonstrate a C# port of the LBFGS method originally written by Jorge Nocedal in Fortran.
 Download source code.
 Download the full Accord.NET Framework.
Introduction
Function optimization is a common problem found in many numerical applications. Suppose we have a differentiable function f : R^{n} → R and we would like to obtain its minimal or maximum value while traversing its space of possible input values. Those kind of problems arise often in parameter optimization of machine learning models and other statistic related applications (and in a myriad of other applications too – however, we will pay special attention to cases related to machine learning).
The problem of maximizing or minimizing a function can be simply stated as max f(x) or min f(x). When there aren’t any constraints limiting possible values for x, it is widely known from calculus that the maximum or a minimum of such a function would occur when the first derivatives f’(x) of the function f(x) in respect to x are zero. Newton’s method is a common method for finding the roots of a differentiable function and is indeed a suitable method for finding those values such that first derivatives of a function are zero.
Example of a convex, but nonlinear function f(x,y) = exp{(x1)²} + exp{(y2)²/2}. Images have been created using Wolfram Alpha.
However, while Newton’s method may work very well with functions of a single variable, the generalization to higher dimensions will require the replacement of the first derivative f’(x) with the function’s gradient vector g of partial derivatives, and the second derivative f’’(x) with the inverse Hessian matrix H^{1}. The problem is that the Hessian is a square matrix with as many rows and columns as parameters of our function f, and, besides being very costly and challenging to compute, even its size may be infeasible to accommodate in main memory when dealing with very high dimensional problems.
To overcome those limitations, several methods attempt to replace the Hessian matrix with an approximation extracted from the first partial derivatives alone. Those methods are called quasiNewton methods, as they replace H in Newton’s method by an approximation instead of using the full Hessian. In case the function being optimized has any particular form or special characteristic which could be exploited, it is also possible to use more specialized methods such as the GaussNewton, LevenbergMarquardt or even lowerbound approximation methods to avoid computing the full Hessian.
The quasiNewton BFGS method for nonlinear optimization builds an approximation for the Hessian matrix based on estimates extracted solely from first order information. However, even being cheaper to compute, the memory requirements are still high as the method still needs to accommodate a dense N x N matrix for the inverse Hessian approximation (where N is the number of variables for the function). To overcome this problem, the LBFGS method, or the limitedmemory version of the BFGS method, never forms or stores the approximation matrix, but only a few vectors which can be used to represent it implicitly.
Source code
The code presented here is based upon a direct translation of the original code by Jorge Nocedal. His original code was released under the permissive BSD license, and honoring the original license and the author’s considerations, this code is released under the BSD as well.
The LBFGS method is implemented within the BroydenFletcherGoldfarbShanno class. Upon object creation, it is possible to specify a function and its gradient through either the use of Lambda expressions or by specifying handles for named functions. After the object has been initialized, a simple call to Minimize runs the standard LBFGS method until convergence. The details for the Minimize method is given below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
<span>/// <summary></span> <span>/// Optimizes the defined function. </span> <span>/// </summary></span> <span>/// </span> <span>/// <param name="values">The initial guess values for the parameters.</param></span> <span>/// <returns>The values of the parameters which optimizes the function.</returns></span> <span>/// </span> <span>public</span> <span>unsafe</span> <span>double</span> Minimize(<span>double</span>[] values) { <span>if</span> (values == <span>null</span>) <span>throw</span> <span>new</span> ArgumentNullException(<span>"values"</span>); <span>if</span> (values.Length != numberOfVariables) <span>throw</span> <span>new</span> DimensionMismatchException(<span>"values"</span>); <span>if</span> (Function == <span>null</span>) <span>throw</span> <span>new</span> InvalidOperationException( <span>"The function to be minimized has not been defined."</span>); <span>if</span> (Gradient == <span>null</span>) <span>throw</span> <span>new</span> InvalidOperationException( <span>"The gradient function has not been defined."</span>); <span>// Initialization</span> x = (<span>double</span>[])values.Clone(); <span>int</span> n = numberOfVariables, m = corrections; <span>// Make initial evaluation</span> f = getFunction(x); g = getGradient(x); <span>this</span>.iterations = 0; <span>this</span>.evaluations = 1; <span>// Obtain initial Hessian</span> <span>double</span>[] diagonal = <span>null</span>; <span>if</span> (Diagonal != <span>null</span>) { diagonal = getDiagonal(); } <span>else</span> { diagonal = <span>new</span> <span>double</span>[n]; <span>for</span> (<span>int</span> i = 0; i < diagonal.Length; i++) diagonal[i] = 1.0; } <span>fixed</span> (<span>double</span>* w = work) { <span>// The first N locations of the work vector are used to</span> <span>// store the gradient and other temporary information.</span> <span>double</span>* rho = &w[n]; <span>// Stores the scalars rho.</span> <span>double</span>* alpha = &w[n + m]; <span>// Stores the alphas in computation of H*g.</span> <span>double</span>* steps = &w[n + 2 * m]; <span>// Stores the last M search steps.</span> <span>double</span>* delta = &w[n + 2 * m + n * m]; <span>// Stores the last M gradient diferences.</span> <span>// Initialize work vector</span> <span>for</span> (<span>int</span> i = 0; i < g.Length; i++) steps[i] = g[i] * diagonal[i]; <span>// Initialize statistics</span> <span>double</span> gnorm = Norm.Euclidean(g); <span>double</span> xnorm = Norm.Euclidean(x); <span>double</span> stp = 1.0 / gnorm; <span>double</span> stp1 = stp; <span>// Initialize loop</span> <span>int</span> nfev, point = 0; <span>int</span> npt = 0, cp = 0; <span>bool</span> finish = <span>false</span>; <span>// Make initial progress report with initialization parameters</span> <span>if</span> (Progress != <span>null</span>) Progress(<span>this</span>, <span>new</span> OptimizationProgressEventArgs (iterations, evaluations, g, gnorm, x, xnorm, f, stp, finish)); <span>// Start main</span> while (!finish) { iterations++; <span>double</span> bound = iterations  1; <span>if</span> (iterations != 1) { <span>if</span> (iterations > m) bound = m; <span>double</span> ys = 0; <span>for</span> (<span>int</span> i = 0; i < n; i++) ys += delta[npt + i] * steps[npt + i]; <span>// Compute the diagonal of the Hessian</span> <span>// or use an approximation by the user.</span> <span>if</span> (Diagonal != <span>null</span>) { diagonal = getDiagonal(); } <span>else</span> { <span>double</span> yy = 0; <span>for</span> (<span>int</span> i = 0; i < n; i++) yy += delta[npt + i] * delta[npt + i]; <span>double</span> d = ys / yy; <span>for</span> (<span>int</span> i = 0; i < n; i++) diagonal[i] = d; } <span>// Compute H*g using the formula given in:</span> <span>// Nocedal, J. 1980, "Updating quasiNewton matrices with limited storage",</span> <span>// Mathematics of Computation, Vol.24, No.151, pp. 773782.</span> cp = (point == 0) ? m : point; rho[cp  1] = 1.0 / ys; <span>for</span> (<span>int</span> i = 0; i < n; i++) w[i] = g[i]; cp = point; <span>for</span> (<span>int</span> i = 1; i <= bound; i += 1) { <span>if</span> (cp == 1) cp = m  1; <span>double</span> sq = 0; <span>for</span> (<span>int</span> j = 0; j < n; j++) sq += steps[cp * n + j] * w[j]; <span>double</span> beta = alpha[cp] = rho[cp] * sq; <span>for</span> (<span>int</span> j = 0; j < n; j++) w[j] = beta * delta[cp * n + j]; } <span>for</span> (<span>int</span> i = 0; i < diagonal.Length; i++) w[i] *= diagonal[i]; <span>for</span> (<span>int</span> i = 1; i <= bound; i += 1) { <span>double</span> yr = 0; <span>for</span> (<span>int</span> j = 0; j < n; j++) yr += delta[cp * n + j] * w[j]; <span>double</span> beta = alpha[cp]  rho[cp] * yr; <span>for</span> (<span>int</span> j = 0; j < n; j++) w[j] += beta * steps[cp * n + j]; <span>if</span> (++cp == m) cp = 0; } npt = point * n; <span>// Store the search direction</span> <span>for</span> (<span>int</span> i = 0; i < n; i++) steps[npt + i] = w[i]; stp = 1; } <span>// Save original gradient</span> <span>for</span> (<span>int</span> i = 0; i < g.Length; i++) w[i] = g[i]; <span>// Obtain the onedimensional minimizer of f by computing a line search</span> mcsrch(x, <span>ref</span> f, <span>ref</span> g, &steps[point * n], <span>ref</span> stp, <span>out</span> nfev, diagonal); <span>// Register evaluations</span> evaluations += nfev; <span>// Compute the new step and</span> <span>// new gradient differences</span> <span>for</span> (<span>int</span> i = 0; i < g.Length; i++) { steps[npt + i] *= stp; delta[npt + i] = g[i]  w[i]; } <span>if</span> (++point == m) point = 0; <span>// Check for termination</span> gnorm = Norm.Euclidean(g); xnorm = Norm.Euclidean(x); xnorm = Math.Max(1.0, xnorm); <span>if</span> (gnorm / xnorm <= tolerance) finish = <span>true</span>; <span>if</span> (Progress != <span>null</span>) Progress(<span>this</span>, <span>new</span> OptimizationProgressEventArgs (iterations, evaluations, g, gnorm, x, xnorm, f, stp, finish)); } } <span>return</span> f; <span>// return the minimum value found (at solution x)</span> } 
The version of the code detailed here only supports Minimization problems. However, it is possible to note that any minimization problem can be converted into a maximization problem and viceversa by taking the opposite of the function and its gradient.
Using the code
Code usage is rather simple. Suppose we would like to maximize the function g(x,y) = exp{(x1)²} + exp{(y2)²/2}:
Since we would like to perform a maximization, we first have to convert it to a minimization problem. The minimization version of the function is simply given by taking f(x,y) = –g(x,y):
Which is a convex function, as can be seen by plotting its surface. As it can be seen, the minimum of the function lies on the point (x,y) = (1,2). As expected, this point coincides with the roots of the partial derivative functions, as shown in the line plots below:
The minimization problem min f(x,y) = (exp((x1)²) + exp((y2)²/2)), computed by Wolfram Alpha.
The roots of the partial derivatives in respective to x (left) and y (right). It is possible to see that the roots occur at 1 and 2, respectively. Images computed using Wolfram Alpha.
The first step in solving this optimization problem automatically is first to specify the target function. The target function f, in this case, can be specified through a lambda function in the form:
1 2 
Func<<span>double</span>[], <span>double</span>> f = (x) => Math.Exp(Math.Pow(x[0]  1, 2))  Math.Exp(0.5 * Math.Pow(x[1]  2, 2)); 
The next step is to specify the gradient g for the function f. In this particular case, we can manually compute the partial derivatives to be df/dx = 2 e^((x1)^2) (x1) and df/dy = e^(1/2 (y2)^2) (y2), in respect to x and y, respectively. Writing a lambda function to compute the gradient vector g, we have:
1 2 3 4 5 
Func<<span>double</span>[], <span>double</span>[]> g = (x) => <span>new</span> <span>double</span>[] { 2 * Math.Exp(Math.Pow(x[0]  1, 2)) * (x[0]  1), Math.Exp(0.5 * Math.Pow(x[1]  2, 2)) * (x[1]  2) }; 
We can also note that this example was rather simple, so the gradient vector was easy to calculate. However, the gradient could also have been computed automatically using Accord.NET‘s FiniteDifferences class. In either case, all we have to do now is to create our LBFGS solver and call its Minimize() method to begin optimization:
1 2 3 4 5 6 
<span>// Create the LBFGS solver</span> BroydenFletcherGoldfarbShanno lbfgs = <span>new</span> BroydenFletcherGoldfarbShanno(numberOfVariables: 2, function: f, gradient: g); <span>// Minimize the function</span> <span>double</span> minValue = lbfgs.Minimize(); <span>// should be 2</span> 
After the optimization is complete, the solution parameter will be available in the Solution property of the lbfgs object, and will be equal to { 1.0, 2.0 }. And also in case one is interested in progress reports (such as in the case of optimizing very large functions), it is also possible to register a listener to the Progress event handler. The complete version of the sample application accompanying the source code is given below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 
<span>// Suppose we would like to find the minimum of the function</span> <span>// </span> <span>// f(x,y) = exp{(x1)²}  exp{(y2)²/2}</span> <span>//</span> <span>// First we need write down the function either as a named</span> <span>// method, an anonymous method or as a lambda function:</span> Func<<span>double</span>[], <span>double</span>> f = (x) => Math.Exp(Math.Pow(x[0]  1, 2))  Math.Exp(0.5 * Math.Pow(x[1]  2, 2)); <span>// Now, we need to write its gradient, which is just the</span> <span>// vector of first partial derivatives del_f / del_x, as:</span> <span>//</span> <span>// g(x,y) = { del f / del x, del f / del y }</span> <span>// </span> Func<<span>double</span>[], <span>double</span>[]> g = (x) => <span>new</span> <span>double</span>[] { <span>// df/dx = {2 e^( (x1)^2) (x1)}</span> 2 * Math.Exp(Math.Pow(x[0]  1, 2)) * (x[0]  1), <span>// df/dy = { e^(1/2 (y2)^2) (y2)}</span> Math.Exp(0.5 * Math.Pow(x[1]  2, 2)) * (x[1]  2) }; Console.WriteLine(<span>"Solving:"</span>); Console.WriteLine(); Console.WriteLine(<span>" min f(x,y) = exp{(x1)²}  exp{(y2)²/2}"</span>); Console.WriteLine(); <span>// Finally, we can create the LBFGS solver, passing the functions as arguments</span> <span>var</span> lbfgs = <span>new</span> BroydenFletcherGoldfarbShanno(numberOfVariables: 2, function: f, gradient: g); <span>// And then minimize the function:</span> <span>double</span> minValue = lbfgs.Minimize(); <span>double</span>[] solution = lbfgs.Solution; <span>// The resultant minimum value should be 2, and the solution</span> <span>// vector should be { 1.0, 2.0 }. The answer can be checked on</span> <span>// Wolfram Alpha by clicking the following the link:</span> <span>// http://www.wolframalpha.com/input/?i=maximize+%28exp%28%28x1%29%C2%B2%29+%2B+exp%28%28y2%29%C2%B2%2F2%29%29</span> Console.WriteLine(<span>"The minimum value of {0:0} occurs at the solution point ({1:0},{2:0})"</span>, minValue, solution[0], solution[1]); 
Conclusion
In this post, we showed how to use a reduced set of the Accord.NET Framework‘s to perform nonlinear optimization. This routine is also used by the Conditional Random Fields and Hidden Conditional Random Fields trainers to optimize parameters for such models. The solver routines have been adapted from the original Fortran’s source code from Nocedal, which, tracing back to a 2001 message from the author, also have been reported to be available under the public domain. Nevertheless, the reduced set of the framework available for download within this post is available under a BSD license, as an alternative to the version available within the Accord.NET Framework which is available only under a LGPL license.
As always, I hope someone will find it useful 🙂
References

D.C. Liu and J. Nocedal. On the Limited Memory Method for Large Scale Optimization (1989), Mathematical Programming B, 45, 3, pp. 503528.

J. Nocedal. Updating QuasiNewton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773782.

J. Nocedal. LBFGS: Software for Largescale Unconstrained Optimization. Website. http://users.eecs.northwestern.edu/~nocedal/lbfgs.html

Souza, César R. The Accord.NET Framework. 01 Sep. 2010. Web. http://accordnet.origo.ethz.ch.

Wolfram Alpha LLC. WolframAlpha. http://www.wolframalpha.com/

Wikipedia contributors, Newton’s method. Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/Newton’s_method

Wikipedia contributors. QuasiNewton method. Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/QuasiNewton_method

Wikipedia contributors. Limitedmemory BFGS. Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/Limitedmemory_BFGS

Wikipedia contributors. BFGS method. Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/BFGS_method
Decision Trees in C#
Decision trees are simple predictive models which map input attributes to a target value using simple conditional rules. Trees are commonly used in problems whose solutions must be readily understandable or explainable by humans, such as in computeraided diagnostics and credit analysis.
 Download sample application and source code
 Download the full Accord.NET Framework
Introduction
Decision Trees give a direct and intuitive way for obtaining the classification of a new instance from a set of simple rules. Because of this characteristic, decision trees find wide use in situations in which the interpretation of the obtained results and of the reasoning process is crucial, such as in computeraided diagnostics (CAD) and in financial credit analysis. Consumer credit analysis is an interesting example because, in many countries, one can not simply reject credit without giving a justification, justification of which is trivial to extract from a decision tree.
The tree on the right has been generated using the Context Free software based on the grammar shared by davdrn.
Learning decision trees
Decision trees can be simply drawn by hand based on any prior knowledge the author may have. However, their real power becomes apparent when trees are learned automatically, through some learning algorithm.
The most notable and classics examples to decision tree learning are the algorithms ID3 (Quinlan, 1986) and the C4.5 (Quinlan, 1993). Both are examples of greedy algorithms, performing local optimum decisions in the hope of producing a most general tree. Such algorithms are based on the principle of the Occam’s razor, favoring simpler or smaller trees in the hope that such smaller trees could retain more generalization power. This preference is formalized through the specification of an hypothesis ordering criteria such as the information gain. The information gain measures the, as the name implies, gain of information in using each of the attributes as a next factor to consider during decision. The information gain can be defined as:
However, the information gain has a bias towards attributes with a high number of possible values (Mitchell, 1997). A way to overcome this bias is to select new selection attributes based on alternative criteria, such as the gain ratio (Quinlan, 1986), defined as:
In the GainRatio, the SplitInformation term attenuates the importance given to attributes with many, uniformly distributed, possible values.
Iterative Dichotomizer 3 (ID3)
The algorithm presented below is a slightly different version of the original ID3 algorithm as presented by Quinlan. The modifications are to support multiple output labels. In each recursion of the algorithm, the attribute which bests classifiers the set of instances (or examples, or inputoutput pairs, or data) is selected according to some selection criteria, such as the InfoGain or the GainRatio.
 ID3(instances, target_attribute, attributes)
 Create a new root node to the tree.
 If all instances have the target_attribute belonging to the same class c,
 Return the tree with single root node with label c.
 If attributes is empty, then
 Return the tree with single root node with the most common label of the target_attribute in instances.
 Else
 A ← the attribute in attributes which best classifies instances
 root decision attribute ← A
 Foreach possible value v_{i} of A,
 Add a new ramification below root, corresponding to the test A = v_{i}
 Let instances_{vi} be the subset of instances with the value v_{i} for A
 If instances_{vi} is empty then
 Below this ramification, add a new leaf node with the most common value of target_attribute in instances.
 Else below this ramification, add the subtree given by the recursion:
ID3(instances_{vi}, target_attribute, attributes – { A })
 End
 Return root
Difficulties and disadvantages of decision tree learning
Despite relying on the Occam’s Razor to guide the learning, neither ID3 or C4.5 algorithms are not guaranteed to produce the smaller or more general tree possible. This happens because their learning is based solely on heuristics and not in truly optimality criteria. The following example, from (Esmeir & Markovitch, 2007) illustrates the learning of the concept xor (exclusive or) by the ID3 algorithm. In this example, A3 and A4 are irrelevant attributes, having absolutely no influence on the target answer. However, yet being irrelevant, ID3 will select one of them to belong to the tree. In fact, ID3 will select the irrelevant attribute A4 to be the root of the learned tree.

One complication of decision tree learning is that the problem to find the smaller consistent tree (in the sense of classifying correctly all training samples) is known to be NPcomplete (Hyafil & Rivest, 1976). Moreover, the separating decision surface generated by such trees are always formed by parallel cuts alongside the attribute space axis, which could be a severely suboptimal solution (Bishop, 2007, p. 666). The example given by Bishop illustrates well this problem: for example, to separate classes whose optimum decision margin are on 45 degrees from one of the axis, it will be needed a high number of parallel cuts in comparison with a single cut on the diagonal such as could be given by any linear decision machine. Another disadvantage of traditional decision tree learning algorithms is that most methods require only a constant learning time, and, as such, do not allow for trading extra training time for a better solutions. The work of (Esmeir & Markovitch, 2007) is dedicated to overcome such problem.
The following picture shows an example on how learning by decision trees is limited to cuts parallel to the axis, as described by Bishop. The YingYang classification problem is a classical example of a nonlinearly separable decision problem. Decision trees, albeit not being linear classifiers, have difficulty classifying this set with simple thresholds.
TopLeft: YingYang nonlinear classification problem. TopRight: Decision surface extracted by a decision tree. Bottom: Decision threshold rules extracted from the tree.
However, despite all of those shortcomings, decision trees plays major roles as base of many successful algorithms. One interesting application and of notorious success is given in the field of object detection in images. The first realtime face detecting algorithm (Viola & Jones, 2001) is based on a degenerated decision tree (also known as a cascade). The body recognition and segmentation algorithm used by the Xbox 360 gaming platform used to process depth images generated by its companion sensor Kinect is equally based on the use of decision trees (Shotton, et al., 2011). As well is the case of the FAST algorithm for interest point detection in images (Rosten & Drummond, 2006).
I should also make the note that both the ViolaJones and the FAST algorithms are available in the Accord.NET Framework for immediate use (the ViolaJones (HaarObjectDetector) has also been recently updated to support multiple threads, so feel free to take a look and experiment with it!).
In sum, its possible to say that great part of the applicability of decision trees came from the simple fact that they are extremely fast to evaluate. One of the reasons for this feature is being easily translated to sets of conditional instructions in a conventional computer program. The decision trees now available in the Accord.NET Framework make full use of this fact and enables the user to actually compile the decision trees to native code onthefly, augmenting even more its performance during classification.
Source code
The code presented in this section is actually part of the Accord.NET Framework. The Accord.NET Framework is a framework extension to the already very popular AForge.NET Framework, adding and providing new algorithms and techniques for machine learning, computer vision and even more.
The Accord.MachineLearning.DecisionTree namespace is comprised of the following classes:
 DecisionTree, the main class representing a decision tree, with methods such as Compute to compute the tree classification given an input vector;
 DecisionNode, the node class for the decision tree, which may or may not have children nodes contained under a collection of children represented by a DecisionBranchNodeCollection;
 DecisionVariable, a class to specify the nature of each variable processable by the tree, such as if the variable is continuous, discrete, which are their expected or valid ranges;
 DecisionBranchNodeCollection, a class to contain children nodes together with information about which attribute of the data should be compared with the child nodes during reasoning.
Whose relationships can be seen in the following class diagram:
The learning algorithms available are either the ID3 algorithm discussed above, or its improved version C4.5 (which can handle continuous variables, but at this time does not yet support pruning), both proposed and published by Ross Quinlan.
Despite the bit complicated look, usage is rather simple as it will be shown in the next section.
Using the code
Consider, for example, the famous Play Tennis example by Tom Mitchell (1998):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
DataTable data = new DataTable("Mitchell's Tennis Example"); data.Columns.Add("Day", "Outlook", "Temperature", "Humidity", "Wind", "PlayTennis"); data.Rows.Add( "D1", "Sunny", "Hot", "High", "Weak", "No" ); data.Rows.Add( "D2", "Sunny", "Hot", "High", "Strong", "No" ); data.Rows.Add( "D3", "Overcast", "Hot", "High", "Weak", "Yes" ); data.Rows.Add( "D4", "Rain", "Mild", "High", "Weak", "Yes" ); data.Rows.Add( "D5", "Rain", "Cool", "Normal", "Weak", "Yes" ); data.Rows.Add( "D6", "Rain", "Cool", "Normal", "Strong", "No" ); data.Rows.Add( "D7", "Overcast", "Cool", "Normal", "Strong", "Yes" ); data.Rows.Add( "D8", "Sunny", "Mild", "High", "Weak", "No" ); data.Rows.Add( "D9", "Sunny", "Cool", "Normal", "Weak", "Yes" ); data.Rows.Add( "D10", "Rain", "Mild", "Normal", "Weak", "Yes" ); data.Rows.Add( "D11", "Sunny", "Mild", "Normal", "Strong", "Yes" ); data.Rows.Add( "D12", "Overcast", "Mild", "High", "Strong", "Yes" ); data.Rows.Add( "D13", "Overcast", "Hot", "Normal", "Weak", "Yes" ); data.Rows.Add( "D14", "Rain", "Mild", "High", "Strong", "No" ); 
In the aforementioned example, one would like to infer if a person would play tennis or not based solely on four input variables. Those variables are all categorical, meaning that there is no order between the possible values for the variable (i.e. there is no order relationship between Sunny and Rain, one is not bigger nor smaller than the other, but are just distinct). Moreover, the rows, or instances presented above represent days on which the behavior of the person has been registered and annotated, pretty much building our set of observation instances for learning.
In order to try to learn a decision tree, we will first convert this problem to a more simpler representation. Since all variables are categories, it does not matter if they are represented as strings, or numbers, since both are just symbols for the event they represent. Since numbers are more easily representable than text string, we will convert the problem to use a discrete alphabet through the use of a codebook.
A codebook effectively transforms any distinct possible value for a variable into an integer symbol. For example, “Sunny” could as well be represented by the integer label 0, “Overcast” by “1”, Rain by “2”, and the same goes by for the other variables. So:
1 2 3 
// Create a new codification codebook to // convert strings into integer symbols Codification codebook = new Codification(data); 
Now we should specify our decision tree. We will be trying to build a tree to predict the last column, entitled “PlayTennis”. For this, we will be using the “Outlook”, “Temperature”, “Humidity” and “Wind” as predictors (variables which will we will use for our decision). Since those are categorical, we must specify, at the moment of creation of our tree, the characteristics of each of those variables. So:
1 2 3 4 5 6 7 8 9 
DecisionVariable[] attributes = { new DecisionVariable("Outlook", 3), // 3 possible values (Sunny, overcast, rain) new DecisionVariable("Temperature", 3), // 3 possible values (Hot, mild, cool) new DecisionVariable("Humidity", 2), // 2 possible values (High, normal) new DecisionVariable("Wind", 2) // 2 possible values (Weak, strong) }; int classCount = 2; // 2 possible output values for playing tennis: yes or no 
Let’s now proceed and create our DecisionTree:
1 
DecisionTree tree = new DecisionTree(attributes, classCount); 
Now we have created our decision tree. Unfortunately, it is not really very useful, since we haven’t taught it the problem we are trying to predict. So now we must instantiate a learning algorithm to make it useful. For this task, in which we have only categorical variables, the simplest choice is to use the ID3 algorithm by Quinlan. Let’s do it:
1 2 3 4 5 6 7 8 
// Create a new instance of the ID3 algorithm ID3Learning id3learning = new ID3Learning(tree); // Translate our training data into integer symbols using our codebook: DataTable symbols = codebook.Apply(data); int[][] inputs = symbols.ToIntArray("Outlook", "Temperature", "Humidity", "Wind"); int[] outputs = symbols.ToIntArray("PlayTennis").GetColumn(0); // Learn the training instances! id3learning.Run(inputs, outputs); 
At this point, the tree has been created. In order to ask the tree to classify new samples, we can use:
1 2 3 4 5 
int[] query = codebook.Translate("Sunny", "Hot", "High", "Strong"); int output = tree.Compute(query); string answer = codebook.Translate("PlayTennis", output); // answer will be "No". 
Please note that, in case any of the steps in this post doesn’t work out, it might be because the most uptodate version in the Accord.NET Framework may have undergone some changes. In this case, please refer to the official documentation at the Accord.NET website.
Going further, a suitable representation of the learned tree could be given by the following diagram:
However, more than just a diagram, we can also go and generate a .NET Expression Tree describing our decision tree. Expression trees represent code in the form of a tree of expressions, which can then be read, modified or compiled. By compiling the DecisionTree‘s expression, we are able to generate code onthefly and let the JIT compile it down to native code at runtime, greatly improving the performance of our decision tree:
1 2 3 4 5 
// Convert to an expression tree var expression = tree.ToExpression(); // Compiles the expression to IL var func = expression.Compile(); 
And here is the resulting C# code obtained by compiling the expression into a lambda function, dumping the function into an dynamic assembly, opening and inspecting this assembly using ILSpy:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 
public static int Compute(double[] input) { if (input[0] == 0.0) { if (input[2] == 0.0) { return 0; } if (input[2] == 1.0) { return 1; } throw new ArgumentException( "Input contains a value outside of expected ranges.", "input"); } else { if (input[0] == 1.0) { return 1; } if (input[0] != 2.0) { throw new ArgumentException( "Input contains a value outside of expected ranges.", "input"); } if (input[3] == 0.0) { return 1; } if (input[3] == 1.0) { return 0; } throw new ArgumentException( "Input contains a value outside of expected ranges.", "input"); } } 
Conclusion
Decision trees are useful tools when the problem to be solved needs to be quickly interpreted and understood by humans. Another suitable use is when the decision needs to be fast. However, decision trees, at least those trained by simple training algorithms such as ID3 and C4.5 can perform quite poorly depending on the problem. As it happens with all machine learning techniques, it is futile to believe there is a one true classifier which would act perfectly on all possible imaginable scenarios. As always, it is important to know our tools and know in which situation each technique would work better.
References
 Bishop, C. M., 2007. Pattern Recognition and Machine Learning (Information Science and Statistics). 1st ed. 2006. Corr. 2nd printing ed. s.l.:Springer
 Fayyad, U. M. & Irani, K. B., 1992. On the Handling of ContinuousValued Attributes in Decision Tree Generation. Machine Learning, January, 8(1), pp. 87102.
 Quinlan, J. R., 1986. Induction of decision trees. Machine Learning, 1(1), pp. 81106.
 Quinlan, J. R., 1993. C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning). 1 ed. s.l.:Morgan Kaufmann.
 Shotton, J. et al., 2011. RealTime Human Pose Recognition in Parts from Single Depth Images. s.l., s.n.
 Viola, P. & Jones, M., 2001. Robust Realtime Object Detection. s.l., s.n.
 Mitchell, T. M., 1997. Decision Tree Learning. In:: Machine Learning (McGrawHill Series in Computer Science). s.l.:McGraw Hill.
 Mitchell, T. M., 1997. Machine Learning (McGrawHill Series in Computer Science). Boston(MA): WCB/McGrawHill.
 Esmeir, S. & Markovitch, S., 2007. Anytime Learning of Decision Trees. J. Mach. Learn. Res., May, Volume 8, pp. 891933.
 Hyafil, L. & Rivest, R. L., 1976. Constructing Optimal Binary Decision Trees is NPcomplete. Information Processing Letters, 5(1), pp. 1517.
Accord.NET Framework: Gesture Controller Components
A new version (2.2.0) of the Accord.NET Framework has just been released. This new version introduces many new features, fixes and improvements. The most interesting additions are certainly the HeadController and FaceController .NET components.
The Accord.NET Controller components can be used to generate events based on webcam motion. By using a combination of HaarCascadeClassifiers, Camshift and Templatebased Tracking, those components are able to detect when a face enters scene, leaves the scene, and moves across a scene.
The video above shows only the sample application which comes together with the framework. However, the interesting part is that this is just a sample of what can be accomplished using the real controller components. The controller components are .NET components, similar to Button, Label or Timer, and can be dragged and dropped from Visual Studio’s ToolBox directly into any application.
Once inside an application, it will be possible to set event actions just as in any other .NET component:
The controls have builtin support for calibration. All values except tilting angle are passed to the hosting application in the [1;+1] range, in which 1 indicates either a total left/down/backwards position and +1 indicates a total right/up/forward position. The tilting angle is given in radians. Please note that the face controller is still a bit experimental and still requires some tuning.
This new version also introduces HSL Color Range object trackers, more default Haar Cascades, an experimental version of linearchain Conditional Random Fields, and the ability to generate hardcoded C# definitions of any Haar cascade available in the OpenCV XML format. There is also initial support for finger detection using new implementations for BorderFollowing contour extraction, KCurvatures and Convex Hull Defects extraction. On the statistics side, there has been the inclusion of the VonMises distribution, Moving and Running Normal distributions and improvements in the Multivariate Gaussian implementation. The full release notes are available in the release’s download page.
Also, a special thanks to Professor Dr. Modesto Castrillón Santana to let me embed some of his Haar definitions into the framework under the LGPL license. Please be sure to include a reference to his work if you plan to use this in an academic publication.
As always, I hope those additions and improvements will be useful to everyone 🙂
Machine Learning Books and Lectures
Jordan, one of the readers of the blog, wrote to point out some cool references for machine learning, mathematics and artificial intelligence.
Thanks again for all your contributions. I’m a .NET programmer coming from a background of studying politics and Latin American studies so when the moment came that I realized I was interested in A.I., I didn’t have many resources to turn to.
I have a collection of books and links that I found very very helpful when I was trying to learn about these concepts. I was thinking they could be useful to some of your readers who were in my shoes two years ago. So without further adieu:
Handbook of Normal Frames and Coordinates
"Handbook of Normal Frames and Coordinates," Iliev, Bozhidar Z. (2006)
“This can be a heavy text, but if you understand the concept, this can take you to the next level – I would categorize this as Topology but has applications in quantum physics, theoretical mathematics, theoretical physics etc… in context of your blogs – granted this is a dedicated read – I think readers will be able to appreciate and truly understand what a Hilbert Space is after reading this book.”
Linear Algebra
"Linear Algebra," Jim Heffron (2008)
“I liked this one because it was free and is still highly rated – I have also heard great reviews about David Poole’s book on linear algebra, but I couldn’t get myself to shell out the money when this was free.”
Complex To Real
http://www.complextoreal.com/tutorial.htm
“There are a ton of great articles here – I have personally read the ones on fourier transforms and wavelet transforms – great stuff”
Stanford Lectures – Fourier Analysis series
(free through Youtube or iTunesU)
“Email the Professor for the companion book. At this point it may have been published – but well worth shelling out the dough in any case.”
BioInspired Artificial Intelligence
"BioInspired Artificial Intelligence," Floreano and Mattiussi (2008)
“Excellent reference – fairly in depth for the number of topics it covers, lots of illustrations (illustrations are always a good thing 🙂 and I’ve also found it to be a useful source for inspiration. I bought this book while I was looking into fuzzy swarm intelligence – it doesn’t have all the answers, but I am simply in love with this book.”
Video lectures on Machine Learning
http://videolectures.net/pascal/
“A collection of video lectures featuring very brilliant people – let’s face it… if you’re doing or are interested in mathematics this complex… you probably don’t know too many people who you can talk to / learn from on the subject unless you’re in a University studying this kind of thing – these are some great lectures on machine learning – I just recently found this site but wish I had found it sooner – it’s great if you’re just exploring machine learning or are very well versed in it – however, I fall somewhere in the middle of that distribution so take it for what it’s worth!”
Fearless Symmetry
"Fearless Symmetry," Ash and Gross (2006)
Another accessible book to those without heavy training in math – great intro to Galois Theory, the Riemann Hypothesis and several other concepts.
Zero: The Biography of a Dangerous Idea
"Zero: The Biography of a Dangerous Idea," Charles Seife (2000)
This one is more historical and conceptual than technical but it’s a captivating read and will help get you through those hard times when you want to put down that book on KDimensional Manifolds, but still need to satisfy your mathematical mind (some say it’s never at rest once you catch that "learning bug").
Khan Academy
Finally, when you get lost, go here! The Khan Academy is a notforprofit 501(c)(3) with the mission of providing a worldclass education to anyone, anywhere.
Thanks Jordan! I hope readers can enjoy those resources as much as I did. Cheers!
Gaussian Mixture Models and ExpectationMaximization
Like KMeans, Gaussian Mixture Models (GMM) can be regarded as a type of unsupervised learning or clustering methods. They are among the most statistically mature methods for clustering. But unlike KMeans, GMMs are able to build soft clustering boundaries, i.e., points in space can belong to any class with a given probability.
 Download sample application (with source code).
 Browse the sample application source code online
 Install the Accord.NET machine learning framework.
The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing InstallPackage Accord.MachineLearning in your IDE’s NuGet package manager.
Contents
Introduction
In statistics, a mixture model is a probabilistic model which assumes the underlying data to belong to a mixture distribution. In a mixture distribution, its density function is just a convex combination (a linear combination in which all coefficients or weights sum to one) of other probability density functions:
The individual p_{i}(x) density functions that are combined to make the mixture density p(x) are called the mixture components, and the weights w_{1}, w_{2}, …, w_{n} associated with each component are called the mixture weights or mixture coefficients.
The most common mixture distribution is the Gaussian (Normal) density function, in which each of the mixture components are Gaussian distributions, each with their own mean and variance parameters.
To give a more solid and concrete understanding of the Gaussian mixture models, we will be jumping directly on how to represent those abstractions in source code through the use of class diagrams. Hopefully class diagrams may give a better and direct overview on the subject than a lengthy theoretical discussion.
Source code
First we will discuss about the characteristics of any probability distribution. The IDistribution interface (depicted below) says any probability distribution may have a probability function and a distribution function. It also says that probability distributions can also be fitted to sets of observations through a parameter estimation method.
Since distribution functions can be either univariate or multivariate, the methods above accept any number of scalar values as input through the use of the params keyword. The Fit method for parameter estimation also accepts a general System.Array because we could be given either an array of univariate observations in the form of a double[] or an array of multivariate observations in the form of a jagged double[][] array. Since each observation may have a different weight in the estimation process, the weights parameter can be used to inform those weights to the fitting method.
Probability distributions may also have some associated measures, mainly the Expectation and the Variance of the modeled random variable. Depending if the distribution is univariate or multivariate, the expected values and variances can be either a single scalar or a vector of scalars. For multivariate distributions the variance can be also represented by a symmetric, positivedefinite matrix known as the variancecovariance matrix.
Moreover, probability distributions can also be continuous or discrete. In continuous distributions the probability function is known as the Probability Density Function (pdf), and in discrete distributions the probability function is known as the Probability Mass Function (pmf). However, since our distribution of interest, the Gaussian, is a continuous distribution, be focusing only on continuous multivariate distributions:
The Normal (Gaussian) distribution and the mixture distributions fall under the multivariate continuous distributions category and are implemented as such.
Normal distributions
The Normal (or Gaussian) multivariate distribution is a multivariate distribution whose parameters are the mean vector μ and a variancecovariance matrix Σ. Its probability density function is given by:
The more detailed class diagram for the normal distribution shows the multivariate characteristics of most of its methods, in particular of the Fit method for estimating Gaussian parameters from a multivariate set of data. Each sample, given in the form of a double[], may also have different associated weights. In this case, the weight associated with each of the samples can be passed through the weights parameter. The source code for the Gaussian distribution can be found in the NormalDistribution.cs class.
To estimate the parameters of a Gaussian distribution all we have to do is compute the mean and the variancecovariance matrix out of the given data. This is a very straightforward procedure and can be done by using the standard methods of the Tools class in the Accord.Statistics namespace. The Fit method updates the distribution parameters, overwriting the current distribution.
Mixture distributions
Mixture distributions are just convex combinations of other probability distributions. Thus said, mixture distributions have an array of component distributions and a coefficient array which contains the weights of the each of the component probability distributions.
In the generic implementation above, T is the type of the distributions used in the mixture. The most common mixture distribution, the Gaussian Mixture Distribution, could then be created by instantiating a Mixture object passing the initial Normal distributions as constructor parameters.
Alternatively, the parameters of the mixture distributions could be estimated from a set of observations by calling the Fit method. To estimate the parameters of a mixture distribution we will be using a common technique known as the ExpectationMaximization algorithm.
Expectation Maximization
ExpectationMaximization (EM) is a well established maximum likelihood algorithm for fitting a mixture model to a set of training data. It should be noted that EM requires an a priori selection of model order, namely, the number of M components to be incorporated into the model. In the Accord.NET Framework, it can be called implicitly by using the Mixture.Fit method or explicitly using the ExpectationMaximization class. The source code for Expectation Maximization can be found in the ExpectationMaximization.cs file.
The general EM algorithm is comprised of the following simple steps:

Initialization
Initialize the distribution parameters, such as the means, covariances and mixing coefficients and evaluate the initial value of the loglikelihood (the goodness of fit of the current distribution against the observation dataset)’;

Expectation
Evaluate the responsibilities (i.e. weight factors of each sample) using the current parameter values;

Maximization
Reestimate the parameters using the responsibilities found in the previous step;

Repeat
Reevaluate the loglikelihood and check if it has changed; if it has changed less than a given threshold, the algorithm has converged.
Below is the actual realization of the algorithm as implemented in the Fit method of the Mixture<T> class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 
/// <summary> /// Fits the underlying distribution to a given set of observations. /// </summary> /// /// <param name="observations">The array of observations to fit the model against. The array /// elements can be either of type double (for univariate data) or /// type double[] (for multivariate data).</param> /// <param name="weights">The weight vector containing the weight for each of the samples.</param> /// <param name="options">Optional arguments which may be used during fitting, such /// as regularization constants and additional parameters.</param> /// /// <remarks> /// Although both double[] and double[][] arrays are supported, /// providing a double[] for a multivariate distribution or a /// double[][] for a univariate distribution may have a negative /// impact in performance. /// </remarks> /// public override void Fit(double[] observations, double[] weights, IFittingOptions options) { // Estimation parameters double threshold = 1e3; IFittingOptions innerOptions = null; if (options != null) { // Process optional arguments MixtureOptions o = (MixtureOptions)options; threshold = o.Threshold; innerOptions = o.InnerOptions; } // 1. Initialize means, covariances and mixing coefficients // and evaluate the initial value of the loglikelihood int N = observations.Length; int K = components.Length; double weightSum = weights.Sum(); // Initialize responsabilities double[] norms = new double[N]; double[][] gamma = new double[K][]; for (int k = 0; k < gamma.Length; k++) gamma[k] = new double[N]; // Clone the current distribution values double[] pi = (double[])coefficients.Clone(); T[] pdf = new T[components.Length]; for (int i = 0; i < components.Length; i++) pdf[i] = (T)components[i].Clone(); // Prepare the iteration double likelihood = logLikelihood(pi, pdf, observations, weights); bool converged = false; // Start while (!converged) { // 2. Expectation: Evaluate the component distributions // responsibilities using the current parameter values. Array.Clear(norms, 0, norms.Length); for (int k = 0; k < gamma.Length; k++) for (int i = 0; i < observations.Length; i++) norms[i] += gamma[k][i] = pi[k] * pdf[k].ProbabilityFunction(observations[i]); for (int k = 0; k < gamma.Length; k++) for (int i = 0; i < weights.Length; i++) if (norms[i] != 0) gamma[k][i] *= weights[i] / norms[i]; // 3. Maximization: Reestimate the distribution parameters // using the previously computed responsibilities for (int k = 0; k < gamma.Length; k++) { double sum = gamma[k].Sum(); for (int i = 0; i < gamma[k].Length; i++) gamma[k][i] /= sum; pi[k] = sum / weightSum; pdf[k].Fit(observations, gamma[k], innerOptions); } // 4. Evaluate the loglikelihood and check for convergence double newLikelihood = logLikelihood(pi, pdf, observations, weights); if (Double.IsNaN(newLikelihood)  Double.IsInfinity(newLikelihood)) throw new ConvergenceException("Fitting did not converge."); if (Math.Abs(likelihood  newLikelihood) < threshold * Math.Abs(likelihood)) converged = true; likelihood = newLikelihood; } // Become the newly fitted distribution. this.initialize(pi, pdf); } 
Gaussian Mixture Models
Gaussian mixture models are among the most commonly used examples of mixture distributions. The GaussianMixtureModel class encompasses a Mixture<NormalDistribution> object and provides methods to learn from data and to perform actual classification through a simplified interface.
Moreover, a common problem which rises in mixture model fitting through EM is the proper initialization of the mixture parameters. One popular approach is to initialize the mixture parameters using the centroids detected by the KMeans algorithm. The code below shows how the mixture parameters are computed by first creating a KMeans object and then passing the clusters to the mixture model for further processing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
/// <summary> /// Divides the input data into K clusters modeling each /// cluster as a multivariate Gaussian distribution. /// </summary> public double Compute(double[][] data, double threshold) { int components = this.gaussians.Count; // Create a new KMeans algorithm KMeans kmeans = new KMeans(components); // Compute the KMeans kmeans.Compute(data, threshold); // Initialize the Mixture Model with data from KMeans NormalDistribution[] distributions = new NormalDistribution[components]; double[] proportions = kmeans.Clusters.Proportions; for (int i = 0; i < components; i++) { double[] mean = kmeans.Clusters.Centroids[i]; double[,] covariance = kmeans.Clusters.Covariances[i]; distributions[i] = new NormalDistribution(mean, covariance); } // Fit a multivariate Gaussian distribution model = Mixture<NormalDistribution>.Estimate(data, threshold, proportions, distributions); // Return the loglikelihood as a measure of goodnessoffit return model.LogLikelihood(data); } 
Using the code
Code usage is rather simple. First, instantiate a new GaussianMixtureModel object passing the desired number of components in the Gaussian mixture. Then, call the Compute(double[][], double) method passing the learning samples and a desired convergence threshold.
1 2 3 4 5 6 7 8 
// Create a new Gaussian Mixture Model with 2 components GaussianMixtureModel gmm = new GaussianMixtureModel(2); // Compute the model (estimate) gmm.Compute(samples, 0.0001); // Classify a single sample int c = gmm.Gaussians.Nearest(sample); 
After training has been completed, a new sample can be classified by calling the Classify(double[]) method.
Sample application
The accompanying sample application demonstrates how Gaussian Mixture Models can be used to cluster normallydistributed samples of data. By clicking the “Create random data” button, a given number of Normal distributions will be created in the twodimensional space using random mean and covariance matrices.
After the data has been created, you can use the “Fit a Gaussian Mixture Model” button to fit a mixture of Gaussians to the data. Each of the Gaussians will receive a random color and the samples which have the greatest probability of belonging to any of the Gaussians will be colored accordingly.
Left: a random dataset containing three Gaussian clusters. Right: the clusters as identified by the Gaussian Mixture Model.
Left: a random dataset containing 10 Gaussian clusters. Right: the clusters as identified by the Gaussian Mixture Model.
Conclusion
In this article we found how Gaussian Mixture Models can be successfully used to create soft clustering boundaries around data. Those soft boundaries are possible because in a mixture model each sample is said to belong to a cluster only within certain probability.
A main drawback of GMMs is that the number of Gaussian mixture components, as in KMeans, is assumed known as prior, so it cannot be considered as totally unsupervised clustering method. To aid in this situation, one could use additional algorithms such as the Minimum Message Length (MML) criteria.
Another problem arises with the correct initialization of the mixture components. A poor choice of initial parameters will invariably lead to poor results, as the EM algorithm converges only to a local optimization point. A commonly used solution is initialization by randomly sampling in the mixture data. Other common solution, covered by the implementation presented here, is to use KMeans to perform initialization. However, KMeans itself may also converge to a poor solution and impact negatively in the mixture model fitting.
References
 Raja, Y., Shaogang, G. Gaussian Mixture Models. Department of Computer Science, Queen Mary and Westfield College, England.
 Bishop, C. M. (2007), Pattern Recognition and Machine Learning (Information Science and Statistics), Springer.
 Wikipedia contributors, “Normal distribution,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Normal_distribution (accessed October 18, 2010).
 Wikipedia contributors, “Probability density function,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Probability_density_function (accessed October 18, 2010).
 Wikipedia contributors, “Mixture density,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Mixture_density (accessed October 18, 2010).
See also
 KMeans Clustering Algorithm
 Principal Component Analysis (PCA)
 Kernel Principal Component Analysis (KPCA)
 Linear Discriminant Analysis (LDA)
 NonLinear Discriminant Analysis with Kernels (KDA)
 Kernel Support Vector Machines (SVM)
 Handwriting Recognition Revisited: Kernel Support Vector Machines
 Logistic Regression Analysis in C#
KMeans clustering in C#
KMeans is one of the most popular, classic and simple approaches to clustering. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics [3].
 Download source code and sample application.
The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a C# framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. It is based on the AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.
Contents
 Introduction
 Source code
 Using the code
 Sample application
 Conclusion
 Acknowledgements
 References
 See also
Introduction
In statistics and machine learning, kmeans clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean [4]. In its most common form, the algorithm is an iterative greedy algorithm which converges to a local optimum after a certain number of iterations.
Illustration of the KMeans algorithm.
The algorithm works by first selecting k locations at random to be the initial centroids for the clusters. Each observation is then assigned to the cluster which has the nearest centroid, and the centroids are recalculated using the mean value of assigned values. The algorithm then repeats this process until the cluster centroids do not change anymore, or until the change less than a given threshold.
There are other refinements and extensions of the algorithm. The version depicted above is its most common form, also referred as Lloyd’s algorithm.
Lloyd’s KMeans algorithm
 Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
 Assign each object to the group that has the closest centroid.
 When all objects have been assigned, recalculate the positions of the K centroids.
 Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.
Source code
The source code has been implemented using Accord.NET and is now part of the framework. In the current version (2.1.2), the following classes related to KMeans are contained inside the Accord.MachineLearning namespace. The source code available in this page contains only the parts of the framework actually needed to support the algorithm.
Class diagram for the KMeans algorithm.
The KMeans class is the main class representing the KMeans algorithm. The algorithm itself is implemented in the Compute(double[][] data, double threshold) method, which accepts a set of observations and a convergence threshold to determine when the method should stop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 
<span>/// <summary></span> <span>/// Divides the input data into K clusters. </span> <span>/// </summary> </span> <span>public</span> <span>int</span>[] Compute(<span>double</span>[][] data, <span>double</span> threshold) { <span>int</span> k = <span>this</span>.K; <span>int</span> rows = data.Length; <span>int</span> cols = data[0].Length; <span>// pick K unique random indexes in the range 0..n1</span> <span>int</span>[] idx = Accord.Statistics.Tools.Random(rows, k); <span>// assign centroids from data set</span> <span>this</span>.centroids = data.Submatrix(idx); <span>// initial variables</span> <span>int</span>[] count = <span>new</span> <span>int</span>[k]; <span>int</span>[] labels = <span>new</span> <span>int</span>[rows]; <span>double</span>[][] newCentroids; <span>do</span> <span>// main loop</span> { <span>// Reset the centroids and the</span> <span>// cluster member counters'</span> newCentroids = <span>new</span> <span>double</span>[k][]; <span>for</span> (<span>int</span> i = 0; i < k; i++) { newCentroids[i] = <span>new</span> <span>double</span>[cols]; count[i] = 0; } <span>// First we will accumulate the data points</span> <span>// into their nearest clusters, storing this</span> <span>// information into the newClusters variable.</span> <span>// For each point in the data set,</span> <span>for</span> (<span>int</span> i = 0; i < data.Length; i++) { <span>// Get the point</span> <span>double</span>[] point = data[i]; <span>// Compute the nearest cluster centroid</span> <span>int</span> c = labels[i] = Nearest(data[i]); <span>// Increase the cluster's sample counter</span> count[c]++; <span>// Accumulate in the corresponding centroid</span> <span>double</span>[] centroid = newCentroids[c]; <span>for</span> (<span>int</span> j = 0; j < centroid.Length; j++) centroid[j] += point[j]; } <span>// Next we will compute each cluster's new centroid</span> <span>// by dividing the accumulated sums by the number of</span> <span>// samples in each cluster, thus averaging its members.</span> <span>for</span> (<span>int</span> i = 0; i < k; i++) { <span>double</span>[] mean = newCentroids[i]; <span>double</span> clusterCount = count[i]; <span>for</span> (<span>int</span> j = 0; j < cols; j++) mean[j] /= clusterCount; } <span>// The algorithm stops when there is no further change in</span> <span>// the centroids (difference is less than the threshold).</span> <span>if</span> (centroids.IsEqual(newCentroids, threshold)) <span>break</span>; <span>// go to next generation</span> centroids = newCentroids; } <span>while</span> (<span>true</span>); <span>// Compute cluster information (optional)</span> <span>for</span> (<span>int</span> i = 0; i < k; i++) { <span>// Extract the data for the current cluster</span> <span>double</span>[][] sub = data.Submatrix(labels.Find(x => x == i)); <span>// Compute the current cluster variance</span> covariances[i] = Statistics.Tools.Covariance(sub, centroids[i]); <span>// Compute the proportion of samples in the cluster</span> proportions[i] = (<span>double</span>)sub.Length / data.Length; } <span>// Return the classification result</span> <span>return</span> labels; } 
The implementation is quite straightforward and does not use additional techniques to avoid convergence problems. More refined techniques may be added to the implementation in the future, so please make sure to download the latest version of Accord.NET Framework for the most uptodate revision of the code.
Using the code
To use the code, create a new instance of the KMeans class passing the desired number of clusters to its constructor. Additionally, you may also pass a distance function to be used as a distance metric during clustering. The default is to use the square Euclidean distance.
1 
<span>// Declare some observations</span><br /> <span>double</span>[][] observations = <br /> {<br /> <span>new</span> <span>double</span>[] { 5, 2, 1 },<br /> <span>new</span> <span>double</span>[] { 5, 5, 6 },<br /> <span>new</span> <span>double</span>[] { 2, 1, 1 },<br /> <span>new</span> <span>double</span>[] { 1, 1, 2 },<br /> <span>new</span> <span>double</span>[] { 1, 2, 2 },<br /> <span>new</span> <span>double</span>[] { 3, 1, 2 },<br /> <span>new</span> <span>double</span>[] { 11, 5, 4 },<br /> <span>new</span> <span>double</span>[] { 15, 5, 6 },<br /> <span>new</span> <span>double</span>[] { 10, 5, 6 },<br /> };<br /><br /> <span>// Create a new KMeans algorithm with 3 clusters </span><br /> KMeans kmeans = <span>new</span> KMeans(3);<br /><br /> <span>// Compute the algorithm, retrieving an integer array</span><br /> <span>// containing the labels for each of the observations</span><br /> <span>int</span>[] labels = kmeans.Compute(observations);<br /><br /> <span>// As a result, the first two observations should belong to the</span><br /> <span>// same cluster (thus having the same label). The same should</span><br /> <span>// happen to the next four observations and to the last three.</span> 
Sample application
The kmeans clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are often used to aid border detection and object recognition. The sample application performs image segmentation using the standard squared Euclidean distance over RGB pixel color space. There are, however, better distance metrics to be used for image segmentation, such as weighted distances and other color spaces, which will not be addressed in this example.
Original image (from Ossi Petruska Flickr page*).
To perform image segmentation, we will first translate our image into an array of pixel values. The single image will be read, pixel by pixel, into a jagged array where each element is a double array of length 3. Each element of those double array will contain one of the three RGB values scaled to the interval [–1,1].
After we perform clustering on this array of pixel values, each pixel will have an associated cluster label. Each of these values will then be swapped by its corresponding cluster centroid. The source code below is called when one clicks the Run button in the application.
1 
<span>private</span> <span>void</span> btnRun_Click(<span>object</span> sender, EventArgs e)<br />{<br /> <span>// Retrieve the number of clusters</span><br /> <span>int</span> k = (<span>int</span>)numClusters.Value;<br /><br /> <span>// Load original image</span><br /> Bitmap image = Properties.Resources.leaf;<br /><br /> <span>// Transform the image into an array of pixel values</span><br /> <span>double</span>[][] pixels = image.ToDoubleArray();<br /><br /><br /> <span>// Create a KMeans algorithm using given k and a</span><br /> <span>// square euclidean distance as distance metric.</span><br /> KMeans kmeans = <span>new</span> KMeans(k, Distance.SquareEuclidean);<br /><br /> <span>// Compute the KMeans algorithm until the difference in</span><br /> <span>// cluster centroids between two iterations is below 0.05</span><br /> <span>int</span>[] idx = kmeans.Compute(pixels, 0.05);<br /><br /><br /> <span>// Replace every pixel with its corresponding centroid</span><br /> pixels.ApplyInPlace((x, i) => kmeans.Clusters.Centroids[idx[i]]);<br /><br /> <span>// Show resulting image in the picture box</span><br /> pictureBox.Image = pixels.ToBitmap(image.Width, image.Height);<br />} 
After segmentation, the following resulting images can be obtained:
Same image after KMeans clustering with k = 5.
Image after KMeans clustering with k = 10.
* The sample image used above has been licensed by Ossi Petruska in a Creative Commons AttributionNonCommercialShareAlike 2.0 Generic license.
Conclusion
KMeans is a very simple and popular approach to clustering. The implementation presented here is the same implementation used in Accord.NET. As it can be seem, the method can be easily extended with custom distance functions through delegates or lambda expressions, and can be used in different contexts, such as image segmentation, without further modifications. As a suggestion for improvement, the method can be further speeded up by using the triangle inequality as suggested on the paper "Using the triangle inequality to accelerate kmeans", by Charles Elkan.
In the next article, we will see how we can use KMeans to initialize the ExpectationMaximization algorithm for estimating Gaussian Mixture densities in Gaussian Mixture Models. Those articles will form the basis for Continuous density Hidden Markov Models.
Acknowledgements
To Antonino Porcino, who provided the first version of the code and for the valuable information about many other methods and algorithms.
References

[1] Matteo Matteucci. “Tutorial on Clustering Algorithms,” Politecnico di Milano, http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html (acessed October 4, 2010).

[2] Teknomo, Kardi. “KMeans Clustering Tutorials,” http://people.revoledu.com/kardi/ tutorial/kMean/ (acessed October 6, 2010).

[3] Wikipedia contributors, "Cluster analysis," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Cluster_analysis (accessed October 4, 2010).

[4] Wikipedia contributors, "Kmeans clustering," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Kmeans_clustering (accessed October 4, 2010).
See also
 Principal Component Analysis (PCA)
 Kernel Principal Component Analysis (KPCA)
 Linear Discriminant Analysis (LDA)
 NonLinear Discriminant Analysis with Kernels (KDA)
 Kernel Support Vector Machines (SVM)
 Handwriting Recognition Revisited: Kernel Support Vector Machines
 Logistic Regression Analysis in C#