Logistic Regression in C#

Logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logistic curve. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription.

The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing Install-Package Accord.MachineLearning in your IDE’s NuGet package manager.


The Logistic regression is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical.

Logistic Sigmoid Function

The logistic sigmoid function is given by

g(z) = 1 / (1 + Exp(-z))

where in the context of logistical regression z is called the logit.

Logistic Regression Model

The logistic regression model is a generalized linear model. This means that it is just a linear regression model taken as input for a non-linear link function. The linear model has the form

z = c1x1 + c2x2 + … cnxn + i = ct x + i

where c is the coefficient vector, i is the intercept value and x is the observation vector for n variables and in the context of logistic regression is called the logit. The logit is then applied as input for the nonlinear logistic sigmoid function g(z), giving as result a probability.

So in a binomial problem where we are trying to determine whether a observation belongs to class C1 or class C2, the logistic model tell us that:

p(C1|x) = g(ct x + i)

p(C2|x) = 1 – p(C1|x)

where p(C1|x) denotes the probability of C1 being true when x is true.
In other words, denotes the probability of x belonging to class C1.


The coefficients for the logistic regression are the values which multiply each observation variable from a sample input vector. The intercept is analogous to the independent term in a polynomial linear regression model or the threshold or bias value for a neuron in a neural network model.

Odds Ratio

After computation of the logistic regression model, every coefficient will have an associated measure called the odds ratio. The odds ratio is a measure of effect size, describing the strength of association or non-independence between two binary data values and can be approximated by raising the coefficient value to the euler’s number.

Odds ratioc = ec

Standard Error

The standard error for the coefficients can be retrieved from the inverse Hessian matrix calculated during the model fitting phase and can be used to give confidence intervals for the odds ratio. The standard error for the i-th coefficient of the regression can be obtained as:

SEi = sqrt(diag(H-1)i)

Confidence Intervals

The confidence interval around the logistic regression coefficient is plus or minus 1.96*SEi, where SEi is the standard error for coefficient i. We can then define:

95% C.I.i = <lower, upper> = <coefficienti – 1.96 * SEi,  coefficienti + 1.96 * SEi>

Wald Statistic and Wald’s Test

The Wald statistic is the ratio of the logistic coefficient to its standard error. A Wald test is used to test the statistical significance of each coefficient in the model. A Wald test calculates a Z statistic, which is:

z = coefficient / standard error

This z value can be squared, yielding a Wald statistic with a chi-square distribution, or, alternatively, it can be taken as is and compared directly with a Normal distribution.

The Wald test outputs a p-value indicating the significance of individual independent variables. If the value is below a chosen significance threshold (typically 0.05), then the variable plays some role in determining the outcome variable that most likely is not result of chance alone. However, there are some problems with the use of the Wald statistic. The Likelihood-ratio test is a better alternative for the Wald test.

Likelihood-Ratio and Chi-Square Test

The likelihood-ratio is the ratio of the likelihood of the full model over the likelihood of a simpler nested model. When compared to the null model the likelihood-ratio test gives an overall model performance measure. When compared to nested models, each with one variable omitted, it tests the statistical significance of each coefficient in the model. These significance tests are considered to be more reliable than the Wald significance test.

The likelihood-ratio is a chi-square statistic given by:

D & = -2lnleft( frac{text{likelihood for first model}}{text{likelihood for second model}} right).

The model with more parameters will always fit at least as well (have a greater log-likelihood). Whether it fits significantly better and should thus be preferred can be determined by deriving the probability or p-value of the obtained difference D. In many cases, the probability distribution of the test statistic can be approximated by a chi-square distribution with (df1 − df2) degrees of freedom, where df1 and df2 are the degrees of freedom of models 1 and 2 respectively.

Regression to a Logistic Model

If we consider the mapping

φ(<x1, x2, …, xn>) = <x1, x2, … xn, 1>

The logistic regression model can also be rewritten as

p(C1|φ) = g(wt φ) = f(φ, w)

so that w contains all coefficients and the intercept value in a single weight vector. Doing so will allow the logistic regression model to be expressed as a common optimization model in the form f(φ, w) allowing many standard non-linear optimization algorithms to be directly applied in the search for the best parameters w that best fits the probability of a class C1 given a input vector φ.

Likelihood function

The likelihood function for the logistic model is given by:

p(t|w) = prod_{n=1}^N y_n^{t_n} { 1 - y_n }^{1-t_n}

but, as the log of products equals the sum of logs, taking the log of the likelihood function results in the Log-likelihood function in the form:

ln p(t|w) = sum_{n=1}^N { t_n ln y_n + (1-t_n) ln (1 - y_n) }

Furthermore, if we take the negative of the log-likelihood, we will have a error function called the cross-entropy error function:

E(w) = -ln p(t|w) = - sum_{n=1}^N { t_n ln y_n + (1-t_n) ln (1 - y_n) }

which gives both better numerical accuracy and enable us to write the error gradient in the same form as the gradient of the sum-of-squares error function for linear regression models (Bishop, 2006).

Another important detail is that the likelihood surface is convex, so it has no local maxima. Any local maxima will also be a global maxima, so one does not need to worry about getting trapped in a valley when walking down the error function.

Iterative Reweighted Least-Squares

The method of Iterative Reweighted Least-Squares is commonly used to find the maximum likelihood estimates of a generalized linear model. In most common applications, an iterative Newton-Raphson algorithm is used to calculate those maximum likelihood values for the parameters. This algorithm uses second order information, represented in the form of a Hessian matrix, to guide incremental coefficient changes. This is also the algorithm used in this implementation.

In the Accord.NET machine learning framework. the Iterative Reweighted Least-Squares is implemented in the IterativeReweightedLeastSquares class (source).

Source Code

Below is the main code segment for the logistic regression, performing one pass of the Iterative Reweighted Least-Squares algorithm.

Furthermore, as the likelihood function is convex, the Logistic Regression Analysis can perform regression without having to experiment different starting points. Below is the code to compute a logistic regression analysis. The algorithm iterates until the largest absolute parameter change between two iterations becomes less than a given limit, or the maximum number of iterations is reached.

Using the code

Code usage is very simple. To perform a Logistic Regression Analysis, simply create an object of the type LogisticRegressionAnalysis and pass the input and output data as constructor parameters. Optionally you can pass the variables names as well.

Then just call Compute() to compute the analysis.

After that, you can display information about the regression coefficients by binding the CoefficientCollection to a DataGridView.

Sample application

The sample application creates Logistic Regression Analyses by reading Excel workbooks. It can also perform standard Linear Regressions, although there aren’t many options available to specify linear models.

lr-1 lr-6

Example data set adapted from http://www.statsdirect.co.uk/help/regression_and_correlation/logi.htm.


lr-2 lr-3

The image on the left shows information about the analysis performed, such as coefficient values, standard errors, p-value for the Wald statistic, Odds ratio and 95% confidence intervals. It also reports the log-likelihood, deviance and likelihood-ratio chi-square test for the final model. The newest available version can also compute likelihood-ratio tests for each coefficient, although not shown in the image.

Final Considerations

The logistic regression model can be seen as the exact same as a one layer MLP with only one hidden neuron using a sigmoid activation function trained by a Newton’s method learning algorithm. Thus we can say the Logistic Regression is just a special case of Neural Networks. However, one possible (and maybe unique) advantage of logistic regression is that it allows simpler interpretation of its results in the form of odds ratios and statistical hypothesis testing. Statistical analysis using neural networks is also possible, but some may argue it is not as straightforward as using ordinary logistic regression.

Multilayer perceptron neural networks with sigmoid activation functions can be created using the Accord.Neuro package and namespace.


Kernel Principal Component Analysis in C#


Kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are done in a reproducing kernel Hilbert space with a non-linear mapping.

The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Analysis Overview

KPCA is an extension of PCA to non-linear distributions. Instead of directly doing a PCA, the original data points are mapped into a higher-dimensional (possibly infinite-dimensional) feature space defined by a (usually nonlinear) function φ through a mathematical process called the “kernel trick”.

The Kernel trick

The kernel trick is a very interesting technique, able to transform any algorithm that solely depends on the dot product between two vectors. To apply the trick, we replace all dot products in an algorithm with a kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. This non-linear algorithm is equivalent to the linear algorithm operating in the range space of φ. However, because kernels are used, the φ function is never explicitly computed. This is desirable [8], as the high-dimensional space can have an arbitrary large dimensionality, or can be even infinite-dimensional (such as in the case the chosen kernel function is a Gaussian) and thus infeasible to compute.

Interesting video by Udi Aharoni demonstrating how points that are not linearly separable in a 2D space can almost always be linearly separated in a higher dimensional (3D) space.

Principal Components in Kernel Space

Like in PCA, the overall idea is to perform a transformation that will maximize the variance of the captured variables while minimizing the overall covariance between those variables. Using the kernel trick, the covariance matrix is substituted by the Kernel matrix and the analysis is carried analogously in feature space. An Eigen value decomposition is performed and the eigenvectors are sorted in ascending order of Eigen values, so those vectors may form a basis in feature space that explain most of the variance in the data on its first dimensions.

However, because the principal components are in feature space, we will not be directly performing dimensionality reduction. Suppose that the number of observations m exceeds the input dimensionality n. In linear PCA, we can find at most n nonzero Eigen values. On the other hand, using Kernel PCA we can find up to m nonzero Eigen values because we will be operating on a m x m kernel matrix[2].

A more detailed discussion can be seen in the works of Mika et al. (Kernel PCA and De-Noising in Feature Spaces) and Scholkopf et al. (Nonlinear Component Analysis as a Kernel Eigenvalue Problem). A more accessible explanation of the topic is also available in the work from Ian Fasel.

The pre-image problem

In standard PCA it was possible to revert the transformation, retrieving the original data points and enabling the use of PCA for compression* and denoising. However, as the results provided by kernel PCA live in some high dimensional feature space,  they need not have pre-images in input space. In fact, even if it exists, it also need be not unique.

Thus, in KPCA, it is only possible to obtain an approximate pre-image for a given set of patterns projected on the feature space. This problem can (and has) been tackled by many approaches, some of them of iterative nature, but some closed-form solutions also exists. Typically those solutions make use of the fact that distances in feature space are related to distances in input space. Thus, those solutions try to achieve an optimal transformation that can embed those feature points in input space respecting those distances relationships.

One of the first solutions of this form were proposed by Kwok and Tsang in their paper “The Pre-Image Problem in Kernel Methods". A better approach is also described by Bakir on his thesis “Extensions to Kernel Dependency Estimation”, alongside with a nice comprehensive description on various other methods for handling the pre-image problem.

Note: actually the use of KPCA for compression is a bit limited, since one needs to store the original data set to compute transformations and its approximate reversion.


Code Overview

Below is the main code behind KPCA. This is a direct, naive implementation of the algorithm that may not scale very well for very large datasets. It is implemented on top of the previous standard PCA code and thus on top of the AForge.NET Framework.


The main code behind the reversion of the projection. It is a direct, naive implementation of the closed-form solution algorithm as proposed by Kwok and Tsang. Currently it is completely not optimized, but it gives an better overview of the algorithm on its essential form. The source code available to download may be better optimized in the future.



Using the Code

As with PCA, code usage is simple. First, choose a kernel implementing the IKernel interface then pass it to the KernelPrincipalComponentAnalysis constructor together with the data points.

Then, simply call Compute() to perform the analysis. Once done, data about the analysis will be available through the object properties and transformations on additional data can be computed by calling Transform().


Sample Application

To demonstrate the use of KPCA, I created a simple Windows Forms Application which performs simple statistical analysis and KPCA transformations. It can also perform projection reversion by solving the approximate pre-image problem.

Scholkopf's KPCA Toy Example Kernel principal components detected by the KPCA Two first components detected by the KPCA

On left: Scholkopf Toy Example. Middle: Kernel principal components detected by the KPCA. Right: The two first dimensions obtained by the two first Kernel principal components.


Wikipedia's Example on KPCA Projection using a non-centered Gaussian kernel Reconstruction of the original data

Left: Wikipedia example. Middle: Projection using a non-centered Gaussian kernel. Right: Reconstruction of the original data using all Kernel principal components, showing some denoising ability.

Linear PCA as a special case

Lindsay Smith's PCA example data Linear Principal Component Analysis

Additionally we may check that linear PCA is indeed a special case of kernel PCA by using a linear kernel on the example by Lindsay Smith on his A Tutorial On Principal Component Analysis.


See also



Introducing System.Console.Forms


Some days ago I started a new project just for fun. System.Console.Forms is a extension to .NET framework to “port” as much as possible of System.Windows.Forms to a Console environment.


I know a lot of people think there is absolutely no reason for developing console programs nowadays, but they are just plain wrong. It obviously isn’t mainstream, but if you use ssh a lot and cannot afford a X connection, then working with console interfaces makes a lot of sense. Also, I know there are other known, well-tested and proven solutions for console user interfaces like the ncurses library, but they are not very object oriented and not very user-friendly if you came from a .NET WinForms or Java SWT/Swing world.

Thus this project was started.


I know accomplishing this task would consume a lot of time and effort, so I can’t promise it will ever be complete, or even if it will not die a few months from now. But if are interested and want to help, feel free to contact me. Much of the base code is almost done (it is already possible to create Forms and UserControls) but most of the default controls, like TextBoxes, RadioButtons and ToolStripMenus are still incomplete or missing.

nano#nano# - 2

As you can see, it currently lacks some visual enhancements (such as drop-shadow borders and a better default form design), but this is something scheduled only for future releases. The first milestone is to replicate only the basic functionality and interface of nano, and the second is to fully reproduce MS-DOS edit.exe. When this becomes possible, the library will be almost feature complete.

I don’t think I’ll ever be able to finish this by myself alone, so I licensed the project under the LGPL if anyone else gets interested. The project is being hosted at Google Code, and the project page is http://console-forms.googlecode.com.

Software Licenses’ Legal Notices

When releasing software it is always good to release under a suitable license. Licensing policies were created to protect you and your code, restricting what others can do using your code, and not what you can do with your own code. The copyright holder (the author, e.g. you) is able to change licensing policies anytime, but, depending on the license you choose, only as long as every author and every person who has ever contributed to your project also agree.

Wikipedia offers a list of popular licenses and a comparison of free software licenses you can use in your projects. But no matter which one you choose, a common practice is to include a license notice header in all your source files to enforce the restrictions you have chosen. Here is a list of common headers for the most common available software licenses. Comparison between its different implications is beyond the scope of this post, but more information is available on the Additional Information section below.

GPL – GNU Public License (details)

LGPL – Lesser GNU Public License (details)

MPL – Mozilla Public License (details)

BSD License (details)

MIT License (details)

Apache License (details)

CC – Creative Commons (details)

Creative Commons licenses are not intended to apply to, and should not be used for software. I’ve included it here just to remember you.

Public Domain (details)

This is only a suggestion as there is no common sense on how to dedicate your work for the public domain. This excerpt was borrowed from the SQLite site, a popular software that is itself dedicated to the public domain.

WARNING: Please pay attention that the Public Domain Dedication is not a license. By using it, you do not simply carve out exceptions to your copyright; you grant your entire copyright to the public without condition. This grant is permanent and irreversible. You’ve been warned.

Additional Information

For a nice comparison of different licenses, implications and more useful information, please check:

Comparison of Source Code Licenses

Also, before blindly adopting the GPLv3, please be sure you have completely and correctly understood everything it says. For more info about the GPL and the GPLv2 vs GPLv3 debate, see:

The GPL for Dummies
The GPLv2 vs. GPLv3 Debate
What the kernel guys are and aren’t (and really should be) saying about GPLv3
The Dangers and Problems with GPLv3
The GPLv2 vs. GPLv3 Chart

Personally, I would rather go with the GPL v2 instead.