Logistic Regression in C#

Logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logistic curve. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription.

The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing Install-Package Accord.MachineLearning in your IDE’s NuGet package manager.

Overview

The Logistic regression is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical.

Logistic Sigmoid Function

The logistic sigmoid function is given by

g(z) = 1 / (1 + Exp(-z))

where in the context of logistical regression z is called the logit.

Logistic Regression Model

The logistic regression model is a generalized linear model. This means that it is just a linear regression model taken as input for a non-linear link function. The linear model has the form

z = c1x1 + c2x2 + … cnxn + i = ct x + i

where c is the coefficient vector, i is the intercept value and x is the observation vector for n variables and in the context of logistic regression is called the logit. The logit is then applied as input for the nonlinear logistic sigmoid function g(z), giving as result a probability.

So in a binomial problem where we are trying to determine whether a observation belongs to class C1 or class C2, the logistic model tell us that:

p(C1|x) = g(ct x + i)

p(C2|x) = 1 – p(C1|x)

where p(C1|x) denotes the probability of C1 being true when x is true.
In other words, denotes the probability of x belonging to class C1.

Coefficients

The coefficients for the logistic regression are the values which multiply each observation variable from a sample input vector. The intercept is analogous to the independent term in a polynomial linear regression model or the threshold or bias value for a neuron in a neural network model.

Odds Ratio

After computation of the logistic regression model, every coefficient will have an associated measure called the odds ratio. The odds ratio is a measure of effect size, describing the strength of association or non-independence between two binary data values and can be approximated by raising the coefficient value to the euler’s number.

Odds ratioc = ec

Standard Error

The standard error for the coefficients can be retrieved from the inverse Hessian matrix calculated during the model fitting phase and can be used to give confidence intervals for the odds ratio. The standard error for the i-th coefficient of the regression can be obtained as:

SEi = sqrt(diag(H-1)i)

Confidence Intervals

The confidence interval around the logistic regression coefficient is plus or minus 1.96*SEi, where SEi is the standard error for coefficient i. We can then define:

95% C.I.i = <lower, upper> = <coefficienti – 1.96 * SEi,  coefficienti + 1.96 * SEi>

Wald Statistic and Wald’s Test

The Wald statistic is the ratio of the logistic coefficient to its standard error. A Wald test is used to test the statistical significance of each coefficient in the model. A Wald test calculates a Z statistic, which is:

z = coefficient / standard error

This z value can be squared, yielding a Wald statistic with a chi-square distribution, or, alternatively, it can be taken as is and compared directly with a Normal distribution.

The Wald test outputs a p-value indicating the significance of individual independent variables. If the value is below a chosen significance threshold (typically 0.05), then the variable plays some role in determining the outcome variable that most likely is not result of chance alone. However, there are some problems with the use of the Wald statistic. The Likelihood-ratio test is a better alternative for the Wald test.

Likelihood-Ratio and Chi-Square Test

The likelihood-ratio is the ratio of the likelihood of the full model over the likelihood of a simpler nested model. When compared to the null model the likelihood-ratio test gives an overall model performance measure. When compared to nested models, each with one variable omitted, it tests the statistical significance of each coefficient in the model. These significance tests are considered to be more reliable than the Wald significance test.

The likelihood-ratio is a chi-square statistic given by:

D & = -2lnleft( frac{text{likelihood for first model}}{text{likelihood for second model}} right).

The model with more parameters will always fit at least as well (have a greater log-likelihood). Whether it fits significantly better and should thus be preferred can be determined by deriving the probability or p-value of the obtained difference D. In many cases, the probability distribution of the test statistic can be approximated by a chi-square distribution with (df1 − df2) degrees of freedom, where df1 and df2 are the degrees of freedom of models 1 and 2 respectively.

Regression to a Logistic Model

If we consider the mapping

φ(<x1, x2, …, xn>) = <x1, x2, … xn, 1>

The logistic regression model can also be rewritten as

p(C1|φ) = g(wt φ) = f(φ, w)

so that w contains all coefficients and the intercept value in a single weight vector. Doing so will allow the logistic regression model to be expressed as a common optimization model in the form f(φ, w) allowing many standard non-linear optimization algorithms to be directly applied in the search for the best parameters w that best fits the probability of a class C1 given a input vector φ.

Likelihood function

The likelihood function for the logistic model is given by:

p(t|w) = prod_{n=1}^N y_n^{t_n} { 1 - y_n }^{1-t_n}

but, as the log of products equals the sum of logs, taking the log of the likelihood function results in the Log-likelihood function in the form:

ln p(t|w) = sum_{n=1}^N { t_n ln y_n + (1-t_n) ln (1 - y_n) }

Furthermore, if we take the negative of the log-likelihood, we will have a error function called the cross-entropy error function:

E(w) = -ln p(t|w) = - sum_{n=1}^N { t_n ln y_n + (1-t_n) ln (1 - y_n) }

which gives both better numerical accuracy and enable us to write the error gradient in the same form as the gradient of the sum-of-squares error function for linear regression models (Bishop, 2006).

Another important detail is that the likelihood surface is convex, so it has no local maxima. Any local maxima will also be a global maxima, so one does not need to worry about getting trapped in a valley when walking down the error function.

Iterative Reweighted Least-Squares

The method of Iterative Reweighted Least-Squares is commonly used to find the maximum likelihood estimates of a generalized linear model. In most common applications, an iterative Newton-Raphson algorithm is used to calculate those maximum likelihood values for the parameters. This algorithm uses second order information, represented in the form of a Hessian matrix, to guide incremental coefficient changes. This is also the algorithm used in this implementation.

In the Accord.NET machine learning framework. the Iterative Reweighted Least-Squares is implemented in the IterativeReweightedLeastSquares class (source).

Source Code

Below is the main code segment for the logistic regression, performing one pass of the Iterative Reweighted Least-Squares algorithm.

Furthermore, as the likelihood function is convex, the Logistic Regression Analysis can perform regression without having to experiment different starting points. Below is the code to compute a logistic regression analysis. The algorithm iterates until the largest absolute parameter change between two iterations becomes less than a given limit, or the maximum number of iterations is reached.

Using the code

Code usage is very simple. To perform a Logistic Regression Analysis, simply create an object of the type LogisticRegressionAnalysis and pass the input and output data as constructor parameters. Optionally you can pass the variables names as well.

Then just call Compute() to compute the analysis.

After that, you can display information about the regression coefficients by binding the CoefficientCollection to a DataGridView.

Sample application

The sample application creates Logistic Regression Analyses by reading Excel workbooks. It can also perform standard Linear Regressions, although there aren’t many options available to specify linear models.

lr-1 lr-6

Example data set adapted from http://www.statsdirect.co.uk/help/regression_and_correlation/logi.htm.

 

lr-2 lr-3

The image on the left shows information about the analysis performed, such as coefficient values, standard errors, p-value for the Wald statistic, Odds ratio and 95% confidence intervals. It also reports the log-likelihood, deviance and likelihood-ratio chi-square test for the final model. The newest available version can also compute likelihood-ratio tests for each coefficient, although not shown in the image.

Final Considerations

The logistic regression model can be seen as the exact same as a one layer MLP with only one hidden neuron using a sigmoid activation function trained by a Newton’s method learning algorithm. Thus we can say the Logistic Regression is just a special case of Neural Networks. However, one possible (and maybe unique) advantage of logistic regression is that it allows simpler interpretation of its results in the form of odds ratios and statistical hypothesis testing. Statistical analysis using neural networks is also possible, but some may argue it is not as straightforward as using ordinary logistic regression.

Multilayer perceptron neural networks with sigmoid activation functions can be created using the Accord.Neuro package and namespace.

References

2 Comments

  1. Thank you very much for taking the time to write this up and post the sample code and application. It’s very useful for me as it’s been a while since I’ve studied LOGIT and the example helped get up to speed very quickly.

  2. Your documentation and your code is amazing!

    We desperately need to find a Logistic Regression Equation calculation in java. So:
    1.) Do you have any suggestions on 3rd party Logistic Regression Equation java tools – Or-
    2.) Could I get the source code in C#, so that I could convert it to java.

Leave a Reply

Your email address will not be published. Required fields are marked *