Logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logistic curve. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription.
The code presented here is part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications in .NET. To use the framework in your projects, install it by typing InstallPackage Accord.MachineLearning in your IDE’s NuGet package manager.
Overview
The Logistic regression is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical.
Logistic Sigmoid Function
The logistic sigmoid function is given by
g(z) = 1 / (1 + Exp(z))
where in the context of logistical regression z is called the logit.
Logistic Regression Model
The logistic regression model is a generalized linear model. This means that it is just a linear regression model taken as input for a nonlinear link function. The linear model has the form
z = c_{1}x_{1} + c_{2}x_{2} + … c_{n}x_{n} + i = c^{t} x + i
where c is the coefficient vector, i is the intercept value and x is the observation vector for n variables and in the context of logistic regression is called the logit. The logit is then applied as input for the nonlinear logistic sigmoid function g(z), giving as result a probability.
So in a binomial problem where we are trying to determine whether a observation belongs to class C_{1} or class C_{2}, the logistic model tell us that:
p(C_{1}x) = g(c^{t} x + i)
p(C_{2}x) = 1 – p(C_{1}x)
where p(C_{1}x) denotes the probability of C_{1} being true when x is true.
In other words, denotes the probability of x belonging to class C_{1}.
Coefficients
The coefficients for the logistic regression are the values which multiply each observation variable from a sample input vector. The intercept is analogous to the independent term in a polynomial linear regression model or the threshold or bias value for a neuron in a neural network model.
Odds Ratio
After computation of the logistic regression model, every coefficient will have an associated measure called the odds ratio. The odds ratio is a measure of effect size, describing the strength of association or nonindependence between two binary data values and can be approximated by raising the coefficient value to the euler’s number.
Odds ratio_{c} = e^{c}
Standard Error
The standard error for the coefficients can be retrieved from the inverse Hessian matrix calculated during the model fitting phase and can be used to give confidence intervals for the odds ratio. The standard error for the ith coefficient of the regression can be obtained as:
SE_{i }= sqrt(diag(H^{1})_{i})
Confidence Intervals
The confidence interval around the logistic regression coefficient is plus or minus 1.96*SE_{i}, where SE_{i} is the standard error for coefficient i. We can then define:
95% C.I._{i} = <lower, upper> = <coefficient_{i} – 1.96 * SE_{i}, coefficient_{i} + 1.96 * SE_{i}>
Wald Statistic and Wald’s Test
The Wald statistic is the ratio of the logistic coefficient to its standard error. A Wald test is used to test the statistical significance of each coefficient in the model. A Wald test calculates a Z statistic, which is:
z = coefficient / standard error
This z value can be squared, yielding a Wald statistic with a chisquare distribution, or, alternatively, it can be taken as is and compared directly with a Normal distribution.
The Wald test outputs a pvalue indicating the significance of individual independent variables. If the value is below a chosen significance threshold (typically 0.05), then the variable plays some role in determining the outcome variable that most likely is not result of chance alone. However, there are some problems with the use of the Wald statistic. The Likelihoodratio test is a better alternative for the Wald test.
LikelihoodRatio and ChiSquare Test
The likelihoodratio is the ratio of the likelihood of the full model over the likelihood of a simpler nested model. When compared to the null model the likelihoodratio test gives an overall model performance measure. When compared to nested models, each with one variable omitted, it tests the statistical significance of each coefficient in the model. These significance tests are considered to be more reliable than the Wald significance test.
The likelihoodratio is a chisquare statistic given by:
The model with more parameters will always fit at least as well (have a greater loglikelihood). Whether it fits significantly better and should thus be preferred can be determined by deriving the probability or pvalue of the obtained difference D. In many cases, the probability distribution of the test statistic can be approximated by a chisquare distribution with (df1 − df2) degrees of freedom, where df1 and df2 are the degrees of freedom of models 1 and 2 respectively.
Regression to a Logistic Model
If we consider the mapping
φ(<x_{1}, x_{2}, …, x_{n}>) = <x_{1}, x_{2}, … x_{n}, 1>
The logistic regression model can also be rewritten as
p(C_{1}φ) = g(w^{t} φ) = f(φ, w)
so that w contains all coefficients and the intercept value in a single weight vector. Doing so will allow the logistic regression model to be expressed as a common optimization model in the form f(φ, w) allowing many standard nonlinear optimization algorithms to be directly applied in the search for the best parameters w that best fits the probability of a class C_{1} given a input vector φ.
Likelihood function
The likelihood function for the logistic model is given by:
but, as the log of products equals the sum of logs, taking the log of the likelihood function results in the Loglikelihood function in the form:
Furthermore, if we take the negative of the loglikelihood, we will have a error function called the crossentropy error function:
which gives both better numerical accuracy and enable us to write the error gradient in the same form as the gradient of the sumofsquares error function for linear regression models (Bishop, 2006).
Another important detail is that the likelihood surface is convex, so it has no local maxima. Any local maxima will also be a global maxima, so one does not need to worry about getting trapped in a valley when walking down the error function.
Iterative Reweighted LeastSquares
The method of Iterative Reweighted LeastSquares is commonly used to find the maximum likelihood estimates of a generalized linear model. In most common applications, an iterative NewtonRaphson algorithm is used to calculate those maximum likelihood values for the parameters. This algorithm uses second order information, represented in the form of a Hessian matrix, to guide incremental coefficient changes. This is also the algorithm used in this implementation.
In the Accord.NET machine learning framework. the Iterative Reweighted LeastSquares is implemented in the IterativeReweightedLeastSquares class (source).
Source Code
Below is the main code segment for the logistic regression, performing one pass of the Iterative Reweighted LeastSquares algorithm.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92

/// <summary> /// Iterates one pass of the optimization algorithm trying to find /// the best regression coefficients for the logistic model. /// </summary> /// <remarks> /// An iterative NewtonRaphson algorithm is used to calculate /// the maximum likelihood values of the parameters. This procedure /// uses the partial second derivatives of the parameters in the /// Hessian matrix to guide incremental parameter changes in an effort /// to maximize the log likelihood value for the likelihood function. /// </remarks> /// <returns> /// The absolute value of the largest parameter change. /// </returns> public double Regress(double[][] input, double[] output) { // Regress using Iterative Reweighted Least Squares estimation. // Initial definitions and memory allocations int N = input.Length; int M = this.Coefficients.Length; double[,] regression = new double[N, M]; double[,] hessian = new double[M, M]; double[] gradient = new double[M]; double[] errors = new double[N]; double[] R = new double[N]; double[] deltas; // Compute the regression matrix, errors and diagonal for (int i = 0; i < N; i++) { double y = this.Compute(input[i]); double o = output[i]; // Calculate error vector errors[i] = y  o; // Calculate R diagonal R[i] = y * (1.0  y); // Compute the regression matrix regression[i, 0] = 1; for (int j = 1; j < M; j++) regression[i, j] = input[i][j  1]; } // Compute error gradient and "Hessian" matrix (with diagonal R) for (int i = 0; i < M; i++) { // Compute error gradient for (int j = 0; j < N; j++) gradient[i] += regression[j, i] * errors[j]; // Compute "Hessian" matrix (regression'*R*regression) for (int j = 0; j < M; j++) for (int k = 0; k < N; k++) hessian[j, i] += regression[k, i] * (R[k] * regression[k, j]); } // Decompose to solve the linear system LuDecomposition lu = new LuDecomposition(hessian); double[,] inverse; if (lu.NonSingular) { // Solve using LU decomposition deltas = lu.Solve(gradient); inverse = lu.Inverse(); } else { // Hessian Matrix is singular, try pseudoinverse solution SingularValueDecomposition svd = new SingularValueDecomposition(hessian); deltas = svd.Solve(gradient); inverse = svd.Inverse(); } // Update coefficients using the calculated deltas for (int i = 0; i < coefficients.Length; i++) this.coefficients[i] = deltas[i]; // Calculate Coefficients standard errors for (int i = 0; i < standardErrors.Length; i++) standardErrors[i] = System.Math.Sqrt(inverse[i, i]); // Return the absolute value of the largest parameter change return Matrix.Max(Matrix.Abs(deltas)); } 
Furthermore, as the likelihood function is convex, the Logistic Regression Analysis can perform regression without having to experiment different starting points. Below is the code to compute a logistic regression analysis. The algorithm iterates until the largest absolute parameter change between two iterations becomes less than a given limit, or the maximum number of iterations is reached.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

/// <summary> /// Computes the Logistic Regression Analysis. /// </summary> /// <remarks>The likelihood surface for the /// logistic regression learning is convex, so there will be only one /// peak. Any local maxima will be also a global maxima. /// </remarks> /// <param name="limit"> /// The difference between two iterations of the regression algorithm /// when the algorithm should stop. If not specified, the value of /// 10e4 will be used. The difference is calculated based on the largest /// absolute parameter change of the regression. /// </param> /// <returns> /// True if the model converged, false otherwise. /// </returns> public bool Compute(double limit, int maxIterations) { double delta; int iteration = 0; do // learning iterations until convergence { delta = regression.Regress(inputData, outputData); iteration++; } while (delta > limit && iteration < maxIterations); // Store additional information this.deviance = regression.GetDeviance(inputData, outputData); this.logLikelihood = regression.GetLogLikelihood(inputData, outputData); this.chiSquare = regression.ChiSquare(inputData, outputData); for (int i = 0; i < regression.Coefficients.Length; i++) { this.waldStatistics[i] = regression.GetWaldStatistic(i); this.standardErrors[i] = regression.GetStandardError(i); this.pvalues[i] = regression.GetPValue(i); this.coefficients[i] = regression.Coefficients[i]; this.confidences[i] = regression.GetConfidenceInterval(i); this.oddsRatios[i] = regression.GetOddsRatio(i); } return delta < limit; } 
Using the code
Code usage is very simple. To perform a Logistic Regression Analysis, simply create an object of the type LogisticRegressionAnalysis and pass the input and output data as constructor parameters. Optionally you can pass the variables names as well.

// Creates the Logistic Regression Analysis of the given source var lra = new LogisticRegressionAnalysis(input, output, independentNames, dependentName); 
Then just call Compute() to compute the analysis.

// Compute the Logistic Regression Analysis lra.Compute(); 
After that, you can display information about the regression coefficients by binding the CoefficientCollection to a DataGridView.

// Populates coefficient overview with analysis data dgvLogisticCoefficients.DataSource = lra.Coefficients; 
Sample application
The sample application creates Logistic Regression Analyses by reading Excel workbooks. It can also perform standard Linear Regressions, although there aren’t many options available to specify linear models.
Example data set adapted from http://www.statsdirect.co.uk/help/regression_and_correlation/logi.htm.
The image on the left shows information about the analysis performed, such as coefficient values, standard errors, pvalue for the Wald statistic, Odds ratio and 95% confidence intervals. It also reports the loglikelihood, deviance and likelihoodratio chisquare test for the final model. The newest available version can also compute likelihoodratio tests for each coefficient, although not shown in the image.
Final Considerations
The logistic regression model can be seen as the exact same as a one layer MLP with only one hidden neuron using a sigmoid activation function trained by a Newton’s method learning algorithm. Thus we can say the Logistic Regression is just a special case of Neural Networks. However, one possible (and maybe unique) advantage of logistic regression is that it allows simpler interpretation of its results in the form of odds ratios and statistical hypothesis testing. Statistical analysis using neural networks is also possible, but some may argue it is not as straightforward as using ordinary logistic regression.
Multilayer perceptron neural networks with sigmoid activation functions can be created using the Accord.Neuro package and namespace.
References
 Bishop, Christopher M.; Pattern Recognition and Machine Learning. Springer; 1st ed. 2006.
 Connor, Edward F.; Logistic Regression.Available in: http://userwww.sfsu.edu/~efc/classes/biol710/logistic/logisticreg.htm.
 DTREG; Logistic Regression.Available in: http://www.dtreg.com/logistic.htm.
 Guidry, Trent F.; Linear and Multiple Linear Regression in C#. Available in:http://www.trentfguidry.net/post/2009/07/19/LinearandMultipleLinearRegressioninC.aspx.
 Shalizi, Cosma; Logistic Regression and Newton’s method.Available in: http://www.stat.cmu.edu/~cshalizi/350/lectures/26/lecture26.pdf.
 Storkey, Amos; Learning from Data: Learning Logistic Regressors.Available in: http://www.inf.ed.ac.uk/teaching/courses/lfd/lectures/logisticlearnprint.pdf.
 Stata; Stata FAQ: How do I interpret odds ratios in logistic regression?
Available in: http://www.ats.ucla.edu/stat/Stata/faq/oratio.htm
 StatsDirect Statistical Software; Logistic Regression.Available in: http://www.statsdirect.co.uk/help/regression_and_correlation/logi.htm
 Wikipedia contributors, “Likelihoodratio Test”, Wikipedia, The Free Encyclopedia,http://en.wikipedia.org/wiki/Likelihoodratio_test (accessed February 10, 2010).
 Wikipedia contributors, “Logistic Regression”, Wikipedia, The Free Encyclopedia,http://en.wikipedia.org/wiki/Logistic_regression (accessed February 10, 2010).