The Limitedmemory BroydenFletcherGoldfarbShanno method is an optimization method belonging to the family of quasiNewton methods for unconstrained nonlinear optimization. In short terms, it is an offtheshelf optimizer for seeking either minimum or maximum points of a any differentiable and possibly nonlinear function, requiring only an expression of the function and its gradient. The goal of this article is to provide and demonstrate a C# port of the LBFGS method originally written by Jorge Nocedal in Fortran.
 Download source code.
 Download the full Accord.NET Framework.
Introduction
Function optimization is a common problem found in many numerical applications. Suppose we have a differentiable function f : R^{n} → R and we would like to obtain its minimal or maximum value while traversing its space of possible input values. Those kind of problems arise often in parameter optimization of machine learning models and other statistic related applications (and in a myriad of other applications too – however, we will pay special attention to cases related to machine learning).
The problem of maximizing or minimizing a function can be simply stated as max f(x) or min f(x). When there aren’t any constraints limiting possible values for x, it is widely known from calculus that the maximum or a minimum of such a function would occur when the first derivatives f’(x) of the function f(x) in respect to x are zero. Newton’s method is a common method for finding the roots of a differentiable function and is indeed a suitable method for finding those values such that first derivatives of a function are zero.
Example of a convex, but nonlinear function f(x,y) = exp{(x1)²} + exp{(y2)²/2}. Images have been created using Wolfram Alpha.
However, while Newton’s method may work very well with functions of a single variable, the generalization to higher dimensions will require the replacement of the first derivative f’(x) with the function’s gradient vector g of partial derivatives, and the second derivative f’’(x) with the inverse Hessian matrix H^{1}. The problem is that the Hessian is a square matrix with as many rows and columns as parameters of our function f, and, besides being very costly and challenging to compute, even its size may be infeasible to accommodate in main memory when dealing with very high dimensional problems.
To overcome those limitations, several methods attempt to replace the Hessian matrix with an approximation extracted from the first partial derivatives alone. Those methods are called quasiNewton methods, as they replace H in Newton’s method by an approximation instead of using the full Hessian. In case the function being optimized has any particular form or special characteristic which could be exploited, it is also possible to use more specialized methods such as the GaussNewton, LevenbergMarquardt or even lowerbound approximation methods to avoid computing the full Hessian.
The quasiNewton BFGS method for nonlinear optimization builds an approximation for the Hessian matrix based on estimates extracted solely from first order information. However, even being cheaper to compute, the memory requirements are still high as the method still needs to accommodate a dense N x N matrix for the inverse Hessian approximation (where N is the number of variables for the function). To overcome this problem, the LBFGS method, or the limitedmemory version of the BFGS method, never forms or stores the approximation matrix, but only a few vectors which can be used to represent it implicitly.
Source code
The code presented here is based upon a direct translation of the original code by Jorge Nocedal. His original code was released under the permissive BSD license, and honoring the original license and the author’s considerations, this code is released under the BSD as well.
The LBFGS method is implemented within the BroydenFletcherGoldfarbShanno class. Upon object creation, it is possible to specify a function and its gradient through either the use of Lambda expressions or by specifying handles for named functions. After the object has been initialized, a simple call to Minimize runs the standard LBFGS method until convergence. The details for the Minimize method is given below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
<span>/// <summary></span> <span>/// Optimizes the defined function. </span> <span>/// </summary></span> <span>/// </span> <span>/// <param name="values">The initial guess values for the parameters.</param></span> <span>/// <returns>The values of the parameters which optimizes the function.</returns></span> <span>/// </span> <span>public</span> <span>unsafe</span> <span>double</span> Minimize(<span>double</span>[] values) { <span>if</span> (values == <span>null</span>) <span>throw</span> <span>new</span> ArgumentNullException(<span>"values"</span>); <span>if</span> (values.Length != numberOfVariables) <span>throw</span> <span>new</span> DimensionMismatchException(<span>"values"</span>); <span>if</span> (Function == <span>null</span>) <span>throw</span> <span>new</span> InvalidOperationException( <span>"The function to be minimized has not been defined."</span>); <span>if</span> (Gradient == <span>null</span>) <span>throw</span> <span>new</span> InvalidOperationException( <span>"The gradient function has not been defined."</span>); <span>// Initialization</span> x = (<span>double</span>[])values.Clone(); <span>int</span> n = numberOfVariables, m = corrections; <span>// Make initial evaluation</span> f = getFunction(x); g = getGradient(x); <span>this</span>.iterations = 0; <span>this</span>.evaluations = 1; <span>// Obtain initial Hessian</span> <span>double</span>[] diagonal = <span>null</span>; <span>if</span> (Diagonal != <span>null</span>) { diagonal = getDiagonal(); } <span>else</span> { diagonal = <span>new</span> <span>double</span>[n]; <span>for</span> (<span>int</span> i = 0; i < diagonal.Length; i++) diagonal[i] = 1.0; } <span>fixed</span> (<span>double</span>* w = work) { <span>// The first N locations of the work vector are used to</span> <span>// store the gradient and other temporary information.</span> <span>double</span>* rho = &w[n]; <span>// Stores the scalars rho.</span> <span>double</span>* alpha = &w[n + m]; <span>// Stores the alphas in computation of H*g.</span> <span>double</span>* steps = &w[n + 2 * m]; <span>// Stores the last M search steps.</span> <span>double</span>* delta = &w[n + 2 * m + n * m]; <span>// Stores the last M gradient diferences.</span> <span>// Initialize work vector</span> <span>for</span> (<span>int</span> i = 0; i < g.Length; i++) steps[i] = g[i] * diagonal[i]; <span>// Initialize statistics</span> <span>double</span> gnorm = Norm.Euclidean(g); <span>double</span> xnorm = Norm.Euclidean(x); <span>double</span> stp = 1.0 / gnorm; <span>double</span> stp1 = stp; <span>// Initialize loop</span> <span>int</span> nfev, point = 0; <span>int</span> npt = 0, cp = 0; <span>bool</span> finish = <span>false</span>; <span>// Make initial progress report with initialization parameters</span> <span>if</span> (Progress != <span>null</span>) Progress(<span>this</span>, <span>new</span> OptimizationProgressEventArgs (iterations, evaluations, g, gnorm, x, xnorm, f, stp, finish)); <span>// Start main</span> while (!finish) { iterations++; <span>double</span> bound = iterations  1; <span>if</span> (iterations != 1) { <span>if</span> (iterations > m) bound = m; <span>double</span> ys = 0; <span>for</span> (<span>int</span> i = 0; i < n; i++) ys += delta[npt + i] * steps[npt + i]; <span>// Compute the diagonal of the Hessian</span> <span>// or use an approximation by the user.</span> <span>if</span> (Diagonal != <span>null</span>) { diagonal = getDiagonal(); } <span>else</span> { <span>double</span> yy = 0; <span>for</span> (<span>int</span> i = 0; i < n; i++) yy += delta[npt + i] * delta[npt + i]; <span>double</span> d = ys / yy; <span>for</span> (<span>int</span> i = 0; i < n; i++) diagonal[i] = d; } <span>// Compute H*g using the formula given in:</span> <span>// Nocedal, J. 1980, "Updating quasiNewton matrices with limited storage",</span> <span>// Mathematics of Computation, Vol.24, No.151, pp. 773782.</span> cp = (point == 0) ? m : point; rho[cp  1] = 1.0 / ys; <span>for</span> (<span>int</span> i = 0; i < n; i++) w[i] = g[i]; cp = point; <span>for</span> (<span>int</span> i = 1; i <= bound; i += 1) { <span>if</span> (cp == 1) cp = m  1; <span>double</span> sq = 0; <span>for</span> (<span>int</span> j = 0; j < n; j++) sq += steps[cp * n + j] * w[j]; <span>double</span> beta = alpha[cp] = rho[cp] * sq; <span>for</span> (<span>int</span> j = 0; j < n; j++) w[j] = beta * delta[cp * n + j]; } <span>for</span> (<span>int</span> i = 0; i < diagonal.Length; i++) w[i] *= diagonal[i]; <span>for</span> (<span>int</span> i = 1; i <= bound; i += 1) { <span>double</span> yr = 0; <span>for</span> (<span>int</span> j = 0; j < n; j++) yr += delta[cp * n + j] * w[j]; <span>double</span> beta = alpha[cp]  rho[cp] * yr; <span>for</span> (<span>int</span> j = 0; j < n; j++) w[j] += beta * steps[cp * n + j]; <span>if</span> (++cp == m) cp = 0; } npt = point * n; <span>// Store the search direction</span> <span>for</span> (<span>int</span> i = 0; i < n; i++) steps[npt + i] = w[i]; stp = 1; } <span>// Save original gradient</span> <span>for</span> (<span>int</span> i = 0; i < g.Length; i++) w[i] = g[i]; <span>// Obtain the onedimensional minimizer of f by computing a line search</span> mcsrch(x, <span>ref</span> f, <span>ref</span> g, &steps[point * n], <span>ref</span> stp, <span>out</span> nfev, diagonal); <span>// Register evaluations</span> evaluations += nfev; <span>// Compute the new step and</span> <span>// new gradient differences</span> <span>for</span> (<span>int</span> i = 0; i < g.Length; i++) { steps[npt + i] *= stp; delta[npt + i] = g[i]  w[i]; } <span>if</span> (++point == m) point = 0; <span>// Check for termination</span> gnorm = Norm.Euclidean(g); xnorm = Norm.Euclidean(x); xnorm = Math.Max(1.0, xnorm); <span>if</span> (gnorm / xnorm <= tolerance) finish = <span>true</span>; <span>if</span> (Progress != <span>null</span>) Progress(<span>this</span>, <span>new</span> OptimizationProgressEventArgs (iterations, evaluations, g, gnorm, x, xnorm, f, stp, finish)); } } <span>return</span> f; <span>// return the minimum value found (at solution x)</span> } 
The version of the code detailed here only supports Minimization problems. However, it is possible to note that any minimization problem can be converted into a maximization problem and viceversa by taking the opposite of the function and its gradient.
Using the code
Code usage is rather simple. Suppose we would like to maximize the function g(x,y) = exp{(x1)²} + exp{(y2)²/2}:
Since we would like to perform a maximization, we first have to convert it to a minimization problem. The minimization version of the function is simply given by taking f(x,y) = –g(x,y):
Which is a convex function, as can be seen by plotting its surface. As it can be seen, the minimum of the function lies on the point (x,y) = (1,2). As expected, this point coincides with the roots of the partial derivative functions, as shown in the line plots below:
The minimization problem min f(x,y) = (exp((x1)²) + exp((y2)²/2)), computed by Wolfram Alpha.
The roots of the partial derivatives in respective to x (left) and y (right). It is possible to see that the roots occur at 1 and 2, respectively. Images computed using Wolfram Alpha.
The first step in solving this optimization problem automatically is first to specify the target function. The target function f, in this case, can be specified through a lambda function in the form:
1 2 
Func<<span>double</span>[], <span>double</span>> f = (x) => Math.Exp(Math.Pow(x[0]  1, 2))  Math.Exp(0.5 * Math.Pow(x[1]  2, 2)); 
The next step is to specify the gradient g for the function f. In this particular case, we can manually compute the partial derivatives to be df/dx = 2 e^((x1)^2) (x1) and df/dy = e^(1/2 (y2)^2) (y2), in respect to x and y, respectively. Writing a lambda function to compute the gradient vector g, we have:
1 2 3 4 5 
Func<<span>double</span>[], <span>double</span>[]> g = (x) => <span>new</span> <span>double</span>[] { 2 * Math.Exp(Math.Pow(x[0]  1, 2)) * (x[0]  1), Math.Exp(0.5 * Math.Pow(x[1]  2, 2)) * (x[1]  2) }; 
We can also note that this example was rather simple, so the gradient vector was easy to calculate. However, the gradient could also have been computed automatically using Accord.NET‘s FiniteDifferences class. In either case, all we have to do now is to create our LBFGS solver and call its Minimize() method to begin optimization:
1 2 3 4 5 6 
<span>// Create the LBFGS solver</span> BroydenFletcherGoldfarbShanno lbfgs = <span>new</span> BroydenFletcherGoldfarbShanno(numberOfVariables: 2, function: f, gradient: g); <span>// Minimize the function</span> <span>double</span> minValue = lbfgs.Minimize(); <span>// should be 2</span> 
After the optimization is complete, the solution parameter will be available in the Solution property of the lbfgs object, and will be equal to { 1.0, 2.0 }. And also in case one is interested in progress reports (such as in the case of optimizing very large functions), it is also possible to register a listener to the Progress event handler. The complete version of the sample application accompanying the source code is given below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 
<span>// Suppose we would like to find the minimum of the function</span> <span>// </span> <span>// f(x,y) = exp{(x1)²}  exp{(y2)²/2}</span> <span>//</span> <span>// First we need write down the function either as a named</span> <span>// method, an anonymous method or as a lambda function:</span> Func<<span>double</span>[], <span>double</span>> f = (x) => Math.Exp(Math.Pow(x[0]  1, 2))  Math.Exp(0.5 * Math.Pow(x[1]  2, 2)); <span>// Now, we need to write its gradient, which is just the</span> <span>// vector of first partial derivatives del_f / del_x, as:</span> <span>//</span> <span>// g(x,y) = { del f / del x, del f / del y }</span> <span>// </span> Func<<span>double</span>[], <span>double</span>[]> g = (x) => <span>new</span> <span>double</span>[] { <span>// df/dx = {2 e^( (x1)^2) (x1)}</span> 2 * Math.Exp(Math.Pow(x[0]  1, 2)) * (x[0]  1), <span>// df/dy = { e^(1/2 (y2)^2) (y2)}</span> Math.Exp(0.5 * Math.Pow(x[1]  2, 2)) * (x[1]  2) }; Console.WriteLine(<span>"Solving:"</span>); Console.WriteLine(); Console.WriteLine(<span>" min f(x,y) = exp{(x1)²}  exp{(y2)²/2}"</span>); Console.WriteLine(); <span>// Finally, we can create the LBFGS solver, passing the functions as arguments</span> <span>var</span> lbfgs = <span>new</span> BroydenFletcherGoldfarbShanno(numberOfVariables: 2, function: f, gradient: g); <span>// And then minimize the function:</span> <span>double</span> minValue = lbfgs.Minimize(); <span>double</span>[] solution = lbfgs.Solution; <span>// The resultant minimum value should be 2, and the solution</span> <span>// vector should be { 1.0, 2.0 }. The answer can be checked on</span> <span>// Wolfram Alpha by clicking the following the link:</span> <span>// http://www.wolframalpha.com/input/?i=maximize+%28exp%28%28x1%29%C2%B2%29+%2B+exp%28%28y2%29%C2%B2%2F2%29%29</span> Console.WriteLine(<span>"The minimum value of {0:0} occurs at the solution point ({1:0},{2:0})"</span>, minValue, solution[0], solution[1]); 
Conclusion
In this post, we showed how to use a reduced set of the Accord.NET Framework‘s to perform nonlinear optimization. This routine is also used by the Conditional Random Fields and Hidden Conditional Random Fields trainers to optimize parameters for such models. The solver routines have been adapted from the original Fortran’s source code from Nocedal, which, tracing back to a 2001 message from the author, also have been reported to be available under the public domain. Nevertheless, the reduced set of the framework available for download within this post is available under a BSD license, as an alternative to the version available within the Accord.NET Framework which is available only under a LGPL license.
As always, I hope someone will find it useful 🙂
References

D.C. Liu and J. Nocedal. On the Limited Memory Method for Large Scale Optimization (1989), Mathematical Programming B, 45, 3, pp. 503528.

J. Nocedal. Updating QuasiNewton Matrices with Limited Storage (1980), Mathematics of Computation 35, pp. 773782.

J. Nocedal. LBFGS: Software for Largescale Unconstrained Optimization. Website. http://users.eecs.northwestern.edu/~nocedal/lbfgs.html

Souza, César R. The Accord.NET Framework. 01 Sep. 2010. Web. http://accordnet.origo.ethz.ch.

Wolfram Alpha LLC. WolframAlpha. http://www.wolframalpha.com/

Wikipedia contributors, Newton’s method. Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/Newton’s_method

Wikipedia contributors. QuasiNewton method. Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/QuasiNewton_method

Wikipedia contributors. Limitedmemory BFGS. Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/Limitedmemory_BFGS

Wikipedia contributors. BFGS method. Wikipedia, The Free Encyclopedia. Available on: http://en.wikipedia.org/wiki/BFGS_method