Neural Network Learning by the Levenberg-Marquardt Algorithm with Bayesian Regularization (part 2)

lm-apprx-1_thumb-5B2-5D

A complete explanation for the totally lost, part 2 of 2. Click here to return to part 1.

Downloads

The code has been incorporated in Accord.NET Framework, which includes the latest version of this code plus many other statistics and machine learning tools.

Contents

  1. Overview
    1. Neural Networks
    2. AForge Framework
    3. Network training as a function optimization problem
  2. Levenberg-Marquardt
    1. Computing the Jacobian
    2. Approximating the Hessian
    3. Solving the Levenberg-Marquardt equation
    4. General Algorithm
    5. Limitations
  3. Bayesian Regularization
    1. Expanded Levenberg-Marquardt Algorithm
  4. Source Code
    1. Using the Code
    2. Sample Applications
  5. Additional Notes
  6. References

Bayesian Regularization

To overcome the problem in interpolating noisy data, MacKay (1992) has proposed a Bayesian framework which can be directly applied to the neural network learning problem. It also allows to estimate the effective number of parameters actually used by the model – in this case, the number of network weights actually needed to solve a particular problem.
Bayesian regularization expands the cost function to search not only for the minimal error, but for the minimal error using the minimal weights. It works by introducing two Bayesian hyperparameters, alpha and beta, to tell which direction (minimal error or minimal weights) the learning process must seek.
The cost function will then become:

C(k) = β*Ed+α*Ew, where:

  1. Ed is the sum of squared errors, and
  2. Ew is the sum of squared weights

By using Bayesian regularization, one can avoid costly cross validation. It is particularly useful to problems that can not, or would suffer, if a portion of the available data were reserved to a validation set. Regularization also reduces (or eliminates) the need for testing different number of hidden neurons for a problem. A third variable, gamma, indicates the number of effective weights being used by the network, thus giving an indication on how complex the network should be.
Many practical realizations of Bayesian Regularization perform an update of the hyperparameters alpha and beta after each training cycle. However, according to Poland (2001) the most popular update algorithm fails to produce robust iterates if there is not much training data. The following algorithm shows how to update the hyperparameters according to both rules. Both methods base their calculations in the inverse Hessian matrix.

Expanded Levenberg-Marquardt Algorithm

Adding bayesian regularization to the Levenberg-Marquardt adds little overhead to the process because we already have an Hessian approximation. So, the algorithm for a learning epoch then becomes:

  1. Compute the Jacobian (by finite differences or using the chain rule)
  2. Compute the error gradient
    • g = JtE
  3. Approximate the Hessian using the cross product Jacobian
    1. H = JtJ
  4. Calculate the cost function
    1. C = β*Ed+α*Ew, where:
      1. Ed is the sum of squared errors, and
      2. Ew is the sum of squared weights
  5. Solve (H + λI)δ = g to find δ
  6. Update the network weights w using δ
  7. Recalculate the cost function using the updated weights
  8. If the cost has not decreased,
    1. Discard the new weights, increase λ using v and go to step 5.
  9. Else decrease λ using v
  10. Update the Bayesian hyperparameters using MacKay’s or Poland’s formulae:
    1. gamma = W – (alpha * tr(H-1))
    2. beta = (N – gamma) / 2.0 * Ed
    3. alpha = W / (2.0 * Ew + tr(H-1)) [modified Poland’s update], or
    4. alpha = gamma / (2.0 * Ew) [original MacKay’s update], where:
      1. W is the number of network parameters (number of weights and biases)
      2. N is the number of entries in the training set
      3. tr(H-1) is the trace of the inverse Hessian matrix

Source Code

Here is the code for running a learning epoch of the Levenberg-Marquardt algorithm.

Using the Code

Code usage is very simple. It follows the common AForge.NET learning algorithm usage, instantiating a LevenbergMarquardtLearning class instead of the more traditional BackpropagationLearning. Optionally, we may opt for using the Jacobian approximated by finite differences instead of computing it directly (although it would not be very practical unless for debugging purposes). We may also enable bayesian regularization by passing a boolean parameter to the constructor.

Sample applications

The original sample applications from AForge.NET have been modified to use Levenberg-Marquardt learning. However, it is quite tricky to get them to converge adequately.

lm-apprx-1 lm-apprx-2
Function approximation using regular Levenberg-Marquardt

lm-apprx-3-br lm-apprx-3-nw lm-apprx-4-br lm-apprx-5-br lm-apprx-6-br
Function approximation using the Levenberg-Marquardt with Bayesian regularization. Note that some of them does not converge to a best fit solution; instead, a possibly more general solution is preferred. The top-right image uses the Nguyen-Widrow method for initializing weights.

lm-xor-1
Solving the XOR problem using Levenberg-Marquardt

Additional Notes

Some versions of the Levenberg-Marquardt algorithm solve the equation (JtJ + λ diag(JtJ) I)δ = JtE instead of  (JtJ + λI)δ = JtE, effectively replacing the identity matrix with the diagonal of the approximated Hessian for the weight update rule. According to Wikipedia, this was suggested by Marquardt to incorporate some local curvature estimation. However, for some reason, more recent implementations that use the Levenberg-Marquardt tag do not include Marquardt’s suggestion. Other implementations suggest using an initial λ related to the size of the elements of the approximated Hessian by making λ0 = t max(diag(JtJ)), where t is a value chosen by the user.
It is also important to note that the Levenberg-Marquardt is extremely dependent on the initial guess for the network parameters. Depending on the initial weights of the network, the algorithm may converge to a local minima or not converge at all.
Besides that, it is an extremely fast method for neural network learning when compared to the standard backpropagation algorithm.

Finally, if you have any comments about the article or about the code, please let me know it! All suggestions are welcome.

References

47 Comments

  1. Thank You for Your effort, this is the only place I have found so far, with such a simple, direct and comprehensive explanation of LMA.

    Once again thx.

  2. The :
    beta = (N – gamma) / 2.0 * Ed
    alpha = W / 2.0 * Ew + tr(H-1) [modified Poland’s update]

    should be read

    beta = (N – gamma) / ( 2.0 * Ed )
    alpha = W / ( 2.0 * Ew + tr(H-1) ) [modified Poland’s update]

    in case anyone’s interested.

  3. Hello,
    Do the Bayesian parameters (alpha & beta) affect the hessian H in eq. at step 5 ?
    as, from the step 1 … step 10 algorithm it seems they don’t.
    On the other hand, in your C code I found that ‘alpha’ should affect H :

    405 // Update diagonal (Levenberg-Marquardt formula)
    406 for (int i = 0; i < numberOfParameters; i++)
    407 hessian[i, i] = diagonal[i] + (lambda + alpha);

    Moreover, according to http://hagan.okstate.edu/icnn97a.pdf , ‘beta’
    should be taken into consideration too.

    What would be the best choice ? Thanks.

  4. Hi Anonymous,

    Thanks for noticing. I should update the algorithm to better reflect the use of regularization.

    About the comment on the use of beta, if you notice the block under the comment “Compute Quasi-Hessian Matrix using Jacobian (H = J’J)” you can see that the Jacobian approximation to the Hessian is already being multiplied by beta. But I did notice that it isn’t being multiplied by 2 as the paper suggests.

    There has been some time since I last used this algorithm, but I will re-check it to see if something is missing.

    Thanks for the feedback,
    César

  5. Cesar, in your c# code you have noticed that only network with one output neuron can be trained by LM algorithm – why?
    I trying to implement LM algorithm for network with more than one output neuron, but have some problems – after 1-3 iterations weights are growing and network does not converge. Additionally, lambda becomes a very big number. Regularization does not help – alpha parameter always very small and reduces influence of sum of squares of weights on cost function. Maybe there is some “tricky” thing in LMA implementation?

  6. Hi Alex,

    The L-M can be used to train networks with multiple output. The fact is that the code I presented is a specialization of the general L-M to the single output case.

    To handle more than one output neuron, the Jacobian matrix becomes slightly different. If you have N samples, M weights and 1 output neuron, the Jacobian will have N rows and M columns (as depicted in the code). However, if you have 2 output neurons, you will need 2N rows and M columns. Each consecutive two rows will relate to the same input sample, but each of those rows will relate to each of the outputs for this sample.

    When using multiple outputs, I believe that the Bayesian regularization part also needs to be replicated. Unfortunately I haven’t found many examples on literature about it. Perhaps you may find some information on the original paper by MacKay (Neural Computation, Vol. 4, No. 3, 1992, pp. 415 to 447) and Foresee and Hagan (Proceedings of the International Joint Conference on Neural Networks, June, 1997).

    I hope it helps. Please let me know if you have success implementing it!

    Best regards,
    César

  7. Hello Cesar,
    I have tested my LMA implementation on XOR problem – it does not converges :-((. Adding Bayesian regularization has no effect… I have compared my formulas for Jacobian computing with yours – they are identical! But in my case sum of squared errors for test weights (derived from solving LMA equation) always greater than initial sum of squared errors, and I need to increase Lambda again and again, and it makes weigths very big. Could you please make some debug information with your LMA for XOR? It will be very helpful for me, if I can compare my and yours Jacobian values for specific weights…
    I suppose that I have missed some very little but very important thing in my LMA implementation…

  8. Cesar,
    I have found some error in my Jacobian computing formulas, but still no convergence even for XOR. Maybe it is because of I’m not using any biases (threshholds) in my network? Is biases important for LMA?

  9. Hi Alex,

    Well, bias terms are usually very important in neural networks in general. Their usefulness isn’t restricted to specific methods such as LMA.

    I really think you should try adding those. Or try fitting a simpler function such as an AND or an OR, which do not require multilayer networks to be learnt.

    Best regards,
    César

  10. Hello, Cesar!
    Your C# code is very helpful for me. I have implemented bias terms for my network and continue to fix my jacobian computation routine. In your jacobian computation by chain rule you “sum” like this:

    for (int k = 0; k < network[layer + 1].InputsCount; k++)
    {
    sum += network[layer + 1][j][k] * network[layer].Output[j];
    }

    Is this correct to use “network[layer].Output[j];” as multiplier? As I understand your code, you compute “sum” as sum of products of output neuron weights and hidden layer outputs. Maybe “network[layer].Output[k];” will be correct?

  11. Hi Alex,

    Thanks. You have found the last bug in the code 🙂

    The correct is indeed Output[k].

    By the way, there is a slightly clearer version of the code available at Google Code. This will be fixed in the next version of the framework.

    Best regards,
    César

  12. Cesar,
    Thanks a lot for all your help! Now, my LMA works well! Additionally, my LMA works with neural network with many output neurons.
    There are two questions:
    1. Does your version of LMA with Bayesian regularization converges slower than “plain” LMA?
    2. Have you printed your results in some journal? Please, let me know to make reference in my thesis.

  13. Hi Alex!

    I am glad this discussion helped both of us 🙂

    1) Bayesian regularization does not exactly “slow” the convergence in all cases, but indeed, it may seem like it does. The regularization tends to attenuate the fitting. It’s like if the network could presume that some points are outliers or just noise in the data and choose not to overfit them, thus converging to a presumably more general solution. I am under the impression that the convergence rate can be either faster or slower, depending on the problem.

    2) No, I still have not. But I will do it soon. If you wish to make a reference, please do it to the Accord.NET Framework. In the bottom there is a small section “Licensing” with a small example for citation. Thanks, I will be very grateful if you could do that.

    Additionally, may I ask if you would like to contribute the code for multiple outputs so it could be eventually implemented in the framework sometime in the future?

    All the best,
    César

  14. Good article, just starting to look at it to see if it fits my needs.

    I have a question. You calculate the weights both by chain rule and finite diff? Why the two? In the article you state that finite diff is not effective, why is that? It also sounds like your chain rule is not complete, as there is the comment (currently only 1 hidden layer supported). Which do you suggest to use, or is this LMA really limited to one hidden and one output?

  15. Hello, Cesar!

    In my case Bayesian regularization make convergence slower but error dynamics is more smooth. I think it is because of Bayesian regularization uses sum of squared weights while computing weigths change

    My code for multiple outputs based on paper “A Constructive Algorithm for Feedforward Neural Networks” (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.2791&rep=rep1&type=pdf). I can email you my code, if you need. Unfortunately, it is in Delphi, not C# or C++…

  16. Hi Anonymous,

    Both are provided for debugging purposes. Chain rule is preferred, as it is a direct computation of the derivatives. Numerical differentiation can be costly in comparison to a closed-form solution in this case, since the backpropagation algorithm is often very efficient.

    Yes, only one hidden layer is supported. It has been proven that a single layer neural network is sufficient to fit any continuous function. If I can remember correctly (sorry, it has been time since I first wrote it) it is also a constraint assumed by the Bayesian regularization procedure.

    As I mentioned in a comment few days ago, there is also a small bug in the chain rule which I am correcting soon. Alex (from the comments above) also have implemented the code for multiple outputs, which I am looking forward to incorporate in the implementation.

    @Alex, if you could, can you please share the code by email? I would be very grateful. Please tell me if I can base my modifications on it. I will surely include a new copyright notice in the implementation mentioning your contribution.

    Best regards,
    César

  17. Dear Cesar,

    Thank you very much for sharing this very helpful information with us.

    I would like to ask you a question regarding the Levenberg-Marquardt training algorithm available in Matlab Neural Network Toolbox.

    I couldn’t figure out why each training ends differently (i.e. different mean square error, different number of iterations, etc) when I do multiple trainings with the same (time delayed) neural network. I start each training with zero weights and biases. The data used for the training is the same (no random division for training and validation). And the network structure is the same. I delete all the past data (including the neural network) before starting the next training. Is there some randomness inside the MATLAB implementation of the LM algorithm? Why the training results are different at the end of each training?

    Many thanks.

  18. Hi Anonymous,

    Unfortunately I don’t know much about the Levenberg-Marquardt implementation in MATLAB, as I don’t have it installed. But as far as I know, the original Levenberg-Marquardt does not involves any random calculation.

    Perhaps what could be happening is that the toolbox is automatically initializing your network to have Gaussian-distributed weights as a prerequisite of the Bayesian Regularization. Have you tried asking in the community support forums for the neural network toolbox? Perhaps they could help better.

    Best regards,
    César

  19. I think there might be something very wrong with the way gradients are calculated. Correct me if I am wrong, but isn’t the point of LMA to use the lambda to interpolate between a GNA and Gradient Descent? Namely that when the GNA is failing to converge, the lambda gets kicked in and gradient descent comes in…. sort of like backprop. So in theory LMA should never be outperformed by backprop… which I do not find to be the case using this with AForge.

    To further test this I fixed the Lambda at max. You would think the net could still learn, using pure gradient descent at this point. However the net does not learn. I also started checking the gradient arrays to ones I manually calculated for a few numbers and they do not match.

    Any idea? Shouldn’t I be able to fix the lambda high and still see convergance? This makes me doubt that the gradient descent part of this is actually working and its all GNA.

  20. Hi Anonymous,

    Some months ago I have replaced the download links with a updated version. The most recent version is available through the Accord.NET Framework or can also be seen in Google Code. Particularly, the last fix was indeed in the calculation of the gradient. The development sources for the next release (not yet on Google Code) have had the gradient computation rewritten to support multiple hidden layers.

    Setting the sample application‘s learning rate to 10 or 1000000 achieves convergence almost instantly. If we also go an fix the damping factor adjustment v to 1, it is still possible to see convergence, albeit really slowly. If we set the learning rate to a very low value and fix v, the code is not able to make progress because it relies on the damping factor adjustment to make the matrix non-singular during the LU decomposition, which may prevent progress otherwise.

    By the way, if you would like, please take a look on the comments section of the first part of the post. Particularly, on the ones preceding this one. If you believe you have found an error, please let me know. As I stated on the linked comment, I can not guarantee the code is perfect. But we can always review it again and make any corrections as necessary.

    Update: I have just committed the latest source to Google Code, in case you would like to take a look.

    Best regards,
    Cesar

  21. for (int j = 0; j < ji.Length; j++)
    s += ji[j] * errors[j];

    I looked at the new code, not sure it is right? See the lines above… you are assuming that ji and errors are going to be of the same length. ji is a row from the jacobian, so it will have a length equal to the weights. errors has a length equal to the number of training set elements. So if you run this on an XOR, where you have fewer training set element items than weights, it throws an out of bounds error.

  22. Hi Anonymous,

    Have you actually tested it? I can not test it right now, but as far as I can read my own code, the code creates and handles a transposed representation of the Jacobian matrix to improve locality on those loops.

    Best regards,
    César

  23. Hi Cesar,

    Thank you very much for sharing your experience.

    A couple of months ago, i’ve started to research a problem using LMA. All works fine, the network converges (above 2500 epochs the delta between two consecutive training iterations outputs is 0),stable (allways reach the same solution for the same parameters) the results seems to be good. But (allways is a but)the results aren’t good enough. The solution deviation is round 25%from the training output.
    As i saw in previous posts, there were some bugs. The odd fact is that my ANN return the same result with your last release as with the older framework version. Even with many layers the result is almost the same.
    Should i suppose that i really hit the global minima and i cannot go further?
    Or there are some tricks that can be used to reduce the deltas?

    Kind Regards,
    Sorin.

  24. Good morning Cesar,

    I’ve tried whithout bayesian distribution as well, but the deviation’s percent between the solution and the training output remains allmost the same (24.xx%).
    Between the two solutions (with or whithout Bayes) are important differences, so the only rational conclusion is the global minima wasn’t reached.
    I’m wondering if it is a procedure or a trick to make LMA to find the global minima.
    As i’ve red until now, some authors claims that they made the LMA to reach the global minima forcing alpha and beta to fixed values (10 for example).

    Rgrds,
    Sorin.

  25. Hi Sorin,

    While I can’t guarantee the code is completely bug free, everyone is invited to actually look at the sources to see if something is missing.

    Anyways, if you are not using Bayesian regularization, changing alpha or beta wouldn’t make much difference, as they are already set by default to 0 and 1, respectively. This means only network errors will be taken in account in the objective function.

    However, something which may help is to set lambda (LearningRate) to a very low value (or 0). This will disable the damping factor, but may introduce problems if your Hessian is singular. If you Hessian is singular, you can try using the SingularValueDecomposition in place of the LuDecomposition in the RunEpoch computation.

    Best regards,
    César

  26. Dear Cesar,

    Since your code was compiled and it is running, i think it’s great. :-p

    Thank you for your recommendation and i’m sorry for my late answer.
    In the mean time i was busy with testing some scenarios.
    Until now, i’ve menage to reduce the error from 24.xx% to 15.xx%. This is great!
    And guess how i did it!
    I’ll tell you: i used many hidden layers (5), following the formula:
    inputLayer(inputCount)->2*inputCount->inputCount->inputCount/2->inputCount/4->inputCount/8->..->1.
    I think that the topology of an ANN is very important. I will digg more and if i’ll find some verified rules, i will let you all know.
    Until then,

    Best regards,
    Sorin.

  27. Dear Cesar,

    Sorry for double post.
    I’ve tried to set the damping factor to 0, but disaster strikes.
    In the while loop inside the RunEpoch procedure, the second condition will never be accomplished. To make the thing worse, the objective function value starts to raise, so i have here a nice infinite loop. More than this, starting with a nonzero learning rate will not solve the issue above, because LR will converge to 0, after, let’s say 250 epochs, will be really 0, and the multiplication with 10 or another number after those epochs will return 0 as well.
    About the Hessian’s singularity, i see this issue is closed reiterrating until it will be nonsingular.
    If i suppose that here is the global minima maybe i’m wrong, because i think that in this particular case the algorithm fails to change it’s searching direction.

    What do you think?

    Thank You,

    BR,
    Sotin

  28. Hi Sorin,

    The Hessian may become non-singular after we add a positive factor to its diagonal. However, this comes at a cost. This is a form of regularization, and the resulting solution may be far from the best solution possible. Another way to solve this issue is to use the SingularValueDecomposition instead, which makes no assumption about positive-definiteness or singularity of the matrix being decomposed.

    About the other issues, I am afraid I do not have a final answer. It has been some time since I last used the code. But I remember to have had experimented with higher numbers of hidden layers as well. What I found (empirically), however, was that there was always a configuration with a single hidden layer which ended up producing better results than with many layers. There is also a theoretical result which says that a single hidden layer is sufficient to learn any continuous function (given a sufficient number of hidden nodes). Well, of course, it may depend on your particular problem. In the problem I was researching, training networks with more hidden layers was significantly faster than with single layers. But they also tended to overfit much more easily.

    Best regards,
    César

  29. Dear Cesar,

    Thank you for your reply and should apologize myself for my late answer. The testing of various scenarios is very time consuming.

    Do you suggest to replace LUD with SVD only when the hessian is singular or for the entire calculation? The second part seems to make sense for me, so i’m sorry if i undesrtood wrong.

    I solved the problem with lambda convergence to 0, reinitializing it to it’s initial value. I don’t know if is good or worse, but when i do this, the convergence doesn’t seems to be influenced (the error descent rate doesn’t change). The odd thing it’s happens when the damping factor starts to raise (the objective function doesn’t go low): instead of searching a better solution, the function seems to be trapped and converge slower to the same minima until the delta became low and the algorithm stops. In the mean time, the lambda value oscilate between exp(5) and exp (1) for example.
    Indeed, using multiple hidden layers will lead to a faster convergence and to overfitting as well. I know, there is a theorem that state: “every continous function can be approximated using a single hidden layer”.
    I’m using a time series prediction algorithm with a training set 3000 records long. It is a litle bit changed because each output is calculated with inputs from another time serie (forecast vs realized). I’m tryng to forecast the next 50 values (having an input forecast for those values) in order to improve the forecast according with historical data. I’ve obtained decent values for the training set (no overfit, good deviation) but the last values (round 10 to 20) from the 50 calculated values are very overfitted. I cannot understand why only few values (the lasts) from the solution are overfited when the other values are ok.

    Thank you for you time,

    Sorin.

  30. Hi Sorin,

    You could substitute the LU with SVD as you see fit. If you chose to compute the SVD only when the Hessian is singular, the only problem is that you will have first to have computed the LU to know this fact, which may result in an unnecessary decomposition.

    About your overfitted values, I am not sure what could be causing it. But in any case, please try randomizing your training set before feeding the learning algorithms. This may be of some help.

    Best regards,
    Cesar

  31. Hi Cesar,

    About the overfiting…forget about it. I change the test configuration to 3000 records for training and 800 for testing. But an odd thing still happens: when the learning error decrease the forecast error raise and viceversa. Strange.

    Since my last post i’ve tried different changes in your original algorithm targeting especially the dampling factor. Various ways to update lambda are tried, but in almost all the cases the same solution was reached. It seems that switching to the steepest descent method the error function will not go in another direction but in the same direction as Gauss-Newton leading to a “forced” convergency. During those tests i’ve only managed to improve the forecast with less than 10% when i expect 20-25%.

    BR
    Sorin

  32. Hi Cesar,

    Thank you for your explanation about LM. I saw in your reference an article writen by Willamowski and others “Efficient Algorithm for Training Neural Networks with one Hidden Layer,” did you ever apply this article and does it converge, I applied but it does not converge.

    good days

  33. Hello all,

    I would also like to announce that a completed version of this code, supporting multiple hidden layers and multiple outputs is now available in the Accord.NET Framework. The code has also been optimized for memory usage, so the Hessian can be inverted in place using a Cholesky decomposition. Moreover, the code now supports a speed-memory tradeoff parameter which allows to trade increasing computing speed for lower memory requirements. I hope it can be more useful now.

    Best regards,
    César

  34. Hi Cesar,
    Thanks a lot for such tutorial. I am trying to get fit a curve and I am able to do this. But now I want to save this network and for a particular X value I want to get the corresponding Y value. Could you please help me?
    Thanks.

  35. Dear Cesar,
    Thanks for the enlightening tutorial! I would like to ask if you can help offer some advice to the following:
    1) I noticed that the value of lambda continue to increase during some training making it impossible for one to achieve convergence. Besides, I feel there must be either convergence or no convergence depending on the initial parameters as you mentioned?
    I also wish to know if other methods can be used to select the required regularization apart from cross validation? Kindly let me know!
    Thanking.

Leave a Reply

Your email address will not be published. Required fields are marked *