A complete explanation for the totally lost, part 2 of 2. Click here to return to part 1.
Downloads
- Download the sample applications.
- Download the Accord.NET Framework.
- Browse the source code online.
The code has been incorporated in Accord.NET Framework, which includes the latest version of this code plus many other statistics and machine learning tools.
Contents
Bayesian Regularization
To overcome the problem in interpolating noisy data, MacKay (1992) has proposed a Bayesian framework which can be directly applied to the neural network learning problem. It also allows to estimate the effective number of parameters actually used by the model – in this case, the number of network weights actually needed to solve a particular problem.
Bayesian regularization expands the cost function to search not only for the minimal error, but for the minimal error using the minimal weights. It works by introducing two Bayesian hyperparameters, alpha and beta, to tell which direction (minimal error or minimal weights) the learning process must seek.
The cost function will then become:
C(k) = β*Ed+α*Ew, where:
- Ed is the sum of squared errors, and
- Ew is the sum of squared weights
By using Bayesian regularization, one can avoid costly cross validation. It is particularly useful to problems that can not, or would suffer, if a portion of the available data were reserved to a validation set. Regularization also reduces (or eliminates) the need for testing different number of hidden neurons for a problem. A third variable, gamma, indicates the number of effective weights being used by the network, thus giving an indication on how complex the network should be.
Many practical realizations of Bayesian Regularization perform an update of the hyperparameters alpha and beta after each training cycle. However, according to Poland (2001) the most popular update algorithm fails to produce robust iterates if there is not much training data. The following algorithm shows how to update the hyperparameters according to both rules. Both methods base their calculations in the inverse Hessian matrix.
Expanded Levenberg-Marquardt Algorithm
Adding bayesian regularization to the Levenberg-Marquardt adds little overhead to the process because we already have an Hessian approximation. So, the algorithm for a learning epoch then becomes:
- Compute the Jacobian (by finite differences or using the chain rule)
- Compute the error gradient
- g = JtE
- Approximate the Hessian using the cross product Jacobian
- H = JtJ
- Calculate the cost function
- C = β*Ed+α*Ew, where:
- Ed is the sum of squared errors, and
- Ew is the sum of squared weights
- Solve (H + λI)δ = g to find δ
- Update the network weights w using δ
- Recalculate the cost function using the updated weights
- If the cost has not decreased,
- Discard the new weights, increase λ using v and go to step 5.
- Else decrease λ using v
- Update the Bayesian hyperparameters using MacKay’s or Poland’s formulae:
- gamma = W – (alpha * tr(H-1))
- beta = (N – gamma) / 2.0 * Ed
- alpha = W / (2.0 * Ew + tr(H-1)) [modified Poland’s update], or
- alpha = gamma / (2.0 * Ew) [original MacKay’s update], where:
- W is the number of network parameters (number of weights and biases)
- N is the number of entries in the training set
- tr(H-1) is the trace of the inverse Hessian matrix
Source Code
Here is the code for running a learning epoch of the Levenberg-Marquardt algorithm.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
<span>/// <summary></span> <span>/// Runs learning epoch.</span> <span>/// </summary></span> <span>/// </span> <span>/// <param name="input">Array of input vectors.</param></span> <span>/// <param name="output">Array of output vectors.</param></span> <span>/// </span> <span>/// <returns>Returns summary learning error for the epoch.</returns></span> <span>/// </span> <span>/// <remarks><para>The method runs one learning epoch, by calling running necessary</span> <span>/// iterations of the Levenberg Marquardt to achieve an error decrease.</remarks></span> <span>///</span> <span>public</span> <span>double</span> RunEpoch(<span>double</span>[][] input, <span>double</span>[][] output) { <span>// Initial definitions and memmory allocations</span> <span>int</span> N = input.Length; <span>this</span>.errors = <span>new</span> <span>double</span>[N]; <span>this</span>.jacobian = <span>new</span> <span>double</span>[N, numberOfParameters]; LuDecomposition decomposition = <span>null</span>; <span>double</span> sumOfSquaredErrors = 0.0; <span>double</span> sumOfSquaredWeights = 0.0; <span>double</span> trace = 0.0; <span>// Compute the Jacobian matrix</span> <span>if</span> (method == JacobianMethod.ByBackpropagation) { sumOfSquaredErrors = JacobianByChainRule(input, output); } <span>else</span> { sumOfSquaredErrors = JacobianByFiniteDifference(input, output); } <span>// Create the initial weights vector</span> sumOfSquaredWeights = CreateWeights(); <span>// Compute Jacobian errors and Hessian</span> <span>for</span> (<span>int</span> i = 0; i < numberOfParameters; i++) { <span>// Compute Jacobian Matrix Errors</span> <span>double</span> s = 0.0; <span>for</span> (<span>int</span> j = 0; j < N; j++) { s += jacobian[j, i] * errors[j]; } gradient[i] = s; <span>// Compute Quasi-Hessian Matrix using Jacobian (H = J'J)</span> <span>for</span> (<span>int</span> j = 0; j < numberOfParameters; j++) { <span>double</span> c = 0.0; <span>for</span> (<span>int</span> k = 0; k < N; k++) { c += jacobian[k, i] * jacobian[k, j]; } hessian[i, j] = beta * c; } } <span>// Store the Hessian diagonal for future computations</span> <span>for</span> (<span>int</span> i = 0; i < numberOfParameters; i++) diagonal[i] = hessian[i, i]; <span>// Define the objective function</span> <span>// bayesian regularization objective function</span> <span>double</span> objective = beta * sumOfSquaredErrors + alpha * sumOfSquaredWeights; <span>double</span> current = objective + 1.0; <span>// Begin of the main Levenberg-Macquardt method</span> lambda /= v; <span>// We'll try to find a direction with less error</span> <span>// (or where the objective function is smaller)</span> <span>while</span> (current >= objective && lambda < lambdaMax) { lambda *= v; <span>// Update diagonal (Levenberg-Marquardt formula)</span> <span>for</span> (<span>int</span> i = 0; i < numberOfParameters; i++) hessian[i, i] = diagonal[i] + (lambda + alpha); <span>// Decompose to solve the linear system</span> decomposition = <span>new</span> LuDecomposition(hessian); <span>// Check if the Jacobian has become non-invertible</span> <span>if</span> (!decomposition.NonSingular) <span>continue</span>; <span>// Solve using LU (or SVD) decomposition</span> deltas = decomposition.Solve(gradient); <span>// Update weights using the calculated deltas</span> sumOfSquaredWeights = UpdateWeights(); <span>// Calculate the new error</span> sumOfSquaredErrors = 0.0; <span>for</span> (<span>int</span> i = 0; i < N; i++) { network.Compute(input[i]); sumOfSquaredErrors += CalculateError(output[i]); } <span>// Update the objective function</span> current = beta * sumOfSquaredErrors + alpha * sumOfSquaredWeights; <span>// If the object function is bigger than before, the method</span> <span>// is tried again using a greater dumping factor.</span> } <span>// If this iteration caused a error drop, then next iteration</span> <span>// will use a smaller damping factor.</span> lambda /= v; <span>// If we are using bayesian regularization, we need to</span> <span>// update the bayesian hyperparameters alpha and beta</span> <span>if</span> (useBayesianRegularization) { <span>// References: </span> <span>// - http://www-alg.ist.hokudai.ac.jp/~jan/alpha.pdf</span> <span>// - http://www.inference.phy.cam.ac.uk/mackay/Bayes_FAQ.html</span> <span>// Compute the trace for the inverse hessian</span> trace = Matrix.Trace(decomposition.Inverse()); <span>// Poland update's formula:</span> gamma = numberOfParameters - (alpha * trace); alpha = numberOfParameters / (2.0 * sumOfSquaredWeights + trace); beta = System.Math.Abs((N - gamma) / (2.0 * sumOfSquaredErrors)); <span>//beta = (N - gama) / (2.0 * sumOfSquaredErrors);</span> <span>// Original MacKay's update formula:</span> <span>// gama = (double)networkParameters - (alpha * trace);</span> <span>// alpha = gama / (2.0 * sumOfSquaredWeights);</span> <span>// beta = (gama - N) / (2.0 * sumOfSquaredErrors);</span> } <span>return</span> sumOfSquaredErrors; } |
Using the Code
Code usage is very simple. It follows the common AForge.NET learning algorithm usage, instantiating a LevenbergMarquardtLearning class instead of the more traditional BackpropagationLearning. Optionally, we may opt for using the Jacobian approximated by finite differences instead of computing it directly (although it would not be very practical unless for debugging purposes). We may also enable bayesian regularization by passing a boolean parameter to the constructor.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
<span>// initialize input and output values</span> <span>double</span>[][] input = <span>new</span> <span>double</span>[4][] { <span>new</span> <span>double</span>[] {0, 0}, <span>new</span> <span>double</span>[] {0, 1}, <span>new</span> <span>double</span>[] {1, 0}, <span>new</span> <span>double</span>[] {1, 1} }; <span>double</span>[][] output = <span>new</span> <span>double</span>[4][] { <span>new</span> <span>double</span>[] {0}, <span>new</span> <span>double</span>[] {1}, <span>new</span> <span>double</span>[] {1}, <span>new</span> <span>double</span>[] {0} }; <span>// create neural network</span> ActivationNetwork network = <span>new</span> ActivationNetwork( SigmoidFunction( 2 ), 2, <span>// two inputs in the network</span> 2, <span>// two neurons in the first layer</span> 1 ); <span>// one neuron in the second layer</span> <span>// create teacher</span> LevenbergMarquardtLearning teacher = <span>new</span> LevenbergMarquardtLearning( network ); <span>// loop</span> <span>while</span> ( !needToStop ) { <span>// run epoch of learning procedure</span> <span>double</span> error = teacher.RunEpoch( input, output ); <span>// check error value to see if we need to stop</span> <span>// ...</span> } |
Sample applications
The original sample applications from AForge.NET have been modified to use Levenberg-Marquardt learning. However, it is quite tricky to get them to converge adequately.
Additional Notes
Some versions of the Levenberg-Marquardt algorithm solve the equation (JtJ + λ diag(JtJ) I)δ = JtE instead of (JtJ + λI)δ = JtE, effectively replacing the identity matrix with the diagonal of the approximated Hessian for the weight update rule. According to Wikipedia, this was suggested by Marquardt to incorporate some local curvature estimation. However, for some reason, more recent implementations that use the Levenberg-Marquardt tag do not include Marquardt’s suggestion. Other implementations suggest using an initial λ related to the size of the elements of the approximated Hessian by making λ0 = t max(diag(JtJ)), where t is a value chosen by the user.
It is also important to note that the Levenberg-Marquardt is extremely dependent on the initial guess for the network parameters. Depending on the initial weights of the network, the algorithm may converge to a local minima or not converge at all.
Besides that, it is an extremely fast method for neural network learning when compared to the standard backpropagation algorithm.
Finally, if you have any comments about the article or about the code, please let me know it! All suggestions are welcome.
References
- Levenberg-Marquardt
- Sam Roweis; Levenberg-Marquardt Optimization, available on: http://www.cs.nyu.edu/~roweis/notes/lm.pdf, 1999.
- Sarle, W.; Neural Network FAQ, part 2 of 7: Learning, available on: http://www.cs.otago.ac.nz/nnweb/FAQ2.html, 2002.
- Wilamowski, B., Chen. Y; Efficient Algorithm for Training Neural Networks with one Hidden Layer, available on: http://cs.olemiss.edu/~ychen/publications/conference/chen_ijcnn99.pdf, 1999.
- Pearlmutter, B. A.; Fast Exact Multiplication by the Hessian, available on: http://eprints.kfupm.edu.sa/40648/1/40648.pdf, 1999.
- Guidry, T. F.; Levenberg Marquardt Using Numerical Derivatives in C#, available in: http://www.trentfguidry.net/post/2009/08/12/Levenberg-Marquardt-using-numerical-derivatives-in-C.aspx, 2009.
- Bayesian Regularization
- MacKay, D.; Bayesian methods for neural networks – FAQ, available on: http://www.inference.phy.cam.ac.uk/mackay/Bayes_FAQ.html, 2004.
- Foresee, F. D.; Hagan, M. T.; Gauss-Newton Approximation to Bayesian Learning, available on: http://hagan.ecen.ceat.okstate.edu/icnn97a.pdf, 1997.
- Poland, J.; On the Robustness of Update Strategies for the Bayesian Hyperparameter alpha, available on: http://www-alg.ist.hokudai.ac.jp/~jan/alpha.pdf, 2001.
- The MathWorks, Inc; Neural Network Toolbox – Levenberg-Marquardt (trainlm), available on: http://matlab.izmiran.ru/help/toolbox/nnet/backpr12.html, 2005.
- MacKay, D.; A Practical Bayesian Framework for Backpropagation Networks,
available on: http://authors.library.caltech.edu/13793/1/MACnc92b.pdf, 1992.
Thank You for Your effort, this is the only place I have found so far, with such a simple, direct and comprehensive explanation of LMA.
Once again thx.
exactly the same feeling
thanks
The :
beta = (N – gamma) / 2.0 * Ed
alpha = W / 2.0 * Ew + tr(H-1) [modified Poland’s update]
should be read
beta = (N – gamma) / ( 2.0 * Ed )
alpha = W / ( 2.0 * Ew + tr(H-1) ) [modified Poland’s update]
in case anyone’s interested.
Thanks for noticing it!
Hello,
Do the Bayesian parameters (alpha & beta) affect the hessian H in eq. at step 5 ?
as, from the step 1 … step 10 algorithm it seems they don’t.
On the other hand, in your C code I found that ‘alpha’ should affect H :
405 // Update diagonal (Levenberg-Marquardt formula)
406 for (int i = 0; i < numberOfParameters; i++)
407 hessian[i, i] = diagonal[i] + (lambda + alpha);
Moreover, according to http://hagan.okstate.edu/icnn97a.pdf , ‘beta’
should be taken into consideration too.
What would be the best choice ? Thanks.
Hi Anonymous,
Thanks for noticing. I should update the algorithm to better reflect the use of regularization.
About the comment on the use of beta, if you notice the block under the comment “Compute Quasi-Hessian Matrix using Jacobian (H = J’J)” you can see that the Jacobian approximation to the Hessian is already being multiplied by beta. But I did notice that it isn’t being multiplied by 2 as the paper suggests.
There has been some time since I last used this algorithm, but I will re-check it to see if something is missing.
Thanks for the feedback,
César
Thank you very much ! Excellent article !
thank u . i search many websites for LMA. but here only i got simple codings.
Cesar, in your c# code you have noticed that only network with one output neuron can be trained by LM algorithm – why?
I trying to implement LM algorithm for network with more than one output neuron, but have some problems – after 1-3 iterations weights are growing and network does not converge. Additionally, lambda becomes a very big number. Regularization does not help – alpha parameter always very small and reduces influence of sum of squares of weights on cost function. Maybe there is some “tricky” thing in LMA implementation?
Hi Alex,
The L-M can be used to train networks with multiple output. The fact is that the code I presented is a specialization of the general L-M to the single output case.
To handle more than one output neuron, the Jacobian matrix becomes slightly different. If you have N samples, M weights and 1 output neuron, the Jacobian will have N rows and M columns (as depicted in the code). However, if you have 2 output neurons, you will need 2N rows and M columns. Each consecutive two rows will relate to the same input sample, but each of those rows will relate to each of the outputs for this sample.
When using multiple outputs, I believe that the Bayesian regularization part also needs to be replicated. Unfortunately I haven’t found many examples on literature about it. Perhaps you may find some information on the original paper by MacKay (Neural Computation, Vol. 4, No. 3, 1992, pp. 415 to 447) and Foresee and Hagan (Proceedings of the International Joint Conference on Neural Networks, June, 1997).
I hope it helps. Please let me know if you have success implementing it!
Best regards,
César
Hello Cesar,
I have tested my LMA implementation on XOR problem – it does not converges :-((. Adding Bayesian regularization has no effect… I have compared my formulas for Jacobian computing with yours – they are identical! But in my case sum of squared errors for test weights (derived from solving LMA equation) always greater than initial sum of squared errors, and I need to increase Lambda again and again, and it makes weigths very big. Could you please make some debug information with your LMA for XOR? It will be very helpful for me, if I can compare my and yours Jacobian values for specific weights…
I suppose that I have missed some very little but very important thing in my LMA implementation…
Hi Alex,
There is a sample application using L-M for the XOR problem included in the Accord.NET Framework. If you wish, perhaps you can open the sample’s solution in Visual Studio and follow it with a debugger. The sample’s binaries and source code is available inside the framework (http://accord-net.origo.ethz.ch). You may also see the code in Google Code’s source code browser.
If you need further help, please ask!
Best regards,
César
Cesar,
I have found some error in my Jacobian computing formulas, but still no convergence even for XOR. Maybe it is because of I’m not using any biases (threshholds) in my network? Is biases important for LMA?
Hi Alex,
Well, bias terms are usually very important in neural networks in general. Their usefulness isn’t restricted to specific methods such as LMA.
I really think you should try adding those. Or try fitting a simpler function such as an AND or an OR, which do not require multilayer networks to be learnt.
Best regards,
César
Hello, Cesar!
Your C# code is very helpful for me. I have implemented bias terms for my network and continue to fix my jacobian computation routine. In your jacobian computation by chain rule you “sum” like this:
for (int k = 0; k < network[layer + 1].InputsCount; k++)
{
sum += network[layer + 1][j][k] * network[layer].Output[j];
}
Is this correct to use “network[layer].Output[j];” as multiplier? As I understand your code, you compute “sum” as sum of products of output neuron weights and hidden layer outputs. Maybe “network[layer].Output[k];” will be correct?
Hi Alex,
Thanks. You have found the last bug in the code 🙂
The correct is indeed Output[k].
By the way, there is a slightly clearer version of the code available at Google Code. This will be fixed in the next version of the framework.
Best regards,
César
Cesar,
Thanks a lot for all your help! Now, my LMA works well! Additionally, my LMA works with neural network with many output neurons.
There are two questions:
1. Does your version of LMA with Bayesian regularization converges slower than “plain” LMA?
2. Have you printed your results in some journal? Please, let me know to make reference in my thesis.
Hi Alex!
I am glad this discussion helped both of us 🙂
1) Bayesian regularization does not exactly “slow” the convergence in all cases, but indeed, it may seem like it does. The regularization tends to attenuate the fitting. It’s like if the network could presume that some points are outliers or just noise in the data and choose not to overfit them, thus converging to a presumably more general solution. I am under the impression that the convergence rate can be either faster or slower, depending on the problem.
2) No, I still have not. But I will do it soon. If you wish to make a reference, please do it to the Accord.NET Framework. In the bottom there is a small section “Licensing” with a small example for citation. Thanks, I will be very grateful if you could do that.
Additionally, may I ask if you would like to contribute the code for multiple outputs so it could be eventually implemented in the framework sometime in the future?
All the best,
César
Good article, just starting to look at it to see if it fits my needs.
I have a question. You calculate the weights both by chain rule and finite diff? Why the two? In the article you state that finite diff is not effective, why is that? It also sounds like your chain rule is not complete, as there is the comment (currently only 1 hidden layer supported). Which do you suggest to use, or is this LMA really limited to one hidden and one output?
Hello, Cesar!
In my case Bayesian regularization make convergence slower but error dynamics is more smooth. I think it is because of Bayesian regularization uses sum of squared weights while computing weigths change
My code for multiple outputs based on paper “A Constructive Algorithm for Feedforward Neural Networks” (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.2791&rep=rep1&type=pdf). I can email you my code, if you need. Unfortunately, it is in Delphi, not C# or C++…
Hi Anonymous,
Both are provided for debugging purposes. Chain rule is preferred, as it is a direct computation of the derivatives. Numerical differentiation can be costly in comparison to a closed-form solution in this case, since the backpropagation algorithm is often very efficient.
Yes, only one hidden layer is supported. It has been proven that a single layer neural network is sufficient to fit any continuous function. If I can remember correctly (sorry, it has been time since I first wrote it) it is also a constraint assumed by the Bayesian regularization procedure.
As I mentioned in a comment few days ago, there is also a small bug in the chain rule which I am correcting soon. Alex (from the comments above) also have implemented the code for multiple outputs, which I am looking forward to incorporate in the implementation.
@Alex, if you could, can you please share the code by email? I would be very grateful. Please tell me if I can base my modifications on it. I will surely include a new copyright notice in the implementation mentioning your contribution.
Best regards,
César
Dear Cesar,
Thank you very much for sharing this very helpful information with us.
I would like to ask you a question regarding the Levenberg-Marquardt training algorithm available in Matlab Neural Network Toolbox.
I couldn’t figure out why each training ends differently (i.e. different mean square error, different number of iterations, etc) when I do multiple trainings with the same (time delayed) neural network. I start each training with zero weights and biases. The data used for the training is the same (no random division for training and validation). And the network structure is the same. I delete all the past data (including the neural network) before starting the next training. Is there some randomness inside the MATLAB implementation of the LM algorithm? Why the training results are different at the end of each training?
Many thanks.
Hi Anonymous,
Unfortunately I don’t know much about the Levenberg-Marquardt implementation in MATLAB, as I don’t have it installed. But as far as I know, the original Levenberg-Marquardt does not involves any random calculation.
Perhaps what could be happening is that the toolbox is automatically initializing your network to have Gaussian-distributed weights as a prerequisite of the Bayesian Regularization. Have you tried asking in the community support forums for the neural network toolbox? Perhaps they could help better.
Best regards,
César
I think there might be something very wrong with the way gradients are calculated. Correct me if I am wrong, but isn’t the point of LMA to use the lambda to interpolate between a GNA and Gradient Descent? Namely that when the GNA is failing to converge, the lambda gets kicked in and gradient descent comes in…. sort of like backprop. So in theory LMA should never be outperformed by backprop… which I do not find to be the case using this with AForge.
To further test this I fixed the Lambda at max. You would think the net could still learn, using pure gradient descent at this point. However the net does not learn. I also started checking the gradient arrays to ones I manually calculated for a few numbers and they do not match.
Any idea? Shouldn’t I be able to fix the lambda high and still see convergance? This makes me doubt that the gradient descent part of this is actually working and its all GNA.
Hi Anonymous,
Some months ago I have replaced the download links with a updated version. The most recent version is available through the Accord.NET Framework or can also be seen in Google Code. Particularly, the last fix was indeed in the calculation of the gradient. The development sources for the next release (not yet on Google Code) have had the gradient computation rewritten to support multiple hidden layers.
Setting the sample application‘s learning rate to 10 or 1000000 achieves convergence almost instantly. If we also go an fix the damping factor adjustment v to 1, it is still possible to see convergence, albeit really slowly. If we set the learning rate to a very low value and fix v, the code is not able to make progress because it relies on the damping factor adjustment to make the matrix non-singular during the LU decomposition, which may prevent progress otherwise.
By the way, if you would like, please take a look on the comments section of the first part of the post. Particularly, on the ones preceding this one. If you believe you have found an error, please let me know. As I stated on the linked comment, I can not guarantee the code is perfect. But we can always review it again and make any corrections as necessary.
Update: I have just committed the latest source to Google Code, in case you would like to take a look.
Best regards,
Cesar
for (int j = 0; j < ji.Length; j++)
s += ji[j] * errors[j];
I looked at the new code, not sure it is right? See the lines above… you are assuming that ji and errors are going to be of the same length. ji is a row from the jacobian, so it will have a length equal to the weights. errors has a length equal to the number of training set elements. So if you run this on an XOR, where you have fewer training set element items than weights, it throws an out of bounds error.
Hi Anonymous,
Have you actually tested it? I can not test it right now, but as far as I can read my own code, the code creates and handles a transposed representation of the Jacobian matrix to improve locality on those loops.
Best regards,
César
The code link above (for the latest), with the fixed gradients, gives a 404? (different anonymous)
Agreed, the googlecode link is bad
There was an issue with the repository, which I am gradually restoring. The link is working again. The rest of the sources will follow soon.
Regards,
César
Hi Cesar,
Thank you very much for sharing your experience.
A couple of months ago, i’ve started to research a problem using LMA. All works fine, the network converges (above 2500 epochs the delta between two consecutive training iterations outputs is 0),stable (allways reach the same solution for the same parameters) the results seems to be good. But (allways is a but)the results aren’t good enough. The solution deviation is round 25%from the training output.
As i saw in previous posts, there were some bugs. The odd fact is that my ANN return the same result with your last release as with the older framework version. Even with many layers the result is almost the same.
Should i suppose that i really hit the global minima and i cannot go further?
Or there are some tricks that can be used to reduce the deltas?
Kind Regards,
Sorin.
Hi Sorin,
Have you tried to disable Bayesian Regularization?
Best regards,
César
Good morning Cesar,
I’ve tried whithout bayesian distribution as well, but the deviation’s percent between the solution and the training output remains allmost the same (24.xx%).
Between the two solutions (with or whithout Bayes) are important differences, so the only rational conclusion is the global minima wasn’t reached.
I’m wondering if it is a procedure or a trick to make LMA to find the global minima.
As i’ve red until now, some authors claims that they made the LMA to reach the global minima forcing alpha and beta to fixed values (10 for example).
Rgrds,
Sorin.
Hi Sorin,
While I can’t guarantee the code is completely bug free, everyone is invited to actually look at the sources to see if something is missing.
Anyways, if you are not using Bayesian regularization, changing alpha or beta wouldn’t make much difference, as they are already set by default to 0 and 1, respectively. This means only network errors will be taken in account in the objective function.
However, something which may help is to set lambda (LearningRate) to a very low value (or 0). This will disable the damping factor, but may introduce problems if your Hessian is singular. If you Hessian is singular, you can try using the SingularValueDecomposition in place of the LuDecomposition in the RunEpoch computation.
Best regards,
César
Dear Cesar,
Since your code was compiled and it is running, i think it’s great. :-p
Thank you for your recommendation and i’m sorry for my late answer.
In the mean time i was busy with testing some scenarios.
Until now, i’ve menage to reduce the error from 24.xx% to 15.xx%. This is great!
And guess how i did it!
I’ll tell you: i used many hidden layers (5), following the formula:
inputLayer(inputCount)->2*inputCount->inputCount->inputCount/2->inputCount/4->inputCount/8->..->1.
I think that the topology of an ANN is very important. I will digg more and if i’ll find some verified rules, i will let you all know.
Until then,
Best regards,
Sorin.
Dear Cesar,
Sorry for double post.
I’ve tried to set the damping factor to 0, but disaster strikes.
In the while loop inside the RunEpoch procedure, the second condition will never be accomplished. To make the thing worse, the objective function value starts to raise, so i have here a nice infinite loop. More than this, starting with a nonzero learning rate will not solve the issue above, because LR will converge to 0, after, let’s say 250 epochs, will be really 0, and the multiplication with 10 or another number after those epochs will return 0 as well.
About the Hessian’s singularity, i see this issue is closed reiterrating until it will be nonsingular.
If i suppose that here is the global minima maybe i’m wrong, because i think that in this particular case the algorithm fails to change it’s searching direction.
What do you think?
Thank You,
BR,
Sotin
Hi Sorin,
The Hessian may become non-singular after we add a positive factor to its diagonal. However, this comes at a cost. This is a form of regularization, and the resulting solution may be far from the best solution possible. Another way to solve this issue is to use the SingularValueDecomposition instead, which makes no assumption about positive-definiteness or singularity of the matrix being decomposed.
About the other issues, I am afraid I do not have a final answer. It has been some time since I last used the code. But I remember to have had experimented with higher numbers of hidden layers as well. What I found (empirically), however, was that there was always a configuration with a single hidden layer which ended up producing better results than with many layers. There is also a theoretical result which says that a single hidden layer is sufficient to learn any continuous function (given a sufficient number of hidden nodes). Well, of course, it may depend on your particular problem. In the problem I was researching, training networks with more hidden layers was significantly faster than with single layers. But they also tended to overfit much more easily.
Best regards,
César
Dear Cesar,
Thank you for your reply and should apologize myself for my late answer. The testing of various scenarios is very time consuming.
Do you suggest to replace LUD with SVD only when the hessian is singular or for the entire calculation? The second part seems to make sense for me, so i’m sorry if i undesrtood wrong.
I solved the problem with lambda convergence to 0, reinitializing it to it’s initial value. I don’t know if is good or worse, but when i do this, the convergence doesn’t seems to be influenced (the error descent rate doesn’t change). The odd thing it’s happens when the damping factor starts to raise (the objective function doesn’t go low): instead of searching a better solution, the function seems to be trapped and converge slower to the same minima until the delta became low and the algorithm stops. In the mean time, the lambda value oscilate between exp(5) and exp (1) for example.
Indeed, using multiple hidden layers will lead to a faster convergence and to overfitting as well. I know, there is a theorem that state: “every continous function can be approximated using a single hidden layer”.
I’m using a time series prediction algorithm with a training set 3000 records long. It is a litle bit changed because each output is calculated with inputs from another time serie (forecast vs realized). I’m tryng to forecast the next 50 values (having an input forecast for those values) in order to improve the forecast according with historical data. I’ve obtained decent values for the training set (no overfit, good deviation) but the last values (round 10 to 20) from the 50 calculated values are very overfitted. I cannot understand why only few values (the lasts) from the solution are overfited when the other values are ok.
Thank you for you time,
Sorin.
Hi Sorin,
You could substitute the LU with SVD as you see fit. If you chose to compute the SVD only when the Hessian is singular, the only problem is that you will have first to have computed the LU to know this fact, which may result in an unnecessary decomposition.
About your overfitted values, I am not sure what could be causing it. But in any case, please try randomizing your training set before feeding the learning algorithms. This may be of some help.
Best regards,
Cesar
Hi Cesar,
About the overfiting…forget about it. I change the test configuration to 3000 records for training and 800 for testing. But an odd thing still happens: when the learning error decrease the forecast error raise and viceversa. Strange.
Since my last post i’ve tried different changes in your original algorithm targeting especially the dampling factor. Various ways to update lambda are tried, but in almost all the cases the same solution was reached. It seems that switching to the steepest descent method the error function will not go in another direction but in the same direction as Gauss-Newton leading to a “forced” convergency. During those tests i’ve only managed to improve the forecast with less than 10% when i expect 20-25%.
BR
Sorin
Hi Cesar,
Thank you for your explanation about LM. I saw in your reference an article writen by Willamowski and others “Efficient Algorithm for Training Neural Networks with one Hidden Layer,” did you ever apply this article and does it converge, I applied but it does not converge.
good days
Hi Baris,
I did see the article, unfortunately I didn’t read it through. So I didn’t try to apply it to the code (yet).
Best regards,
César
Hello all,
I would also like to announce that a completed version of this code, supporting multiple hidden layers and multiple outputs is now available in the Accord.NET Framework. The code has also been optimized for memory usage, so the Hessian can be inverted in place using a Cholesky decomposition. Moreover, the code now supports a speed-memory tradeoff parameter which allows to trade increasing computing speed for lower memory requirements. I hope it can be more useful now.
Best regards,
César
This comment has been removed by the author.
Hi Cesar,
Thanks a lot for such tutorial. I am trying to get fit a curve and I am able to do this. But now I want to save this network and for a particular X value I want to get the corresponding Y value. Could you please help me?
Thanks.
Hi Razib,
You can save a network by using the ActivationNetwork Save method. The Neural Network objects belong to the AForge.NET Framework libraries, so feel free to read the documentation on the Save method at the AForge site. Indeed, Andrew has answered a very similar question on AForge.NET’s forums, which I believe you will find useful.
Best regards,
Cesar
Dear Cesar,
Thanks for the enlightening tutorial! I would like to ask if you can help offer some advice to the following:
1) I noticed that the value of lambda continue to increase during some training making it impossible for one to achieve convergence. Besides, I feel there must be either convergence or no convergence depending on the initial parameters as you mentioned?
I also wish to know if other methods can be used to select the required regularization apart from cross validation? Kindly let me know!
Thanking.
Dear Cesar,
In the step of updating the Bayesian hyperparameters,
I’m wondering if one can use Tr[(H + λI)-]1 instead of Tr(H-1).
Thank you.
Hi Cesar,
Thank you for a detailed tutorial. It bridges the gap between the knowledge provided in the original paper and a through understanding required for coding. However i was wondering about the gradient evaluation ‘g’ in the provided algorithm. Although in the L-M algorithm it is Transpose(Jacobian) * residual, in the case of Bayesian regularization formulation shouldn’t it be [(Transpose(Jacobian) * residual) + alpha*Weights] because of the modification in the objective Function F=beta*SSE + alpha*SSW where SSE and SSW are sum of squared errors and sum of squared weights? I may be wrong. A brief derivation is provided below.
Starting from Taylor series: F(W+del_W) = F(W) + del_W * F’ + (1/2)(del_W)^2 * F”(W)
we have del_W = inv(F”)*F’ (This is well known Newton approx.) ; here del_W is step size.
In the current context, the derivative of F w.r.t W => F’ = beta * SSE’+ alpha* W
ans the second derivative of F w.r.t W => F”= beta * SSE” + alpha * I;
=> del_W = inv( beta * SSE” + alpha * I ) * (beta * SSE’+ alpha* W ); Here SSE’ is Jacobian * residual and SSE” is Transpose(J)* J;
In other words del_W = inv(H) * [ beta* J* residual + alpha*W]
Hessian H is straightforward as provided in your code. I am not sure why the J* e is used for ‘g’ evaluation instead of [ beta* J* e + alpha*W], where e is residual.
Please correct me if there are any assumptions i am missing in the derivation.
Thank you