Neural Network Learning by the Levenberg-Marquardt Algorithm with Bayesian Regularization (part 1)


A complete explanation for the totally lost, part 1 of 2.


The code has been incorporated in Accord.NET Framework, which includes the latest version of this code plus many other statistics and machine learning tools.


  1. Overview
    1. Neural Networks
    2. AForge Framework
    3. Network training as a function optimization problem
  2. Levenberg-Marquardt
    1. Computing the Jacobian
    2. Approximating the Hessian
    3. Solving the Levenberg-Marquardt equation
    4. General Algorithm
    5. Limitations
  3. Bayesian Regularization
    1. Expanded Levenberg-Marquardt Algorithm
  4. Source Code
    1. Using the Code
    2. Sample Applications
  5. Additional Notes
  6. References


The problem of neural network learning can be seen as a function optimization problem, where we are trying to determine the best network parameters (weights and biases) in order to minimize network error. This said, several function optimization techniques from numerical linear algebra can be directly applied to network learning, one of these techniques being the Levenberg-Marquardt algorithm.
The Levenberg–Marquardt algorithm provides a numerical solution to the problem of minimizing a (generally nonlinear) function, over a space of parameters for the function. It is a popular alternative to the Gauss-Newton method of finding the minimum of a function.

Neural Networks

Neural networks are a relatively new artificial intelligence technique. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. The learning procedure tries is to find a set of connections w that gives a mapping that fits the training set well.
Furthermore, neural networks can be viewed as highly nonlinear functions with the basic the form:


Where x is the input vector presented to the network, w are the weights of the network, and y is the corresponding output vector approximated or predicted by the network. The weight vector w is commonly ordered first by layer, then by neurons, and finally by the weights of each neuron plus its bias.
This view of network as an parameterized function will be the basis for applying standard function optimization methods to solve the problem of neural network training.

AForge Framework

AForge.NET Framework is a C# framework designed for developers and researchers in the fields of Computer Vision and Artificial Intelligence. Here, the Levenberg-Marquardt learning algorithm is implemented as a class implementing the ISupervisedLearning interface from the AForge framework.

Network training as a function optimization problem

As mentioned previously, neural networks can be viewed as highly non-linear functions. From this perspective, the training problem can be considered as a general function optimization problem, with the adjustable parameters being the weights and biases of the network, and the Levenberg-Marquardt can be straightforward applied in this case.

Levenberg-Marquardt Algorithm

The Levenberg-Marquardt algorithm is a very simple, but robust, method for approximating a function. Basically, it consists in solving the equation:


Where J is the Jacobian matrix for the system, λ is the Levenberg’s damping factor, δ is the weight update vector that we want to find and E is the error vector containing the output errors for each input vector used on training the network. The δ tell us by how much we should change our network weights to achieve a (possibly) better solution. The JtJ matrix can also be known as the approximated Hessian.
The λ damping factor is adjusted at each iteration, and guides the optimization process. If reduction of E is rapid, a smaller value can be used, bringing the algorithm closer to the Gauss–Newton algorithm, whereas if an iteration gives insufficient reduction in the residual, λ can be increased, giving a step closer to the gradient descent direction.

Computing the Jacobian

The Jacobian is a matrix of all first-order partial derivatives of a vector-valued function. In the neural network case, it is a N-by-W matrix, where N is the number of entries in our training set and W is the total number of parameters (weights + biases) of our network. It can be created by taking the partial derivatives of each output in respect to each weight, and has the form:
Jacobian matrix for neural networks
Where F(xi, w) is the network function evaluated for the i-th input vector of the training set using the weight vector w and wj is the j-th element of the weight vector w of the network.
In traditional Levenberg-Marquardt implementations, the Jacobian is approximated by using finite differences. However, for neural networks, it can be computed very efficiently by using the chain rule of calculus and the first derivatives of the activation functions.

Approximating the Hessian

For the least-squares problem, the Hessian generally doesn’t needs to be calculated. As stated earlier, it can be approximated by using the Jacobian matrix with the formula:


Which is is a very good approximation of the Hessian if the residual errors at the solution are “small”. If the residuals are not sufficiently small at the solution, this approach may result in slow convergence. The Hessian can also be used to apply regularization to the learning process, which will be discussed later.

Solving the Levenberg-Marquardt equation

Levenberg’s main contribution to the method was the introduction of the damping factor λ. This value is summed to every member of the approximate Hessian diagonal before the system is solved for the gradient. Tipically, λ would start as a small value such as 0.1.
Then, the Levenberg-Marquardt equation is solved, commonly by using a LU decomposition. However, the system can only be solved if the approximated Hessian has not become singular (not having an inverse). If this is the case, the equation can still be solved by using a SVD decomposition.
After the equation is solved, the weights w are updated using δ and network errors for each entry in the training set are recalculated. If the new sum of squared errors has decreased, λ is decreased and the iteration ends. If it has not, then the new weights are discarded and the method is repeated with a higher value for λ.
This adjustment for λ is done by using an adjustment factor v, usually defined as 10. If  λ needs to increase, it is multiplied by v. If it needs to decrease, then it is divided by v. The process is repeated until the error decreases. When this happens, the current iteration ends.

General Levenberg-Marquardt Algorithm

As stated earlier, the Levenberg-Marquardt consists basically in solving (2) with different λ values until the sum of squared error decreases. So, each learning iteration (epoch) will consist of the following basic steps:

  1. Compute the Jacobian (by using finite differences or the chain rule)
  2. Compute the error gradient
    • g = JtE
  3. Approximate the Hessian using the cross product Jacobian (eq. 3)
    1. H = JtJ
  4. Solve (H + λI)δ = g to find δ
  5. Update the network weights w using δ
  6. Recalculate the sum of squared errors using the updated weights
  7. If the sum of squared errors has not decreased,
    1. Discard the new weights, increase λ using v and go to step 4.
  8. Else decrease λ using v and stop.

Variations of the algorithm may include different values for v, one for decreasing λ and other for increasing it. Others may solve (H + λdiag(H))δ = g instead of (H + λI)δ = g (2), while others may select the initial λ according to the size of the elements on H, by setting λ0 = t max(diag(H)), where t is a value chosen by the user. I’ve chosen the identity matrix equation because, apparently, it is the same method implemented internally by the Neural Network Toolbox in MATLAB.
We can see we will have a problem if the error does not decrease after a some iterations. In this case, the algorithm also stops if λ becomes too large.


The Levenberg-Marquardt is very sensitive to the initial network weighs. Also, it does not consider outliers in the data, what may lead to overfitting noise. To avoid those situations, we can use a technique known as regularization.

In the next part of this article (part 2), we’ll discuss more about Bayesian regularization. We will also present and demonstrate the usage of the article’s accompanying source code. Please click here to go to Neural Network Learning by the Levenberg-Marquardt Algorithm with Bayesian Regularization (part 2).


  1. Hi
    First of all: great tutorial.
    I have on e question though: I don’t understand how one would calculate the jacobian via finite differences. If you could give me a specific example of one element in the jacobian, with what F’s and what w’s are used for the differences, that would be really helpful.
    Thanks ahead.

  2. Hi,

    Thanks for the feedback. Well, if you see the definition of the Jacobian matrix, you will see that it is just a matrix of derivatives. If you express your network as a single function, you would have, for example in a single neuron network with 2 inputs and no bias term, a function like:

    F(x1,x2,w1,w2) = g(x1*w1 + x2*w2)

    Where x1, x2 are the inputs of the network, w1, w2 are the neuron’s weights and g(.) is the activation function.

    If you have a single pattern vector x = [1 2] in your training set, the first element of the matrix would be:

    del F(1,2,w1,w2) / del w1

    Now if you want to compute it by using finite differences, you would choose a suitable small constant h and compute del F / del w1 using

    (F(1,2,w1+h,w2) – F(1,2,w1,w2)) / h

    This is one of the simplest forms for computing a derivative using finite differences. For more information you could check the previously linked Wikipedia page or either this other one.

    Hope it helps,

  3. First of all, thank you for the tutorial, what i do not understand is how can i update the weights between the input and hidden layers?
    i mean, this procedure is availeble to update the weights between the hidden and output layer as is based on the vector E

    Thanks a lot

  4. Hi Anonymous,

    Well, I am afraid I didn’t understand your question. The procedure updates all weights in neural networks with a single hidden layer. The weights in the input layer are computed by applying the chain rule in the derivation of the network function.

    If you wish you can take a look on the sources or in the sample application to see how the algorithm can be realized in code.

    Best regards,

  5. Sorry César,

    First of all, thanks again for rephlying this fast. If i have a ANN with one hidden layer i have two different types of weights (the first one connects the input layer with the hidden layer and the second one connects the hidden layer with the output layer). When the gradient decent method is used, the error is backpropagated from the output layer. So i do not understand i have to develop two different jacobian matrices to update the different weights (each of these matrices would have different size) or if is neccesary only one jacobian matrix that will update all weight of the ANN, that is, weights between input-hidden and hidden-output.

    Thank you very much César


  6. Hi Anonymous,

    Only one Jacobian is necessary. From a function optimization point of view, it does not matter whether the weight comes from the input-hidden or hidden-output weights. It is still a free parameter to be optimized.

    Best regards,

  7. hi all

    thank you very much for this tutorial it help me for understanding how LM training works

    Best regards

  8. Nice article..

    I have a question. Even after going through the code, I did not really understand how the derivatives were calculated for the jacobian matrix. Is there any particular formula that computes the derivatives?


  9. When I looked at some other research papers, jacobian is calculated by derivative of error with respect to weight. This is what we use in the normal back propagation. But, when I used the same thing its not working. Any ideas are greatly appreciated.

    Thanks in advance,

  10. Hi Anu,

    This is not the same Jacobian computed in backpropagation learning. Those papers may be referring to the Jacobian formed by the derivatives of the error in respect to the weights. The Jacobian used by the Levenberg-Marquardt is the derivative of the network function in respect to the weights. Perhaps you could take a look on my previous answer to one of the comments above to see if it can help.


  11. Cesar,
    Could you please explain how to compute Jacobian’s parts via chain rule? In classic backpropagation I can compute partial derivatives of error function for each weigth using only neuron’s outputs and neuron’s weights. In previous answer you described that del F(1,2,w1,w2) / del w1 can be computed using finite differences, but what about chain rule? Does computing partial derivatives for LM differ from backpropagation and what is difference?

  12. Hi Alex,

    As you mentioned, in classical backpropagation you can compute partial derivatives of the error function for each weight. In this Jacobian formulation, what has to be computed is the partial derivatives of the network function for each weight.

    In this code I derived the algorithm solely for networks with a single hidden layer because of the Bayesian regularization. But you can try for more general networks by first writing your network topology as a single function F(x|w) = y and then starting deriving w.r.t to w1, w2, and so on. You can adapt the backpropagation code to reuse previously computed values in the calculation of the chain rule to avoid unnecessary computations.

    Best regards,

  13. Hi,
    I have pretty much the same question as Anupama above : i’ve been through several papers (including C. Bishop’s “Neural Networks for Pattern recognition”, p.292, and “Training feed-forward networks with the Marquardt algorithm ” (Hagan, M.T.; Menhaj, M.B., Nov 1994) and both consider the Jacobian of the total error, that is J = (dEi/dwj) … is it “normal” ?

    Best regards,


  14. Hi all,

    I am in the process of revising the entire post. The neural network literature really seems to use the derivatives of the error function for the Jacobian. However, when writing this post and its accompanying source code, I was following the notes by Sam Roweis in his paper Levenberg-Marquardt Optimization. From the paper (page 3):

    One way to approximate E as locally quadratic (in w) near a minimum is to approximate f(x;w) as a linear function of w, which we will now derive. Remember that all gradients are with respect to w and all averages are over input output pairs.

    The paper continues and in the bottom of the page it writes the derivatives d and the Hessian H using f(x;w). It then continues and discusses the Levenberg-Marquardt optimization in terms of the same d and H. It seems this is an approximation to the E(w) found in the neural network literature.

    I confess I am now unsure which is most correct, if I have interpreted it wrong or even if the two aren’t equivalent. The source code uses this implementation and seems to work fine.

    By the way, the download links now redirect to the Accord.NET Framework download page. I encourage interested users to actually download the framework, as it is much better maintained than the older standalone packages which were available here.

    Best regards,

  15. Hello all,

    I am writing this follow up to confirm that the original post is still correct. Please see Bishop, C. Pattern Recognition and Machine Learning, 2006; pages 248-251.

    On page 248, Bishop’s shows how the Jacobian can be calculated using the partial derivatives of y_k in respect to x. Unless y_k is defined to something else, I am assuming y_k = F_k(x_i;w), which agrees with the definition given in this post.

    Moreover, the H ~= J’J is an approximation of the exact Hessian, and not the exact Hessian itself. Even if the Hessian is defined as the second derivatives of the error function as given in eq. 5.83 (page 251), the second term (which is related to the error) is dropped from the equation to give the Levenberd-Marquardt outer product approximation. Thus I still believe the article is correct.

    However, I am no perfect, and if anyone else believes to have found an error, please let me know, so we can review it again and make any corrections as necessary.


  16. Hi,

    First, thanks for your answer… indeed, the [Bishop2006] tends to suggest it is the Jacobian of the output itself – however, when using this definition, i’m unsure about how to consider multidimensional outputs (let’s say, in R^P) with N multidimensional samples from R^M, for.
    First, because the Jacobian would be M-by-P matrix (d(F_k(x^(n)_j;w))/dx^(n)_j), for one particular sample x^(n) – wouldn’t it? If we consider the weights as the input variable, it would still be the R-by-P matrix (d(F_k(x^(n);w_i))/dx_j) where R is the number of bias. I can’t see how to obtain a Jacobian for multidimensional input, output and multiple samples, unless each coordinate (of the input, for example) is considered as a different input variable and the N M-dimensional samples are seen as a MN-dimensional variable.

    (i’m not sure i was really clear, but it bugs me a lot :/)

  17. Hi Ceacy,

    For the multiple output case, all you have to do is insert extra rows in your Jacobian matrix. Suppose you have a network with N samples of dimensionality M, N corresponding outputs of dimensionality P and W network parameters.

    If P = 1, your Jacobian matrix is the one depicted in the post, with N rows and W columns. Each row will correspond to a sample in your training set, and will be formed by the derivatives of the single output given by the corresponding sample w.r.t. weights.

    Now, if P = 2, your Jacobian matrix will have NP rows and W columns. Each two rows will be formed by the derivatives corresponding to one of the samples, and each of those two rows will correspond to the derivatives of each output w.r.t to the weights.

    And the same goes on for any P.

    Unfortunately I can’t remember exactly from where I got this definition so I can not give you a reference. But if you take a look at the comments section of the part 2 article ( you can follow the discussion with reader Alex in which he seems to have implemented the method for multiple outputs using this formulation. I believe this description was given in one of the works in the reference section. Perhaps you could try taking a look on them!

    Best regards,

  18. Thanks for your answer, again. I finally figured out where i was wrong, as i had misunderstood the notations in the various articles (and books, for that matter) i had read.

    Indeed, they didn’t suggest to take the Jacobian of the *squared* total error (that is, (dE_i/dw_j)_{i,j}, where E=1/2*sum_i(E_i)=1/2*sum_i(e_i^2), but the Jacobian of the n-dimensional error vector e=(e_i)_i itself – that is, without the square, which makes sense … as e_i=y_i-t_i, de_i/dw_j=dy_i/dw_j and we end up with the same matrix as if we had taken the Jacobian of the fucntion itself.

    Best regards,


  19. Dear Cesar,
    Thanks for experience sharing. I have several questions on your source code. In Levenberg-Marquardt algorithm file you implement Jackobian calculation using chain rule. I am afraid of misunderstanding lines devoted to derivatives. You have used Derivative2 method of activation function class. The title provides a hint to the second order derivative. As for me, that will return derivative by function value, i.e. f'(x)=F(f(x)). Am I right? Besides, let me know if your default activation function is th(alpha*x).
    Best regards,

  20. Hi Ivan,

    I, myself, had misunderstood the meaning of the Derivative2 method in the past. But I soon discovered that it is not the second derivative. It is, indeed, the first derivative as you suggested.

    My source code uses the AForge.NET’s Neural Network implementation, and thus, uses the definition available in the framework’s documentation: Derivative2(x).

    As we can see, it is just an optimized version of the first derivative. It makes use of the fact that, often, the derivative an activation function can be written directly in terms of its output. Functions such as the sigmoid can be written in terms of the output value as y(1-y). As the output has already been computed, there will be no need to compute it twice.

    The activation function could be any function implementing the IActivationFunction interface.

    Best regards,

  21. Dear Cesar,
    I have tested your LMA implementation on two variables function
    approximation problem. Divided on the samples amount the 73d
    iteration error was 0.00005. Not bad, but too slow…

    There are some questions.

    1) Could you explain the difference between
    “network[layer][neuron].Output” and “network[layer].Output[neuron]”.

    2) Let me some quotation

    // Hidden layer case (only 1 hidden layer is currently supported)
    if (network.LayersCount == 2)
    layer = 0;
    previousLayerOutput = input;

    // for each neuron in the input layer
    for (int neuron = 0; neuron <
    network[layer].NeuronsCount; neuron++)
    output = network[layer][neuron].Output;

    // for each weight of the input neuron
    for (int i = 0; i < network[layer][neuron].InputsCount; i++)
    sum = 0.0;
    // for each neuron in the next layer
    for (int j = 0; j < network[layer +
    1].NeuronsCount; j++)
    // for each weight of the next neuron
    for (int k = 0; k < network[layer +
    1].InputsCount; k++)
    sum += network[layer + 1][j][k] *
    sum += network[layer + 1][j].Threshold;

    // consider there’s only one output neuron
    double w = network[layer + 1][outputIndex][neuron];

    weightDerivatives[layer][neuron][i] =
    function.Derivative2(output) *
    function.Derivative(sum) * w *


    In my opinion, weightDerivative should be

    function.Derivative2(output) *
    function.Derivative2(network[layer+1][outputIndex].Output) * w *

    Could you take a look on that lines and review them?

    I am looking forward hearing from you.

    Best regards,

  22. IvanZ:

    As far as i can see, for the first question, the two are equivalent (cf. and
    When computing the output with Layer
    ::Compute(), the output of the layer is set with the output of the neurons, when calling Neuron::Compute(). And this last method (at least for ActivationNeuron: sets its member output with the result before returning it.

    (for the second, i’m not familiar with the code, but, for what it’s worth, i’d tend to agree with you – the double sum on j and k seems a little odd: there should be only a sum on the output layer, and since it only involves one neuron, the sum itself vanishes)

  23. Hi all,

    Thanks for reporting, I will take a more careful look on it and post an answer. It has been some time since this code has been written.

    Best regards,

  24. Hello all,

    Yes, the last lines are also equivalent. I also took some time and implemented the method for any number of layers. I will see if I can also finish implementing it for any number of outputs, and then I will publish an updated version.

    Best regards,

  25. Hi…
    First of all is a great tutorial

    I have a problem to solve levenberg marquardt algorithm. My Final Project is about LevMar but I don’t have much references.

    I have 2 input, 3 hidden, and 1 output layers. My question is :
    1. How to get jacobian matrix and solve it?
    2.(H + λI)δ = g. how to get δ?
    3. dimension of jacobian matrix?
    4. dimension of hessian matrix?
    5. dimension of δ?

    my final project is Design PID using Neural Network Levenberg Marquardt

    thank a lot for helps.

    Ahmad Arif A

  26. Hi Arif,

    The Jacobian dimensions are NxW, in which N is the number of inputs in your data set and W the number of weights in your network. The Hessian is WxW in which W is again the number of weights in your network. Delta is the weight update to be used in the step, it has the same length as the number W of weights in your network. You can find delta by solving the linear system using any suitable method, such as a matrix decomposition.

    Best regards,

    1. Hi César,
      what about the dimension of E,in the equation (2)? is it the number of output of the network? or it’s the number of inputs in my dataset?

      Thanks ahead
      taosong fan

  27. Hello all,

    I just would like to announce that a completed version of this code, supporting multiple hidden layers and multiple outputs is now available in the Accord.NET Framework. The code has also been optimized for memory usage, so the Hessian can be inverted in place using a Cholesky decomposition. Moreover, the code now supports a speed-memory tradeoff parameter which allows to trade increasing computing speed for lower memory requirements. I hope it can be more useful now.

    Best regards,

  28. about the Jacobian S.Haykin in his book “Neural Netwokrs”, p 224-226 give the same matrix as you did, but the thing that i didn’t understand is the jacobian given by Chris Bishop is his book “Pattern Recognition and Machine Learning, 2006” he said “Here we consider the evaluation of the Jacobian matrix, whose elements are given by the derivatives of the network outputs with respect to the inputs”
    unless there is more than one jacobian.
    thanks ahead.

  29. about the Jacobian S.Haykin in his book “Neural Netwokrs”, p 224-226 give the same matrix as you did, but the thing that i didn’t understand is the jacobian given by Chris Bishop is his book “Pattern Recognition and Machine Learning, 2006” he said “Here we consider the evaluation of the Jacobian matrix, whose elements are given by the derivatives of the network outputs with respect to the inputs”
    unless there is more than one jacobian.
    thanks ahead.

  30. about the Jacobian S.Haykin in his book “Neural Netwokrs”, p 224-226 give the same matrix as you did, but the thing that i didn’t understand is the jacobian given by Chris Bishop is his book “Pattern Recognition and Machine Learning, 2006” he said “Here we consider the evaluation of the Jacobian matrix, whose elements are given by the derivatives of the network outputs with respect to the inputs”
    unless there is more than one jacobian.
    thanks ahead.

  31. about the Jacobian S.Haykin in his book “Neural Netwokrs”, p 224-226 give the same matrix as you did, but the thing that i didn’t understand is the jacobian given by Chris Bishop is his book “Pattern Recognition and Machine Learning, 2006” he said “Here we consider the evaluation of the Jacobian matrix, whose elements are given by the derivatives of the network outputs with respect to the inputs”
    unless there is more than one jacobian.
    thanks ahead.

  32. about the Jacobian S.Haykin in his book “Neural Netwokrs”, p 224-226 give the same matrix as you did, but the thing that i didn’t understand is the jacobian given by Chris Bishop is his book “Pattern Recognition and Machine Learning, 2006” he said “Here we consider the evaluation of the Jacobian matrix, whose elements are given by the derivatives of the network outputs with respect to the inputs”
    unless there is more than one jacobian.
    thanks ahead.

  33. about the Jacobian S.Haykin in his book “Neural Netwokrs”, p 224-226 give the same matrix as you did, but the thing that i didn’t understand is the jacobian given by Chris Bishop is his book “Pattern Recognition and Machine Learning, 2006” he said “Here we consider the evaluation of the Jacobian matrix, whose elements are given by the derivatives of the network outputs with respect to the inputs”
    unless there is more than one jacobian.
    thanks ahead.

  34. hi

    i could not understand the jacobian computation.
    i need jacobian computation in c code.for 3inputneuron, 2 hidden neuron, 1 outputneuron. please send me.

  35. Hi Cesar,

    I think there is a problem with hessian calculation from jacobian: allways the upper diagonal is zero, so the risk for hessian to became singular is quite high.

    I will try to fix this (possible) bug.


  36. Hi Marcel,

    Thanks for the interest in the code. If I remember correctly, the code stores only the upper triangular part of the Hessian, since it is assumed symmetric. The lower triangular part is used to store its Cholesky decomposition, conserving memory.

    Best regards,

  37. Hi Hamdi,

    I think I must have missed your comment and questions. Yes, there can be two Jacobians in the context of neural networks: one for the error function and one for the network function itself. But actually, the Jacobian in Bishop’s book is the same as mine. The Jacobian used in Levenberg-Marquardt is the Jacobian of the network function itself.

    Best regards,

  38. Hi Cesar!

    Firsat of all, nice work!

    Looking at your code, i’m wondering if LM ANN may be used for a regression problem with multiple output neurons. For example feeding forward with a training set of vectors (defined as value and phase) and backpropagating with another set of vectors similar (as structure) with the inputs.
    For example the training set is something like: input[i][0]=v*cos(alpha), input[i][1]=v*sin(alpha) and output[i][0]=w*cos(alpha], output[i][1]=w*sin(alpha).
    Thank you.

  39. Hi Anon!

    Yeah, most likely it could work (it is, if the function is really learnable). Just be sure to use the latest version of the framework, which has support for many layers and many outputs. The version I used to share here in the blog was much simpler and didn’t offer those options, but the one on the framework does!

    Best regards,

  40. Is LM algorithm similar to neural network training??? or training of neural network should be done after applying LM algorithm again. I’m using this for image compression in matlab..

  41. Hi César!
    Excelent works! I have read your article very seriously. and i have some questions .
    It seems that the W should be all weight of the neural network,if i have 4 inputs,and 6 neuron in hidelayer,and 3 neuron in outputlayer,that’s to say i have a W with(1to42),do you agrees with me ?

  42. Hi there!

    Sorry for the little delay in answering your question! Indeed, W should reflect the number of parameters in the network. It is the total number of weights (plus biases) in the network. In my code, I have implemented a small function to count all those values given an activation neural network.

    If you have 4 inputs, 6 neurons in first layer, 3 neuron in second layer, then you will have W = (4+1)*6 + (6+1)*3 = 51 weights in your network (the +1’s are because of the bias terms in each neuron).

    Hope it helps!

    Best regards,

  43. Hi Cesar,
    Thanks for the awesome tutorial. Please kindly let me know if there is any method which can be used to select the initial weights for training. I would like to increase the chance of getting a candidate model when using the Levenberg Marquardt algorithm. I make the above assertion given the fact that the convergence clearl depends on initial weights.
    I would be happy to also compare any open source code with the one I wrote. I am trying to use regularization rather than divide the give data set into training and test set. I look forward to your reply.
    Kind regards,

  44. I read your article. I m doing my final year project on cash forecasting with neural networks using levenberg marquardt algorithm with bayesian regularization for banks . But i am unable to relate it with banks working can you help me with that ? if you can then please try to explain it with the real example of bank . I will be thankful to you if you can help

Leave a Reply

Your email address will not be published. Required fields are marked *