# Como enviar cartas pela internet

Pode parecer estranho neste mundo onde reina a internet e o correio digital. Mas ainda assim há horas em que precisamos recorrer ao bom e velho correio convencional para enviar correspondência. Além do que, uma carta em papel é muito mais formal e impactante do que um email. No entanto, se você está com preguiça de ir até o correio apenas para postar uma carta, ou se precisa enviar correspondência para outro país e não está certo de como preencher o envelope nos padrões do país destinatário, não se preocupe! Existem serviços que permitem enviar cartas convencionais através da internet de maneira eficiente e confiável!

Para enviar cartas pela internet, existem vários sites que podemos utilizar.

### Para o território nacional

Serviço próprio dos correios. Uma carta custa, em média, R\$3,60. Pode ser usada para cartas internacionais também, mas é um tanto caro e as opções de pagamento incluem penas cartão de crédito.

### Para outros países

O melhor serviço que encontrei (e já utilizei) é o Mail A Letter. Uma carta internacional custa U\$1,99 (R\$3,53 em out/2009) e sua maior vantagem é que o site aceita pagamentos via paypal. Para cartas para os Estados Unidos, o serviço custa apenas U\$0,99.

http://www.mailaletter.com/

# Neural Network Learning by the Levenberg-Marquardt Algorithm with Bayesian Regularization (part 1)

A complete explanation for the totally lost, part 1 of 2.

The code has been incorporated in Accord.NET Framework, which includes the latest version of this code plus many other statistics and machine learning tools.

## Overview

The problem of neural network learning can be seen as a function optimization problem, where we are trying to determine the best network parameters (weights and biases) in order to minimize network error. This said, several function optimization techniques from numerical linear algebra can be directly applied to network learning, one of these techniques being the Levenberg-Marquardt algorithm.
The Levenberg–Marquardt algorithm provides a numerical solution to the problem of minimizing a (generally nonlinear) function, over a space of parameters for the function. It is a popular alternative to the Gauss-Newton method of finding the minimum of a function.

### Neural Networks

Neural networks are a relatively new artificial intelligence technique. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. The learning procedure tries is to find a set of connections w that gives a mapping that fits the training set well.
Furthermore, neural networks can be viewed as highly nonlinear functions with the basic the form:

 (1)

Where x is the input vector presented to the network, w are the weights of the network, and y is the corresponding output vector approximated or predicted by the network. The weight vector w is commonly ordered first by layer, then by neurons, and finally by the weights of each neuron plus its bias.
This view of network as an parameterized function will be the basis for applying standard function optimization methods to solve the problem of neural network training.

#### AForge Framework

AForge.NET Framework is a C# framework designed for developers and researchers in the fields of Computer Vision and Artificial Intelligence. Here, the Levenberg-Marquardt learning algorithm is implemented as a class implementing the ISupervisedLearning interface from the AForge framework.

#### Network training as a function optimization problem

As mentioned previously, neural networks can be viewed as highly non-linear functions. From this perspective, the training problem can be considered as a general function optimization problem, with the adjustable parameters being the weights and biases of the network, and the Levenberg-Marquardt can be straightforward applied in this case.

### Levenberg-Marquardt Algorithm

The Levenberg-Marquardt algorithm is a very simple, but robust, method for approximating a function. Basically, it consists in solving the equation:

 (2)

Where J is the Jacobian matrix for the system, λ is the Levenberg’s damping factor, δ is the weight update vector that we want to find and E is the error vector containing the output errors for each input vector used on training the network. The δ tell us by how much we should change our network weights to achieve a (possibly) better solution. The JtJ matrix can also be known as the approximated Hessian.
The λ damping factor is adjusted at each iteration, and guides the optimization process. If reduction of E is rapid, a smaller value can be used, bringing the algorithm closer to the Gauss–Newton algorithm, whereas if an iteration gives insufficient reduction in the residual, λ can be increased, giving a step closer to the gradient descent direction.

#### Computing the Jacobian

The Jacobian is a matrix of all first-order partial derivatives of a vector-valued function. In the neural network case, it is a N-by-W matrix, where N is the number of entries in our training set and W is the total number of parameters (weights + biases) of our network. It can be created by taking the partial derivatives of each output in respect to each weight, and has the form:

Where F(xi, w) is the network function evaluated for the i-th input vector of the training set using the weight vector w and wj is the j-th element of the weight vector w of the network.
In traditional Levenberg-Marquardt implementations, the Jacobian is approximated by using finite differences. However, for neural networks, it can be computed very efficiently by using the chain rule of calculus and the first derivatives of the activation functions.

#### Approximating the Hessian

For the least-squares problem, the Hessian generally doesn’t needs to be calculated. As stated earlier, it can be approximated by using the Jacobian matrix with the formula:

 (3)

Which is is a very good approximation of the Hessian if the residual errors at the solution are “small”. If the residuals are not sufficiently small at the solution, this approach may result in slow convergence. The Hessian can also be used to apply regularization to the learning process, which will be discussed later.

#### Solving the Levenberg-Marquardt equation

Levenberg’s main contribution to the method was the introduction of the damping factor λ. This value is summed to every member of the approximate Hessian diagonal before the system is solved for the gradient. Tipically, λ would start as a small value such as 0.1.
Then, the Levenberg-Marquardt equation is solved, commonly by using a LU decomposition. However, the system can only be solved if the approximated Hessian has not become singular (not having an inverse). If this is the case, the equation can still be solved by using a SVD decomposition.
After the equation is solved, the weights w are updated using δ and network errors for each entry in the training set are recalculated. If the new sum of squared errors has decreased, λ is decreased and the iteration ends. If it has not, then the new weights are discarded and the method is repeated with a higher value for λ.
This adjustment for λ is done by using an adjustment factor v, usually defined as 10. If  λ needs to increase, it is multiplied by v. If it needs to decrease, then it is divided by v. The process is repeated until the error decreases. When this happens, the current iteration ends.

#### General Levenberg-Marquardt Algorithm

As stated earlier, the Levenberg-Marquardt consists basically in solving (2) with different λ values until the sum of squared error decreases. So, each learning iteration (epoch) will consist of the following basic steps:

1. Compute the Jacobian (by using finite differences or the chain rule)
• g = JtE
3. Approximate the Hessian using the cross product Jacobian (eq. 3)
1. H = JtJ
4. Solve (H + λI)δ = g to find δ
5. Update the network weights w using δ
6. Recalculate the sum of squared errors using the updated weights
7. If the sum of squared errors has not decreased,
1. Discard the new weights, increase λ using v and go to step 4.
8. Else decrease λ using v and stop.

Variations of the algorithm may include different values for v, one for decreasing λ and other for increasing it. Others may solve (H + λdiag(H))δ = g instead of (H + λI)δ = g (2), while others may select the initial λ according to the size of the elements on H, by setting λ0 = t max(diag(H)), where t is a value chosen by the user. I’ve chosen the identity matrix equation because, apparently, it is the same method implemented internally by the Neural Network Toolbox in MATLAB.
We can see we will have a problem if the error does not decrease after a some iterations. In this case, the algorithm also stops if λ becomes too large.

#### Limitations

The Levenberg-Marquardt is very sensitive to the initial network weighs. Also, it does not consider outliers in the data, what may lead to overfitting noise. To avoid those situations, we can use a technique known as regularization.

In the next part of this article (part 2), we’ll discuss more about Bayesian regularization. We will also present and demonstrate the usage of the article’s accompanying source code. Please click here to go to Neural Network Learning by the Levenberg-Marquardt Algorithm with Bayesian Regularization (part 2).

# Neural Network Learning by the Levenberg-Marquardt Algorithm with Bayesian Regularization (part 2)

The code has been incorporated in Accord.NET Framework, which includes the latest version of this code plus many other statistics and machine learning tools.

## Bayesian Regularization

To overcome the problem in interpolating noisy data, MacKay (1992) has proposed a Bayesian framework which can be directly applied to the neural network learning problem. It also allows to estimate the effective number of parameters actually used by the model – in this case, the number of network weights actually needed to solve a particular problem.
Bayesian regularization expands the cost function to search not only for the minimal error, but for the minimal error using the minimal weights. It works by introducing two Bayesian hyperparameters, alpha and beta, to tell which direction (minimal error or minimal weights) the learning process must seek.
The cost function will then become:

C(k) = β*Ed+α*Ew, where:

1. Ed is the sum of squared errors, and
2. Ew is the sum of squared weights

By using Bayesian regularization, one can avoid costly cross validation. It is particularly useful to problems that can not, or would suffer, if a portion of the available data were reserved to a validation set. Regularization also reduces (or eliminates) the need for testing different number of hidden neurons for a problem. A third variable, gamma, indicates the number of effective weights being used by the network, thus giving an indication on how complex the network should be.
Many practical realizations of Bayesian Regularization perform an update of the hyperparameters alpha and beta after each training cycle. However, according to Poland (2001) the most popular update algorithm fails to produce robust iterates if there is not much training data. The following algorithm shows how to update the hyperparameters according to both rules. Both methods base their calculations in the inverse Hessian matrix.

### Expanded Levenberg-Marquardt Algorithm

Adding bayesian regularization to the Levenberg-Marquardt adds little overhead to the process because we already have an Hessian approximation. So, the algorithm for a learning epoch then becomes:

1. Compute the Jacobian (by finite differences or using the chain rule)
• g = JtE
3. Approximate the Hessian using the cross product Jacobian
1. H = JtJ
4. Calculate the cost function
1. C = β*Ed+α*Ew, where:
1. Ed is the sum of squared errors, and
2. Ew is the sum of squared weights
5. Solve (H + λI)δ = g to find δ
6. Update the network weights w using δ
7. Recalculate the cost function using the updated weights
8. If the cost has not decreased,
1. Discard the new weights, increase λ using v and go to step 5.
9. Else decrease λ using v
10. Update the Bayesian hyperparameters using MacKay’s or Poland’s formulae:
1. gamma = W – (alpha * tr(H-1))
2. beta = (N – gamma) / 2.0 * Ed
3. alpha = W / (2.0 * Ew + tr(H-1)) [modified Poland’s update], or
4. alpha = gamma / (2.0 * Ew) [original MacKay’s update], where:
1. W is the number of network parameters (number of weights and biases)
2. N is the number of entries in the training set
3. tr(H-1) is the trace of the inverse Hessian matrix

## Source Code

Here is the code for running a learning epoch of the Levenberg-Marquardt algorithm.

### Using the Code

Code usage is very simple. It follows the common AForge.NET learning algorithm usage, instantiating a LevenbergMarquardtLearning class instead of the more traditional BackpropagationLearning. Optionally, we may opt for using the Jacobian approximated by finite differences instead of computing it directly (although it would not be very practical unless for debugging purposes). We may also enable bayesian regularization by passing a boolean parameter to the constructor.

### Sample applications

The original sample applications from AForge.NET have been modified to use Levenberg-Marquardt learning. However, it is quite tricky to get them to converge adequately.

Function approximation using regular Levenberg-Marquardt

Function approximation using the Levenberg-Marquardt with Bayesian regularization. Note that some of them does not converge to a best fit solution; instead, a possibly more general solution is preferred. The top-right image uses the Nguyen-Widrow method for initializing weights.

Solving the XOR problem using Levenberg-Marquardt

Some versions of the Levenberg-Marquardt algorithm solve the equation (JtJ + λ diag(JtJ) I)δ = JtE instead of  (JtJ + λI)δ = JtE, effectively replacing the identity matrix with the diagonal of the approximated Hessian for the weight update rule. According to Wikipedia, this was suggested by Marquardt to incorporate some local curvature estimation. However, for some reason, more recent implementations that use the Levenberg-Marquardt tag do not include Marquardt’s suggestion. Other implementations suggest using an initial λ related to the size of the elements of the approximated Hessian by making λ0 = t max(diag(JtJ)), where t is a value chosen by the user.
It is also important to note that the Levenberg-Marquardt is extremely dependent on the initial guess for the network parameters. Depending on the initial weights of the network, the algorithm may converge to a local minima or not converge at all.
Besides that, it is an extremely fast method for neural network learning when compared to the standard backpropagation algorithm.

# Universal Approximation Theorem

The Universal Approximation Theorem states that:

Let φ(·) be a nonconstant, bounded, and monotonically-increasing continuous function. Let Im0 denote the m0-dimensional unit hypercube [0,1]m0. The space of continuous functions on Im0 is denoted by C(Im0). Then, given any function f Э C(Im0) and є > 0, there exist an integer m1 and sets of real constants αi, bi and wij, where i = 1, …, m1 and j = 1, …, m0 such that we may define:

$F( x_1 , dots, x_{m_0} ) = sum_{i=1}^{m_1} alpha_i varphi left( sum_{j=1}^{m_0} w_{i,j} x_j + b_iright)$

as an approximate realization of the function f; that is,

$| F( x_1 , dots, x_{m_0} ) - f ( x_1 , dots, x_{m_0} ) | < varepsilon$

for all x1, x2, …, xm0 that lie in the input space.

[ From Wikipedia ]

Beautiful, isn’t? 😛

# Análise de Componente Principal em C#

For the english version, Principal Component Analysis in C#, click this link.

A Análise de Componente Principal (PCA) é uma ferramenta exploratória criada por Karl Pearson em 1901 para identificar tendencias desconhecidas em um conjunto de dados multidimensional. A análise envolve um procedimento matemático que transforma um conjunto de variaveis possivelmente correlatas em um grupo menor de variaveis incorrelatas de denominadas componentes principais. Os primeiros componentes principais correspondem a maior variabilidade possivel contida nos dados, enquanto os componentes posteriores correspondem ao restante da informacao, em ordem decrescente de variabilidade.

### Visão Geral

A técnica PCA essencialmente rotaciona o conjunto de pontos ao redor de sua média objetivando seu alinhamento com os primeiros componentes principais. Isto move a maior parte da variancia possivel (usando uma transformação linear) para as primeiras dimensões. Os valores nas dimensoes restantes, portanto, tendem a ser altamente correlacionados e podem ser ignorados com perda mínima de informação.

Para uma explicação mais detalhada sobre PCA, leia o excelente texto Tutorial On Principal Component Analysis, por Lindsay Smith (2002).

### AForge.NET Framework

Minha idéia inicial era implementar a PCA internamente no AForge.NET Framework. O AForge.NET é um excelente framework de Inteligência Artificial e Visão Computacional de código aberto escrito para .NET e desenvolvido principalmente por Andrew Kirillov.

No entanto, como isto envolveria muitas mudanças e adições ao framework, acredito que seria muito difícil rever todas mudanças requeridas para adicionar esta funcionalidade diretamente em seu código fonte.

Devido a isso, este novo, e mais limpo, código, foi implementado usando somente os binários do AForge como ponto de partida. Espero que este código ajude outros precisando realizar a Análise de Componente Principal em seus próprios projetos em C#.

Esta nova biblioteca, a que chamei de Accord.NET, extende o AForge.NET adicionando novas features como a Principal Component Analysis, decomposições numéricas e algumas outras poucas transformações matemáticas e ferramentas. Na verdade, esta extensão é muito mais como um campo de teste para novas features que eu gostaria de ver em versões futuras do AForge. Mas para maior simplicidade, omiti as partes não relacionadas com PCA e criei esta versão compacta apenas para suportar a análise.

## Decisões de Implementação

Como pessoas que querem utilizar PCA em seus projetos geralmente já tem suas próprias classes de matrizes, decidi evitar usar implementações específicas para tornar o código mais flexível. Também tentei evitar dependencias em outros métodos sempre que possível, para tornar o código bastante independente. Acredito que isto também tenha tornado o código mais simples de se entender.

O código está dividido entre estes dois projetos:

• Accord.Math, que fornece as ferramentas matemáticas, como decomposições e transfomações; e
• Accord.Statistics, que fornece análises e ferramentas estatísticas.

Ambos dependem do núcleo do AForge.NET. Adicionalmente, sua estrutura e organização interna também tenta similar ao máximo a empregada no AForge sempre que possível.

O código fonte fornecido não inclui o código fonte na íntegra do Accord Framework, que permanece como um framework de testes do que seria interessante ver no AForge, mas inclui somente porções limitadas do código necessárias para suportar PCA.

## Visão Geral do Código

Below is the main code behind the PCA.

## Usando o Código

Para realizar uma análise simples, você pode simplesmente instanciar um novo objeto PrincipalComponentAnalysis passando seu conjunto de dados e chamar seu método Compute para computar o modelo. Então você pode simplesmente chamar o método Transform para projetar os dados no espaço dos componentes principais.

Abaixo está um exemplo de código demonstrando seu uso.

## Demonstração

Para demonstrar o uso da Análise de Componente Principal, criei uma simples aplicação Windows Forms que efetua uma análise de dados simples e as transformações PCA em planilhas Excel.

The application can open Excel workbooks. Here we are loading some random Gaussian data, some random Poisson data, and a linear multiplication of the first variable (thus also being Gaussian).

Simple descriptive analysis of the source data, with a histogram plot of the first variable. We can see it fits a Gaussian distribution with 1.0 mean and 1.0 standard deviation.

Here we perform PCA by using the Correlation method (actually the transformation uses SVD on the standardized data, not the correlation matrix, but the effect is the same). Because the third variable is a linear multiplication of the first, the analysis detected it was totally irrelevant, thus having zero importance level.

Now we can make a projection of the source data using only the first two components.

Nota: Os componentes principais não são únicos porque a decomposição em valor singular não é única. Além do mais, os sinais na matriz de rotação são arbitrários, e portanto podem difererir entre diferentes programas para PCA.

junto com a aplicação de demonstração vem uma planilha Excel contendo vários exemplos de dados. O primeiro exemplo é o mesmo utilizado por Lindsay em seu Tutorial on Principal Component Analysis. Outros exemplos incluem dados Gaussianos, dados não correlacionados e combinações lineares de dados Gaussianos para exemplificar melhor a análise.

Bem, novamente, espero que alguém ache este código útil! 😛