Deep Neural Networks and Restricted Boltzmann Machines


The new version of the Accord.NET brings a nice addition for those working with machine learning and pattern recognition: Deep Neural Networks and Restricted Boltzmann Machines.

Class diagram for Deep Neural Networks in the Accord.Neuro namespace.

Deep neural networks have been listed as a recent breakthrough in signal and image processing applications, such as in speech recognition and visual object detection. However, is not the neural networks which are the new things here; but rather, the learning algorithms. Neural Networks have existed for decades, but previous learning algorithms were unsuitable to learn networks with more than one or two hidden layers.

But why more layers?

The Universal Approximation Theorem (Cybenko 1989; Hornik 1991) states that a standard multi-layer activation neural network with a single hidden layer is already capable of approximating any arbitrary real function with arbitrary precision. Why then create networks with more than one layer?
To reduce complexity. Networks with a single hidden layer may arbitrarily approximate any function, but they may require an exponential number of neurons to do so. We can borrow a more tactile example from the electronics field. Any boolean function can be expressed using only a single layer of AND, OR and NOT gates (or even only NAND gates). However, one would hardly use only this to fully design, let’s say, a computer processor. Rather, specific behaviors would be modeled in logic blocks, and those blocks would then be combined to form more complex blocks until we create a all-compassing block implementing the entire CPU.
The use of several hidden layers is no different. By allowing more layers we allow the network to model more complex behavior with less activation neurons; futhermore the first layers of the network may specialize on detecting more specific structures to help in the later classification. Dimensionality reduction and feature extraction could have been performed directly inside the network on its first layers rather than using specific separate algorithms. 

Do computers dream of electric sheep?

The key insight in learning deep networks was to apply a pre-training algorithm which could be used to tune individual hidden layers separately. Each layer is learned separately without supervision. This means the layers are able to learn features without knowing their corresponding output label. This is known as a pre-training algorithm because, after all layers have been learned unsupervised, a final supervised algorithm is used to fine-tune the network to perform the specific classification task at hand.

As shown in the class diagram on top of this post, Deep Networks are simply cascades of Restricted Boltzmann Machines (RBMs). Each layer of the final network is created by connecting the hidden layers of each RBM as if they were hidden layers of a single activation neural network.

Now, the most interesting part about this approach will given now. It is about one specific detail on how the RBMs are learned, which in turn allows a very interesting use of the final networks. As each layer is a RBM learned using an unsupervised algorithm, they can be seen as standard generative models. And if they are generative, they can be used to reconstruct what they have learned. And by sequentially alternating computation and reconstruction steps initialized with a random observation vector, the networks may produce patterns which have been created using solely they inner knowledge about the concepts it has learned. This may be seen fantastically close to the concept of a dream.

At this point I would also like to invite you to watch the video linked above. And if you like what you see, I also invite you to download the latest version of the Accord.NET Framework and experiment with those newly added features.

The new release also includes k-dimensional trees, also known as kd-trees, which can be use to speed up nearest neighbor lookups in algorithms which need it. They are particularly useful in algorithms such as the mean shift algorithm for data clustering, which has been included as well; and in instance classification algorithms such as the k-nearest neighbors.