Video classification with hybrid models

Deep learning reigns undisputed as the new de-facto method for image classification. However, at least until the beginning of 2016, it was not yet quite clear whether deep learning methods were undisputedly better than more traditional approaches based on handcrafted features for the specific task of video classification. While most research interest is certainly being directed towards deep learning, we thought it would be fun to play on the handcrafted features side for a while and we actually made some pretty interesting discoveries along the way.

Disclaimer: this post is directed towards beginners in computer vision and may contain very loose descriptions or simplifications of some topics in order to bring interest about the field to unexperienced readers – for a succinct and peer-reviewed discussion about the topic, the reader is strongly advised to refer to https://arxiv.org/abs/1608.07138 instead.

Three (non-exhaustive) approaches for classification

In the following paragraphs, we will explore a little three different ways of performing video classification: the task of determining to which of a finite set of labels a video must belong to.

Handcrafted features: designing image features yourself

Initial video classification systems were based on handcrafted features, or, in other words, based on techniques for extracting useful information from a video that have been discovered or invented based on the expertise of the researcher.

For example, a very basic approach would be to consider that whatever could be considered as a “corner” in an image (see link for an example) would be quite relevant for determining what was inside this image. We could then present this collection of points (which would be of much reduced size than the image itself) to a classifier  and hope it would be able to determine its contents from this simplified representation of the data.

In the case of video, the most successful example of those is certainly the Dense Trajectories approach of Wang et al., that captures frame-level descriptors over pixel trajectories determined by optical flow.

Deep learning: letting networks figure out features for you

Well, some could actually think that using corners or other features to try to guess what is in an image or video is not straightforward at all – why not simply use the image or video itself as a whole and present this to the classifier to check out what it would come up with? Well the truth is that this had been tried, but until a few years ago, it simply wouldn’t work except for simple problems. For some time we didn’t know how to make classifiers that could handle extremely large amounts of data, and yet extract anything useful from it.

This started to change in 2006, when the interest on training large neural networks started to raise. In 2012, a deep convolutional network with 8 layers managed to win the ImageNet challenge, far outperforming other approaches based on Fisher Vector encodings of handcrafted features. Since then, many works have shown how deep nets could perform way better than other methods in image classification and other domains.

However, while deep nets have certainly been shown to be the undisputed winners in action classification, the same could not yet be said about video classification, at least not in the beginning of 2016.  Moreover, training deep neural networks for video can also be extremely costly, both in terms of computational power needed as well as the number and sheer size of examples needed to train those networks.

Hybrid models: getting the best out of the both worlds 😉

So, could we design classification architectures that could be both powerful and easy to train? In the sense that it wouldn’t be necessary to rely on huge amounts of labelled data in order to learn even the most basic, pixel-level aspects of a video, but instead in a way that we could leverage this knowledge already from techniques that are already known to work fairly well?

Fisher with a Deep Net
Fisher and a deep net – although not quite the same variety found in video classification.

Well, indeed, the answer seems to be yes – as long as you pay attention to some to some details that can actually make a huge difference.

This is shown in the paper “Sympathy for Details: Hybrid Classification Architectures for Action Recognition.” This paper is mostly centered about two things: a) showing that it is possible to make standard methods perform as good as deep nets for video; and b) showing that by combining Fisher Vectors of traditional handcrafted video features with a multi-layer architecture can actually perform better than most deep nets for video, achieving state-of-the-art results at the date of submission.

The paper, co-written with colleagues and advisors from Xerox Research Centre and the Computer Vision Center of the Autonomous University of Barcelona, has been accepted for publication in the 14th European Conference on Computer Vision (ECCV’16), to be held in Amsterdam, The Netherlands, this October, and can be found here.

Deep Learning Artificial Neural Networks: Speech Recognition and Universal Translators

Happy new year everyone!

With the beginning of this year, I would like to share a video I wish I had found earlier. It is about the recent breakthrough given by Deep Neural Networks in the field of speech recognition – which, despite I had known was a breakthrough, I didn’t know it was already leading to such surprising great results.

 

Deep neural networks are also available in the Accord.NET Framework. However, they’ve been a very recent addition – if you find any issues, bugs, or just wish to collaborate on development, please let me know!

Deep Neural Networks and Restricted Boltzmann Machines

diagram

The new version of the Accord.NET brings a nice addition for those working with machine learning and pattern recognition: Deep Neural Networks and Restricted Boltzmann Machines.

Class diagram for Deep Neural Networks in the Accord.Neuro namespace.

Deep neural networks have been listed as a recent breakthrough in signal and image processing applications, such as in speech recognition and visual object detection. However, is not the neural networks which are the new things here; but rather, the learning algorithms. Neural Networks have existed for decades, but previous learning algorithms were unsuitable to learn networks with more than one or two hidden layers.

But why more layers?

The Universal Approximation Theorem (Cybenko 1989; Hornik 1991) states that a standard multi-layer activation neural network with a single hidden layer is already capable of approximating any arbitrary real function with arbitrary precision. Why then create networks with more than one layer?
To reduce complexity. Networks with a single hidden layer may arbitrarily approximate any function, but they may require an exponential number of neurons to do so. We can borrow a more tactile example from the electronics field. Any boolean function can be expressed using only a single layer of AND, OR and NOT gates (or even only NAND gates). However, one would hardly use only this to fully design, let’s say, a computer processor. Rather, specific behaviors would be modeled in logic blocks, and those blocks would then be combined to form more complex blocks until we create a all-compassing block implementing the entire CPU.
The use of several hidden layers is no different. By allowing more layers we allow the network to model more complex behavior with less activation neurons; futhermore the first layers of the network may specialize on detecting more specific structures to help in the later classification. Dimensionality reduction and feature extraction could have been performed directly inside the network on its first layers rather than using specific separate algorithms. 

Do computers dream of electric sheep?

The key insight in learning deep networks was to apply a pre-training algorithm which could be used to tune individual hidden layers separately. Each layer is learned separately without supervision. This means the layers are able to learn features without knowing their corresponding output label. This is known as a pre-training algorithm because, after all layers have been learned unsupervised, a final supervised algorithm is used to fine-tune the network to perform the specific classification task at hand.

As shown in the class diagram on top of this post, Deep Networks are simply cascades of Restricted Boltzmann Machines (RBMs). Each layer of the final network is created by connecting the hidden layers of each RBM as if they were hidden layers of a single activation neural network.

Now, the most interesting part about this approach will given now. It is about one specific detail on how the RBMs are learned, which in turn allows a very interesting use of the final networks. As each layer is a RBM learned using an unsupervised algorithm, they can be seen as standard generative models. And if they are generative, they can be used to reconstruct what they have learned. And by sequentially alternating computation and reconstruction steps initialized with a random observation vector, the networks may produce patterns which have been created using solely they inner knowledge about the concepts it has learned. This may be seen fantastically close to the concept of a dream.

At this point I would also like to invite you to watch the video linked above. And if you like what you see, I also invite you to download the latest version of the Accord.NET Framework and experiment with those newly added features.

The new release also includes k-dimensional trees, also known as kd-trees, which can be use to speed up nearest neighbor lookups in algorithms which need it. They are particularly useful in algorithms such as the mean shift algorithm for data clustering, which has been included as well; and in instance classification algorithms such as the k-nearest neighbors.

A word on Neural Network learning

micro_tanks

Some time ago, while doing random research, I came across this funny but important note about neural networks, and thought it was worth sharing.

The choice of the dimensionality and domain of the input set is crucial to the success of any connectionist model. A common example of a poor choice of input set and test data is the Pentagon’s foray into the field of object recognition. This story is probably apocryphal and many different versions exist on-line, but the story describes a true difficulty with neural nets.

As the story goes, a network was set up with the input being the pixels in a picture, and the output was a single bit, yes or no, for the existence of an enemy tank hidden somewhere in the picture. When the training was complete, the network performed beautifully, but when applied to new data, it failed miserably. The problem was that in the test data, all of the pictures that had tanks in them were taken on cloudy days, and all of the pictures without tanks were taken on sunny days. The neural net was identifying the existence or non-existence of sunshine, not tanks.

– David Gerhard; Pitch Extraction and Fundamental Frequency: History and Current Techniques, Technical Report TR-CS 2003-06, 2003.

 

🙂