Using neon for Scene Recognition: Mini-Places2

2016-02-09

Author Bio Image

Hunter Lang

Hunter Lang is a math major at MIT and interned with Nervana in Summer 2015 and Winter 2016. He worked on several projects ranging from data science to cloud software design. He surfs, reads, and likes to eat (whole) pizzas.

Introduction

Much of the latest research in computer vision has focused on deep learning techniques. It has been applied to object recognition, where the goal is to predict what type of object is pictured in an image, and object localization, where the goal is to predict an object’s location in an image. Scene recognition is the problem of determining an image’s setting, and it remains a comparatively unexplored area.

For the image above, an object recognition system might output “knife.” A localization system would draw a bounding box around the knife. A scene recognition system would output “kitchen.”

Places2

The absence of a large public corpus of labeled scene images, analogous to ImageNet (the standard dataset for recognition and localization), has made it difficult to conduct experiments with deep learning models, since neural networks are best applied in settings with large amounts of training data. Aiming to solve this issue, researchers at MIT recently released the Places2 dataset, which contains roughly ten million images of over four hundred scenes. Places2 was included in ILSVRC 2015, an international computer vision competition, as a “taster challenge.”


At the time of the release, there was very little research into building a neural scene recognition model on image data. Object recognition models are very effective, but
a priori, one might expect that scene models need to be invariant to differences in image content and therefore need to use different architectures. For instance, a picture of a tomato being sliced on a cutting board should be labelled “kitchen,” but so should an image of a dish being washed at a sink. Two images showing totally different objects could have the same scene label. The following examples, taken from Places2, are all labelled “kitchen,” but have different object content.

At the same time, the deep convolutional networks that achieve state-of-the-art performance on object recognition tasks have been successfully applied to unrelated domains like speech recognition and language modeling, suggesting the convolutional architecture has enough power to model many different complex systems–perhaps even enough to learn all the different types of objects in each scene category.

The Competition

To go along with the new dataset and the ILSVRC challenge, the MIT course 6.869: Advances in Computer Vision hosted a competition in Fall 2015 to build the most accurate scene recognition model on a miniature version of Places2, scaled down in the interest of time. The smaller dataset featured only 100,000 images in the training set and 10,000 images for validation, and was publicly available on the course page at the time of this post. Inspired by the cross-domain success of convolutional neural networks, we began the competition by using architectures from object recognition despite the potential difficulty mentioned above. We used neon to implement models; the fast training speed let us iterate quickly and try many different layer configurations and hyperparameters. The simple model below, inspired by the VGG network, won second place in the MIT challenge.

init1 = Xavier()
   init2 = Xavier(local=False)
   relu = Rectlin()
   conv_params = {
       'strides': 1,
       'init': init1,
       'activation': relu,
       'batch_norm': True
   }
   layers = [
       Conv((3, 3, 32), padding=1, **conv_params),
       Conv((3, 3, 32), padding=1, **conv_params),
       Pooling((2, 2), strides=2, padding=1),
       Conv((3, 3, 64), padding=1, **conv_params),
       Conv((3, 3, 64), padding=1, **conv_params),
       Pooling((2, 2), strides=2, padding=1),
       Conv((3, 3, 128), padding=1, **conv_params),
       Conv((3, 3, 128), padding=1, **conv_params),
       Conv((3, 3, 128), padding=1, **conv_params),
       Conv((3, 3, 128), padding=1, **conv_params),
       Pooling((2, 2), strides=2, padding=1),
       Conv((3, 3, 256), padding=1, **conv_params),
       Conv((3, 3, 256), padding=1, **conv_params),
       Conv((3, 3, 256), padding=1, **conv_params),
       Pooling((2, 2), strides=2),
       Affine(1024, init2, batch_norm=True, activation=relu),
       Dropout(keep=0.5),
       Affine(1024, init2, batch_norm=True, activation=relu),
       Dropout(keep=0.5),
       Affine(100, init2, activation=Softmax())
   ]
   model = Model(layers=layers)

This network architecture uses batch normalization, Xavier weight initialization, and Dropout (pdf). Since the dataset was small, we had to experiment with different regularization techniques. Various initialization schemes, gradient clipping, and data augmentation (random cropping and flipping, color noise) were already built in to neon. Additional techniques, like data augmentation with blur, training with gradient noise, and multi-scale evaluation were straightforward to add.

We also experimented with different built-in optimizers like Adam, Adagrad (pdf), and RMSProp, but found that there is no substitute for vanilla gradient descent with a carefully-tuned learning rate schedule. Since individual networks trained very quickly, it was easy to iteratively run models to find a good schedule. The network above had a 20% top-5 error rate on the validation set. (Top-k error is the percentage of times the correct scene label is not one of the labels the network considers the k best). We trained it twice using slightly different hyperparameters and averaged the outputs of the final softmax layers to get predictions for the test set. The final test performance was 47.97% top-1 error and 19.87% top-5 error. The winning network’s scores were 45.32% and 18.43% top-1 and top-5 respectively.

Following up

Soon after the MIT competition ended, results came in from ILSVRC. The winning submission for object recognition came from Microsoft Research Asia and used a new model architecture called the Deep Residual Network (ResNet). Researchers noticed that adding layers to a deep neural network can sometimes make the accuracy worse, when in principle the extra layers should be able to learn the identity mapping. With this intuition in mind, the authors group their network into modules and put additive skip connections between each module’s inputs and final pre-activations. They suggest in the paper that the skip connection makes it easier to recover the identity mapping during learning: if the module sets all its weights and biases to zero, it will output its input. In practice, this seems to account for big gains in accuracy relative to previous state-of-the-art networks.

Using an ensemble of three simple ResNets trained in neon, we achieved 38.84% and 12.64% top-1 and top-5 error on the mini-Places2 validation set, over 5% lower than the best competition score. The Deep Residual Networks we used have 40 convolutional layers and train in neon in about 500 seconds per epoch.

If you would like to extend these models or apply them to your own datasets, you can get started here.

I would like to thank Alex Park, Arjun Bansal, Urs Köster, and my competition partner Nicholas Hynes, for their contributions to this work.

This work was performed on the Nervana Cloud. To get started with the Nervana Cloud or request more information, please contact us at products@nervanasys.com.

Related Blog Posts

neon™ 2.0: Optimized for Intel® Architectures

neon™ is a deep learning framework created by Nervana Systems with industry leading performance on GPUs thanks to its custom assembly kernels and optimized algorithms. After Nervana joined Intel, we have been working together to bring superior performance to CPU platforms as well. Today, after the result of a great collaboration between the teams, we…

Read more

#neon

Intel® Nervana™ Graph Beta

We are building the Intel Nervana Graph project to be the LLVM for deep learning, and today we are excited to announce a beta release of our work we previously announced in a technical preview. We see the Intel Nervana Graph project as the beginning of an ecosystem of optimization passes, hardware backends and frontend…

Read more

#Intel Nervana Graph #neon

Training Generative Adversarial Networks in Flexpoint

Training Generative Adversarial Networks in Flexpoint With the recent flood of breakthrough products using deep learning for image classification, speech recognition and text understanding, it’s easy to think deep learning is just about supervised learning. But supervised learning requires labels, which most of the world’s data does not have. Instead, unsupervised learning, extracting insights from…

Read more

#neon