Training Generative Adversarial Networks in Flexpoint

2017-06-19

Author Bio Image

Urs Köster

Urs has over 9 years of research experience in machine learning, spanning areas from computer vision and image processing to large scale neural data analysis. His data science experience ranges from working with national laboratories in applying deep learning to understand climate change, to helping customers solve challenging computer vision problems in medical imaging. Urs works on making the fastest implementations of convolutional and recurrent networks. For his postdoc at UC Berkeley, he used unsupervised machine learning algorithms such as Restricted Boltzmann Machines to understand the visual system.

Training Generative Adversarial Networks in Flexpoint

With the recent flood of breakthrough products using deep learning for image classification, speech recognition and text understanding, it’s easy to think deep learning is just about supervised learning. But supervised learning requires labels, which most of the world’s data does not have. Instead, unsupervised learning, extracting insights from unlabeled data will open deep learning to a diverse set of applications.

There are obvious use cases such as using generative models for tasks such as texture generation or super-resolution (https://arxiv.org/abs/1609.04802). Even more interesting are the possibilities for  semi-supervised learning that involves learning of efficient data representations, achieving similar performance to models that today require millions of images, or thousands of hours of speech to train with just a handful of labeled examples. In this blog post, we will explore how this research ties in with our work on high performance, low bit-width deep learning hardware.

 

The Twitter Cortex team uses GANs for superresolution. Left is the original image, right, the high-resolution version produced by the network.

 

Unsupervised deep learning has gained substantial momentum with Generative Adversarial Networks (GANs), which poses the network training as a two-player game, with two networks competing against each other. One of the networks, the generator, learns to transform low-dimensional noise to mimic the training data (e.g. images). The second network, the discriminator, learns to distinguish fake images produced by the generator from real images in the training data. Thus the cost function of the GAN is based on a simple binary classification problem: the discriminator is trained to classify as accurately as possible, and the generator optimized to confuse the discriminator as much as possible. In a perfectly trained GAN with sufficient capacity, the generator outputs data with indistinguishable statistics as real data and the discriminator performs at pure chance level; this is a stable state theoretically, though convergence to it is often tricky to obtain in practice. GANs were invented by Ian Goodfellow in 2014, while he was a grad student in Yoshua Bengio’s lab in Montreal. Ian has a lot of practical information about training GANs in his NIPS 2016 Tutorial. A particularly popular flavor of this model is the DC-GAN that was developed by researchers at FAIR.

 

One dog is real, one is generated by the DC-GAN algorithm. Left is the real dog, right is the generated dog.

The Wasserstein GAN (W-GAN) marked a recent and major milestone in GAN development, developed by Martin Arjovsky at NYU’s Courant Institute of Mathematical Sciences together with Facebook researchers. The W-GAN has two big advantages: It is easier to train than a standard GAN because the cost function provides a more robust gradient signal. It also comes with a cost function (the Wasserstein-1 distance estimate) that can be used to monitor convergence, making it much easier to design a good model and find the right set of hyperparameters.

GANs are popular for generative models of images, but haven’t reached the holy grail of generating the full distribution of natural images yet. They do a lot better when trained on images from a particular class such as birds and flowers, facesand for unfathomable reasons, images of bedrooms, where a set of 3 million images is available from Princeton’s large-scale scene understanding dataset.  

Unsupervised Learning meets low precision

At Nervana, we follow the cutting edge of machine learning research very closely, so we can optimally support new models in the next generation of AI hardware we are developing. Our team of data scientists implements new models in our neon deep learning framework, where we can run it through our suite of simulators. One of the main differences between Nervana hardware and other accelerators is that we use the Flexpoint data format, which combines the hardware-friendly aspects of fixed point with the “it just works” user friendliness of floating point.

There has been a lot of research into low precision data types for deep learning, ranging from 16-bit floating point all the way down to binary neural networks, which we have blogged about previously. Often these networks require significant changes from their 32-bit counterparts, while Flexpoint is designed to work without any changes to the network or training procedure.

To prove the point, we took our W-GAN implementation and trained it on the LSUN bedroom dataset both in 32-bit floating point and Flexpoint with a 16-bit mantissa and 5-bit exponent. The only changes we made to the original model were to use uniform instead of Gaussian noise (since it’s a little bit faster to sample), and a noise dimensionality of 128 instead of 100 (we really like powers of two). Neither of these changes seem to affect the quality of the results, which are shown below. The samples generated by the two models are shown after every epoch of training for a fixed set of noise inputs, so the content of the generated images changes frequently at the beginning and then stabilized over the course of training. Results from the model trained in Flexpoint are indistinguishable visually, and in fact inspecting the learning curve shows that convergence is unchanged.

Right: Real LSUN images; Left: Learning Curves in floating point and Flexpoint

 

   

Right: Images generated from Flex 16+5 model; Left: Images generated from 32-bit floating point model

 

As far as we know GANs have not yet received any attention in reduced bit-width deep learning, yet we can train them without having to make changes to the model or our (simulated) hardware. The code we used for training this model in 32 and 16 bit floating point (although unfortunately not the Flexpoint simulator tools) is open source and available on GitHub as part of our neon examples.




“Training Generative Adversarial Networks in Flexpoint” was written by Urs Köster and Xin Wang.
 

Related Blog Posts

neon v2.1.0: Leveraging Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

We are excited to announce the availability of neon™ 2.1 framework. An optimized backend based on Intel® Math Kernel Library (Intel® MKL), is enabled by default on CPU platforms with this release. neon™ 2.1 also uses a newer version of the Intel ® MKL for Deep Neural Networks (Intel ® MKL-DNN), which features optimizations for…

Read more

#neon #Release Notes

neon™ 2.0: Optimized for Intel® Architectures

neon™ is a deep learning framework created by Nervana Systems with industry leading performance on GPUs thanks to its custom assembly kernels and optimized algorithms. After Nervana joined Intel, we have been working together to bring superior performance to CPU platforms as well. Today, after the result of a great collaboration between the teams, we…

Read more

#neon

Intel® Nervana™ Graph Beta

We are building the Intel Nervana Graph project to be the LLVM for deep learning, and today we are excited to announce a beta release of our work we previously announced in a technical preview. We see the Intel Nervana Graph project as the beginning of an ecosystem of optimization passes, hardware backends and frontend…

Read more

#Intel Nervana Graph #neon