Author Bio Image

Jason Knight

Jason is on the algorithms team at Nervana. He is a passionate engineer and computational biologist. His experience includes designing hierarchical Bayesian statistical models for classification of biological datasets, implementing and optimizing MCMC techniques for torturous energy landscapes, and engineering large distributed data processing systems on top of Apache Spark and Kubernetes. His other interests include cryptocurrencies, programming languages, and pairing active learning with cloud biology.

We are building the Intel Nervana Graph project to be the LLVM for deep learning, and today we are excited to announce a beta release of our work we previously announced in a technical preview. We see the Intel Nervana Graph project as the beginning of an ecosystem of optimization passes, hardware backends and frontend connectors to popular deep learning frameworks. We also see our set of independent modules and utilities (serialization, visualization, autodiff) as reusable components for building the frameworks of the future.

Since our technical preview release late last year, we have been working on enabling bigger models with higher performance and more frontend framework connectors. Let’s look at these improvements one at a time starting with the frontends:

TensorFlow/XLA Support

With the experimental release of XLA and associated APIs earlier this year, we’ve jumped on the chance to integrate Intel Nervana Graph’s backends and optimization pass infrastructure into TensorFlow at a deeper level. This takes our previous out-of-band TensorFlow model conversion routine from before and instead enables a seamless user experience that allows most existing TensorFlow scripts and utilities (including TensorBoard) to work out of the box without any modifications.

The connection between the XLA and Intel Nervana Graph APIs was quite straightforward given the similar projects’ intent for a compact and explicit intermediate representation.

While today the XLA/Intel Nervana Graph integration is at a pre-alpha level, we’d love for people to take it for a spin and kick the tires. We’re working on ironing out known performance issues and improving op and backend support.

Caffe

We now have preliminary support for Caffe models with command line emulation and python converter support.

Distributed Training Support

We’ll be speaking more about this more in an upcoming blog post with additional technical details. The heterogeneous transformer now supports training across multiple CPUs or GPUs. This support will expand to eventually encompass multiple hosts as well.

Memory Optimizations

Moving to hardware independent optimizations, using liveness analysis on the dataflow graphs produced by frontends such as neon and TensorFlow, we are able to evaluate which temporary arrays are no longer needed and reclaim their memory for additional computations. This results in 5 to 6 times less memory usage across a range of smaller test models.

In the view above, the green bar represents the amount of memory used directly by the operation in question, and the red bar represents the amount of memory that is still live at this point in the computation. This feature will be landing soon, and there are more improvements still to come.

Leveraging MKL-DNN

On the backend side of things, we have made considerable progress to utilize Intel’s open sourc MKL-DNN library to greatly accelerate deep learning computations on Intel processors. So far we’ve completed integration of MKL-DNN APIs into the ngraph CPU backend for Relu, Pooling, BatchNormalization and element-wise additions. In addition, we consider a full-graph optimization for tensor layout to optimize MKL-DNN performance for its hand tuned kernels.

This effort also included implementing a new pattern matching toolkit to allow for optimization passes to recognize arbitrarily complex computational patterns (think regular expressions for graphs). This pattern matcher was then used to detect operator compounding (aka “fusion”) opportunities starting with matrix multiplications followed by bias that offer increased performance from the MKL-DNN library.

Expect more announcements from this integration as we begin our benchmarking process and continue the optimization efforts on Intel architectures.

neon 3.0

In addition to the highlighted Intel Nervana Graph improvements above, we are also releasing a technical preview of neon 3.0. Neon is Intel’s reference implementation of a deep learning framework designed for high performance and high productivity. Paired with a library of recent topologies that are available for easy reuse in our Model Zoo, this enables users to quickly get up and running to solve their data science needs.

The neon 3.0 technical preview includes more layer types, more pre-assembled models, and more APIs to make new layers and new models even easier to build. For two examples of new layer types, we now include the Connectionist Temporal Classification (CTC) cost function as implemented in Baidu’s ctc-warp, and an abstraction around RNN layers that should make it easy for anyone to implement recurrent layers with arbitrary internal computations with minimal boilerplate.  

We added several models to demonstrate recent topologies and the usage of these new features. For examples, Residual networks using CIFAR 10 dataset, Deep Convolutional Generative Adversarial Networks (DCGAN) using MNIST dataset, a speech model with bi-directional RNNs and CTC using Librispeech, a LSTM network to do IMDB movie review sentiment analysis, a character-based seq2seq model, and a deep Q-network (DQN) playing a Atari game.

And finally, we have introduced the first pieces of a query selector API in neon that allows for users to easily query and manipulate operators in the Intel Nervana computation graph that underlies the higher-level model objects. This allows users to build up a model by specifying a series of layers, then query for “all trainable tensors in convolution layers which are followed by pooling”, or “all element-wise operators not in convolutional layers”. Then with the output of this query, the user can modify the attributes of those Intel Nervana Graph operators or add debug operators. The selector API is still at a very early stage, but we are excited for its potential so please feel free to give us feedback on the design and intended use cases.

Join us

Join us by making pull requests, suggestions, and comments on GitHub. We are also hiring for full-time and internship positions.

Related Blog Posts

Preview Release: Intel® Nervana™ Graph

The field of deep learning is moving at a rapid pace.  Practitioners need tools that are flexible enough to keep up. Theano popularized the notion of computational graphs as a powerful abstraction, and more recently, TensorFlow iterated on that concept. Together, they demonstrate some first steps in unlocking the potential of deep learning, but we…

Read more

#Intel Nervana Graph