neon v2.1.0: Leveraging Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

2017-09-18

Author Bio Image

Jayaram Bobba

Jayaram Bobba is a senior software engineer in the AI Products Group at Intel. He works on graph compilers and CPU optimizations for Machine Learning frameworks like Tensorflow and neon. He also leads the team developing the CPU backend for Nervana Graph. During his time at Intel, he has contributed to many binary translation projects and HW/SW codesign ranging from microarchitecture enhancements to SW algorithm improvements. Jayaram has a PhD from University of Wisconsin in Computer Architecture.

We are excited to announce the availability of neon™ 2.1 framework. An optimized backend based on Intel® Math Kernel Library (Intel® MKL), is enabled by default on CPU platforms with this release. neon™ 2.1 also uses a newer version of the Intel ® MKL for Deep Neural Networks (Intel ® MKL-DNN), which features optimizations for the new Intel® Xeon® processor (code name Skylake) and the upcoming Intel® Xeon Phi™ coprocessor (code name Knights Mill). As usual, the DNN component of Intel MKL is provided free of charge and is downloaded automatically as part of the neon installation.

Intel® Xeon® Platinum processors (code name Skylake) are some of the fastest microprocessors currently available. Aside from general-purpose compute performance, features like the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set can significantly accelerate deep learning (DL) workloads; neon 2.1 and the associated Intel MKL-DNN library automatically use Intel AVX-512 instructions to accelerate DL kernels like convolution and inner product. This release also leverages data layouts that are more suitable to the vector width (16 single-precision floats) of the machine. As seen in Figure 1, this results in up to 1.9X improvement in DL training performance as compared to the previous generation of the Intel® Xeon® processor(1), (based on Broadwell microarchitecture).

For the next release of neon, more performance improvements are in the pipeline. neon currently uses a unique CHWN data layout for data that helps build fast GPU kernels.   CPU kernels prefer a different data layout which leads to data conversions among the layers of a model, and we are working to reduce the number of these conversions and to accelerate each conversion. Lastly, we are planning to improve Recurrent Neural Network (RNN) performance and also to optimize for a wider variety of models.

Over the past year, we have accelerated DL training performance on Intel CPUs by many orders of magnitude (Figure 2). Resnet-50 training throughput went up from about 1 image/sec on an Intel Xeon E5-2699 V4(1) before optimization to 62 images/sec on an Intel Xeon Platinum 8180 (2) (see Table below) , an improvement of 62X. We have evolved the latest version of Intel Xeon processor into a platform that is optimized for both DL and many other AI problem-solving approaches.

[1] neon is a trademark of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

 

Figure 1: Performance improvement for neon™ 2.1 on Intel® Xeon® Platinum CPUs (SKX) relative to Intel® Xeon® E5-2699 V4 (BDW) CPUs

 

Figure 2: neon™ performance improvements since mid-2016

 

 BDW(1) BDW(1) SKX(2)
 Framework Neon 1.9 Neon 2.1 Neon 2.1
Convnet-Alexnet
(BatchSize=256)
12.09 503 963
Convnet-Googlenet V1
(BatchSize=128)
1.66 167 275
Resnet-50 on Imagenet
(BatchSize=50)
1 42 62

(1) BDW-based Xeon Processor Spec (2) SKX-based Xeon Processor Spec

 

Intel ® neon™ 2.1 framework Key Contributors:

Peng Zhang (Development), Wei Wang (Benchmarking and Documentation), Dawn Stone (Validation)

Jayaram Bobba is a senior software engineer in the AI Products Group at Intel. He works on graph compilers and CPU optimizations for Machine Learning frameworks like Tensorflow and neon. He also leads the team developing the CPU backend for Nervana Graph. During his time at Intel, he has contributed to many binary translation projects and HW/SW codesign ranging from microarchitecture enhancements to SW algorithm improvements. Jayaram has a PhD from University of Wisconsin in Computer Architecture.

Peng Zhang is a software engineer in the Software Service Group at Intel. He works on the optimization of Deep Learning frameworks including Neon and Torch. He has a Master’s Degree from Tsinghua University in Control Science and Engineering.

Wei Wang is a software engineer in the AI Products Group at Intel. He works on benchmarking Machine Learning frameworks like Neon and Caffe. He has a PhD from University of Delaware in High-Performance Computing (HPC).

Dawn Stone is a software engineer in the AI Products Group at Intel. She works on validating Machine Learning frameworks including Intel Nervana™ Graph and neon™.

 


LEGAL Notices and Disclaimers

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.


Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.

Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process.

© 2017 Intel Corporation. Intel, the Intel logo, MKL-DNN, Xeon, Xeon Phi, Xeon Phi logos, Xeon logos, neon, and AVX-512,  are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

[1] neon is a trademark of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

Related Blog Posts

neon™ 2.0: Optimized for Intel® Architectures

neon™ is a deep learning framework created by Nervana Systems with industry leading performance on GPUs thanks to its custom assembly kernels and optimized algorithms. After Nervana joined Intel, we have been working together to bring superior performance to CPU platforms as well. Today, after the result of a great collaboration between the teams, we…

Read more

#neon

Intel® Nervana™ Graph Beta

We are building the Intel Nervana Graph project to be the LLVM for deep learning, and today we are excited to announce a beta release of our work we previously announced in a technical preview. We see the Intel Nervana Graph project as the beginning of an ecosystem of optimization passes, hardware backends and frontend…

Read more

#Intel Nervana Graph #neon

Training Generative Adversarial Networks in Flexpoint

Training Generative Adversarial Networks in Flexpoint With the recent flood of breakthrough products using deep learning for image classification, speech recognition and text understanding, it’s easy to think deep learning is just about supervised learning. But supervised learning requires labels, which most of the world’s data does not have. Instead, unsupervised learning, extracting insights from…

Read more

#neon