"Not so fast, FFT": Winograd


Author Bio Image

Scott Leishman

Scott has over nine years experience creating machine learning based solutions to solve large-scale, real-world problems. At Nervana, Scott’s a member of the Algo team, with one foot in the Cloud. Inside of work he can often be found pushing and reviewing code. Outside of work he can often be found running long distances and quaffing local craft beer (occasionally simultaneously).

Deep learning thrives on speed. Faster training enables the construction of larger and more complex networks to tackle new domains such as speech or decision making.

Recently, small convolutional filter sizes have become an important component in convolutional neural networks such as Google’s AlphaGo network or Microsoft’s deep residual networks. While most convolutions are computed with the fast fourier transform (FFT) algorithm, the rising prominence of small 3×3 filter sizes makes the way for a lesser known technique specialized for small filter sizes: Winograd’s minimal filtering algorithms (Lavin and Gray, 2015).

We have implemented the Winograd algorithm on GPUs and benchmarked performance and convergence on state-of-the-art networks. Depending on the network architecture, training with Nervana’s Winograd algorithm yields speed-ups of 2-3x over NVIDIA’s cuDNN v4 kernels.

We benchmarked speed on:

  • NVIDIA Titan X GPU (fixed at 1 Ghz)
  • Intel® Core™ i7 CPU 975 @ 3.33GHz

with several convolutional networks operating on the Imagenet dataset: VGG, GoogleNet, Microsoft’s deep residual networks. We tested different minibatch sizes and compared the computation time of the 3×3 layers between Nervana Winograd and NVIDIA cuDNN v4.

Performance was measured in units of algorithmic speedup. Computation speed (e.g. images/second) was normalized to the maximum theoretical speed of the direct convolutional approach.  Values above one indicate how efficiently a faster algorithm is implemented. 

For the 3×3 layers of the VGG model, Nervana Winograd is up to 3 times faster than cuDNN v4 (see Figure below). These speed-ups were also realized when we measured the end-to-end forward propagation and backward propagation times. 

We obtained similar results with GoogleNetv2 and MSRA networks.

Not only is Nervana Winograd fast, but also numerically accurate. For example, AlexNet convergence is exactly the same as with direct convolution, as shown below.

AlexNet convergence

These results are also confirmed by third-party benchmarking. Our Winograd implementation is open-source, and can be found in the latest release of our deep learning library, neon.  Look forward to a forthcoming part 2 with more technical details on Nervana Winograd. We are actively working on improving Nervana Winograd to allow your networks to train faster and handle more complex problems.



Lavin, Andrew and Gray, Scott (2015). Fast Algorithms for Convolutional Neural Networks. http://arxiv.org/abs/1509.09308

Related Blog Posts

neon v2.1.0: Leveraging Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

We are excited to announce the availability of neon™ 2.1 framework. An optimized backend based on Intel® Math Kernel Library (Intel® MKL), is enabled by default on CPU platforms with this release. neon™ 2.1 also uses a newer version of the Intel ® MKL for Deep Neural Networks (Intel ® MKL-DNN), which features optimizations for…

Read more

#neon #Release Notes

neon™ 2.0: Optimized for Intel® Architectures

neon™ is a deep learning framework created by Nervana Systems with industry leading performance on GPUs thanks to its custom assembly kernels and optimized algorithms. After Nervana joined Intel, we have been working together to bring superior performance to CPU platforms as well. Today, after the result of a great collaboration between the teams, we…

Read more


Intel® Nervana™ Graph Beta

We are building the Intel Nervana Graph project to be the LLVM for deep learning, and today we are excited to announce a beta release of our work we previously announced in a technical preview. We see the Intel Nervana Graph project as the beginning of an ecosystem of optimization passes, hardware backends and frontend…

Read more

#Intel Nervana Graph #neon