Flexpoint: numerical innovation underlying the Intel® Nervana™ Neural Network Processor

Nov 10, 2017

We are delighted to announce our research team’s new paper, Flexpoint: an adaptive numerical format for efficient training of deep neural networks, accepted to the 2017 Neural Information Processing Systems (NIPS) conference.

This paper provided a complete description and validation of Flexpoint, the core numerical technology powering the Intel® Nervana™ Neural Network Processor–specialized hardware that enables efficient scale-up and acceleration of deep learning.  In this blog post, we explain Flexpoint’s operational mechanisms and the underlying philosophy of its design.  The reader is expected to have some basic knowledge of deep learning.

 

A rush for specialized deep learning hardware and the call for numerical innovations

First, a quick flash of the background to set the stage.

In recent years deep learning has grown so successful that it now dominates today’s artificial intelligence (AI) landscape.  It has proven itself in almost every industry with remarkable success stories, heralding a world-changing AI revolution with unstoppable momentum.

Deep learning’s success stems from the ability to learn from huge amounts of data.  This requires massive computing power.  The performance and energy efficiency of computing devices in today’s data centers is seriously limiting the rapid growth of deep learning powered AI.  More so than ever, this vibrant and fast expanding sector of today’s economy engenders a huge demand for cheaper and faster processors specifically designed for AI, and especially deep learning, workloads.

The insatiable demand for deep learning hardware translates to growing efforts to design ASICs (application-specific integrated circuits) specifically tailored for deep learning, i.e. special processors capable of performing inference and training of deep neural networks with less silicon, less power and less time than general purpose computing architectures.  It is widely accepted that ASICs can deliver the promise of further scaling up the compute capability for AI development[1].  In essence,  such new hardware should be optimized for dense matrix operations, or more generally tensor processing.  In response to the need to scale up computations aimed at deep networks, along with similar efforts by other industry leaders, Intel Nervana has announced the new Intel® Nervana™ Neural Network Processor (Intel® Nervana™ NNP, Figure 1), the first generation of Intel’s deep learning ASIC designed from the ground up for AI.

Figure 1.  Intel® Nervana™ Neural Network Processor.  This is the first generation of Intel’s ASIC processors for deep learning, previously known as Lake Crest, and the first chip capable of tensor processing in Flexpoint.

Not unlike our competitors’ efforts, the Intel Nervana NNP boasts key features optimized for tensor processing, including fast high-bandwidth links between processing cores connected in smart topology, high computing density, high-bandwidth high-speed memory, and streamlined numerical operations that balance trade-off between speed and precision.  Despite the general similarities at a glance, our design is unique at many levels.

In particular, on the forefront of limited numerical precision for accelerated deep learning, we have pushed significantly further than our competitions.  This is epitomized by our invention of Flexpoint, a tensorial numerical data format that works on our hardware to offer higher speed and higher compute density than do conventional numerical formats widely used in digital computing today.

Before we share our inspiration behind the invention of Flexpoint and the value of such numerical innovations, let us first start with an explanation of standard numerical formats and their utilizations in deep learning today.

 

Standard numerical formats and their hardware requirements

In digital computing devices, all numbers have to be represented by finite bit strings upon which logic is instrumented to realize mathematical operations.

Readers with basic programming experience must know a few number types that correspond to standard numerical formats: integers and floating point numbers (Figure 2).

Figure 2.  Several standard numerical formats.  On the right we list the smallest and largest number each numerical format can represent (defining a dynamic range of encoding) and its accuracy, i.e. relative error for floating point formats and absolute error for integers.  A single bit (“binary”) is also included, though not strictly a standard numerical format. 

Two notions are important here: dynamic range and precision.  The former is the span between the smallest and the largest representable absolute numbers, and the latter is related to how many numbers can be represented within the dynamic range.

Integers are simple: the smallest representable number is always 1, and there is a largest representable number (typically ,  being the number of encoding bits).  Spacing between neighboring representable numbers is always 1, leading to a linear scaling between dynamic range and relative accuracy.  To represent numbers with fractions, integers can be scaled by a certain power of 2, equivalent to shifting a decimal point by a fixed number of bits, leading to fixed point.  Fixed point arithmetic is essentially integer arithmetic; it simply shifts all representable numbers along the logarithmic scale by a same amount, preserving the same scaling between dynamic range and absolute accuracy as of integers.  In order to represent numbers in a much wider range with a same number of bits, scientific notation is used, leading to floating point: a certain number(N)of bits, called mantissa, are allocated to represent the significand, while the rest of the bits(M)encode an exponent.  Here the dynamic range is determined by the maximum and minimum exponent, and precision (relative accuracy) by the number of mantissa bits.

What happens when a real number is represented in a fixed or floating point format?  Generally speaking, it becomes the nearest representable number in that format.  This is a process called quantization, equivalent to the application of a staircase function, the quantization function (Figure 3)[2].  The difference between the representation and the original number is called quantization error.  Error arises either due to limited dynamic range, or due to lack of precision.  When a narrow dynamic range renders numbers of large absolute values unrepresentable, overflow occurs (Figure 3 provides an intuition); when small unrepresentable numbers have to be rounded to zero, underflow occurs, a phenomenon not necessarily lead to large quantization error, but killer of numerical stability for certain algorithms.  Within the dynamic range, if a lack of precision makes coarse quantization, this leads to large rounding error.

Of course, with infinite bits, all types of quantization error vanish, but given a certain bit budget, floating point formats usually strike a good balance between wide dynamic range and fine encoding precision (Figure 2), leading to lower quantization error for most data.  IEEE 754 standardizes bit allocations, special numbers and arithmetics of 64-bit, 32-bit and 16-bit floating point formats: float64, float32 and float16 (also called double, single and half precision, respectively).  These are the numerical formats universally accepted and used in scientific computing, underlying simulations from subatomic to cosmic phenomena in all scientific disciplines.  These standard floating point formats are, not surprisingly, native to existing mainstream computing devices, CPUs and general purpose GPUs.  Presently, a majority of the deep learning research, development and practical deployment is conducted in single precision, i.e. float32.

Figure 3.  Quantization and quantization error.  Here we illustrate how real numbers are quantized.  Quantization functions are represented by orange curves, whose deviations from identity (blue diagonals) lead to quantization error.  A hypothetical distribution of raw values (blue) and the corresponding discrete distributions resulting from quantization (orange) are shown on the top and right side.  (a) A fixed point quantization properly suited for the statistical characteristics of the data.  (b) A 1-bit shift from the scenario in (a) makes a too narrow dynamic range, leading to substantial overflows.  (c) An extreme case of quantization: binarization.  (d) A floating point quantization can achieve less quantization error than its fixed point counterpart with the same bit budget, given the data distribution.
Then why would a hardware designer want to go beyond the numerical formats in which almost all known phenomena in the universe can be simulated?  The answer is simple: this generality is costly.

When a specialized processor handles specific types of computation, e.g. those related to digital signal processing or those to deep neural networks, it may cut corners that are otherwise prohibited for general purpose computing, so as to improve efficiency while maintaining effectiveness.  Table 1 shows how expensive arithmetic operations are in silico for floating point and integer formats.  Two trends are obvious: (1) floating point is more expensive than integer arithmetic; (2) high bit-widths are more expensive than low ones.  Hence, the guiding principles for the choice of numerical formats for ASICs are (1) to replace floating point with fixed point, and (2) to quantize to lower bit-widths, so far as computational performance is not (or only tolerably) impacted.

Operation Energy (pJ) Area (μm2)
int8 addition 0.03 36
int16 addition 0.05 67
int32 addition 0.1 137
float16 addition 0.4 1,360
float32 addition 0.9 4,184
int8 multiplication 0.2 282
int32 multiplication 3.1 3,495
float16 multiplication 1.1 1,640
float32 multiplication 3.7 7,700

 

Table 1.  Hardware power and area cost of operations in various numerical formats.  These numbers are from Dr. Bill Dally’s NIPS 2015 tutorial “High-performance hardware for machine learning” (the power numbers due to Dr. Mark Horowitz’s ISSCC 2014 paper, and area numbers based on the TSMC 45 nm process). 

 

The unreasonable effectiveness of low-precision inference and the challenge of low-precision training

Next, let us study numerical quantization in the context of deep learning computation.

First of all, most values involved in deep neural network theories are real numbers.  When inference and training are carried out numerically on a computer, it is equivalent to applying quantization functions (Figure 3) as additional activation functions throughout the network (note that this is after each operation rather than only once per layer).  These additional activation functions are not part of the design of the network for inference, nor are they backpropagated through during training (strictly speaking, this is impossible due to non-differentiability); instead, each operation in backpropagation is also followed by quantization.  Thus, intuitively, one should probably only use quantization functions that approximate identity sufficiently closely given a data distribution, such that quantization error does not accumulate catastrophically through the network, or everything might break down.

For uncanny reasons, this is a pleasantly wrong intuition.  When weights and activations in a trained deep neural network are aggressively quantized, or even binarized (to either 1 or -1), inference usually still works.  Moreover, provided that accumulated gradients are kept in a high-precision format while all the rest are binarized, training sometimes works as well.  The term “unreasonable effectiveness” is, again, highly accurate in this case.  These results sparked active research into limited precision inference and training, as we discussed extensively in our paper.  Limited precision inference is very successful and has already found itself in production hardware, whereas training at low precision remains largely an open challenge.

Recent theoretical work has just started to shed light on this phenomenon.  For example, Alex Anderson and Cory Berg showed that peculiarities of high dimensions might be responsible, e.g. binarization in a high-dimensional vector space even preserves dot-product!  Perhaps, the reason why purely binary inference works but training does not is because, mathematically, forward propagation in deep neural nets only requires a vector space over the underlying number field but back-propagation further demands differentiability of functions defined thereon; the former survives aggressive quantization while the latter is destroyed.

So much for the digression.  Though an active direction of research, aggressively quantized (< 8-bit) training techniques are still not mature enough to go into specialized hardware.  Today’s burning question is rather the following: whether it is possible, and if so how, to go from float32, the current de facto numerical standard for training, down to 16-bit and/or to fixed point operations.  This obviously is the lowest-hanging fruit in hardware acceleration of deep neural network training as of today.

This is our motivation to invent Flexpoint and its first practical incarnation flex16+5.

Now, let us investigate empirically whether 16-bit fixed point arithmetic is possible for training.  We use the training process of a deep ResNet trained with the CIFAR10 dataset as an example; we train it in high-precision floating point and inspect the value distributions of typical tensors before and after training (Figure 4).  All tensors, at any specific stage of training, have a rather peaked (and rightward skewed) distribution sufficiently covered by a 16-bit range.  But the positions of these ranges vary from tensor to tensor, and for certain tensors they shift significantly during the course of training (usually weights are the most stable, and gradients/updates are the most variable).  Thus, it is feasible to do training with 16-bit integer operations, as long as the dynamic range is properly positioned and adaptively adjusted (instead of fixed) for each tensor during the course of training.

Figure 4.  Distributions of tensor scales in a deep neural network and their evolution during training.  Distributions of values for weights (a), activations (b) and weight updates (c), of a ResNet trained with the CIFAR10 dataset for 165 epochs; illustrated are those during the first epoch (blue) and last epoch (purple).  The horizontal bars beneath the histograms mark the dynamic range covered by a 16-bit fixed point representation.  The entire range of the horizontal axis covers all representable numbers by flex16+5.
 

This naturally leads to tensors with all integer entries that share a common exponent, which is modified on-the-fly to shift the dynamic range dynamically (no pun intended).  Courbariaux et al. call this, rather oxymoronically, dynamical fixed point (DFXP).  They proposed an adaptive mechanism to adjust the exponent during training: tensors are polled periodically for frequency of overflows; if it exceeds a certain threshold, the exponent is incremented to extend the dynamic range, and vice versa.  The main drawback of this approach is that this update mechanism only passively reacts to overflows rather than anticipating and preemptively avoiding overflows, i.e. “let it overflow and correct later”; the resulting quantization error turned out to be catastrophic for maintaining convergence of the training as reported in this paper.

A remedy to this drawback is to monitor a recent history of the absolute scale of each tensor, use a sound statistical model to predict its trend, estimate the probability of overflow, and preemptively adjust scale to prevent overflow when one is imminently likely to occur.  Flexpoint is born.

 

Flexpoint and Autoflex

Flexpoint is a tensorial numerical format based on an -bit integer tensor storing mantissas in two’s complement form, and an -bit exponent, shared across all elements of the tensor.  This format is denoted as flexN+M (Figure 5).

Figure 5.  Two ways of going 16-bit: floating point versus Flexpoint.  Schematics of a float16 tensor (a) and a flex16+5 tensor (b).  Both tensors aggregate  entries,  typically being a large number.  Elements of a Flexpoint tensor are integers (16-bit here) in two’s complement form (thusly no sign bit illustrated) with an external shared exponent (5-bit here); associated with the tensor is an additional deque of length  (typically a  small number, 16 in this paper, one of the hyperparameters of Autoflex) that stores the recent history of the maximum absolute mantissa values of the tensor and associated exponents.  Physically only the mantissas are present on the device which communicates maximum absolute values to the host; all exponents and the history deque are managed externally on the host.

It is worth noting that a Flexpoint tensor is essentially a fixed point, not floating point, tensor.  Even though there is a shared exponent, its storage and communication can be amortized over the entire tensor, a negligible overhead for huge tensors.  Most of the memory on device is used to store tensor elements with higher precision that scales with the dimensionality of tensors (typically huge for deep neural networks).  The external storage (on host) of the shared exponents and statistics deque requires a small memory that is constant for each tensor.

Operations on tensor elements leverage integer arithmetic, reducing hardware requirements in power and area as compared to floating point.  Specifically, element-wise multiplication of two tensors can be computed as fixed point operations since the common exponent is identical across all the output elements.  Similarly, addition across elements of the same tensor is also a fixed point operation since they share a common exponent.

These remarkable hardware advantages over floating point tensors come at the cost of added complexity of exponent management, as Courbariaux et al. suggested.  Seeking an elegant solution, we devised an exponent management algorithm called Autoflex (Figure 6), designed for iterative algorithms such as stochastic gradient descent where tensor operations, e.g. matrix multiplication and convolution, are performed repeatedly and outputs are stored in hardware buffers.

Autoflex runs in initialization mode before training starts (Figure 6, top).  In this mode, exponent of a tensor is iteratively adjusted, starting from an initial guess, until it is proper.  During training, each operation on a tensor is wrapped around by an adaptive exponent management mechanism (Figure 6, bottom), which predicts the trend of the tensor’s maximum absolute mantissa value based on statistics gathered from previous iterations.  A threshold is compared against the prediction and the exponent is adjusted to preempt overflow.  This is executed on a per-operation basis for each tensor, typically twice for each training iteration: once after forward and once after backward propagation.  Formulation of the threshold contains a few hyperparameters of Autoflex; see our paper for details.

Figure 6.  The Autoflex algorithm for exponent management.  Flow charts showing the Autoflex algorithm in initialization mode (before training) and operation mode (executed after each operation on the tensor during training).  The algorithm manages tensor exponents externally, wrapping around the actual operations of the neural network in fixed point (black boxes in the diagram).

To gain further intuition, let us observe Autoflex in action with a concrete training example, by means of a Flexpoint simulator on GPU (see our paper for technical details).  Figure 7 shows three typical tensors in flex16+5 during training of a small 2-layer perceptron for 400 iterations on the CIFAR10 dataset.  During training, maximum absolute values of mantissa and exponent scales are stored at each iteration.  The first column of Figure 7 shows a weight tensor, which is highly stable as it is only updated with small gradient steps.  Its maximum absolute mantissa slowly approaches , at which point the exponent is adjusted, and maximum absolute value drops by 1 bit accordingly.  Shown in the third row is the corresponding floating point representation of the statistics computed from the product of maximum absolute mantissa and the scale, which is used to perform the exponent prediction.  Using a sliding window of  values, the predicted maximum is computed, and used to set the exponent for the next iteration.  In this example, the prediction crosses the exponent boundary of  about 20 iterations before the actual value itself does, safely preventing an overflow.  Tensors with more variation in scale across iterations are shown in the second (activations) and third columns (updates) of Figure 7.  The algorithm adaptively leaves about half a bit and 1 bit respectively of headroom in these two cases.  Even as the tensor fluctuates in magnitude by more than a factor of two (Figure 7c), the maximum absolute value of the mantissa is effectively protected from overflowing.  The price to pay for this cautious approach is that, for volatile tensors such as weight updates in this case, maximum absolute value of mantissa lingers around 3 bits below the threshold (Figure 7c), often leaving the top bits zero and using only 13 of the 16 bits on average for representing data, trading precision for dynamic range.

Figure 7.  Autoflex in action.  Evolution of different tensors during training of a two-layer perceptron for 400 iterations on the CIFAR10 dataset.  The three columns show a linear layer’s weights (a), activations (b) and weight updates (c) during the whole course.  The first row are plots of the maximum absolute values of the mantissa.  The second row shows the scale (of which the logarithm is the exponent being managed), which is adjusted to keep the maximum absolute mantissa values (first row) at the top of the dynamic range without overflowing.  When the product of these two values (third row) is anticipated to cross a power of two boundary (e.g. arrows in the plots of the first column), the scale is adjusted to keep the mantissa in the correct range.  Weights are the least volatile, and updates the most variable; Autoflex takes this into account in its statistical prediction by keeping maximum absolute values at least a certain number of standard deviations from the boundary.  In each case the Autoflex prediction (green line, third row) crosses the exponent boundary (gray horizontal line) before the actual data (red line) does, which means that exponent adjustments preempt future overflows.

In sum, Flexpoint provides a tensorial numerical solution for deep neural network training.  Strictly speaking, it is not a mere numerical format, but a data structure wrapping around an integer tensor with associated adaptive exponent management algorithms.

 

Flexpoint versus floating point: going down to 16-bit

In our paper, we further questioned how the specific 16-bit Flexpoint format flex16+5 compares to float16, in actual training of modern deep neural networks, using float32 as a benchmark. We investigated three representative training processes, an AlexNet trained on ImageNet 1000-class, a ResNet-110 trained on CIFAR10 and a Wasserstein GAN trained on the LSUN bedroom images.  All experiments were carried out using a Flexpoint GPU simulator based on the neon framework.

We took the model topologies and hyperparameters exactly as in float32, without tweaking anything (such as scaling certain values in the network or changing certain hyperparameters for training) when training in either of the 16-bit formats.

The results are summarized in Figure 8.  In all three experiments, flex16+5 achieved close numerical parity with float32, whereas significant performance gaps were observed in float16 for certain cases.  For the two convolutional networks for image classification, misclassification errors were significantly higher in float16 than in float32 and flex16+5 (Figure 8a, b).  In the case of Wasserstein GAN, float16 significantly deviated from float32 and flex16+5, starting from an undertrained discriminator (Figure 8c); quality of generated images was also accordingly worse (Figure 8d-g).

It is worth pointing out that the degree to which float16 training deviated from float32 varied from case to case.  The underlying reasons might be as diverse as the performance upshots.  As our goal was to validate flex16+5 training, we did not investigate into causes of float16‘s underperformance per each case.  It is likely, with certain extra modifications such as scaling of loss functions, etc., float16 might still achieve performance parity with float32.  However, we show here that flex16+5 achieved this without modifying model topology or any hyperparameters, and furthermore, the chosen hyperparameters of Autoflex were so robust that we did not need to tweak them for different models tested.


Figure 8.  Flexpoint versus floating point.  Comparison of two 16-bit formats flex16+5 versus float16, in actual training of modern deep neural networks, using float32 as a benchmark.  We investigated three cases: an AlexNet trained on ImageNet 1000-class (a), a ResNet-110 trained on CIFAR10 (b) and a Wasserstein GAN trained on the LSUN bedroom images (c-g).  We plotted model performance measured by misclassification errors in the two cases of convolutional networks (a, b), and by the estimated Wasserstein-1 distance in the case of GAN (c).  In all three cases, flex16+5 achieved numerical parity with float32, whereas significant performance gaps were observed in float16.  Generated images (e-g) by the Wasserstein GAN trained in float16 had visibly more saturated patches and scored significantly worse than float32 or flex16+5 in Fréchet Inception Distance (d), a metric for the quality of generated images.

 

The Flexpoint Philosophy

In summary, there are three main advantages of Flexpoint versus floating point in the context of moving deep neural network training from 32-bit to 16-bit:

  1. Flexpoint requires smaller and more efficient multipliers than floating point, because its operations are essentially fixed point arithmetic, with only marginally added complexity of exponent management, a negligible overhead for huge tensors.
  2. In particular, flex16+5 has higher precision than float16, because it has 16-bit mantissas instead of 11-bit, with the caveat that the full 16-bit range is usually not entirely used, especially for small values in the tensor (because exponent is shared) and for volatile tensors (because more headroom needs to be reserved to prevent overflow).
  3. There is no need for extra engineering (i.e. any changes of the model, its hyperparameters or deciding allocations of numerical types in the case of mixed precision training) to move training workloads from float32 to flex16+5.  Hyperparameters of Autoflex are also robust enough to not require tuning from model to model and remain hidden from the user.  The upshot is that the user does not need to know anything under the hood when training in flex16+5–it works just like float32.

If you are still with us, we hope you can appreciate why we bothered to go beyond standard numerical formats and invented Flexpoint, and also how Flexpoint makes the core numerical technology powering our Crest family ASICs for deep learning.

It should be noted that limited precision training is an active research area.  When techniques using more aggressive quantization (8-bit or lower) become mature enough for large-scale hardware acceleration, the questions we answered here probably need to be revisited, because it is not obvious whether Flexpoint’s advantages at 16-bit can extrapolate there for deep neural network training.  However, in today’s market, we believe Flexpoint strikes a desirable balance between aggressive extraction of performance and wide support for seamless migrations of existing training workloads.

Acknowledgements

We thank our colleagues Dr. Evren Tumer, Dr. Urs Köster, Dr. Arjun Bansal, Will Constable, Dr. Oguz Elibol, Stewart Hall, Dr. Luke Hornof, Dr. Amir Khosrowshahi, Carey Kloss, Ruby Pai, Dr. Naveen Rao, Scott Gray and Jordan Hart for early contributions to the project and helping with this blog post.

[1] A competing approach is using field-programmable gate arrays (FPGAs) for accelerated deep learning, but the advantage of FPGAs are in their  flexibility and versatility, not in making processing faster and cheaper, two key factors determining the scaling of deep learning.

[2] Statistical theory of quantization is relevant here; accordingly, wisdoms in designing digital signal processor (DSP) can be very useful in numerical analysis for designing deep learning ASICs.