Three flavors of Quantization

4 min readFeb 9, 2022

John Inacay, Mike Wang, Wiley Wang (All authors equally)

What is Quantization? (and why?)

One of the early driving forces for quantization is to adapt large GPU models into edge computing devices, so that a low power and low cost device can benefit from the machine learning ability from a deep learning model.

Quantization in Deep Learning is the practice of reducing the numerical precision of weights with (hopefully) minimal loss in inference quality. In other words, we convert models from float to int. While we lose precision, we can gain in optimizing for devices lacking in memory or float math efficiency (or both).

A model’s operations are now in lower precision formats (i.e. float64 -> int8), so we can both have the model’s weights be a smaller file size and run faster on target hardware that may not be optimized for float calculations. This technique allows us to adapt much larger models on tighter hardware and runtime constraints like a phone or other local device (See here to read about AppleML using DETR and quantization to run on-device).

In this blogpost, we will show off three different quantization techniques.

Dynamic Quantization

Dynamic Quantization works by quantizing the weights of a network often to a lower bit representation such as 16 bit floating point or 8 bit integers. During inference, Dynamic Quantization will promote these data types to higher bit representations (e.g. 32 bit floating point) in order to apply the trained model. This is applied post-training.

However, this technique doesn’t work universally. We tried applying the technique to ResNet18 (a CNN) and it produced no reduction in model size. When we tried it on BERT, we saw the model reduce in model size by over ½.

Post-Training Static Quantization

Post-training Static Quantization, as its name suggests, also occurs after training. The weights are quantized just like dynamic quantization, but we go through a conditioning step. A representative dataset is run through the model before being converted to get activation statistics to determine the quantization parameters for the model. While the next technique may seem better, Post-Training Static Quantization can be advantageous in cases of needing to optimize for multiple target devices.

Quantization Aware Training (QAT)

In Quantization Aware Training, the model’s weights are trained in higher precision, but are rounded to the target precision when calculating a layer. Gradient descent is allowed to be in higher precision than the target weight precision, but we still train to the target expressive precision. This technique typically has the best performance compared to other forms of quantization. During the training process, QAT will simulate the effects of quantization through rounding and clipping values. QAT will then attempt to account for these effects. When training is complete, the model weights and activations can be quantized to target data types.

Common Optimizations

Here’s a list of common optimizations used for the quantization process.

Fusing Layers: Commonly, activations can be fused with their preceding layers to reduce their compute cost.
Representative Dataset: The representative dataset is used for the fine tuning process. This dataset is representative of the data the network will see during inference.
Quantizing Activations: Activation functions can also be quantized.

Common Tips

Quantize to 16 bit floating point for GPU inference: GPUs tend to be optimized for floating point operations. In addition, 16 bit floating point calculations also tend to be much faster than higher bit operations such as 32 and 64 bit floating point. By quantizing to 16 bit floating point, your network will typically be able to run significantly faster than your unquantized network.
Quantize to 8 bit integer for CPU inference: In contrast to GPUs, CPUs are typically optimized to perform integer operations more quickly than floating point operations. By quantizing to 8 bit integers rather than 16 bit floating point, you’ll typically be able to run a network the fastest if it is quantized to the 8 bit integer format.

Discussion

To use dynamic quantization, it’s free of tuning parameters, but to do quantization dynamically at runtime based on the observed data range, it’s rather low cost to implement. On the flip side, it’s not highly optimized. For post-training static quantization, it’s optimized statically over the activation distribution. These quantization can be done without redoing training, therefore, a speedy way to generate a compact quantized model. As for quantization aware training models, since training is part of the quantization optimization, when time and resources are available, QAT often provides the best results. Between post-training static quantization and QAT, we can quickly create optimized models if we have multiple targets using the former. The latter shines when there is a single, known hardware target we want to optimize to. If you have the single target case, hardware companies like Intel and Nvidia have been providing tools to facilitate QAT targeting their respective family of inference devices.

In addition to quantization, pruning is also a well known technique used to compress models. Pruning involves removing branches of the neural network that have little impact on the final result. The final pruned neural network is typically much smaller and faster than the original network while having similar performance.

Conclusion

Quantization is a useful technique for production and deploying models. For many use cases, you may train deep learning models on high end GPUs and deploy different hardware. In addition, neural networks are getting larger. Typically this hardware is less powerful such as desktop CPUs or Edge Computing hardware. As an analogy, quantization can be seen as a compiler. In traditional software development, you’d typically compile your code with optimizations during deployment in order to run your code as quickly as possible. Similarly, quantization is an optimization you can apply before deployment when you’ve finished developing and training your model. Quantization is undoubtedly a useful technique to understand for engineers who need to deploy their neural networks to the real world.