## Background

In recent years, deep learning-based Artificial Intelligence (AI) and Machine Learning (ML) models are moving from cloud to edge devices for various factors such as bandwidth, latency, etc as in Figure-1.

Power consumption, latency, hardware size are important aspects for inference at the edge. Moreover, The Models developed in the cloud using the GPUs are floating-point models.

Most importantly, it is highly desired for the edge inference hardware running Neural Networks do not have to support expensive Floating-Point units to suit the lower power and cost budgets. Compared to floating-point counterparts, the fixed point math engines are very area and power-efficient

Moreover, the model developed in the cloud needs to be quantized without significant loss of accuracy. An efficient quantization mechanism can quantize 32-bit Floating Point (FP) Models to 8-bit INT operations with a loss of accuracy of less than 0.5% for most of the models. This process reduces the memory footprint, compute requirements and thereby reduces the latency, and power required to do the inference of the models.

Most of the edge devices, including GPUs (who have FP arithmetic), now take advantage of lower precision and quantized operations. whereas, Quantization is a defacto step for edge inference. This blog talks about common methods for doing quantization, challenges, and methods to overcome them.

## Types of Quantization

Primarily Three types of quantization techniques of neural networks listed in increasing order of complexity and accuracy.

- Power-2
- Symmetric
- Asymmetric

#### Power-2

Power-2 quantization uses only the left and right shifts of the data to perform the quantization. Shifts have a very low-cost of implementation, as barrel shifters are part of most hardware architectures. To keep the most significant 7 bits (for INT8), based on the absolute maximum value, the weights and biases will be quantized by shifting left or right. These bits are tracked and re-adjusted if needed before and after the operation.

#### Symmetric

The next level in complexity is the Symmetric quantization, also sometimes referred to as linear quantization, which takes the maximum value in the tensor and equally divides the range using the maximum value. Here the activations will be re-adjusted based on the input and output scaling factors.

#### Asymmetric

Finally, Asymmetric quantization fully utilizes the bits given but has a high complexity of implementation

Xf = Scale * (Xq – Xz), where Xf is the floating-point value of X, Xq is the quantized value and Xz is the Zero Offset

The zero offset is adjusted such that the zero value is represented with a non-fractional value, so that zero paddings do not introduce a bias (Conv).

Moreover. there will be an additional Add/Sub operation per every operand when performing matrix multiplications and convolutions and not very conducive for many hardware architectures.

Conv = SUM((Xq-Xz) * (Wq-Wz)), where W is the weight and X is the input

Operations such as RELU that are present at most layers, knock off negative values, so Integer representations (INT8), virtually loses a bit in this representation compared to a UINT8 representation.

After Multiply-And-Accumulate function, the re-quantization step uses Scalar multipliers and shifts to avoid supporting division functions.

The quantization can be implemented per Tensor (per layer) or per output Channel (for Conv), if the dynamic range of each of the channel’s weights is quite different.

## Methods of Quantization

Above all, Four methods of quantization listed in increasing order of accuracy,

- Without Vectors
- This post-training quantization flow determines the range of the activations without any vectors

- For INT8, the scaling factors are determined using the Scale / Shift values.

- Use Vectors to establish a range
- The vectors are used to know the range of activation.

- The activations are quantized based on the range determined by running vectors and registering the range of each Tensor

- Second pass Quantized Training
- After a model is completely trained, the model is retrained by inserting Quantization nodes to incrementally retrain for the error, etc.

- In this, the forward path uses quantized operations and the back-propagation is using floating-point arithmetic. (Tensorflow Lite – TFLite – Fake Quantization)

- Quantized Operator Flow (Q-Keras – Keras Quantization)
- There are new frameworks in development that perform quantization aware training (use quantized operands while training from the start)

For methods that do training, it requires that the customers provide their datasets and the definition for the loss/accuracy functions. It is likely that both are customer’s proprietary information that they would not readily share (a common scenario) and hence the need for the first two methods mentioned above.

## Sample Results

The table below shows the results of quantization for some sample networks. The asymmetric UINT8 quantization, Per channel INT8, and UINT8 retrain quantization are compared with FP accuracy

From the above result, some networks perform very poorly with standard quantization and may require per channel quantization and some would require retraining or other methods.

Moreover, retraining or training with quantized weights gives the best result but would require customers providing datasets, model parameters, Loss and accuracy functions, etc., which may be feasible for all the cases.

In Addition, some of the customer models have frozen (constant) weights and cannot re-train immediately after the release.

## Methods to improve accuracy

From the below figure, one of the main reasons for the loss of accuracy is the dynamic range of the weights across different channels

### Dynamic Range

Methods to address the loss of accuracy without having to retrain are

- Per-channel
- Firstly, from the figure above, establishing a different scale for each of the output channels of a convolutional neural network (CNN) would preserve the accuracy as corroborated by the results

- Secondly, this introduces additional complexity to track the scaling of each channel and readjust the scaling value at the output activations to be at the same level to be used in the next layer

- Moreover, Hardware should track the per-channel scaling values and apply them on the output of each layer

- Reducing the Dynamic Range
- The distribution of the range of weights can be analyzed and weights that are anomalies, small in the count, and that is extreme can be clamped to the 2-Standard Deviation or 3-Standard Deviation.

- Usually, this step requires validation of the accuracies as some of the extreme weight may be required by the design

- Equalization
- The weights of each of the output layer are equalized in conjunction with the input weights of the next layers to normalize the ranges.

- Bias Correction
- Adjusts the bias values to compensate for the Quantization introduced bias error.

## Quantization Impact on Frames / Sec

8-bit Quantization significantly improves the model performance by reducing the memory size and by doing 8-bit integer computations.

By adopting a dynamic bit of widths per layer, quantization gives even more performance benefits. Moreover, some layers can have quantization to 4-bit, 2-bit, and 1-bit quantization (ternary and binary neural networks) without significant loss of accuracy. These would be layer dependent and usually, the final layers tend to perform well with lower quantization granularity.

Binary and ternary operations are computationally less intensive than 8-bit Multiply Accumulates, yielding much lower power.

Even in cases where the hardware does not support 4-bit or 2-bit operations, the model compression achieved by just storing weights in a lesser number of bits reduces the storage and bus bandwidth requirement.

## Gyrus Flows

As a result, Gyrus developed several modules that bridge the gap between the Open Source AI Frameworks (Tensorflow, Pytorch, etc.) and the hardware from silicon vendors.,

The flows perform Graph Optimization, Compression, Pruning, Quantization, Scheduling, Compilation and ML/AI Library modules.

*To learn more about our AI framework models, check out Gyrus AI Framework Modules*

Specifically, in Quantization, the flows perform any of the three types of quantization (Power-2, Symmetric, and Asymmetric) and across the different methods of quantization (No Vectors, Vectors for Range, and Vectors for Retraining) mentioned above.

The flows also employ advanced methods to improve the quantization loss by doing per channel quantization, Dynamic range adjustment, and bias corrections. These flows are applicable for canned models or new models and the results presented by Gyrus quantization modules. Also, the quantization flows work in conjunction with compression and pruning and the quantization method used influences the compression scheme employed in the hardware.

## Conclusion

Quantized fixed-point operations are the norm in edge computing. All Silicon vendors should support all or a sub-set of the different quantization schemes as there are advantages for each of them depending on the networks/models. To achieve close to FP accuracy, one needs to employ additional techniques than simple conversions. In addition, representing small bit widths for certain layers gives additional power and size advantages.

## Types of Quantization

Power-2, Symmetric and Asymmetric quantization are the three types of quantization techniques of neural networks listed in increasing order of complexity and accuracy.

## What is Symmetric Quantization

Symmetric quantization also sometimes referred to as linear quantization, which takes the maximum value in the tensor and equally divides the range using the maximum value. Here the activations will re-adjusted based on the input and output scaling factors.

## What is Asymmetric Quantization

Asymmetric quantization fully utilizes the bits given but has a high complexity of implementation.