Topics
Activations are quantized to a specified bit-width (8-bit, in the case of BitNet b1.58) using absmax per token quantization. This involves scaling the activations into the rangeΒ [β128, 127]
Β for an 8-bit bit-width. The quantization formula is:
Breaking down each component:
- The scale factor which is a scalar for row is computed as: divide 127 by the maximum absolute value in that row
- Create a new quantized matrix by applying
round clamp/clip
as per formula - Finally, dequantize by simply dividing each row by its scale factor
The process maintains precision per row while ensuring all values fit within the 8-bit integer range.
Note
We apply layer normalization (LN) before quantizing the activations to maintain the variance of the output: