Topics

Activations are quantized to a specified bit-width (8-bit, in the case of BitNet b1.58) using absmax per token quantization. This involves scaling the activations into the rangeΒ [βˆ’128, 127]Β for an 8-bit bit-width. The quantization formula is:

Breaking down each component:

  1. The scale factor which is a scalar for row is computed as: divide 127 by the maximum absolute value in that row
  2. Create a new quantized matrix by applying round clamp/clip as per formula
  3. Finally, dequantize by simply dividing each row by its scale factor

The process maintains precision per row while ensuring all values fit within the 8-bit integer range.

Note

We apply layer normalization (LN) before quantizing the activations to maintain the variance of the output: