Topics

In a transformer block, the feed-forward layer is the most computationally heavy. In 1-bit LLMs, the nn.Linear is replaced with a BitLinear layer. This layer essentially performs a RoundClip operation converting regular float values to -1, 0 or 1. We also quantize activation functions to 8-bit precision in both BitNet and BitNet b1.58.

Note that the RoundClip operation is not differentiable, so we can’t directly perform backpropagation. We can use certain tricks such as straight through estimators (STE) which allows gradients to flow through the non-differentiable rounding operation by approximating its gradient as 1.