Topics
In a transformer block, the feed-forward layer is the most computationally heavy. In 1-bit LLMs, the nn.Linear
is replaced with a BitLinear
layer. This layer essentially performs a RoundClip
operation converting regular float values to -1, 0 or 1. We also quantize activation functions to 8-bit precision in both BitNet and BitNet b1.58.
Note that the RoundClip
operation is not differentiable, so we can’t directly perform backpropagation. We can use certain tricks such as straight through estimators (STE) which allows gradients to flow through the non-differentiable rounding operation by approximating its gradient as 1.