Topics
Quantization is typically a post-training process to lower the memory usage of models (though we also have quantization-aware training), which doesn’t require architectural changes. The standard approach is to lower the floating point precision of the weights, e.g. 2.3456...
to 2.3
. This reduction in precision results in faster inference and lower memory usage, but at the cost of reduced accuracy.
Contrary to this, in 1-bit LLMs, every weight is represented in binary [0, 1]
(or ternary in the case of BitNet b1.58 model). The model training is done from scratch and there are some architectural changes in 1-bit LLMs to accomodate this.