Topics
STE is a technique used to handle non-differentiable functions in neural networks, particularly useful for training networks with discrete or binary operations.
The core idea is elegant in its simplicity:
- During forward pass: Use the actual discrete/binary function
- During backward pass: Pretend the function was a simple identity function (this is an explicit bias)
- Different gradient approximations can be used instead of the simple identity such as clipping (between -1 and 1) or thresholding the gradients
This is practically used in binarized neural networks and also to quantize activation functions, but has limitations in the form of biased gradient estimates and training instability with deep networks. A popular way to implement this is to use the detach trick for straight through estimators