Topics

Hard attention differs fundamentally from soft attention by making a discrete choice rather than computing a weighted average. Instead of soft weights, attention scores define a probability distribution over input elements, from which one or a small subset is sampled. The output relies solely on the selected element(s), making the selection process stochastic and discrete.

This discrete nature offers advantages:

  • improved model explainability by revealing the specific element(s) attended to
  • potential computational efficiency during inference as only selected elements are processed, unlike soft attention which processes all inputs

Major challenge is non-differentiability. The discrete sampling step blocks standard gradient flow for backpropagation. Training requires specialized methods:

Applications: image captioning and VQA, where selecting specific image regions (glimpses) is intuitive. Despite conceptual benefits like interpretability and efficiency, training complexity due to non-differentiability limits hard attention’s widespread adoption.