causal linear attention as an RNN

Topics

linear attention

Causal linear attention can be expressed as an RNN. This applies when a token at position $t$ only attends to tokens at positions $j \leq t$ .

This formulation, shown by Katharopoulos et al. (2020), defines state at time $t$ using accumulated key-value (context) summary $S_{t}$ and accumulated key summary $Z_{t}$ :

S_{t} Z_{t} = j = 1 \sum t ϕ (K_{j})^{T} V_{j} \in R^{c \times d_{v}} = j = 1 \sum t ϕ (K_{j})^{T} \in R^{c}

For more details on notation, refer linear attention math formulation. Cool observation is that these states update recurrently:

S_{t} Z_{t} = S_{t - 1} + ϕ (K_{t})^{T} V_{t} = Z_{t - 1} + ϕ (K_{t})^{T} (S_{0} = 0) (Z_{0} = 0)

Output at time $t$ computed using current query $ϕ (Q_{t})$ and updated state $(S_{t}, Z_{t})$ :

Output_{t}^{1 \times d_{v}} \approx \frac{ϕ ( Q _{t} ) S _{t}}{ϕ ( Q _{t} ) Z _{t}}

Standard causal attention with kv-caching (storing all previous keys/values) needs $O (N)$ memory. Causal linear attention with RNN formulation only needs $O (1)$ memory (specifically $O (c d_{v} + c))$ per step during inference) for fixed-size states $(S_{t}, Z_{t})$ , making it efficient for long sequences.

Pros:

$O (N)$ time complexity and $O (1)$ space complexity per step during inference
Good for long sequences

Cons:

Sequential nature of the RNN formulation hinders parallelization during training
The kernel function is still an approximation, so performance < standard attn

Trends:

To mitigate training slowdown, use chunkwise processing: input sequence is divided into non-overlapping chunks. Within each chunk, computations can be parallelized. The recurrent state update is performed sequentially between chunks. Nice balance between parallelization and sequential nature required for causal modeling across chunks
Hardware-aware implementations of linear attention using tools like Triton
Gaining traction through libraries like flash-linear-attention

Altamash Khan

Altamash Khan

causal linear attention as an RNN

Backlinks

Altamash Khan

causal linear attention as an RNN

Related

Backlinks