causal mask vs padding mask

Topics

masked attention

It is important to distinguish the causal mask from the padding mask:

Causal Mask: Prevents attention to future positions within a sequence. Applied only in decoder self-attention. Shape is typically triangular. Purpose is to enforce auto-regressive property during training/inference
Padding Mask: Prevents attention to padding tokens added to make sequences in a batch equal length. Can be applied in encoder self-attention, decoder self-attention, and decoder cross-attention whenever padding is present. Shape depends on the lengths of sequences in the batch. Purpose is to ignore meaningless padding tokens during computation.

Both are often implemented by adding large negative values before the softmax, but they target different tokens for different reasons. Some libraries combine both masks automatically when provided.

Altamash Khan

Altamash Khan

causal mask vs padding mask

Backlinks

Altamash Khan

causal mask vs padding mask

Related

Backlinks