perplexity

Topics

evaluation metric

sequence modeling

Measures how well a language model predicts a provided sample of text. A lower perplexity score indicates that the model is better at predicting the text, meaning the text is less “perplexing” to the model.

Unlike extrinsic evaluations that measure performance on a downstream task (like translation or summarization), perplexity directly assesses the language modeling objective: predicting the next token given the previous ones (auto-regressive property).

Given a seq of tokens $W = w_{1}, w_{2}, \dots, w_{N}$ , and a language model that provides a probability $P (w_{i} ∣ w_{1}, \dots, w_{i - 1})$ for each token $w_{i}$ given the preceding tokens, the perplexity (PP) of the sequence $W$ is calculated as the inverse probability of the sequence, normalized by the number of tokens N (normalization to avoid favoring short documents).

The joint probability of the sequence $W$ under the language model is:

P (W) = P (w_{1} ​, w_{2} ​, \dots, w_{N} ​) = i = 1 \prod N ​ P (w_{i} ∣ w_{1} ​, \dots, w_{i - 1} ​)

The perplexity is then defined as:

PP (W) PP (W) = P (W)^{- \frac{1}{N}} = (i = 1 \prod N ​ P (w_{i} ∣ w_{1} ​, \dots, w_{i - 1} ​))^{- \frac{1}{N}}

After simplification, the formula essentially boils down to the exponential of the average negative log-likelihood of the tokens:

PP (W) = exp (- \frac{1}{N} i = 1 \sum N lo g ​ P (w_{i} ∣ w_{1} ​, \dots, w_{i - 1} ​))

A lower perplexity value means the model assigns a higher probability to the test sequence, indicating a better fit to the data.

Warning

While fundamental for core modeling assessment, perplexity alone doesn’t guarantee good performance on complex tasks like dialogue or QA. Modern evaluations use broader task-specific metrics.

Altamash Khan

Altamash Khan

perplexity

Backlinks

Altamash Khan

perplexity

Related

Backlinks