Topics

Measures how well a language model predicts a provided sample of text. A lower perplexity score indicates that the model is better at predicting the text, meaning the text is less “perplexing” to the model.

Unlike extrinsic evaluations that measure performance on a downstream task (like translation or summarization), perplexity directly assesses the language modeling objective: predicting the next token given the previous ones (auto-regressive property).

Given a seq of tokens , and a language model that provides a probability for each token given the preceding tokens, the perplexity (PP) of the sequence is calculated as the inverse probability of the sequence, normalized by the number of tokens N (normalization to avoid favoring short documents).

The joint probability of the sequence under the language model is:

The perplexity is then defined as:

After simplification, the formula essentially boils down to the exponential of the average negative log-likelihood of the tokens:

A lower perplexity value means the model assigns a higher probability to the test sequence, indicating a better fit to the data.

Warning

While fundamental for core modeling assessment, perplexity alone doesn’t guarantee good performance on complex tasks like dialogue or QA. Modern evaluations use broader task-specific metrics.