Bilingual Evaluation Understudy (BLEU) is a metric for evaluating Machine translation, with value in range 0 to 1. The higher the BLEU score, the closer the computer-generated text is to the human-translated reference text. It uses a weighted sum of precisions of n-grams at its core along with a brevity penalty (BP) to penalize translations that are too short. The BLEU formula is:
Calculation steps:
- Choose the n-gram order: N=4 typically
- Count matching n-grams: For each n-gram size, calculate how many matching n-grams exist between the machine-generated text and the reference translations. Be sure to account for repeated phrases
- Calculate the precision of n-grams in BLEU i.e. values
- Do the weighted sum of log probs, followed by exponentiation, which is very similar to geometric mean average precision:
- Apply BP to get the final BLEU score