Bilingual Evaluation Understudy (BLEU) is a metric for evaluating Machine translation, with value in range 0 to 1. The higher the BLEU score, the closer the computer-generated text is to the human-translated reference text. It uses a weighted sum of precisions of n-grams at its core along with a brevity penalty (BP) to penalize translations that are too short. The BLEU formula is:

Calculation steps:

  • Choose the n-gram order: N=4 typically
  • Count matching n-grams: For each n-gram size, calculate how many matching n-grams exist between the machine-generated text and the reference translations. Be sure to account for repeated phrases
  • Calculate the precision of n-grams in BLEU i.e. values
  • Do the weighted sum of log probs, followed by exponentiation, which is very similar to geometric mean average precision:
  • Apply BP to get the final BLEU score