BLEU (Bilingual Evaluation Understudy) calculates the similarity between a machine-generated translation and one or more high-quality human reference translations. It generates a score between 0 and 1, where 1 indicates perfect alignment.

It calculates precision for matching n-grams (typically for n=1 to 4) between the candidate translation and references. This "modified precision" penalizes repetitive candidate n-grams. To address artificially inflated scores from very short outputs, it incorporates a brevity penalty. Finally, the individual n-gram precisions are combined using geometric averaging, often with equal weights.

BLEU serves as a fast, consistent, and scalable automated metric for evaluating machine translation system output quality. It helps track progress during model development and compare different systems efficiently. While valuable, it primarily measures surface-level n-gram overlap and correlates imperfectly with human judgments of fluency and meaning; it should typically be used alongside human evaluation.

How is the BLEU score calculated?

関連する質問

Is there a big difference between fine-tuning and retraining a model?

What is the difference between zero-shot learning and few-shot learning?

What are the application scenarios of few-shot learning?

What are the differences between the BLEU metric and ROUGE?