The BLEU (Bilingual Evaluation Understudy) metric is an algorithm for automatically evaluating the quality of machine-translated text. It measures the similarity between the machine-generated translation and one or more high-quality human reference translations.

BLEU calculates a precision score by comparing overlapping sequences of words (n-grams, typically up to 4-grams) between the machine output and the reference(s). It incorporates a brevity penalty to penalize overly short translations that omit content present in the references. The score ranges from 0 to 1, where higher scores indicate closer resemblance to the reference translations. Its effectiveness relies on multiple, high-quality references.

BLEU is widely used in machine translation research and development to rapidly evaluate and compare the performance of different models or systems during training and experimentation. It provides an efficient, automated benchmark, enabling iterative improvement. However, while useful for system-level comparison, it correlates imperfectly with human judgments of fluency and adequacy and is best used alongside human evaluation.

What is the BLEU metric

関連する質問

Is there a big difference between fine-tuning and retraining a model?

What is the difference between zero-shot learning and few-shot learning?

What are the application scenarios of few-shot learning?

What are the differences between the BLEU metric and ROUGE?