How is the BLEU score calculated?
BLEU (Bilingual Evaluation Understudy) calculates the similarity between a machine-generated translation and one or more high-quality human reference translations. It generates a score between 0 and 1, where 1 indicates perfect alignment.
It calculates precision for matching n-grams (typically for n=1 to 4) between the candidate translation and references. This "modified precision" penalizes repetitive candidate n-grams. To address artificially inflated scores from very short outputs, it incorporates a brevity penalty. Finally, the individual n-gram precisions are combined using geometric averaging, often with equal weights.
BLEU serves as a fast, consistent, and scalable automated metric for evaluating machine translation system output quality. It helps track progress during model development and compare different systems efficiently. While valuable, it primarily measures surface-level n-gram overlap and correlates imperfectly with human judgments of fluency and meaning; it should typically be used alongside human evaluation.
関連する質問
Is there a big difference between fine-tuning and retraining a model?
Fine-tuning adapts a pre-existing model to a specific task using a relatively small dataset, whereas retraining involves building a new model architec...
What is the difference between zero-shot learning and few-shot learning?
Zero-shot learning (ZSL) enables models to recognize or classify objects for which no labeled training examples were available during training. In con...
What are the application scenarios of few-shot learning?
Few-shot learning enables models to learn new concepts or perform tasks effectively with only a small number of labeled examples. Its core capability...
What are the differences between the BLEU metric and ROUGE?
BLEU and ROUGE are both automated metrics for evaluating the quality of text generated by NLP models, but they measure different aspects. BLEU primari...