FAQに戻る
Enterprise Applications

What are the shortcomings of BLEU?

BLEU is popular but unreliable for evaluating text quality. Its oversimplified approach cannot accurately assess meaning or fluency.

BLEU focuses only on surface n-gram matches against reference translations, ignoring semantic adequacy and grammatical correctness. It heavily favors statistical similarity over genuine translation quality. The brevity penalty unfairly penalizes legitimate translations shorter than references. BLEU is insensitive to word order changes beyond its n-gram size, sometimes rewarding nonsensical output. It relies entirely on the availability of high-quality reference texts.

These limitations reduce BLEU's value in real-world NLP tasks. Its scores often misalign with human judgments, especially for nuanced or creative language. Developers increasingly supplement or replace it with semantic-aware metrics (like BERTScore) and human evaluations despite its historical use in research and commercial systems.

関連する質問