Last Updated : 23 Jul, 2025
Neural Machine Translation (NMT) is a standard task in NLP that involves translating a text from a source language to a target language. BLEU (Bilingual Evaluation Understudy) is a score used to evaluate the translations performed by a machine translator. In this article, we'll see the mathematics behind the BLEU score and its implementation in Python.
What is BLEU Score?As stated above BLEU Score is an evaluation metric for Machine Translation tasks. It is calculated by comparing the n-grams of machine-translated sentences to the n-gram of human-translated sentences. Usually, it has been observed that the BLEU score decreases as the sentence length increases. This, however, might vary depending upon the model used for translation. The following is a graph depicting the variation of the BLEU Score with the sentence length.
Mathematical Expression for BLEU ScoreMathematically, BLEU Score is given as follows:
Modified n-gram precision ( p_i )BLEU Score = BP * exp(\sum_{i=1}^{N}(w_i * ln(p_i))
Here,
- BP stands for Brevity Penalty
- w_i is the weight for n-gram precision of order i (typically weights are equal for all i)
- p_i is the n-gram modified precision score of order i.
- N is the maximum n-gram order to consider (usually up to 4)
The modified precision p_i is indeed calculated as the ratio between the number of n-grams in the candidate translation that match exactly n-grams in any of the reference translations, clipped by the number of n-grams in the candidate translation.
Brevity Penalty (BP)p_i = \frac{\text{Count Clip}(matches_i, \text{max-ref-count}_i)}{\text{candidate-n-grams}_i}
Here,
- Count Clips is a function that clips the number of matched n-grams ( matches_i )by the maximum count of the n-gram across all reference translations ( \text{max-ref-count}_i .
- matches_i is the number of n-grams of order i that match exactly between the candidate translation and any of the reference translations.
- \text{max-ref-count}_i is the maximum number of occurrences of the specific n-gram of order i found in any single reference translation.
- \text{candidate-n-grams}_i is the total number of n-grams of order i present in the candidate translation.
Brevity Penalty penalizes translations that are shorter than the reference translations. The mathematical expression for Brevity Penalty is given as follows:
How to Compute BLEU Score?BP = \exp(1- \frac{r}{c})
Here,
- r is the length of the candidate translation
- c is the average length of the reference translations.
For a better understanding of the calculation of the BLEU Score, let us take an example. Following is a case for French to English Translation:
We can clearly see that the translation done by the machine is not accurate. Let's calculate the BLEU score for the translation.
Unigram Modified PrecisionFor n = 1, we'll calculate the Unigram Modified Precision:
Unigram Count in Machine TranslationMax count in Ref
Clipped Count =1
1 picture 21
1 by 11
1 me 11
1Here the unigrams (the, picture, by, me) are taken from the machine-translated text. Count refers to the frequency of n-grams in all the Machine Translated Text, and Clipped Count refers to the frequency of unigram in the reference texts collectively.
P_1 = \frac{\text{Clipped Count}}{\text{Count in MT}} = \frac{1+1+1+1}{2+2+1+1} =\frac{4}{6} = \frac{2}{3}
Bigram Modified PrecisionFor n = 2, we'll calculate the Bigram Modified Precision:
Bigrams Count in MTMax Count in Ref
Clipped Count =1
1 picture the 10
0 picture by 10
0 by me 11
1P_2 = \frac{\text{Clip Count}}{\text{Count in MT}} = \frac{2}{5}
Trigram Modified PrecisionFor n = 3, we'll calculate the Trigram Modified Precision:
Trigram Count in MTMax Count in Ref
Clipped Count =0
0 picture the picture 10
0 the picture by 10
0 picture by me 10
0P_3 = \frac{0+0+0+0}{1+1+1+1} =0.0
4-gram Modified PrecisionFor n =4, we'll calculate the 4-gram Modified Precision:
4-gram CountMax Count in Ref
Clipped Count =0
0 picture the picture by 10
0 the picture by me 10
0P_4 = \frac{0+0+0}{1+1+1} =0.0
Computing Brevity PenaltyNow we have computed all the precision scores, let's find the Brevity Penalty for the translation:
Brevity Penalty = min(1, \frac{Machine\,Translation\,Output\,Length}{Maximum\,Reference\,Output\,Length})
Brevity Penalty (BP) = min(1, \frac{6}{6}) = 1
Computing BLEU ScoreFinally, the BLEU score for the above translation is given by:
BLEU Score = BP * exp(\sum_{n=1}^{4} w_i * log(p_i))
On substituting the values, we get,
\text{BLEU Score} = 1 * exp(0.25*ln(2/3) + 0.25*ln(2/5) + 0*ln(0) + 0*ln(0))
\text{BLEU Score} = 0.718
Finally, we have calculated the BLEU score for the given translation.
BLEU Score Implementation in PythonHaving calculated the BLEU Score manually, one is by now accustomed to the mathematical working of the BLEU score. However, Python's NLTK provides an in-built module for BLEU score calculation. Let's calculate the BLEU score for the same translation example as above but this time using NLTK.
Code:
Python3
from nltk.translate.bleu_score import sentence_bleu
# Define your desired weights (example: higher weight for bi-grams)
weights = (0.25, 0.25, 0, 0) # Weights for uni-gram, bi-gram, tri-gram, and 4-gram
# Reference and predicted texts (same as before)
reference = [["the", "picture", "is", "clicked", "by", "me"],
["this", "picture", "was", "clicked", "by", "me"]]
predictions = ["the", "picture", "the", "picture", "by", "me"]
# Calculate BLEU score with weights
score = sentence_bleu(reference, predictions, weights=weights)
print(score)
Output:
0.7186082239261684
We can see that the BLEU score computed using Python is the same as the one computed manually. Thus, we have successfully calculated the BLEU score and understood the mathematics behind it.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4