LSP Enable Program/ Become a Partner/Become a Customer/Contact Us
 
Language Studio Enterprise Login
Username: 
Password: 
Lost & found in translation

Darn!!!! I can’t remember that word...it’s on the tip of my tongue. What is it?

We have all had this state from time to time. The term for this state is "lethologica" - Asia Online's linguists hunted down the definition - "the inability to remember a word you already know and would like to use."

Translation Quality Metrics

Asia Online encourages free distribution of the Asia Online Translation Quality Evaluation Tool Translation Quality Evaluation as it will help to ensure that reliable and accurate tools to assess BLEU and F-Measure readings are available to all who need them. We encourage comparison across the industry against any translation engine and the tool is able to perform up to 8 comparisons at one time. An update for the tool is currently being developed to support METEOR, NIST and TER. This will make all the key industry quality metrics available in one tool for the first time.

 NOTE: 

If you are comparing BLEU scores or F-Measure scores between different translation systems, it is essential that you use the same measurement/evaluation tool to compare all data sets. Do not use different tools for different data sets. Not all BLEU or F-Measure implementations are identical and different results may be reported when compared to the same data on other measurement tools.

Asia Online provides a free BLEU and F-Measure tool to evaluate translations. It can be downloaded from our Tools and Software Downloads Tools and Software Downloads page.

A solid discussion of both BLEU and F-Measure is available from a research paper titled “Evaluation of Machine Translation and its Evaluation” authored by Joseph P. Turian, Luke Shen, and I. Dan Melamed of New York University. It can be found at: http://nlp.cs.nyu.edu/publication/papers/turian-summit03eval.pdf 

BLEU Definition

Measuring translation quality is difficult because there is not an absolute way to measure how “correct” a translation is. The most common way to measure quality is to compare the output of automated translation to a human translation of the same document. The problem is that one human translator will translate the document significantly differently than another human translator. This inconsistency in the human reference translations leads to problems when using these human references to measure the quality of an automated translation solution.

A document translated by an automated software solution may have 60% of the words overlap with one translator’s translation, and only 40% with the other translator’s translation; even though both human reference translations can be technically correct, the one with the 60% overlap with machine translation provides a higher “quality” score for the automated translation than the other translator’s translation did. Therefore, although humans are the true test of correctness, they do not provide an entirely objective and consistent measurement for quality.

The BLEU metric scores a translation on a scale of 0 to 1.The closer to 1, the more overlap there is with a human reference translation and thus the better the system is. In a nutshell, the BLEU metric measures how many words overlap, giving higher scores to sequential words. For example, a string of four words in the translation that match the human reference translation (in the same order) will have a positive impact on the BLEU score and is weighted more heavily (and scored higher) than a one or two word match.

  • The scoring algorithms punish you (brevity penalty) for unnecessarily repeating high frequency words like “the”.
  • Studies have shown that there is a high correlation between BLEU and human judgments of quality when properly used.
  • BLEU scores are often stated on a scale of 1 to 100 to simplify communication but should not be confused with percentage of accuracy.
  • Even two competent human translations of the exact same material may only score in the 60 or 70 range if they use different vocabulary and phrasing.
To conduct a BLEU measurement the following data is necessary:
  • One or more human reference translations. (This should be data that has NOT been used in building the system as training data and ideally should be unknown to the developer. It is generally recommended that 100 to 1000 sentences be used at least.)
  • Automated translation output of the exact same source data set.
  • This utility or one like it that performs the comparison and calculation for you.

As would be expected using multiple human reference tests will always result in higher scores as the SMT output has more human variations to match against. The NIST (National Institute of Standards & Technology) uses BLEU as an approximate measure of quality in its annual MT competitions with four human reference sets to ensure that some variance in human translation are captured, and thus allow more accurate quality evaluations of the MT alternatives.

What is BLEU useful for?

Asia Online provides an environment that allows users to develop and make many adjustments in developing an SMT translation system. Often, new data can be added with beneficial results but sometimes this new data can cause a negative effect. Thus, to measure the progress made in the development process, users need to be able to measure the quality quickly and regularly to make sure they are improving the system and are in fact making progress.

Competent and dispassionate human judgment is always the best gauge of a systems translation quality. However, users and developers need immediate feedback on development strategies, so using human translators for every test is not an efficient solution. The nature of developing an SMT system is such that the developers will experiment with many different approaches and data combinations to find one that will produce the best results.

During the development process, an automatic test is necessary to quickly see the impact of a strategy. This utility will help to measure BLEU and in time other measures that will provide quick feedback on development strategies and the current quality of an SMT system. BLEU allows developers a way “to monitor the effect of daily changes to their systems in order to weed out bad ideas from good ideas. (Papineni, et. al. 2002).

When used to evaluate the relative merit of different system building strategies, BLEU can be quite effective as it provides very quick feedback and this enables SMT developers to quickly refine and improve translation systems they are building and continue to improve quality on a long term basis.

What is BLEU not useful for?

BLEU scores are always very directly related to a specific “Test Set” and a specific language pair. Thus, BLEU should not be used as an absolute measure of translation quality because the BLEU score can vary even for one language depending on the test and subject domain. In most cases comparing BLEU scores across different languages is meaningless unless very strict protocols have been followed.

Because of this, Asia Online always uses human translators to measure fluency and verify the accuracy of the systems. Also, most industry leaders will always vet the BLEU score readings with human assessments before production use.

Problems with BLEU

There are several criticisms of BLEU that should also be understood if you are to use the metric effectively. BLEU only measures word-by-word similarity, and looks to match and measure the extent to which word clusters in two documents are identical. Accurate translations that use different words may score poorly since there is no match in the human reference. There is no understanding of paraphrases and synonyms so scores can be somewhat misleading in terms of overall accuracy. Also, nonsensical language that contains the right phrases in the wrong order can score high. These are further discussed http://www.theregister.co.uk/2007/05/15/google_translation/page2.html 

This link is an academic critique of the BLEU that clearly points out many of the shortcomings of the metric http://www.iccs.inf.ed.ac.uk/~miles/papers/eacl06.pdf 

Having pointed out all these shortcomings the BLEU metric is still a very useful tool for practitioners engaged in the difficult task of creating automated translation systems that are continually improving. Careful and informed use of BLEU can drive the development and evolution of systems and allow researchers to test out many different hypotheses to determine if they are favorably affecting the performance of a translation engines output.

Guidelines to Interpreting BLEU
  • Small differences in BLEU score are more meaningful when scores are low. Thus the difference between 20 and 22 will be much clearer than the difference between 70 and 72.
  • Very small differences (2 or 3 points) in BLEU scores will sometimes be meaningless and output should always be examined for reasonableness.
  • The highest BLEU score is not necessarily the best system especially if the other competitive systems are very close.
  • BLEU scores are very closely related to the Test Sets used and only reflect quality in terms of the Test data used.
  • The Test Set needs to be similar to the new material that the system will be used to translate. If it is not the BLEU scores will not be representative of the likely results.

F-Measure Definition

Like the BLEU measure, the F-Measure metric for evaluating machine translation quality on precision and recall calculates the F-Measure score over a candidate file and at least one reference file. It was developed by New York University.

The precision and recall scores are based on maximum matching sets of tokens between the candidate file and the different reference files.

The advantage of the popular BLEU is that there is both a precision and a recall value. The arbitrary choice for the n-gram length is nicely solved by taking maximum matching sets. However this introduces another arbitrary setting, the setting for the exponent for calculating maximum matching sets. Usually a square value produces good comparable results.

F-measure is the weighted harmonic mean of precision and recall. The formula of F-measure is as follows:

As the alpha value increases, the weight of recall increases in the measure.

Top Top
The World Speaks One Language - Yours
Home/ Portal/ Translation/ Solutions/ Technology/ Tools & Downloads/ Resources/ News/ FAQ/ About Us/ Blog/ Contact Us/ Join Mailing List