Skip to main content

Posts

Showing posts with the label MT evaluation

DeepL is a unicorn. How good are its machine translations?

With machine translation provider DeepL having closed its latest funding round , it is a good time to see how DeepL's machine translations rank in the MT Decider Benchmark . We benchmarked four large online MT providers, Amazon, DeepL, Google and Microsoft, with news domain data in 23 language pairs/46 language directions (we are leaving out the MT Decider language pairs English↔Arabic and English↔Korean that only have transcribed speech test data for an apples-to-apples comparison). Using the evaluation metric COMET , DeepL ranks first for 24 of the 46 language directions! Google Translate is a close second ranking first for 19 out of 46 language directions. This is an impressive result given the competition.  DeepL doesn't rank the highest in the Q4/2022 MT Decider Index that reflects which provider is best across the evaluated language pairs. Google Translate does. Why? We use ranked voting to calculate our index, so it matters which provider ranks 1st, 2nd, 3rd and 4th. D...

MT Decider Benchmark Q4/2022 now available

The Q4/2022 edition of the MT Decider Benchmark with the latest comparison of machine translation quality with Amazon Translate, DeepL, Google Translate and Microsoft Translator is out! With the addition of English↔Korean the benchmark now covers 25 language pairs. Quote handling differences by online MT services significantly distort BLEU score results . For the benchmark we now apply quote normalization before calculating BLEU scores. Now the metrics COMET and BLEU agree more often on the best service for a language direction, allowing you to confidently choose the best MT service. We kept test data fresh by updating to 2021 data, where available. We used the the latest evaluation libraries sacreBLEU 2.3.1 and COMET 1.1.3, incorporating the latest innovations and bug fixes from academic research. Instead of naming the benchmark with the quarter when machine translations were captured, we now name it with the quarter when the benchmark reports are compiled. This is why MT Decider ...

mteval - an automation library for automatic machine translation evaluation

For creating the MT Decider Benchmark Polyglot needed a library to automate the evaluation of machine translations for many different languages, with many MT services, with multiple datasets, using multiple automatic evaluation metrics. The tools sacreBLEU and COMET provide a great basis for scoring machine translations relative to human reference translations with the most popular metrics - BLEU, TER, chrF, chrF++ and COMET. Running evaluations from the command line is their focus. We needed automation from Python to run evaluations in Jupyter Notebook environments like Google Colaboratory , which offers free-to-use GPUs for COMET evaluation. We also wanted to translate the test sets with major online MT services and persist test sets, machine translations and the evaluation results. The result of this is the Python library mteval , with the source available under the Apache License 2.0 on github . Feedback is welcome. The plan for December is to publish the MT Decider Scorer Jup...

“May I quote you?” – why quotation marks are difficult for machine translation and problematic for BLEU scores

Today a post about typography, nevertheless with big impacts on MT quality evaluation ... Why worry about quotation marks?  You might wonder why you should worry about the handling of quotation marks in machine translation and it's impact on automatic machine translation quality evaluation? Isn't quoting an easy task in one language that can be easily transferred in translation with some simple rules? As we will see, it isn't an easy task - wrong quotation marks can significantly distort BLEU scores and thereby the relative ranking of MT systems. Quotation marks in English Back in the days of character encoding with ASCII we had one character for double-quotes: " (U+0022: Quotation Mark) and another character for single quotes: ' (U+0027: Apostrophe) with the later doing double duty as an apostrophe and single quote. With these we can quote a sentence like: "She said: 'It's getting late.'" This is quite ugly looking typographically, which is...

MT Decider Benchmark: BLEU Differences by Language Pair

In the launch post for the MT Decider Benchmark I noted that the difference in machine translation quality, as measured by the BLEU score, can differ as much as 54% or more than 9 BLEU points between the evaluated online MT services Amazon Translate, DeepL, Google Translate, and Microsoft Translator.  But what are the differences for the individual language pairs/translation directions? Here is the chart of BLEU score differences by language pair sorted from largest to smallest score difference: Somewhat unsurprisingly online MT services differ most in quality for languages that are morphologically complex and/or are low-resource. One interesting observation is that for all language pairs the difference between best and worst online MTservice is at the minimum 1.39 BLEU points . A score difference of over one BLEU point is considered significant in academic research. Therefore it is definitely worth to check if you are using the best online MT service(s) for your lan...

The Best MT Services for Your Language Pairs

Affordable, High-quality Translations with Online Machine Translation Services Compared to a few years ago we live in fortunate times when we want to to translate from one human language into another using machines: there are many affordable online machine translation (MT) services available delivering high-quality translations.  MT Quality Matters For perishable, low-impact content web publishers can publish machine translated text directly in the languages they need. When high-quality human-edited translations are needed, translation providers can use machine translations as draft translations for post-editing for increased speed and efficiency - provided that the machine translations are of sufficient quality. But MT quality varies as much as 54% or more than 9 BLEU points(!) between different MT services for some language pairs. This is a huge difference! What are the Best MT Services for Your Language Pairs? How then can MT users determine which MT providers offer the best qu...

MT Decider Index Q2/2022

As a machine translation user you want to use the online MT service with the highest translation quality for each language pair you are translating. But evaluating ever changing MT services across many language pairs is hard!  Polyglot Technology solves this challenge by producing the MT Decider Benchmark, a vendor-independent, transparent, and up-to-date evaluation of online MT services every quarter for 24 language pairs.  The MT Decider Index is a cross-language ranking distilled from the MT Decider Benchmark. This is the MT Decider Index for the second quarter of 2022: Google Translate Microsoft Translator DeepL Amazon Translate The MT Decider Benchmark Q2/2022 is now available . To learn about TAUS DeMT Evaluate , an evaluation service and report jointly created by TAUS and Polyglot Technology, the MT Decider Index and the MT Decider Benchmark, please attend this Nimdzi Live August 3rd webcast with Anne-Maj van der Meer (TAUS), Amir Kamran (TAUS), myself, Achim ...

An MT Journey in 11 Easy Steps

When I talk to people that are new to machine translation (MT), I often get the question how they can determine whether MT can really help them with their translation needs, be it MT for post-editing or raw MT. This got me thinking what steps are essential to choose an MT solution that satisfies these translation needs from a linguistic quality and business perspective. I came up with this workflow that can serve as a guide through your MT journey. I will describe each of the steps in detail below. The blue steps are required, while the green ones are optional, depending on the quality goals and use case. 1. Choosing an MT Project This first step in the MT Journey is less defined than the ones after. I believe that at the begin of the journey it helps to broaden the perspective to clearly identify the destination of the journey. Deep learning is the foundational technology behind what is called artificial intelligence (AI) these days. Deep learning is what powers neural ma...

Healthcare MT with Google AutoML Translation

To help people make the right choices for their healthcare, the U.S. Centers for Medicare & Medicaid Services provide the site https://www.healthcare.gov/ as an information hub. The Centers try to reach many language communities , which is especially important for an aging population. With Spanish being the native language of roughly 13% of the US population, the most effort is put into a Spanish version of the site - https://www.cuidadodesalud.com/ . Reading information on a website is only the first step to get health insurance - there are navigators, assisters, partners, agents and brokers that assist in signing up for insurance. Wouldn't it be great if these people had a customized MT system available to communicate with people that need insurance? Such an MT system could also provide initial translations for English content that is not (yet) translated, also as post-editing drafts for translators translating healthcare/health insurance information for this site or oth...

Bilingual Evaluation Understudy? A Practical Guide to MT Quality Evaluation with BLEU

Automatic Metrics for Machine Translation  Whether you publish machine translations directly or use them as post-editing input, evaluating their quality is essential. In this blog post we evaluate the quality of the translations in isolation against available human reference translations, rather than evaluating them in a larger context, e.g. in the context of business metrics. Judging the quality of machine translations is best done by humans. However this is slow, expensive and not easily repeatable each time an MT system is updated. Automatic metrics provide a good way to repeatedly judge the quality of MT output. BLEU (Bilingual Evaluation Understudy) is the prevalent automatic metric for close to two decades now and likely will remain so, at least until document-level evaluation metrics get established (see Läubli, Sennrich and Volk: Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation ). If you have been reading about machine translatio...

Translating Zoellick - Aligning PDF Files to Create Evaluation Data for MT

Imagine we are a translation provider for an organization like the World Bank , we heard about this new technology of neural machine translation and we would like to try out how well this works for the materials we have to translate. We do have access to some translated PDF files in English and German from years past, but unfortunately no access to a translation memory. To evaluate machine translation objectively with automated metrics like BLEU we need about 1000 to 2000 aligned, high-quality translated sentences that are representative of the material we intent to translate. In this blog post we create such evaluation data from the PDFs by extracting the text and manually aligning the sentences. In the next blog post we use this evaluation data to evaluate the translation quality of different MT systems using automated metrics. Downloading World Bank Open Knowledge Repository PDF Files Most of the World Bank Open Knowledge Repository is generously licensed under Creative C...