Meta's large-scale study on language translation, the results are all 'routine'-AI-php.cn

Home

Technology peripherals

Meta's large-scale study on language translation, the results are all 'routine'

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 11, 2023 pm 11:46 PM

ai Model meta

In early July this year, Meta AI released a new translation model called No Language Left Behind (NLLB), which we can literally translate as "no language left behind."

As the name suggests, NLLB can support arbitrary translation between 200 languages, and Meta AI has also made it open source. It can translate languages you have never seen before, such as Luganda, Urdu, etc.

Metas large-scale study on language translation, the results are all routine

Paper address: https://research.facebook.com/publications/no- language-left-behind/
Open source address: https://github.com/facebookresearch/fairseq/tree/nllb

However, this research has been questioned recently. Some people believe that many of the claims made by Meta AI in the NLLB are unfounded and misleading, and the evaluation results are seriously flawed. Additionally, skeptics say it would be easy to get higher numbers than they report based on Meta AI's assessment methodology.

The skeptic is natural language processing research scientist Benjamin Marie, who is proficient in translation technology. What he questioned can be summarized as Meta AI comparing spBLEU and BLEU side by side.

Metas large-scale study on language translation, the results are all routine

Regarding this question, some researchers said: spBLEU is a reasonable metric, provided there are no spaces in the text (Thai, etc.). But comparing spBLEU and BLEU is definitely incorrect.

Metas large-scale study on language translation, the results are all routine

##Netizen Arle Lommel said in reply to Benjamin Marie: This is a great point. This also taught me that when it comes to machine learning research, we should be very cautious about research that lacks confirmation. What you find here does suggest that the problem gets complicated when people just refer to fractions without controlling how they are produced.

Metas large-scale study on language translation, the results are all routine

Vedanuj Goswami, one of the authors of the paper, said: “We 100% agree with the authors that you cannot divide the BLEU score into Comparison with different tokenizers. But the author’s main argument that most of the results in our paper are incomparable is not true.

In our paper, Table 30 and Table 31 The same tokenizer is used for spBLEU evaluation (FLORES-101 spm tokenizer) specifically for comparability. We do not use the FLORES-200 spm tokenizer. We describe this in detail in the title of Table 30 and in Section 8.3.1 .Similarly, Tables 35, 36, 37, 38 all use comparable metrics/tokenizers for proper comparison. We have updated the paper

In general, the current machine translation The evaluation method is not yet perfect, and different papers use different methods."

Metas large-scale study on language translation, the results are all routine

################### Specific content:#########The evaluation method is flawed#########First let us make a simple analogy:############Paul has 25 bananas and Bill has 30 tomatoes. Would you say Bill has 5 more bananas than Paul? ############BLEU is like a banana, spBLEU is like a tomato. Replace Paul with Previous work and Bill with NLLB. We can now write something like this: ############The previous work was performed at 25 BLEU and NLLB was performed at 30 spBLEU. Would you say NLLB is 5 BLEU points better than previous work? ######

Metas large-scale study on language translation, the results are all routine

With the above analogy, the content introduced below may be easier to understand.

Previously, Meta AI released a paper that comprehensively explained and evaluated NLLB. In the abstract of the paper, they claim that the model achieves a 44% BLEU improvement compared to previous SOTA methods. In other words, NLLB will produce better results than previous studies.

Regarding BLEU, it is rare in the history of machine translation research to see BLEU improve by 44% over previous SOTA technology. So this simple sentence in the paper represents scientific progress. Some media directly reported this statement and, without further verification, positioned Meta AI at the top of language machine translation.

If Meta AI chooses to publish such a large technical study, they should provide very reliable scientific evidence. Otherwise, Meta AI's claim to do better than others, without any evidence, will only undermine the very hard work that other research institutions have done and are doing.

Marie To explain the NLLB error problem, he attempts to demonstrate how Meta AI can be misled by its own results. Using simple examples from NLLB and similar examples she found herself, Marie demonstrates that it's easy to go beyond SOTA when using NLLB's flawed assessment methods. Finally, Marie identifies and specifically explains the main errors in their assessment.

Meta AI compared its model with data from more than 20 previous studies and concluded that NLLB significantly outperformed previous studies. To make such a large number of comparisons feasible, they rely on automated evaluation metrics for machine translation evaluation, primarily BLEU and spBLEU.

BLEU is extremely popular in machine translation, despite its shortcomings.

For example, we want to translate the following French text from the FLORES101 dataset into English using Google Translate. If you speak French, you will notice that this is a very poor quality translation: grammatical errors, inconsistent terminology, and it does not read naturally. In fact, since the dataset was created from English, Meta AI only evaluates machine translation when translating to English.

Metas large-scale study on language translation, the results are all routine

We can do this by counting how many tokens in Google Translate are also in this reference translation and compare it to the reference translation Compare. A token is defined here as a sequence of characters separated by a space. Orange highlights all token sequences in the Google Translate above that appear in the reference translation below.

Metas large-scale study on language translation, the results are all routine

Considering only all matching tokens, the BLEU score can be calculated to be 50.8 BLEU. This score alone means nothing, it only makes sense when compared to another BLEU score.

The key point to understand here is that the score is calculated based on tokens, which is ignored in most machine translation research. The BLEU score is calculated using SacreBLEU, which performs its own internal tokenization, basically just adding spaces before punctuation. This is one of the most reliable and repeatable methods of calculating BLEU scores. Meta AI uses spBLEU.

So what is spBLEU? It is BLEU but uses different tokenization. It tokenizes Google Translate and reference translations as follows.

Metas large-scale study on language translation, the results are all routine

The token associated with spBLEU generates the token by breaking the word into smaller fragments (attached to the token It's not important here, try ignoring it). A direct consequence of using spBLEU tokenization is that we end up with more tokens for both translations and references. Since there are more tokens, we can expect Google Translate to match more tokens from the reference. Then the score will grow. In fact, the spBLEU score here is 54.8.

We can’t help but ask 4 points higher than the BLEU score calculated above using SacreBLEU internal tokenization? So is translation getting better and better?

Apparently not, the translation remains the same. Comparing BLEU and spBLEU makes no sense at all. BLEU and spBLEU handle Google Translate and reference translations differently and are used for evaluation purposes only. They are actually different indicators. If they were the same indicator, we wouldn't have to name them differently. As we often read and hear in the machine translation research community, it is not fair, or even unfair, to compare translation quality using BLEU scores calculated for different or even almost similar tokens. If you want your research to be scientifically credible, you just need to calculate your BLEU score consistently using the exact same tokenization.

##Meta AI claims that NLLB is much better than previous studies because they can always achieve better spBLEU scores than previously published BLEU scores, the opposite is true. Because getting the spBLEU score lower than the BLEU score for a given translation is an extremely difficult task. Even more incomprehensible is why not just use the chrBLEU metric if their goal is to get the highest score.

For example in Google Translate and Reference Translation, each character becomes a token (in other words, spaces are added between characters).

We then calculate the chrBLEU value to be 75.5, which is 20.7 points higher than spBLEU. According to NLLB's assessment, this will be a significant improvement that will be a new high point for machine translation, while the original Google Translate remains unchanged.

Metas large-scale study on language translation, the results are all routine

Examples of errors in the paper

Now, let’s look at the specifics of NLLB evaluation Example.

Meta AI claims to have outperformed previous work by comparing its numbers to previously published figures. In this paper, conclusions are drawn from Tables 30, 31, 32, 35, 36, 37, and 38, which are compared with previous work.

will start from table 32. This is one of the most illustrative examples because of the different types of errors that can occur.

Metas large-scale study on language translation, the results are all routine

From the table, all numbers except the NLLB-200 column are copied directly from previously published papers IndicBART and IndicTrans. For readability, Meta AI marks the highest score for each language in bold, with the bold column indicating the corresponding system is the best.

The table says spBLEU for all, which is misleading. Actually, all means only NLLB-200, since IndicBART and IndicTrans use not spBLEU but BLEU. However, upon comparison, it is found that the spBLEU score of NLLB is higher than the BLEU score of previous work. But does that mean NLLB is better? Is this like 30 tomatoes better than 25 bananas?

In the text explaining the results we can see:

Metas large-scale study on language translation, the results are all routine

For example (c) Google Translate, (d) Microsoft Translate. NLLB-200 significantly outperforms all models in most directions. The training dataset for NLLB-200 includes 25 Indian languages, almost twice as many as those covered by (a) and (b). The performance improvements can be attributed to more multilingual transmissions, as well as improved data quality for Indic language mining and back-translation.

In other words, NLLB had more tomatoes than the previous study had bananas. So NLLB has more bananas.

spBLEU scores are higher than BLEU scores because they are calculated on smaller and different tokens. However, does NLLB translate better? We simply can't answer. To make matters worse, IndicBART and IndicTrans are not comparable as they both use two different token methods.

Most of the tables listed above have similar problems and have more or less errors.

If you look at the papers published by IndicBART and IndicTrans to check the numbers, you will see that there are other issues. Columns (a, b) in Table 32 are all swapped, the IndicBART numbers are the numbers in indicatrans and vice versa.

If you look at Table 30, the problem is even bigger. Metas large-scale study on language translation, the results are all routine However, Table 30 has been updated in the paper, and Benjamin Marie expressed his gratitude to Vedanuj for updating the article. Table 30 does mention that the tokenizer is the same. I admit my mistake.

Metas large-scale study on language translation, the results are all routine

As shown in Table 32, Meta AI claims that NLLB is superior to previous DeltaLM and Deepnet, while comparing the BLEU obtained using different calculation methods Fraction. What is new here is that they also compared NLLB to their previous work, M2M-100, also evaluated using spBLEU. So does this comparison make sense? No. Even though they both use spBLEU, they actually use different tokenizers, which makes comparison impossible. They make the following statement in footnote 28:

Metas large-scale study on language translation, the results are all routine

"Our analysis shows that when performed on the FLORES-101 language When measured, there are minor differences between the SPM-200 model of FLORES-200 and the SPM-100 model of FLORES-101. The main advantage of SPM-200 is that it covers more than 200 languages."

Small differences are also differences. In this case, these differences matter because we are doing scientific research.

One advancement for NLLB compared to their work on M2M-100 is the addition of more languages to the model and dataset. It includes the tokenization model. Technically speaking, if you add more languages with different writing systems to this tokenizer while keeping the vocabulary size constant, you will mechanically get a vocabulary with smaller tokens. As seen above, using smaller tokens may result in better scores. Let's verify it.

As shown below:

Metas large-scale study on language translation, the results are all routine

##This tokenization generates 95 tokens , while NLLB generates 97 tokens. This is just a subtle difference, if spBLEU is calculated using M2M-100 tokenization, the score is 53.8, which is 1 point lower than NLLB tokenization. According to the machine translation research literature, a difference of 1 point is usually enough to claim that a system is significantly better. As expected, NLLB will produce higher scores than M2M-100.

The next table is the last table in this article: Table 31.

Metas large-scale study on language translation, the results are all routine

## Likewise, we have the same problem mentioned above:

1. M2M-100 and NLLB use two different tokenizations for scoring, so comparison cannot be made. 2. MMTAfrica seems to use M2M-100 tokenization in their paper. It's comparable to the M2M-100, but not to the NLLB.

There are still some problems in the article, so I won’t introduce them one by one here. The main mistake made by Meta AI in NLLB is a very common mistake in machine translation evaluation, although we should admit that this work is truly amazing and may provide higher translation quality for many languages.

The above is the detailed content of Meta's large-scale study on language translation, the results are all 'routine'. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7575

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

111

Related knowledge

How to solve the complexity of WordPress installation and update using Composer Apr 17, 2025 pm 10:54 PM

When managing WordPress websites, you often encounter complex operations such as installation, update, and multi-site conversion. These operations are not only time-consuming, but also prone to errors, causing the website to be paralyzed. Combining the WP-CLI core command with Composer can greatly simplify these tasks, improve efficiency and reliability. This article will introduce how to use Composer to solve these problems and improve the convenience of WordPress management.

How to solve SQL parsing problem? Use greenlion/php-sql-parser! Apr 17, 2025 pm 09:15 PM

When developing a project that requires parsing SQL statements, I encountered a tricky problem: how to efficiently parse MySQL's SQL statements and extract the key information. After trying many methods, I found that the greenlion/php-sql-parser library can perfectly solve my needs.

How to solve complex BelongsToThrough relationship problem in Laravel? Use Composer! Apr 17, 2025 pm 09:54 PM

In Laravel development, dealing with complex model relationships has always been a challenge, especially when it comes to multi-level BelongsToThrough relationships. Recently, I encountered this problem in a project dealing with a multi-level model relationship, where traditional HasManyThrough relationships fail to meet the needs, resulting in data queries becoming complex and inefficient. After some exploration, I found the library staudenmeir/belongs-to-through, which easily installed and solved my troubles through Composer.

Solve CSS prefix problem using Composer: Practice of padaliyajay/php-autoprefixer library Apr 17, 2025 pm 11:27 PM

I'm having a tricky problem when developing a front-end project: I need to manually add a browser prefix to the CSS properties to ensure compatibility. This is not only time consuming, but also error-prone. After some exploration, I discovered the padaliyajay/php-autoprefixer library, which easily solved my troubles with Composer.

How to solve the problem of PHP project code coverage reporting? Using php-coveralls is OK! Apr 17, 2025 pm 08:03 PM

When developing PHP projects, ensuring code coverage is an important part of ensuring code quality. However, when I was using TravisCI for continuous integration, I encountered a problem: the test coverage report was not uploaded to the Coveralls platform, resulting in the inability to monitor and improve code coverage. After some exploration, I found the tool php-coveralls, which not only solved my problem, but also greatly simplified the configuration process.

How to solve the complex problem of PHP geodata processing? Use Composer and GeoPHP! Apr 17, 2025 pm 08:30 PM

When developing a Geographic Information System (GIS), I encountered a difficult problem: how to efficiently handle various geographic data formats such as WKT, WKB, GeoJSON, etc. in PHP. I've tried multiple methods, but none of them can effectively solve the conversion and operational issues between these formats. Finally, I found the GeoPHP library, which easily integrates through Composer, and it completely solved my troubles.

How to solve the problem of virtual columns in Laravel model? Use stancl/virtualcolumn! Apr 17, 2025 pm 09:48 PM

During Laravel development, it is often necessary to add virtual columns to the model to handle complex data logic. However, adding virtual columns directly into the model can lead to complexity of database migration and maintenance. After I encountered this problem in my project, I successfully solved this problem by using the stancl/virtualcolumn library. This library not only simplifies the management of virtual columns, but also improves the maintainability and efficiency of the code.

git software installation tutorial Apr 17, 2025 pm 12:06 PM

Git Software Installation Guide: Visit the official Git website to download the installer for Windows, MacOS, or Linux. Run the installer and follow the prompts. Configure Git: Set username, email, and select a text editor. For Windows users, configure the Git Bash environment.

See all articles