Meta’s AI translation breaks 200 language barrier • The Register
Meta’s quest to translate underserved languages marks its first victory with the open source release of a language model capable of deciphering 202 languages.
Named after Meta No language left out initiative and double NLLB-200the model is the first capable of translating so many languages, according to its creators, all with the aim of improving translation for languages neglected by similar projects.
“The vast majority of improvements in machine translation over the past decades have been in high-resource languages,” Meta researchers wrote in a paper. [PDF]. “While machine translation continues to grow, the fruits it bears are unevenly distributed,” they said.
According to announcement of NLLB-200, the model can translate 55 African languages ”with high quality results”. Prior to the creation of NLLB-200, Meta said less than 25 African languages were covered by widely used translation tools. When tested against the BLUE standard, Meta said that NLLB-200 showed an average improvement of 44% over other state-of-the-art translation models. For some African and Indian languages, the improvement would have reached 70%.
Accompanied by his publish on GitHub as an open-source model, Meta said it also gives $200,000 in grants to nonprofit organizations willing to research real applications for NLLB-200.
Although text-generating AI can write like humans, it lacks common sense
Lofty goals aside, Meta is already putting NLLB-200 to work. The model and other results from the NLLB program “will support more than 25 billion translations served daily across Facebook News Feed, Instagram, and our other platforms.”
Additionally, Meta has worked with the Wikimedia Foundation to use NLLB-200 as the back-end for Wikipedia’s content translation tool. By including NLLB-200, CTT added 10 languages that were not supported by any other translation tool.
There are still obstacles. Meta explains that he had to do a lot of work to overcome the obstacles by doubling the capacity of NLLB, which he overcame through “regularization and program learning, self-supervised learning and diversification of the back-translation”. Meta has also made extensive use of language model distillation, which reduces previously trained AIs to training data for newer models.
As part of its open supply of NLLB-200, Meta is also releasing the new Flores-200 evaluation dataset it built for the project, starting training data, its toxicity list in 200 languages, its new LASER3 sentence encoder, worksites data mining library, 3.3 billion and 1.3 billion parameter dense transformer models, 1.3 billion and 600 million parameter models distilled from NLLB -200 and NLLB-200 itself, which contains 54.5 billion parameters.
Not all communities may welcome the inclusion of their language in NLLB, or other programs for that matter. New Zealand’s Maori community confronted against translation companies last year, arguing that the entities were prohibited from buying language data and reselling the Maori language to its speakers. ®