AI goes multilingual with Hugging Face’s BLOOM

We’re excited to bring back Transform 2022 in person on July 19 and virtually from July 20-28. Join leaders in AI and data for in-depth discussions and exciting networking opportunities. Register today!

With all the excitement and innovations surrounding artificial intelligence (AI) in recent years, one key element has often been overlooked: support for multiple languages, beyond English.

That will now change, in part thanks to the launch of BLOOM (acronym for BigScience Large Open-science Open-access Multilingual Language Model). BLOOM debuted in 2021, with development led by machine learning startup Hugging Face, which raised $100 million in May.

The BigScience effort also benefits from a wide range of contributors, including Nvidia’s Megatron and Microsoft DeepSpeed ​​teams, as well as support from CNRS, the French National Research Agency. The BLOOM model was built and trained using the Jean Zay Supercomputer which is in France.

BLOOM has a similar architecture to the large OpenAI GPT-3 language model, but with the main fundamental difference being that BLOOM is multilingual.

“GPT-3 is monolingual and BLOOM was designed from the ground up to be multilingual, so it was trained on multiple languages, and also to incorporate a significant amount of programming language data,” said Teven Le Scao, engineer at research at Hugging Face, at VentureBeat. “BLOOM supports 46 human languages ​​and 13 programming languages ​​- so that’s a very big difference.”

How BLOOM was trained with open source machine learning models

The BLOOM effort involved several components, including collecting a large data set and then building a training model.

The Scao explained that Hugging Face used Nvidia’s Megatron and Microsoft’s DeepSpeed ​​open-source projects, both of which are efforts designed to allow data scientists to train large language models. Both Megatron and DeepSpeed ​​are based on the open source PyTorch machine learning framework. For BLOOM, the researchers developed a fork of the Megatron and DeepSpeed ​​projects that allowed the model to examine all of the different languages.

As for BLOOM itself, the project was developed in the open and uses its own open license which is modeled on the Responsible AI Licence.

“We’re trying to define what open source means in the context of big AI models, because they don’t really work like software,” Le Scao said.

He explained that the goal of licensing for BLOOM was to make the model as open as possible, while still retaining some control over the use cases organizations have for the model.

How Big Language Models Fit into Natural Language Processing

Large language models (LLMs) are a subset of the overall field of natural language processing (NLP).

The Scao said the language model is like an “atomic unit” for NLP, providing the basic building blocks on which complex AI interactions and applications can be built.

For example, he noted that it doesn’t make sense for an NLP model to learn to summarize and speak a language at the same time. The Scao said that a human being does not learn to speak English and write a full research report at the same time. Generally, it makes sense for the human to first learn to speak the language.

Use cases for multilingual templates like BLOOM

To date, most AI language models have used English or Chinese. BLOOM will now expand use cases, including for French, Spanish and Arabic speakers, where there was no open LLM available before.

In addition to providing a new basis for several spoken human languages, BLOOM could also open a new era for code development.

Using AI for code development is a relatively nascent space, with GitHub’s co-pilot, which became generally available in late June, being among the early leaders. The Scao expects that due to the diversity of programming languages ​​that BLOOM understands, it will help enable new applications for developers.

“BLOOM is going to be a solid platform for coding apps,” said Le Scao.

Now that BLOOM is ready to use, Le Scao also expects unexpected new use cases to emerge.

“That’s the fun part, because we’ve done all the hard work to make BLOOM work, and now anyone can run whatever crazy experience they want from a powerful language model,” he said. -he declares.

VentureBeat’s mission is to be a digital public square for technical decision makers to learn about transformative enterprise technology and conduct transactions. Learn more about membership.

Comments are closed.