Multilingual models for inference

There are several multilingual models in 🤗 Transformers, and their inference usage differs from monolingual models. Not all multilingual model usage is different though. Some models, like google-bert/bert-base-multilingual-uncased, can be used just like a monolingual model. This guide will show you how to use multilingual models whose usage differs for inference.

XLM

XLM has ten different checkpoints, only one of which is monolingual. The nine remaining model checkpoints can be split into two categories: the checkpoints that use language embeddings and those that don’t.

XLM with language embeddings

The following XLM models use language embeddings to specify the language used at inference:

Language embeddings are represented as a tensor of the same shape as the input_ids passed to the model. The values in these tensors depend on the language used and are identified by the tokenizer’s lang2id and id2lang attributes.

In this example, load the FacebookAI/xlm-clm-enfr-1024 checkpoint (Causal language modeling, English-French):

>>> import torch
>>> from transformers import XLMTokenizer, XLMWithLMHeadModel

>>> tokenizer = XLMTokenizer.from_pretrained("FacebookAI/xlm-clm-enfr-1024")
>>> model = XLMWithLMHeadModel.from_pretrained("FacebookAI/xlm-clm-enfr-1024")

The lang2id attribute of the tokenizer displays this model’s languages and their ids:

>>> print(tokenizer.lang2id)
{'en': 0, 'fr': 1}

Next, create an example input:

>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")])  # batch size of 1

Set the language id as "en" and use it to define the language embedding. The language embedding is a tensor filled with 0 since that is the language id for English. This tensor should be the same size as input_ids.

>>> language_id = tokenizer.lang2id["en"]  # 0
>>> langs = torch.tensor([language_id] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])

>>> # We reshape it to be of size (batch_size, sequence_length)
>>> langs = langs.view(1, -1)  # is now of shape [1, sequence_length] (we have a batch size of 1)

Now you can pass the input_ids and language embedding to the model:

>>> outputs = model(input_ids, langs=langs)

The run_generation.py script can generate text with language embeddings using the xlm-clm checkpoints.

XLM without language embeddings

The following XLM models do not require language embeddings during inference:

These models are used for generic sentence representations, unlike the previous XLM checkpoints.

BERT

The following BERT models can be used for multilingual tasks:

These models do not require language embeddings during inference. They should identify the language from the context and infer accordingly.

XLM-RoBERTa

The following XLM-RoBERTa models can be used for multilingual tasks:

XLM-RoBERTa was trained on 2.5TB of newly created and cleaned CommonCrawl data in 100 languages. It provides strong gains over previously released multilingual models like mBERT or XLM on downstream tasks like classification, sequence labeling, and question answering.

M2M100

The following M2M100 models can be used for multilingual translation:

In this example, load the facebook/m2m100_418M checkpoint to translate from Chinese to English. You can set the source language in the tokenizer:

>>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

>>> en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
>>> chinese_text = "不要插手巫師的事務, 因為他們是微妙的, 很快就會發怒."

>>> tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="zh")
>>> model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")

Tokenize the text:

>>> encoded_zh = tokenizer(chinese_text, return_tensors="pt")

M2M100 forces the target language id as the first generated token to translate to the target language. Set the forced_bos_token_id to en in the generate method to translate to English:

>>> generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
'Do not interfere with the matters of the witches, because they are delicate and will soon be angry.'

MBart

The following MBart models can be used for multilingual translation:

In this example, load the facebook/mbart-large-50-many-to-many-mmt checkpoint to translate Finnish to English. You can set the source language in the tokenizer:

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

>>> en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
>>> fi_text = "Älä sekaannu velhojen asioihin, sillä ne ovat hienovaraisia ja nopeasti vihaisia."

>>> tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", src_lang="fi_FI")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

Tokenize the text:

>>> encoded_en = tokenizer(en_text, return_tensors="pt")

MBart forces the target language id as the first generated token to translate to the target language. Set the forced_bos_token_id to en in the generate method to translate to English:

>>> generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
"Don't interfere with the wizard's affairs, because they are subtle, will soon get angry."

If you are using the facebook/mbart-large-50-many-to-one-mmt checkpoint, you don’t need to force the target language id as the first generated token otherwise the usage is the same.