NLLB-MOE

Overview

The NLLB model was presented in No Language Left Behind: Scaling Human-Centered Machine Translation by Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang.

The abstract of the paper is the following:

Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.

This model was contributed by Arthur Zucker. The original code can be found here.

Usage tips

Implementation differences with SwitchTransformers

The biggest difference is the way the tokens are routed. NLLB-MoE uses a top-2-gate which means that for each input, only the top two experts are selected based on the highest predicted probabilities from the gating network, and the remaining experts are ignored. In SwitchTransformers, only the top-1 probabilities are computed, which means that tokens have less probability of being forwarded. Moreover, if a token is not routed to any expert, SwitchTransformers still adds its unmodified hidden states (kind of like a residual connection) while they are masked in NLLB’s top-2 routing mechanism.

Generating with NLLB-MoE

The available checkpoints require around 350GB of storage. Make sure to use accelerate if you do not have enough RAM on your machine.

While generating the target text set the forced_bos_token_id to the target language id. The following example shows how to translate English to French using the facebook/nllb-200-distilled-600M model.

Note that we’re using the BCP-47 code for French fra_Latn. See here for the list of all BCP-47 in the Flores 200 dataset.

>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-moe-54b")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-moe-54b")

>>> article = "Previously, Ring's CEO, Jamie Siminoff, remarked the company started when his doorbell wasn't audible from his shop in his garage."
>>> inputs = tokenizer(article, return_tensors="pt")

>>> translated_tokens = model.generate(
...     **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"], max_length=50
... )
>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
"Auparavant, le PDG de Ring, Jamie Siminoff, a fait remarquer que la société avait commencé lorsque sa sonnette n'était pas audible depuis son magasin dans son garage."

Generating from any other language than English

English (eng_Latn) is set as the default language from which to translate. In order to specify that you’d like to translate from a different language, you should specify the BCP-47 code in the src_lang keyword argument of the tokenizer initialization.

See example below for a translation from romanian to german:

>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-moe-54b", src_lang="ron_Latn")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-moe-54b")

>>> article = "Şeful ONU spune că nu există o soluţie militară în Siria"
>>> inputs = tokenizer(article, return_tensors="pt")

>>> translated_tokens = model.generate(
...     **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], max_length=30
... )
>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

Resources

NllbMoeConfig

class transformers.NllbMoeConfig

< >

( vocab_size = 128112 max_position_embeddings = 1024 encoder_layers = 12 encoder_ffn_dim = 4096 encoder_attention_heads = 16 decoder_layers = 12 decoder_ffn_dim = 4096 decoder_attention_heads = 16 encoder_layerdrop = 0.05 decoder_layerdrop = 0.05 use_cache = True is_encoder_decoder = True activation_function = 'relu' d_model = 1024 dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.0 init_std = 0.02 decoder_start_token_id = 2 scale_embedding = True router_bias = False router_dtype = 'float32' router_ignore_padding_tokens = False num_experts = 128 expert_capacity = 64 encoder_sparse_step = 4 decoder_sparse_step = 4 router_z_loss_coef = 0.001 router_aux_loss_coef = 0.001 second_expert_policy = 'all' normalize_router_prob_before_dropping = False batch_prioritized_routing = False moe_eval_capacity_token_fraction = 1.0 moe_token_dropout = 0.2 pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 output_router_logits = False **kwargs )

Parameters

  • vocab_size (int, optional, defaults to 50265) — Vocabulary size of the NllbMoe model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling NllbMoeModel or
  • d_model (int, optional, defaults to 1024) — Dimensionality of the layers and the pooler layer.
  • encoder_layers (int, optional, defaults to 12) — Number of encoder layers.
  • decoder_layers (int, optional, defaults to 12) — Number of decoder layers.
  • encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer encoder.
  • decoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer decoder.
  • decoder_ffn_dim (int, optional, defaults to 4096) — Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
  • encoder_ffn_dim (int, optional, defaults to 4096) — Dimensionality of the “intermediate” (often named feed-forward) layer in encoder.
  • activation_function (str or function, optional, defaults to "gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.
  • dropout (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
  • attention_dropout (float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities.
  • activation_dropout (float, optional, defaults to 0.0) — The dropout ratio for activations inside the fully connected layer.
  • classifier_dropout (float, optional, defaults to 0.0) — The dropout ratio for classifier.
  • max_position_embeddings (int, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
  • init_std (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • encoder_layerdrop (float, optional, defaults to 0.0) — The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.
  • decoder_layerdrop (float, optional, defaults to 0.0) — The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.
  • second_expert_policy ( str, optional, default to "all") — The policy used for the sampling the probability of being sampled to a second expert for each token.
  • normalize_router_prob_before_dropping (bool, optional, defaults to True) — Whether or not to normalize the router probabilities before applying a mask based on the experts capacity (capacity dropping).
  • batch_prioritized_routing (bool, optional, defaults to True) — Whether or not to orders the tokens by their router probabilities before capacity dropping. This means that the tokens that have the highest probabilities will be routed before other tokens that might be further in the sequence.
  • moe_eval_capacity_token_fraction (float, optional, defaults to 1.0) — Fraction of tokens as capacity during validation, if set to negative, uses the same as training. Should be in range: (0.0, 1.0].
  • num_experts (int, optional, defaults to 128) — Number of experts for each NllbMoeSparseMlp layer.
  • expert_capacity (int, optional, defaults to 64) — Number of tokens that can be stored in each expert.
  • encoder_sparse_step (int, optional, defaults to 4) — Frequency of the sparse layers in the encoder. 4 means that one out of 4 layers will be sparse.
  • decoder_sparse_step (int, optional, defaults to 4) — Frequency of the sparse layers in the decoder. 4 means that one out of 4 layers will be sparse.
  • router_dtype (str, optional, default to "float32") — The dtype used for the routers. It is preferable to keep the dtype to "float32" as specified in the selective precision discussion in the paper.
  • router_ignore_padding_tokens (bool, optional, defaults to False) — Whether to ignore padding tokens when routing. if False, the padding tokens are not routed to any experts.
  • router_bias (bool, optional, defaults to False) — Whether or not the classifier of the router should have a bias.
  • moe_token_dropout (float, optional, defualt ot 0.2) — Masking rate for MoE expert output masking (EOM), which is implemented via a Dropout2d on the expert outputs.
  • output_router_logits (bool, optional, defaults to False) — Whether or not to return the router logits. Only set to True to get the auxiliary loss when training.
  • use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models).

This is the configuration class to store the configuration of a NllbMoeModel. It is used to instantiate an NLLB-MoE model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the NLLB-MoE facebook/nllb-moe-54b architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import NllbMoeModel, NllbMoeConfig

>>> # Initializing a NllbMoe facebook/nllb-moe-54b style configuration
>>> configuration = NllbMoeConfig()

>>> # Initializing a model from the facebook/nllb-moe-54b style configuration
>>> model = NllbMoeModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

NllbMoeTop2Router

class transformers.NllbMoeTop2Router

< >

( config: NllbMoeConfig )

Router using tokens choose top-2 experts assignment.

This router uses the same mechanism as in NLLB-MoE from the fairseq repository. Items are sorted by router_probs and then routed to their choice of expert until the expert’s expert_capacity is reached. There is no guarantee that each token is processed by an expert, or that each expert receives at least one token.

The router combining weights are also returned to make sure that the states that are not updated will be masked.

route_tokens

< >

( router_logits: Tensor input_dtype: dtype = torch.float32 padding_mask: Optional = None )

Computes the dispatch_mask and the dispatch_weights for each experts. The masks are adapted to the expert capacity.

forward

< >

( hidden_states: Tensor padding_mask: Optional = None ) top_1_mask (torch.Tensor of shape (batch_size, sequence_length))

Parameters

  • hidden_states (torch.Tensor) — (batch_size, sequence_length, hidden_dim) from which router probabilities are computed.

Returns

top_1_mask (torch.Tensor of shape (batch_size, sequence_length))

Index tensor of shape [batch_size, sequence_length] corresponding to the expert selected for each token using the top1 probabilities of the router. router_probabilities (torch.Tensor of shape (batch_size, sequence_length, nump_experts)): Tensor of shape (batch_size, sequence_length, num_experts) corresponding to the probabilities for each token and expert. Used for routing tokens to experts. router_logits (torch.Tensor of shape (batch_size, sequence_length))): Logits tensor of shape (batch_size, sequence_length, num_experts) corresponding to raw router logits. This is used later for computing router z-loss.

The hidden states are reshaped to simplify the computation of the router probabilities (combining weights for each experts.)

NllbMoeSparseMLP

class transformers.NllbMoeSparseMLP

< >

( config: NllbMoeConfig ffn_dim: int expert_class: Module = <class 'transformers.models.nllb_moe.modeling_nllb_moe.NllbMoeDenseActDense'> )

Implementation of the NLLB-MoE sparse MLP module.

forward

< >

( hidden_states: Tensor padding_mask: Optional = False ) hidden_states (torch.Tensor of shape (batch_size, sequence_length, hidden_dim))

Parameters

  • hidden_states (torch.Tensor of shape (batch_size, sequence_length, hidden_dim)) — The hidden states
  • padding_mask (torch.Tensor, optional, defaults to False) — Attention mask. Can be in the causal form or not.

Returns

hidden_states (torch.Tensor of shape (batch_size, sequence_length, hidden_dim))

Updated hidden states router_logits (torch.Tensor of shape (batch_size, sequence_length, num_experts)): Needed for computing the loss

The goal of this forward pass is to have the same number of operation as the equivalent NllbMoeDenseActDense (mlp) layer. This means that all of the hidden states should be processed at most twice ( since we are using a top_2 gating mecanism). This means that we keep the complexity to O(batch_size x sequence_length x hidden_dim) instead of O(num_experts x batch_size x sequence_length x hidden_dim).

1- Get the router_probs from the router. The shape of the router_mask is (batch_size X sequence_length, num_expert) and corresponds to the boolean version of the router_probs. The inputs are masked using the router_mask.

2- Dispatch the hidden_states to its associated experts. The router probabilities are used to weight the contribution of each experts when updating the masked hidden states.

NllbMoeModel

class transformers.NllbMoeModel

< >

( config: NllbMoeConfig )

Parameters

  • config (NllbMoeConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare NllbMoe Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_ids: Optional = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None head_mask: Optional = None decoder_head_mask: Optional = None cross_attn_head_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None output_router_logits: Optional = None return_dict: Optional = None ) transformers.modeling_outputs.Seq2SeqMoEModelOutput or tuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

Returns

transformers.modeling_outputs.Seq2SeqMoEModelOutput or tuple(torch.FloatTensor)

A transformers.modeling_outputs.Seq2SeqMoEModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (NllbMoeConfig) and inputs.

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the decoder of the model.

    If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output.

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs.

  • decoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • decoder_router_logits (tuple(torch.FloatTensor), optional, returned when output_router_logits=True is passed or when config.add_router_probs=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, sequence_length, num_experts).

    Router logits of the decoder model, useful to compute the auxiliary loss for Mixture of Experts models.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs.

  • encoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • encoder_router_logits (tuple(torch.FloatTensor), optional, returned when output_router_logits=True is passed or when config.add_router_probs=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, sequence_length, num_experts).

    Router logits of the encoder model, useful to compute the auxiliary loss and the z_loss for the sparse modules.

The NllbMoeModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

The NllbMoeModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoTokenizer, NllbMoeModel

>>> tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/random-nllb-moe-2-experts")
>>> model = SwitchTransformersModel.from_pretrained("hf-internal-testing/random-nllb-moe-2-experts")

>>> input_ids = tokenizer(
...     "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids  # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1

>>> # preprocess: Prepend decoder_input_ids with start token which is pad token for NllbMoeModel
>>> decoder_input_ids = model._shift_right(decoder_input_ids)

>>> # forward pass
>>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
>>> last_hidden_states = outputs.last_hidden_state

NllbMoeForConditionalGeneration

class transformers.NllbMoeForConditionalGeneration

< >

( config: NllbMoeConfig )

Parameters

  • config (NllbMoeConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The NllbMoe Model with a language modeling head. Can be used for summarization. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( input_ids: Optional = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None head_mask: Optional = None decoder_head_mask: Optional = None cross_attn_head_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None output_router_logits: Optional = None return_dict: Optional = None ) transformers.modeling_outputs.Seq2SeqMoEOutput or tuple(torch.FloatTensor)

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are input IDs?

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,
    • 0 for tokens that are masked.

    What are attention masks?

  • decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.

    Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

    What are decoder input IDs?

    NllbMoe uses the eos_token_id as the starting token for decoder_input_ids generation. If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values).

  • decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
  • head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,
    • 0 indicates the head is masked.
  • decoder_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,
    • 0 indicates the head is masked.
  • cross_attn_head_mask (torch.Tensor of shape (decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,
    • 0 indicates the head is masked.
  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape (batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
  • decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passing decoder_input_ids you can choose to directly pass an embedded representation. If past_key_values is used, optionally only the last decoder_inputs_embeds have to be input (see past_key_values). This is useful if you want more control over how to convert decoder_input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

    If decoder_input_ids and decoder_inputs_embeds are both unset, decoder_inputs_embeds takes the value of inputs_embeds.

  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
  • output_router_logits (bool, optional) — Whether or not to return the logits of all the routers. They are useful for computing the router loss, and should not be returned during inference.
  • return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

Returns

transformers.modeling_outputs.Seq2SeqMoEOutput or tuple(torch.FloatTensor)

A transformers.modeling_outputs.Seq2SeqMoEOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (NllbMoeConfig) and inputs.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Language modeling loss.

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • decoder_router_logits (tuple(torch.FloatTensor), optional, returned when output_router_logits=True is passed or when config.add_router_probs=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, sequence_length, num_experts).

    Router logits of the decoder model, useful to compute the auxiliary loss for Mixture of Experts models.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • encoder_router_logits (tuple(torch.FloatTensor), optional, returned when output_router_logits=True is passed or when config.add_router_probs=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, sequence_length, num_experts).

    Router logits of the encoder model, useful to compute the auxiliary loss and z_loss for Mixture of Experts models.

The NllbMoeForConditionalGeneration forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Translation example:

>>> from transformers import AutoTokenizer, NllbMoeForConditionalGeneration

>>> model = NllbMoeForConditionalGeneration.from_pretrained("facebook/nllb-moe-54b")
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-moe-54b")

>>> text_to_translate = "Life is like a box of chocolates"
>>> model_inputs = tokenizer(text_to_translate, return_tensors="pt")

>>> # translate to French
>>> gen_tokens = model.generate(**model_inputs, forced_bos_token_id=tokenizer.get_lang_id("eng_Latn"))
>>> print(tokenizer.batch_decode(gen_tokens, skip_special_tokens=True))