SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the previous version. For more details on the differences between v1 and v2, refer to section Difference with SeamlessM4T-v1.

SeamlessM4Tv2Model can perform all the above tasks, but each task also has its own dedicated sub-model.

Recent advancements in automatic speech translation have dramatically expanded language coverage, improved multimodal capabilities, and enabled a wide range of tasks and functionalities. That said, large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model—SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. The expanded version of SeamlessAlign adds 114,800 hours of automatically aligned data for a total of 76 languages. SeamlessM4T v2 provides the foundation on which our two newest models, SeamlessExpressive and SeamlessStreaming, are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one’s voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention (EMMA) mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To understand the performance of these models, we combined novel and modified versions of existing automatic metrics to evaluate prosody, latency, and robustness. For human evaluations, we adapted existing protocols tailored for measuring the most relevant attributes in the preservation of meaning, naturalness, and expressivity. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. In sum, Seamless gives us a pivotal look at the technical foundation needed to turn the Universal Speech Translator from a science fiction concept into a real-world technology. Finally, contributions in this work—including models, code, and a watermark detector—are publicly released and accessible at the link below.

Usage

In the following example, we’ll load an Arabic audio sample and an English text sample and convert them into Russian speech and French text.

You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.

Speech

SeamlessM4Tv2Model can seamlessly generate text or speech with few or no changes. Let’s target Russian voice translation:

With basically the same code, I’ve translated English text and Arabic speech to Russian speech samples.

Text

Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass generate_speech=False to SeamlessM4Tv2Model.generate(). This time, let’s translate to French.

Tips

1. Use dedicated models

SeamlessM4Tv2Model is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:

Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove generate_speech=False.

2. Change the speaker identity

You have the possibility to change the speaker used for speech synthesis with the speaker_id argument. Some speaker_id works better than other for some languages!

3. Change the generation strategy

You can use different generation strategies for text generation, e.g .generate(input_ids=input_ids, text_num_beams=4, text_do_sample=True) which will perform multinomial beam-search decoding on the text model. Note that speech generation only supports greedy - by default - or multinomial sampling, which can be used with e.g. .generate(..., speech_do_sample=True, speech_temperature=0.6).

4. Generate speech and text at the same time

Use return_intermediate_token_ids=True with SeamlessM4Tv2Model to return both speech and text !

Model architecture

SeamlessM4T-v2 features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as “unit tokens,” from the translated text.

Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the HiFi-GAN architecture is placed on top of the second seq2seq model.

Difference with SeamlessM4T-v1

Improvements on the second-pass model

The second seq2seq model, named text-to-unit model, is now non-auto regressive, meaning that it computes units in a single forward pass. This achievement is made possible by:

Difference in the speech encoder

The speech encoder, which is used during the first-pass generation process to predict the translated text, differs mainly from the previous speech encoder through these mechanisms:

Generation process

SeamlessM4Tv2Model

class transformers.SeamlessM4Tv2Model

< source >

( config current_modality = 'text' )

Parameters

config (~SeamlessM4Tv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
current_modality (str, optional, defaults to "text") — Default modality. Used only to initialize the model. It can be set to "text" or "speech". This will be updated automatically according to the modality passed to the forward and generate passes (input_ids for text and input_features for audio).

The original SeamlessM4Tv2 Model transformer which can be used for every tasks available (S2ST, S2TT, T2TT, T2ST). This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

generate

< source >

( input_ids: Optional = None input_features: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None speaker_id: Optional = 0 generate_speech: Optional = True **kwargs ) → Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor], ModelOutput]

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary.

Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are input IDs?
input_features (torch.FloatTensor of shape (batch_size, sequence_length, num_banks), optional) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
return_intermediate_token_ids (bool, optional) — If True, also returns the intermediate generated text and unit tokens. Set to True if you also want to get translated text alongside the audio. Note that if generate_speech=True, this parameter will be ignored.
tgt_lang (str, optional) — The language to use as target language for translation.
speaker_id (int, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs.
generate_speech (bool, optional, defaults to True) — If False, will only returns the text tokens and won’t generate speech.
kwargs (optional) — Remaining dictioy of keyword arguments that will be passed to GenerationMixin.generate(). Keyword arguments are of two types:
- Without a prefix, they will be entered as **kwargs for the generate method of each sub-model, except for decoder_input_ids which will only be passed through the text components.
- With a text_ or speech_ prefix, they will be input for the generate method of the text model and speech model respectively. It has the priority over the keywords without a prefix.
This means you can, for example, specify a generation strategy for one generation but not for the other.

Returns

Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor], ModelOutput]

If generate_speech and return_intermediate_token_ids, returns SeamlessM4Tv2GenerationOutput.
If generate_speech and not return_intermediate_token_ids, returns a tuple composed of waveforms of shape (batch_size, sequence_length)and and waveform_lengths which gives the length of each sample.
If generate_speech=False, it will returns ModelOutput.

Generates translated token ids and/or translated audio waveforms.

This method successively calls the .generate function of two different sub-models. You can specify keyword arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments that will be passed to one of them.

For example, calling .generate(input_ids=input_ids, num_beams=4, speech_do_sample=True) will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4Tv2ForTextToSpeech

class transformers.SeamlessM4Tv2ForTextToSpeech

< source >

( config: SeamlessM4Tv2Config )

Parameters

config (~SeamlessM4Tv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The text-to-speech SeamlessM4Tv2 Model transformer which can be used for T2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

generate

< source >

( input_ids: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None speaker_id: Optional = 0 **kwargs ) → Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary.

Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are input IDs?
return_intermediate_token_ids (bool, optional) — If True, also returns the intermediate generated text and unit tokens. Set to True if you also want to get translated text alongside the audio.
tgt_lang (str, optional) — The language to use as target language for translation.
speaker_id (int, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs.
kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to GenerationMixin.generate(). Keyword arguments are of two types:
- Without a prefix, they will be entered as **kwargs for the generate method of each sub-model, except for decoder_input_ids which will only be passed through the text components.
- With a text_ or speech_ prefix, they will be input for the generate method of the text model and speech model respectively. It has the priority over the keywords without a prefix.
This means you can, for example, specify a generation strategy for one generation but not for the other.

Returns

Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]

If return_intermediate_token_ids, returns SeamlessM4Tv2GenerationOutput.
If not return_intermediate_token_ids, returns a tuple composed of waveforms of shape (batch_size, sequence_length)and and waveform_lengths which gives the length of each sample.

Generates translated audio waveforms.

For example, calling .generate(input_ids, num_beams=4, speech_do_sample=True) will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4Tv2ForSpeechToSpeech

class transformers.SeamlessM4Tv2ForSpeechToSpeech

< source >

( config )

Parameters

config (~SeamlessM4Tv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The speech-to-speech SeamlessM4Tv2 Model transformer which can be used for S2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

generate

< source >

( input_features: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None speaker_id: Optional = 0 **kwargs ) → Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]

Parameters

input_features (torch.FloatTensor of shape (batch_size, sequence_length, num_banks)) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
return_intermediate_token_ids (bool, optional) — If True, also returns the intermediate generated text and unit tokens. Set to True if you also want to get translated text alongside the audio.
tgt_lang (str, optional) — The language to use as target language for translation.
speaker_id (int, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs.
kwargs (optional) — Remaining dictionary of keyword arguments that will be passed to GenerationMixin.generate(). Keyword arguments are of two types:
- Without a prefix, they will be entered as **kwargs for the generate method of each sub-model, except for decoder_input_ids which will only be passed through the text components.
- With a text_ or speech_ prefix, they will be input for the generate method of the text model and speech model respectively. It has the priority over the keywords without a prefix.
This means you can, for example, specify a generation strategy for one generation but not for the other.

Returns

Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]

If return_intermediate_token_ids, returns SeamlessM4Tv2GenerationOutput.
If not return_intermediate_token_ids, returns a tuple composed of waveforms of shape (batch_size, sequence_length)and and waveform_lengths which gives the length of each sample.

Generates translated audio waveforms.

For example, calling .generate(input_features, num_beams=4, speech_do_sample=True) will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4Tv2ForTextToText

class transformers.SeamlessM4Tv2ForTextToText

< source >

( config: SeamlessM4Tv2Config )

Parameters

config (~SeamlessM4Tv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The text-to-text SeamlessM4Tv2 Model transformer which can be used for T2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: LongTensor = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None **kwargs )

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary.

Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are input IDs?
attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?
decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are decoder input IDs?

Bart uses the eos_token_id as the starting token for decoder_input_ids generation. If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values).

For translation and summarization training, decoder_input_ids should be provided. If no decoder_input_ids is provided, the model will create this tensor by shifting the input_ids to the right for denoising pre-training following the paper.
decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.

If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_mask and modify to your needs. See diagram 1 in the paper for more information on the default strategy.
encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape (batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passing decoder_input_ids you can choose to directly pass an embedded representation. If past_key_values is used, optionally only the last decoder_inputs_embeds have to be input (see past_key_values). This is useful if you want more control over how to convert decoder_input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

If decoder_input_ids and decoder_inputs_embeds are both unset, decoder_inputs_embeds takes the value of inputs_embeds.
labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

The SeamlessM4Tv2ForTextToText forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

generate

< source >

( input_ids = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) → ModelOutput or torch.LongTensor

Parameters

input_ids (torch.Tensor of varying shape depending on the modality, optional) — Indices of input sequence tokens in the vocabulary.

Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are input IDs?
tgt_lang (str, optional) — The language to use as target language for translation.
generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id and input_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch ID batch_id and the previously generated tokens inputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
kwargs (Dict[str, Any], optional) — Ad hoc parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model.

Returns

ModelOutput or torch.LongTensor

A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor. The possible ModelOutput types are:

Generates sequences of token ids.

Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model’s default generation configuration. You can override any generation_config by passing the corresponding parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True).

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4Tv2ForSpeechToText

class transformers.SeamlessM4Tv2ForSpeechToText

< source >

( config: SeamlessM4Tv2Config )

Parameters

config (~SeamlessM4Tv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The speech-to-text SeamlessM4Tv2 Model transformer which can be used for S2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_features: LongTensor = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None **kwargs )

Parameters

input_features (torch.FloatTensor of shape (batch_size, sequence_length, num_banks)) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?
decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are decoder input IDs?

Bart uses the eos_token_id as the starting token for decoder_input_ids generation. If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values).

For translation and summarization training, decoder_input_ids should be provided. If no decoder_input_ids is provided, the model will create this tensor by shifting the input_ids to the right for denoising pre-training following the paper.
decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.

If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_mask and modify to your needs. See diagram 1 in the paper for more information on the default strategy.
encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape (batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passing decoder_input_ids you can choose to directly pass an embedded representation. If past_key_values is used, optionally only the last decoder_inputs_embeds have to be input (see past_key_values). This is useful if you want more control over how to convert decoder_input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

If decoder_input_ids and decoder_inputs_embeds are both unset, decoder_inputs_embeds takes the value of inputs_embeds.
labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.

The SeamlessM4Tv2ForSpeechToText forward method, overrides the __call__ special method.

generate

< source >

( input_features = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) → ModelOutput or torch.LongTensor

Parameters

input_features (torch.FloatTensor of shape (batch_size, sequence_length, num_banks)) — Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
tgt_lang (str, optional) — The language to use as target language for translation.
generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id and input_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch ID batch_id and the previously generated tokens inputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval.
synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
kwargs (Dict[str, Any], optional) — Ad hoc parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model.

Returns

ModelOutput or torch.LongTensor

A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor. The possible ModelOutput types are:

Generates sequences of token ids.

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4Tv2Config

class transformers.SeamlessM4Tv2Config

< source >

( vocab_size = 256102 t2u_vocab_size = 10082 char_vocab_size = 10943 hidden_size = 1024 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True max_position_embeddings = 4096 is_encoder_decoder = True encoder_layerdrop = 0.05 decoder_layerdrop = 0.05 activation_function = 'relu' dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.0 scale_embedding = True encoder_layers = 24 encoder_ffn_dim = 8192 encoder_attention_heads = 16 decoder_layers = 24 decoder_ffn_dim = 8192 decoder_attention_heads = 16 decoder_start_token_id = 3 max_new_tokens = 256 pad_token_id = 0 bos_token_id = 2 eos_token_id = 3 speech_encoder_layers = 24 speech_encoder_attention_heads = 16 speech_encoder_intermediate_size = 4096 speech_encoder_hidden_act = 'swish' speech_encoder_dropout = 0.0 add_adapter = True speech_encoder_layerdrop = 0.1 feature_projection_input_dim = 160 adaptor_kernel_size = 8 adaptor_stride = 8 adaptor_dropout = 0.1 num_adapter_layers = 1 position_embeddings_type = 'relative_key' conv_depthwise_kernel_size = 31 left_max_position_embeddings = 64 right_max_position_embeddings = 8 speech_encoder_chunk_size = 20000 speech_encoder_left_chunk_num = 128 t2u_bos_token_id = 0 t2u_pad_token_id = 1 t2u_eos_token_id = 2 t2u_encoder_layers = 6 t2u_encoder_ffn_dim = 8192 t2u_encoder_attention_heads = 16 t2u_decoder_layers = 6 t2u_decoder_ffn_dim = 8192 t2u_decoder_attention_heads = 16 t2u_max_position_embeddings = 4096 t2u_variance_predictor_embed_dim = 1024 t2u_variance_predictor_hidden_dim = 256 t2u_variance_predictor_kernel_size = 3 t2u_variance_pred_dropout = 0.5 sampling_rate = 16000 upsample_initial_channel = 512 upsample_rates = [5, 4, 4, 2, 2] upsample_kernel_sizes = [11, 8, 8, 4, 4] resblock_kernel_sizes = [3, 7, 11] resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]] leaky_relu_slope = 0.1 unit_hifi_gan_vocab_size = 10000 unit_embed_dim = 1280 lang_embed_dim = 256 spkr_embed_dim = 256 vocoder_num_langs = 36 vocoder_num_spkrs = 200 variance_predictor_kernel_size = 3 var_pred_dropout = 0.5 vocoder_offset = 4 **kwargs )

Parameters

vocab_size (int, optional, defaults to 256102) — Vocabulary size of the text modality of the SeamlessM4Tv2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling ~SeamlessM4Tv2Model, ~SeamlessM4Tv2ForTextToSpeech or ~SeamlessM4Tv2ForTextToText.
t2u_vocab_size (int, optional, defaults to 10082) — Unit vocabulary size of the SeamlessM4Tv2 model. Defines the number of different “unit tokens” that can be represented by the inputs_ids passed when calling the Text-To-Units sub-model of ~SeamlessM4Tv2Model, ~SeamlessM4Tv2ForSpeechToSpeech or ~SeamlessM4Tv2ForTextToSpeech.
char_vocab_size (int, optional, defaults to 10943) — Character vocabulary size of the SeamlessM4Tv2 model. Defines the number of different character tokens that can be represented by the char_inputs_ids passed when calling the Text-To-Units sub-model of ~SeamlessM4Tv2Model, ~SeamlessM4Tv2ForSpeechToSpeech or ~SeamlessM4Tv2ForTextToSpeech.

Parameters shared across sub-models

hidden_size (int, optional, defaults to 1024) — Dimensionality of the “intermediate” layers in the architecture.
initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the layer normalization layers.
use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models).
max_position_embeddings (int, optional, defaults to 4096) — The maximum sequence length that this model text encoder and decoder might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
is_encoder_decoder (bool, optional, defaults to True) — Whether the model is used as an encoder/decoder or not.
encoder_layerdrop (float, optional, defaults to 0.05) — The LayerDrop probability for the encoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.
decoder_layerdrop (float, optional, defaults to 0.05) — The LayerDrop probability for the decoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.
activation_function (str or function, optional, defaults to "relu") — The non-linear activation function (function or string) in the decoder and feed-forward layers. If string, "gelu", "relu", "selu", "swish" and "gelu_new" are supported.
dropout (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, decoder, and pooler.
attention_dropout (float, optional, defaults to 0.1) — The dropout probability for all attention layers.
activation_dropout (float, optional, defaults to 0.0) — The dropout probability for all activation layers in the model.
scale_embedding (bool, optional, defaults to True) — Scale embeddings by diving by sqrt(d_model).

Text encoder and text decoder specific parameters

encoder_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer text encoder.
encoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text encoder.
encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text encoder.
decoder_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer text decoder.
decoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text decoder.
decoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text decoder.
decoder_start_token_id (int, optional, defaults to 3) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. Only applied in the text decoder.
max_new_tokens (int, optional, defaults to 256) — The maximum numbers of text tokens to generate, ignoring the number of tokens in the prompt.
pad_token_id (int, optional, defaults to 0) — The id of the padding text token. Only applied to the text-decoder model.
bos_token_id (int, optional, defaults to 2) — The id of the beginning-of-stream text token. Only applied to the text-decoder model.
eos_token_id (int, optional, defaults to 3) — The id of the end-of-stream text token. Only applied to the text-decoder model.

Speech encoder specific parameters

speech_encoder_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer speech encoder.
speech_encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer speech encoder.
speech_encoder_intermediate_size (int, optional, defaults to 4096) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer speech encoder.
speech_encoder_hidden_act (str or function, optional, defaults to "swish") — The non-linear activation function (function or string) in the speech encoder. If string, "gelu", "relu", "selu", "swish" and "gelu_new" are supported.
speech_encoder_dropout (float, optional, defaults to 0.0) — The dropout probability for all layers in the speech encoder.
add_adapter (bool, optional, defaults to True) — Add an adapter layer on top of the speech encoder.
speech_encoder_layerdrop (float, optional, defaults to 0.1) — The LayerDrop probability for the speech encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.
feature_projection_input_dim (int, optional, defaults to 160) — Input dimension of the input feature projection of the speech encoder, i.e the dimension after processing input audios with SeamlessM4TFeatureExtractor.
adaptor_kernel_size (int, optional, defaults to 8) — Kernel size of the convolutional layers in the adapter network. Only relevant if add_adapter is True.
adaptor_stride (int, optional, defaults to 8) — Stride of the convolutional layers in the adapter network. Only relevant if add_adapter is True.
adaptor_dropout (float, optional, defaults to 0.1) — The dropout probability for all layers in the speech adapter.
num_adapter_layers (int, optional, defaults to 1) — Number of convolutional layers that should be used in the adapter network. Only relevant if add_adapter is True.
position_embeddings_type (str, optional, defaults to "relative_key") — Can be specified to relative_key. If left to None, no relative position embedding is applied. Only applied to the speech encoder. For more information on "relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.).
conv_depthwise_kernel_size (int, optional, defaults to 31) — Kernel size of convolutional depthwise 1D layer in Conformer blocks. Only applied to the speech encoder.
left_max_position_embeddings (int, optional, defaults to 64) — The left clipping value for relative positions.
right_max_position_embeddings (int, optional, defaults to 8) — The right clipping value for relative positions.
speech_encoder_chunk_size (int, optional, defaults to 20000) — The size of each attention chunk.
speech_encoder_left_chunk_num (int, optional, defaults to 128) — Number of chunks on the left up to which lookahead is allowed.

Text-To-Unit (t2u) model specific parameters

t2u_bos_token_id (int, optional, defaults to 0) — The id of the beginning-of-stream unit token. Only applied to the text-to-unit seq2seq model.
t2u_pad_token_id (int, optional, defaults to 1) — The id of the padding unit token. Only applied to the text-to-unit seq2seq model.
t2u_eos_token_id (int, optional, defaults to 2) — The id of the end-of-stream unit token. Only applied to the text-to-unit seq2seq model.
t2u_encoder_layers (int, optional, defaults to 6) — Number of hidden layers in the Transformer text-to-unit encoder.
t2u_encoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit encoder.
t2u_encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text-to-unit encoder.
t2u_decoder_layers (int, optional, defaults to 6) — Number of hidden layers in the Transformer text-to-unit decoder.
t2u_decoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit decoder.
t2u_decoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text-to-unit decoder.
t2u_max_position_embeddings (int, optional, defaults to 4096) — The maximum sequence length that this model text-to-unit component might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
t2u_variance_predictor_embed_dim (int, optional, defaults to 1024) — The projection dimension of the text-to-unit’s duration predictor.
t2u_variance_predictor_hidden_dim (int, optional, defaults to 256) — Internal dimension of the text-to-unit’s duration predictor.
t2u_variance_predictor_kernel_size (int, optional, defaults to 3) — Kernel size of the convolutional layers of the text-to-unit’s duration predictor.
t2u_variance_pred_dropout (float, optional, defaults to 0.5) — The dropout probability of the text-to-unit’s duration predictor.

Hifi-Gan Vocoder specific parameters
sampling_rate (int, optional, defaults to 16000) — The sampling rate at which the output audio will be generated, expressed in hertz (Hz).
upsample_initial_channel (int, optional, defaults to 512) — The number of input channels into the hifi-gan upsampling network. Applies to the vocoder only.
upsample_rates (Tuple[int] or List[int], optional, defaults to [5, 4, 4, 2, 2]) — A tuple of integers defining the stride of each 1D convolutional layer in the vocoder upsampling network. The length of upsample_rates defines the number of convolutional layers and has to match the length of upsample_kernel_sizes. Applies to the vocoder only.
upsample_kernel_sizes (Tuple[int] or List[int], optional, defaults to [11, 8, 8, 4, 4]) — A tuple of integers defining the kernel size of each 1D convolutional layer in the vocoder upsampling network. The length of upsample_kernel_sizes defines the number of convolutional layers and has to match the length of upsample_rates. Applies to the vocoder only.
resblock_kernel_sizes (Tuple[int] or List[int], optional, defaults to [3, 7, 11]) — A tuple of integers defining the kernel sizes of the vocoder 1D convolutional layers in the multi-receptive field fusion (MRF) module. Applies to the vocoder only.
resblock_dilation_sizes (Tuple[Tuple[int]] or List[List[int]], optional, defaults to [[1, 3, 5], [1, 3, 5], [1, 3, 5]]) — A nested tuple of integers defining the dilation rates of the vocoder dilated 1D convolutional layers in the multi-receptive field fusion (MRF) module. Applies to the vocoder only.
leaky_relu_slope (float, optional, defaults to 0.1) — The angle of the negative slope used by the leaky ReLU activation in the vocoder. Applies to the vocoder only.
unit_hifi_gan_vocab_size (int, optional, defaults to 10000) — Vocabulary size of the SeamlessM4Tv2 vocoder. Defines the number of different unit tokens that can be represented by the inputs_ids passed when calling the vocoder of ~SeamlessM4Tv2Model, ~SeamlessM4Tv2ForSpeechToSpeech or ~SeamlessM4Tv2ForTextToSpeech.
unit_embed_dim (int, optional, defaults to 1280) — The projection dimension of the input ids given to the hifi-gan vocoder. Applies to the vocoder only.
lang_embed_dim (int, optional, defaults to 256) — The projection dimension of the target language given to the hifi-gan vocoder. Applies to the vocoder only.
spkr_embed_dim (int, optional, defaults to 256) — The projection dimension of the speaker id given to the hifi-gan vocoder. Applies to the vocoder only.
vocoder_num_langs (int, optional, defaults to 36) — Number of langs supported by the vocoder. Might be different from t2u_num_langs.
vocoder_num_spkrs (int, optional, defaults to 200) — Number of speakers supported by the vocoder.
variance_predictor_kernel_size (int, optional, defaults to 3) — Kernel size of the duration predictor. Applies to the vocoder only.
var_pred_dropout (float, optional, defaults to 0.5) — The dropout probability of the duration predictor. Applies to the vocoder only.
vocoder_offset (int, optional, defaults to 4) — Offset the unit token ids by this number to account for symbol tokens. Applies to the vocoder only.

This is the configuration class to store the configuration of a ~SeamlessM4Tv2Model. It is used to instantiate an SeamlessM4Tv2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SeamlessM4Tv2 "" architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

>>> from transformers import SeamlessM4Tv2Model, SeamlessM4Tv2Config

>>> # Initializing a SeamlessM4Tv2 "" style configuration
>>> configuration = SeamlessM4Tv2Config()

>>> # Initializing a model from the "" style configuration
>>> model = SeamlessM4Tv2Model(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

SeamlessM4T-v2

Overview

Usage

Speech

Text

Tips

1. Use dedicated models

2. Change the speaker identity

3. Change the generation strategy

4. Generate speech and text at the same time

Model architecture

Difference with SeamlessM4T-v1

Improvements on the second-pass model

Difference in the speech encoder

Generation process

SeamlessM4Tv2Model

class transformers.SeamlessM4Tv2Model

generate

SeamlessM4Tv2ForTextToSpeech

class transformers.SeamlessM4Tv2ForTextToSpeech

generate

SeamlessM4Tv2ForSpeechToSpeech

class transformers.SeamlessM4Tv2ForSpeechToSpeech

generate

SeamlessM4Tv2ForTextToText

class transformers.SeamlessM4Tv2ForTextToText

forward

generate

SeamlessM4Tv2ForSpeechToText

class transformers.SeamlessM4Tv2ForSpeechToText

forward

generate

SeamlessM4Tv2Config

class transformers.SeamlessM4Tv2Config