The SeamlessM4T-v2 model was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.
SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the previous version. For more details on the differences between v1 and v2, refer to section Difference with SeamlessM4T-v1.
SeamlessM4T-v2 enables multiple tasks without relying on separate models:
SeamlessM4Tv2Model can perform all the above tasks, but each task also has its own dedicated sub-model.
The abstract from the paper is the following:
Recent advancements in automatic speech translation have dramatically expanded language coverage, improved multimodal capabilities, and enabled a wide range of tasks and functionalities. That said, large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model—SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. The expanded version of SeamlessAlign adds 114,800 hours of automatically aligned data for a total of 76 languages. SeamlessM4T v2 provides the foundation on which our two newest models, SeamlessExpressive and SeamlessStreaming, are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one’s voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention (EMMA) mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To understand the performance of these models, we combined novel and modified versions of existing automatic metrics to evaluate prosody, latency, and robustness. For human evaluations, we adapted existing protocols tailored for measuring the most relevant attributes in the preservation of meaning, naturalness, and expressivity. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. In sum, Seamless gives us a pivotal look at the technical foundation needed to turn the Universal Speech Translator from a science fiction concept into a real-world technology. Finally, contributions in this work—including models, code, and a watermark detector—are publicly released and accessible at the link below.
In the following example, we’ll load an Arabic audio sample and an English text sample and convert them into Russian speech and French text.
First, load the processor and a checkpoint of the model:
>>> from transformers import AutoProcessor, SeamlessM4Tv2Model
>>> processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
>>> model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
Here is how to use the processor to process text and audio:
>>> # let's load an audio sample from an Arabic speech corpus
>>> from datasets import load_dataset
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
>>> audio_sample = next(iter(dataset))["audio"]
>>> # now, process it
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
>>> # now, process some English text as well
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
SeamlessM4Tv2Model can seamlessly generate text or speech with few or no changes. Let’s target Russian voice translation:
>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
With basically the same code, I’ve translated English text and Arabic speech to Russian speech samples.
Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass generate_speech=False
to SeamlessM4Tv2Model.generate().
This time, let’s translate to French.
>>> # from audio
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
>>> # from text
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
SeamlessM4Tv2Model is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
>>> from transformers import SeamlessM4Tv2ForSpeechToSpeech
>>> model = SeamlessM4Tv2ForSpeechToSpeech.from_pretrained("facebook/seamless-m4t-v2-large")
Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove generate_speech=False
.
>>> from transformers import SeamlessM4Tv2ForTextToText
>>> model = SeamlessM4Tv2ForTextToText.from_pretrained("facebook/seamless-m4t-v2-large")
Feel free to try out SeamlessM4Tv2ForSpeechToText and SeamlessM4Tv2ForTextToSpeech as well.
You have the possibility to change the speaker used for speech synthesis with the speaker_id
argument. Some speaker_id
works better than other for some languages!
You can use different generation strategies for text generation, e.g .generate(input_ids=input_ids, text_num_beams=4, text_do_sample=True)
which will perform multinomial beam-search decoding on the text model. Note that speech generation only supports greedy - by default - or multinomial sampling, which can be used with e.g. .generate(..., speech_do_sample=True, speech_temperature=0.6)
.
Use return_intermediate_token_ids=True
with SeamlessM4Tv2Model to return both speech and text !
SeamlessM4T-v2 features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as “unit tokens,” from the translated text.
Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the HiFi-GAN architecture is placed on top of the second seq2seq model.
The architecture of this new version differs from the first in a few aspects:
The second seq2seq model, named text-to-unit model, is now non-auto regressive, meaning that it computes units in a single forward pass. This achievement is made possible by:
The speech encoder, which is used during the first-pass generation process to predict the translated text, differs mainly from the previous speech encoder through these mechanisms:
Here’s how the generation process works:
This model was contributed by ylacombe. The original code can be found here.
( config current_modality = 'text' )
Parameters
str
, optional, defaults to "text"
) —
Default modality. Used only to initialize the model. It can be set to "text"
or "speech"
.
This will be updated automatically according to the modality passed to the forward and generate passes (input_ids
for text and input_features
for audio). The original SeamlessM4Tv2 Model transformer which can be used for every tasks available (S2ST, S2TT, T2TT, T2ST). This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None input_features: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None speaker_id: Optional = 0 generate_speech: Optional = True **kwargs ) → Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor], ModelOutput]
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.FloatTensor
of shape (batch_size, sequence_length, num_banks)
, optional) —
Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the
SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. bool
, optional) —
If True
, also returns the intermediate generated text and unit tokens. Set to True
if you also want
to get translated text alongside the audio. Note that if generate_speech=True
, this parameter will be
ignored. str
, optional) —
The language to use as target language for translation. int
, optional, defaults to 0) —
The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs
. bool
, optional, defaults to True
) —
If False
, will only returns the text tokens and won’t generate speech. **kwargs
for the generate
method of each sub-model,
except for decoder_input_ids
which will only be passed through the text components.generate
method of the
text model and speech model respectively. It has the priority over the keywords without a prefix.This means you can, for example, specify a generation strategy for one generation but not for the other.
Returns
Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor], ModelOutput]
generate_speech
and return_intermediate_token_ids
, returns SeamlessM4Tv2GenerationOutput
.generate_speech
and not return_intermediate_token_ids
, returns a tuple composed of waveforms of
shape (batch_size, sequence_length)
and and waveform_lengths
which gives the length of each sample.generate_speech=False
, it will returns ModelOutput
.Generates translated token ids and/or translated audio waveforms.
This method successively calls the .generate
function of two different sub-models. You can specify keyword
arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
that will be passed to one of them.
For example, calling .generate(input_ids=input_ids, num_beams=4, speech_do_sample=True)
will successively
perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
For an overview of generation strategies and code examples, check out the following guide.
( config: SeamlessM4Tv2Config )
Parameters
The text-to-speech SeamlessM4Tv2 Model transformer which can be used for T2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None speaker_id: Optional = 0 **kwargs ) → Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
bool
, optional) —
If True
, also returns the intermediate generated text and unit tokens. Set to True
if you also want
to get translated text alongside the audio. str
, optional) —
The language to use as target language for translation. int
, optional, defaults to 0) —
The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs
. **kwargs
for the generate
method of each sub-model,
except for decoder_input_ids
which will only be passed through the text components.generate
method of the
text model and speech model respectively. It has the priority over the keywords without a prefix.This means you can, for example, specify a generation strategy for one generation but not for the other.
Returns
Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]
return_intermediate_token_ids
, returns SeamlessM4Tv2GenerationOutput
.return_intermediate_token_ids
, returns a tuple composed of waveforms of shape (batch_size, sequence_length)
and and waveform_lengths
which gives the length of each sample.Generates translated audio waveforms.
This method successively calls the .generate
function of two different sub-models. You can specify keyword
arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
that will be passed to one of them.
For example, calling .generate(input_ids, num_beams=4, speech_do_sample=True)
will successively perform
beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
For an overview of generation strategies and code examples, check out the following guide.
( config )
Parameters
The speech-to-speech SeamlessM4Tv2 Model transformer which can be used for S2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_features: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None speaker_id: Optional = 0 **kwargs ) → Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]
Parameters
torch.FloatTensor
of shape (batch_size, sequence_length, num_banks)
) —
Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the
SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. bool
, optional) —
If True
, also returns the intermediate generated text and unit tokens. Set to True
if you also want
to get translated text alongside the audio. str
, optional) —
The language to use as target language for translation. int
, optional, defaults to 0) —
The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs
. **kwargs
for the generate
method of each sub-model,
except for decoder_input_ids
which will only be passed through the text components.generate
method of the
text model and speech model respectively. It has the priority over the keywords without a prefix.This means you can, for example, specify a generation strategy for one generation but not for the other.
Returns
Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]
return_intermediate_token_ids
, returns SeamlessM4Tv2GenerationOutput
.return_intermediate_token_ids
, returns a tuple composed of waveforms of shape (batch_size, sequence_length)
and and waveform_lengths
which gives the length of each sample.Generates translated audio waveforms.
This method successively calls the .generate
function of two different sub-models. You can specify keyword
arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
that will be passed to one of them.
For example, calling .generate(input_features, num_beams=4, speech_do_sample=True)
will successively perform
beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
For an overview of generation strategies and code examples, check out the following guide.
( config: SeamlessM4Tv2Config )
Parameters
The text-to-text SeamlessM4Tv2 Model transformer which can be used for T2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: LongTensor = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None **kwargs )
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) —
Indices of decoder input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Bart uses the eos_token_id
as the starting token for decoder_input_ids
generation. If past_key_values
is used, optionally only the last decoder_input_ids
have to be input (see past_key_values
).
For translation and summarization training, decoder_input_ids
should be provided. If no
decoder_input_ids
is provided, the model will create this tensor by shifting the input_ids
to the right
for denoising pre-training following the paper.
torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) —
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids
. Causal mask will also
be used by default.
If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
tuple(tuple(torch.FloatTensor)
, optional) —
Tuple consists of (last_hidden_state
, optional: hidden_states
, optional: attentions
)
last_hidden_state
of shape (batch_size, sequence_length, hidden_size)
, optional) is a sequence of
hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) —
Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape
(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that
don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all
decoder_input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, target_sequence_length, hidden_size)
, optional) —
Optionally, instead of passing decoder_input_ids
you can choose to directly pass an embedded
representation. If past_key_values
is used, optionally only the last decoder_inputs_embeds
have to be
input (see past_key_values
). This is useful if you want more control over how to convert
decoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
If decoder_input_ids
and decoder_inputs_embeds
are both unset, decoder_inputs_embeds
takes the value
of inputs_embeds
.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size]
(see input_ids
docstring) Tokens with indices set to -100
are ignored (masked), the
loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. The SeamlessM4Tv2ForTextToText forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
( input_ids = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) → ModelOutput or torch.LongTensor
Parameters
torch.Tensor
of varying shape depending on the modality, optional) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
str
, optional) —
The language to use as target language for translation. ~generation.GenerationConfig
, optional) —
The generation configuration to be used as base parametrization for the generation call. **kwargs
passed to generate matching the attributes of generation_config
will override them. If
generation_config
is not provided, the default will be used, which had the following loading
priority: 1) from the generation_config.json
model file, if it exists; 2) from the model
configuration. Please note that unspecified parameters will inherit GenerationConfig’s
default values, whose documentation should be checked to parameterize generation. LogitsProcessorList
, optional) —
Custom logits processors that complement the default logits processors built from arguments and
generation config. If a logit processor is passed that is already created with the arguments or a
generation config an error is thrown. This feature is intended for advanced users. StoppingCriteriaList
, optional) —
Custom stopping criteria that complement the default stopping criteria built from arguments and a
generation config. If a stopping criteria is passed that is already created with the arguments or a
generation config an error is thrown. This feature is intended for advanced users. Callable[[int, torch.Tensor], List[int]]
, optional) —
If provided, this function constraints the beam search to allowed tokens only at each step. If not
provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id
and
input_ids
. It has to return a list with the allowed tokens for the next generation step conditioned
on the batch ID batch_id
and the previously generated tokens inputs_ids
. This argument is useful
for constrained generation conditioned on the prefix, as described in Autoregressive Entity
Retrieval. bool
, optional, defaults to False
) —
Whether to continue running the while loop until max_length (needed for ZeRO stage 3) Dict[str, Any]
, optional) —
Ad hoc parametrization of generate_config
and/or additional model-specific kwargs that will be
forwarded to the forward
function of the model. Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True
) or a torch.FloatTensor
. The possible
ModelOutput types are:
Generates sequences of token ids.
Most generation-controlling parameters are set in generation_config
which, if not passed, will be set to the
model’s default generation configuration. You can override any generation_config
by passing the corresponding
parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True)
.
For an overview of generation strategies and code examples, check out the following guide.
( config: SeamlessM4Tv2Config )
Parameters
The speech-to-text SeamlessM4Tv2 Model transformer which can be used for S2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_features: LongTensor = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None **kwargs )
Parameters
torch.FloatTensor
of shape (batch_size, sequence_length, num_banks)
) —
Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the
SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) —
Indices of decoder input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Bart uses the eos_token_id
as the starting token for decoder_input_ids
generation. If past_key_values
is used, optionally only the last decoder_input_ids
have to be input (see past_key_values
).
For translation and summarization training, decoder_input_ids
should be provided. If no
decoder_input_ids
is provided, the model will create this tensor by shifting the input_ids
to the right
for denoising pre-training following the paper.
torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) —
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids
. Causal mask will also
be used by default.
If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
tuple(tuple(torch.FloatTensor)
, optional) —
Tuple consists of (last_hidden_state
, optional: hidden_states
, optional: attentions
)
last_hidden_state
of shape (batch_size, sequence_length, hidden_size)
, optional) is a sequence of
hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) —
Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape
(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that
don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all
decoder_input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, target_sequence_length, hidden_size)
, optional) —
Optionally, instead of passing decoder_input_ids
you can choose to directly pass an embedded
representation. If past_key_values
is used, optionally only the last decoder_inputs_embeds
have to be
input (see past_key_values
). This is useful if you want more control over how to convert
decoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
If decoder_input_ids
and decoder_inputs_embeds
are both unset, decoder_inputs_embeds
takes the value
of inputs_embeds
.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size]
(see input_ids
docstring) Tokens with indices set to -100
are ignored (masked), the
loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. The SeamlessM4Tv2ForSpeechToText forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
( input_features = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) → ModelOutput or torch.LongTensor
Parameters
torch.FloatTensor
of shape (batch_size, sequence_length, num_banks)
) —
Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the
SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. str
, optional) —
The language to use as target language for translation. ~generation.GenerationConfig
, optional) —
The generation configuration to be used as base parametrization for the generation call. **kwargs
passed to generate matching the attributes of generation_config
will override them. If
generation_config
is not provided, the default will be used, which had the following loading
priority: 1) from the generation_config.json
model file, if it exists; 2) from the model
configuration. Please note that unspecified parameters will inherit GenerationConfig’s
default values, whose documentation should be checked to parameterize generation. LogitsProcessorList
, optional) —
Custom logits processors that complement the default logits processors built from arguments and
generation config. If a logit processor is passed that is already created with the arguments or a
generation config an error is thrown. This feature is intended for advanced users. StoppingCriteriaList
, optional) —
Custom stopping criteria that complement the default stopping criteria built from arguments and a
generation config. If a stopping criteria is passed that is already created with the arguments or a
generation config an error is thrown. This feature is intended for advanced users. Callable[[int, torch.Tensor], List[int]]
, optional) —
If provided, this function constraints the beam search to allowed tokens only at each step. If not
provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id
and
input_ids
. It has to return a list with the allowed tokens for the next generation step conditioned
on the batch ID batch_id
and the previously generated tokens inputs_ids
. This argument is useful
for constrained generation conditioned on the prefix, as described in Autoregressive Entity
Retrieval. bool
, optional, defaults to False
) —
Whether to continue running the while loop until max_length (needed for ZeRO stage 3) Dict[str, Any]
, optional) —
Ad hoc parametrization of generate_config
and/or additional model-specific kwargs that will be
forwarded to the forward
function of the model. Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True
) or a torch.FloatTensor
. The possible
ModelOutput types are:
Generates sequences of token ids.
Most generation-controlling parameters are set in generation_config
which, if not passed, will be set to the
model’s default generation configuration. You can override any generation_config
by passing the corresponding
parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True)
.
For an overview of generation strategies and code examples, check out the following guide.
( vocab_size = 256102 t2u_vocab_size = 10082 char_vocab_size = 10943 hidden_size = 1024 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True max_position_embeddings = 4096 is_encoder_decoder = True encoder_layerdrop = 0.05 decoder_layerdrop = 0.05 activation_function = 'relu' dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.0 scale_embedding = True encoder_layers = 24 encoder_ffn_dim = 8192 encoder_attention_heads = 16 decoder_layers = 24 decoder_ffn_dim = 8192 decoder_attention_heads = 16 decoder_start_token_id = 3 max_new_tokens = 256 pad_token_id = 0 bos_token_id = 2 eos_token_id = 3 speech_encoder_layers = 24 speech_encoder_attention_heads = 16 speech_encoder_intermediate_size = 4096 speech_encoder_hidden_act = 'swish' speech_encoder_dropout = 0.0 add_adapter = True speech_encoder_layerdrop = 0.1 feature_projection_input_dim = 160 adaptor_kernel_size = 8 adaptor_stride = 8 adaptor_dropout = 0.1 num_adapter_layers = 1 position_embeddings_type = 'relative_key' conv_depthwise_kernel_size = 31 left_max_position_embeddings = 64 right_max_position_embeddings = 8 speech_encoder_chunk_size = 20000 speech_encoder_left_chunk_num = 128 t2u_bos_token_id = 0 t2u_pad_token_id = 1 t2u_eos_token_id = 2 t2u_encoder_layers = 6 t2u_encoder_ffn_dim = 8192 t2u_encoder_attention_heads = 16 t2u_decoder_layers = 6 t2u_decoder_ffn_dim = 8192 t2u_decoder_attention_heads = 16 t2u_max_position_embeddings = 4096 t2u_variance_predictor_embed_dim = 1024 t2u_variance_predictor_hidden_dim = 256 t2u_variance_predictor_kernel_size = 3 t2u_variance_pred_dropout = 0.5 sampling_rate = 16000 upsample_initial_channel = 512 upsample_rates = [5, 4, 4, 2, 2] upsample_kernel_sizes = [11, 8, 8, 4, 4] resblock_kernel_sizes = [3, 7, 11] resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]] leaky_relu_slope = 0.1 unit_hifi_gan_vocab_size = 10000 unit_embed_dim = 1280 lang_embed_dim = 256 spkr_embed_dim = 256 vocoder_num_langs = 36 vocoder_num_spkrs = 200 variance_predictor_kernel_size = 3 var_pred_dropout = 0.5 vocoder_offset = 4 **kwargs )
Parameters
int
, optional, defaults to 256102) —
Vocabulary size of the text modality of the SeamlessM4Tv2 model. Defines the number of different tokens
that can be represented by the inputs_ids
passed when calling ~SeamlessM4Tv2Model,
~SeamlessM4Tv2ForTextToSpeech or ~SeamlessM4Tv2ForTextToText. int
, optional, defaults to 10082) —
Unit vocabulary size of the SeamlessM4Tv2 model. Defines the number of different “unit tokens” that can be
represented by the inputs_ids
passed when calling the Text-To-Units sub-model of ~SeamlessM4Tv2Model,
~SeamlessM4Tv2ForSpeechToSpeech or ~SeamlessM4Tv2ForTextToSpeech. int
, optional, defaults to 10943) —
Character vocabulary size of the SeamlessM4Tv2 model. Defines the number of different character tokens that
can be represented by the char_inputs_ids
passed when calling the Text-To-Units sub-model of
~SeamlessM4Tv2Model, ~SeamlessM4Tv2ForSpeechToSpeech or ~SeamlessM4Tv2ForTextToSpeech. Parameters shared across sub-models
int
, optional, defaults to 1024) —
Dimensionality of the “intermediate” layers in the architecture. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 1e-05) —
The epsilon used by the layer normalization layers. bool
, optional, defaults to True
) —
Whether or not the model should return the last key/values attentions (not used by all models). int
, optional, defaults to 4096) —
The maximum sequence length that this model text encoder and decoder might ever be used with. Typically set
this to something large just in case (e.g., 512 or 1024 or 2048). bool
, optional, defaults to True
) —
Whether the model is used as an encoder/decoder or not. float
, optional, defaults to 0.05) —
The LayerDrop probability for the encoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details. float
, optional, defaults to 0.05) —
The LayerDrop probability for the decoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details. str
or function
, optional, defaults to "relu"
) —
The non-linear activation function (function or string) in the decoder and feed-forward layers. If string,
"gelu"
, "relu"
, "selu"
, "swish"
and "gelu_new"
are supported. float
, optional, defaults to 0.1) —
The dropout probability for all fully connected layers in the embeddings, encoder, decoder, and pooler. float
, optional, defaults to 0.1) —
The dropout probability for all attention layers. float
, optional, defaults to 0.0) —
The dropout probability for all activation layers in the model. bool
, optional, defaults to True
) —
Scale embeddings by diving by sqrt(d_model). Text encoder and text decoder specific parameters
int
, optional, defaults to 24) —
Number of hidden layers in the Transformer text encoder. int
, optional, defaults to 8192) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text encoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer text encoder. int
, optional, defaults to 24) —
Number of hidden layers in the Transformer text decoder. int
, optional, defaults to 8192) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text decoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer text decoder. int
, optional, defaults to 3) —
If an encoder-decoder model starts decoding with a different token than bos, the id of that token. Only
applied in the text decoder. int
, optional, defaults to 256) —
The maximum numbers of text tokens to generate, ignoring the number of tokens in the prompt. int
, optional, defaults to 0) —
The id of the padding text token. Only applied to the text-decoder model. int
, optional, defaults to 2) —
The id of the beginning-of-stream text token. Only applied to the text-decoder model. int
, optional, defaults to 3) —
The id of the end-of-stream text token. Only applied to the text-decoder model. Speech encoder specific parameters
int
, optional, defaults to 24) —
Number of hidden layers in the Transformer speech encoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer speech encoder. int
, optional, defaults to 4096) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer speech encoder. str
or function
, optional, defaults to "swish"
) —
The non-linear activation function (function or string) in the speech encoder. If string, "gelu"
,
"relu"
, "selu"
, "swish"
and "gelu_new"
are supported. float
, optional, defaults to 0.0) —
The dropout probability for all layers in the speech encoder. bool
, optional, defaults to True
) —
Add an adapter layer on top of the speech encoder. float
, optional, defaults to 0.1) —
The LayerDrop probability for the speech encoder. See the [LayerDrop paper](see
https://arxiv.org/abs/1909.11556) for more details. int
, optional, defaults to 160) —
Input dimension of the input feature projection of the speech encoder, i.e the dimension after processing
input audios with SeamlessM4TFeatureExtractor. int
, optional, defaults to 8) —
Kernel size of the convolutional layers in the adapter network. Only relevant if add_adapter is True
. int
, optional, defaults to 8) —
Stride of the convolutional layers in the adapter network. Only relevant if add_adapter is True
. float
, optional, defaults to 0.1) —
The dropout probability for all layers in the speech adapter. int
, optional, defaults to 1) —
Number of convolutional layers that should be used in the adapter network. Only relevant if add_adapter is True
. str
, optional, defaults to "relative_key"
) —
Can be specified to relative_key
. If left to None
, no relative position embedding is applied. Only
applied to the speech encoder. For more information on "relative_key"
, please refer to Self-Attention
with Relative Position Representations (Shaw et al.). int
, optional, defaults to 31) —
Kernel size of convolutional depthwise 1D layer in Conformer blocks. Only applied to the speech encoder. int
, optional, defaults to 64) —
The left clipping value for relative positions. int
, optional, defaults to 8) —
The right clipping value for relative positions. int
, optional, defaults to 20000) — The size of each attention chunk. int
, optional, defaults to 128) —
Number of chunks on the left up to which lookahead is allowed. Text-To-Unit (t2u) model specific parameters
int
, optional, defaults to 0) —
The id of the beginning-of-stream unit token. Only applied to the text-to-unit seq2seq model. int
, optional, defaults to 1) —
The id of the padding unit token. Only applied to the text-to-unit seq2seq model. int
, optional, defaults to 2) —
The id of the end-of-stream unit token. Only applied to the text-to-unit seq2seq model. int
, optional, defaults to 6) —
Number of hidden layers in the Transformer text-to-unit encoder. int
, optional, defaults to 8192) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit encoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer text-to-unit encoder. int
, optional, defaults to 6) —
Number of hidden layers in the Transformer text-to-unit decoder. int
, optional, defaults to 8192) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit decoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer text-to-unit decoder. int
, optional, defaults to 4096) —
The maximum sequence length that this model text-to-unit component might ever be used with. Typically set
this to something large just in case (e.g., 512 or 1024 or 2048). int
, optional, defaults to 1024) —
The projection dimension of the text-to-unit’s duration predictor. int
, optional, defaults to 256) —
Internal dimension of the text-to-unit’s duration predictor. int
, optional, defaults to 3) —
Kernel size of the convolutional layers of the text-to-unit’s duration predictor. float
, optional, defaults to 0.5) —
The dropout probability of the text-to-unit’s duration predictor.
Hifi-Gan Vocoder specific parameters
int
, optional, defaults to 16000) —
The sampling rate at which the output audio will be generated, expressed in hertz (Hz). int
, optional, defaults to 512) —
The number of input channels into the hifi-gan upsampling network. Applies to the vocoder only. Tuple[int]
or List[int]
, optional, defaults to [5, 4, 4, 2, 2]
) —
A tuple of integers defining the stride of each 1D convolutional layer in the vocoder upsampling network.
The length of upsample_rates defines the number of convolutional layers and has to match the length of
upsample_kernel_sizes. Applies to the vocoder only. Tuple[int]
or List[int]
, optional, defaults to [11, 8, 8, 4, 4]
) —
A tuple of integers defining the kernel size of each 1D convolutional layer in the vocoder upsampling
network. The length of upsample_kernel_sizes defines the number of convolutional layers and has to match
the length of upsample_rates. Applies to the vocoder only. Tuple[int]
or List[int]
, optional, defaults to [3, 7, 11]
) —
A tuple of integers defining the kernel sizes of the vocoder 1D convolutional layers in the multi-receptive
field fusion (MRF) module. Applies to the vocoder only. Tuple[Tuple[int]]
or List[List[int]]
, optional, defaults to [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
) —
A nested tuple of integers defining the dilation rates of the vocoder dilated 1D convolutional layers in
the multi-receptive field fusion (MRF) module. Applies to the vocoder only. float
, optional, defaults to 0.1) —
The angle of the negative slope used by the leaky ReLU activation in the vocoder. Applies to the vocoder
only. int
, optional, defaults to 10000) —
Vocabulary size of the SeamlessM4Tv2 vocoder. Defines the number of different unit tokens that can be
represented by the inputs_ids
passed when calling the vocoder of ~SeamlessM4Tv2Model,
~SeamlessM4Tv2ForSpeechToSpeech or ~SeamlessM4Tv2ForTextToSpeech. int
, optional, defaults to 1280) —
The projection dimension of the input ids given to the hifi-gan vocoder. Applies to the vocoder only. int
, optional, defaults to 256) —
The projection dimension of the target language given to the hifi-gan vocoder. Applies to the vocoder only. int
, optional, defaults to 256) —
The projection dimension of the speaker id given to the hifi-gan vocoder. Applies to the vocoder only. int
, optional, defaults to 36) —
Number of langs supported by the vocoder. Might be different from t2u_num_langs
. int
, optional, defaults to 200) —
Number of speakers supported by the vocoder. int
, optional, defaults to 3) —
Kernel size of the duration predictor. Applies to the vocoder only. float
, optional, defaults to 0.5) —
The dropout probability of the duration predictor. Applies to the vocoder only. int
, optional, defaults to 4) —
Offset the unit token ids by this number to account for symbol tokens. Applies to the vocoder only. This is the configuration class to store the configuration of a ~SeamlessM4Tv2Model. It is used to instantiate an SeamlessM4Tv2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SeamlessM4Tv2 "" architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
>>> from transformers import SeamlessM4Tv2Model, SeamlessM4Tv2Config
>>> # Initializing a SeamlessM4Tv2 "" style configuration
>>> configuration = SeamlessM4Tv2Config()
>>> # Initializing a model from the "" style configuration
>>> model = SeamlessM4Tv2Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config