The SeamlessM4T model was proposed in SeamlessM4T — Massively Multilingual & Multimodal Machine Translation by the Seamless Communication team from Meta AI.
This is the version 1 release of the model. For the updated version 2 release, refer to the Seamless M4T v2 docs.
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
SeamlessM4T enables multiple tasks without relying on separate models:
SeamlessM4TModel can perform all the above tasks, but each task also has its own dedicated sub-model.
The abstract from the paper is the following:
What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication
First, load the processor and a checkpoint of the model:
>>> from transformers import AutoProcessor, SeamlessM4TModel
>>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
>>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
Here is how to use the processor to process text and audio:
>>> # let's load an audio sample from an Arabic speech corpus
>>> from datasets import load_dataset
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
>>> audio_sample = next(iter(dataset))["audio"]
>>> # now, process it
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
>>> # now, process some English test as well
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
SeamlessM4TModel can seamlessly generate text or speech with few or no changes. Let’s target Russian voice translation:
>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
With basically the same code, I’ve translated English text and Arabic speech to Russian speech samples.
Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass generate_speech=False
to SeamlessM4TModel.generate().
This time, let’s translate to French.
>>> # from audio
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
>>> # from text
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
SeamlessM4TModel is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
>>> from transformers import SeamlessM4TForSpeechToSpeech
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-medium")
Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove generate_speech=False
.
>>> from transformers import SeamlessM4TForTextToText
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium")
Feel free to try out SeamlessM4TForSpeechToText and SeamlessM4TForTextToSpeech as well.
You have the possibility to change the speaker used for speech synthesis with the spkr_id
argument. Some spkr_id
works better than other for some languages!
You can use different generation strategies for speech and text generation, e.g .generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)
which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.
Use return_intermediate_token_ids=True
with SeamlessM4TModel to return both speech and text !
SeamlessM4T features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as “unit tokens,” from the translated text.
Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the HiFi-GAN architecture is placed on top of the second seq2seq model.
Here’s how the generation process works:
This model was contributed by ylacombe. The original code can be found here.
( config current_modality = 'text' )
Parameters
str
, optional, defaults to "text"
) —
Default modality. Used to initialize the model. The original SeamlessM4T Model transformer which can be used for every tasks available (S2ST, S2TT, T2TT, T2ST). This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None input_features: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None spkr_id: Optional = 0 generate_speech: Optional = True **kwargs ) → Union[SeamlessM4TGenerationOutput, Tuple[Tensor], ModelOutput]
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.FloatTensor
of shape (batch_size, sequence_length, num_banks)
, optional) —
Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the
SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. bool
, optional) —
If True
, also returns the intermediate generated text and unit tokens. Set to True
if you also want
to get translated text alongside the audio. Note that if generate_speech=True
, this parameter will be
ignored. str
, optional) —
The language to use as target language for translation. int
, optional, defaults to 0) —
The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs
. bool
, optional, defaults to True
) —
If False
, will only returns the text tokens and won’t generate speech. **kwargs
for the generate
method of each sub-model,
except for decoder_input_ids
which will only be passed through the text components.generate
method of the
text model and speech model respectively. It has the priority over the keywords without a prefix.This means you can, for example, specify a generation strategy for one generation but not for the other.
Returns
Union[SeamlessM4TGenerationOutput, Tuple[Tensor], ModelOutput]
generate_speech
and return_intermediate_token_ids
, returns SeamlessM4TGenerationOutput
.generate_speech
and not return_intermediate_token_ids
, returns a tuple composed of waveforms of
shape (batch_size, sequence_length)
and and waveform_lengths
which gives the length of each sample.generate_speech=False
, it will returns ModelOutput
.Generates translated token ids and/or translated audio waveforms.
This method successively calls the .generate
function of two different sub-models. You can specify keyword
arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
that will be passed to one of them.
For example, calling .generate(input_ids=input_ids, num_beams=4, speech_do_sample=True)
will successively
perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
For an overview of generation strategies and code examples, check out the following guide.
( config: SeamlessM4TConfig )
Parameters
The text-to-speech SeamlessM4T Model transformer which can be used for T2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None spkr_id: Optional = 0 **kwargs ) → Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
bool
, optional) —
If True
, also returns the intermediate generated text and unit tokens. Set to True
if you also want
to get translated text alongside the audio. str
, optional) —
The language to use as target language for translation. int
, optional, defaults to 0) —
The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs
. **kwargs
for the generate
method of each sub-model,
except for decoder_input_ids
which will only be passed through the text components.generate
method of the
text model and speech model respectively. It has the priority over the keywords without a prefix.This means you can, for example, specify a generation strategy for one generation but not for the other.
Returns
Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]
return_intermediate_token_ids
, returns SeamlessM4TGenerationOutput
.return_intermediate_token_ids
, returns a tuple composed of waveforms of shape (batch_size, sequence_length)
and and waveform_lengths
which gives the length of each sample.Generates translated audio waveforms.
This method successively calls the .generate
function of two different sub-models. You can specify keyword
arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
that will be passed to one of them.
For example, calling .generate(input_ids, num_beams=4, speech_do_sample=True)
will successively perform
beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
For an overview of generation strategies and code examples, check out the following guide.
( config )
Parameters
The speech-to-speech SeamlessM4T Model transformer which can be used for S2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_features: Optional = None return_intermediate_token_ids: Optional = None tgt_lang: Optional = None spkr_id: Optional = 0 **kwargs ) → Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]
Parameters
torch.FloatTensor
of shape (batch_size, sequence_length, num_banks)
) —
Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the
SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. bool
, optional) —
If True
, also returns the intermediate generated text and unit tokens. Set to True
if you also want
to get translated text alongside the audio. str
, optional) —
The language to use as target language for translation. int
, optional, defaults to 0) —
The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs
. **kwargs
for the generate
method of each sub-model,
except for decoder_input_ids
which will only be passed through the text components.generate
method of the
text model and speech model respectively. It has the priority over the keywords without a prefix.This means you can, for example, specify a generation strategy for one generation but not for the other.
Returns
Union[SeamlessM4TGenerationOutput, Tuple[Tensor]]
return_intermediate_token_ids
, returns SeamlessM4TGenerationOutput
.return_intermediate_token_ids
, returns a tuple composed of waveforms of shape (batch_size, sequence_length)
and and waveform_lengths
which gives the length of each sample.Generates translated audio waveforms.
This method successively calls the .generate
function of two different sub-models. You can specify keyword
arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments
that will be passed to one of them.
For example, calling .generate(input_features, num_beams=4, speech_do_sample=True)
will successively perform
beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.
For an overview of generation strategies and code examples, check out the following guide.
( config: SeamlessM4TConfig )
Parameters
The text-to-text SeamlessM4T Model transformer which can be used for T2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: LongTensor = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None **kwargs )
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) —
Indices of decoder input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Bart uses the eos_token_id
as the starting token for decoder_input_ids
generation. If past_key_values
is used, optionally only the last decoder_input_ids
have to be input (see past_key_values
).
For translation and summarization training, decoder_input_ids
should be provided. If no
decoder_input_ids
is provided, the model will create this tensor by shifting the input_ids
to the right
for denoising pre-training following the paper.
torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) —
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids
. Causal mask will also
be used by default.
If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
tuple(tuple(torch.FloatTensor)
, optional) —
Tuple consists of (last_hidden_state
, optional: hidden_states
, optional: attentions
)
last_hidden_state
of shape (batch_size, sequence_length, hidden_size)
, optional) is a sequence of
hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) —
Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape
(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that
don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all
decoder_input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, target_sequence_length, hidden_size)
, optional) —
Optionally, instead of passing decoder_input_ids
you can choose to directly pass an embedded
representation. If past_key_values
is used, optionally only the last decoder_inputs_embeds
have to be
input (see past_key_values
). This is useful if you want more control over how to convert
decoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
If decoder_input_ids
and decoder_inputs_embeds
are both unset, decoder_inputs_embeds
takes the value
of inputs_embeds
.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size]
(see input_ids
docstring) Tokens with indices set to -100
are ignored (masked), the
loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. The SeamlessM4TForTextToText forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
( input_ids = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) → ModelOutput or torch.LongTensor
Parameters
torch.Tensor
of varying shape depending on the modality, optional) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
str
, optional) —
The language to use as target language for translation. ~generation.GenerationConfig
, optional) —
The generation configuration to be used as base parametrization for the generation call. **kwargs
passed to generate matching the attributes of generation_config
will override them. If
generation_config
is not provided, the default will be used, which had the following loading
priority: 1) from the generation_config.json
model file, if it exists; 2) from the model
configuration. Please note that unspecified parameters will inherit GenerationConfig’s
default values, whose documentation should be checked to parameterize generation. LogitsProcessorList
, optional) —
Custom logits processors that complement the default logits processors built from arguments and
generation config. If a logit processor is passed that is already created with the arguments or a
generation config an error is thrown. This feature is intended for advanced users. StoppingCriteriaList
, optional) —
Custom stopping criteria that complement the default stopping criteria built from arguments and a
generation config. If a stopping criteria is passed that is already created with the arguments or a
generation config an error is thrown. This feature is intended for advanced users. Callable[[int, torch.Tensor], List[int]]
, optional) —
If provided, this function constraints the beam search to allowed tokens only at each step. If not
provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id
and
input_ids
. It has to return a list with the allowed tokens for the next generation step conditioned
on the batch ID batch_id
and the previously generated tokens inputs_ids
. This argument is useful
for constrained generation conditioned on the prefix, as described in Autoregressive Entity
Retrieval. bool
, optional, defaults to False
) —
Whether to continue running the while loop until max_length (needed for ZeRO stage 3) Dict[str, Any]
, optional) —
Ad hoc parametrization of generate_config
and/or additional model-specific kwargs that will be
forwarded to the forward
function of the model. Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True
) or a torch.FloatTensor
. The possible
ModelOutput types are:
Generates sequences of token ids.
Most generation-controlling parameters are set in generation_config
which, if not passed, will be set to the
model’s default generation configuration. You can override any generation_config
by passing the corresponding
parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True)
.
For an overview of generation strategies and code examples, check out the following guide.
( config: SeamlessM4TConfig )
Parameters
The speech-to-text SeamlessM4T Model transformer which can be used for S2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_features: LongTensor = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None **kwargs )
Parameters
torch.FloatTensor
of shape (batch_size, sequence_length, num_banks)
) —
Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the
SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) —
Indices of decoder input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Bart uses the eos_token_id
as the starting token for decoder_input_ids
generation. If past_key_values
is used, optionally only the last decoder_input_ids
have to be input (see past_key_values
).
For translation and summarization training, decoder_input_ids
should be provided. If no
decoder_input_ids
is provided, the model will create this tensor by shifting the input_ids
to the right
for denoising pre-training following the paper.
torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) —
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids
. Causal mask will also
be used by default.
If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
tuple(tuple(torch.FloatTensor)
, optional) —
Tuple consists of (last_hidden_state
, optional: hidden_states
, optional: attentions
)
last_hidden_state
of shape (batch_size, sequence_length, hidden_size)
, optional) is a sequence of
hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) —
Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape
(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that
don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all
decoder_input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, target_sequence_length, hidden_size)
, optional) —
Optionally, instead of passing decoder_input_ids
you can choose to directly pass an embedded
representation. If past_key_values
is used, optionally only the last decoder_inputs_embeds
have to be
input (see past_key_values
). This is useful if you want more control over how to convert
decoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
If decoder_input_ids
and decoder_inputs_embeds
are both unset, decoder_inputs_embeds
takes the value
of inputs_embeds
.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size]
(see input_ids
docstring) Tokens with indices set to -100
are ignored (masked), the
loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. The SeamlessM4TForSpeechToText forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
( input_features = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) → ModelOutput or torch.LongTensor
Parameters
torch.FloatTensor
of shape (batch_size, sequence_length, num_banks)
) —
Input audio features. This should be returnes by the SeamlessM4TFeatureExtractor class or the
SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details. str
, optional) —
The language to use as target language for translation. ~generation.GenerationConfig
, optional) —
The generation configuration to be used as base parametrization for the generation call. **kwargs
passed to generate matching the attributes of generation_config
will override them. If
generation_config
is not provided, the default will be used, which had the following loading
priority: 1) from the generation_config.json
model file, if it exists; 2) from the model
configuration. Please note that unspecified parameters will inherit GenerationConfig’s
default values, whose documentation should be checked to parameterize generation. LogitsProcessorList
, optional) —
Custom logits processors that complement the default logits processors built from arguments and
generation config. If a logit processor is passed that is already created with the arguments or a
generation config an error is thrown. This feature is intended for advanced users. StoppingCriteriaList
, optional) —
Custom stopping criteria that complement the default stopping criteria built from arguments and a
generation config. If a stopping criteria is passed that is already created with the arguments or a
generation config an error is thrown. This feature is intended for advanced users. Callable[[int, torch.Tensor], List[int]]
, optional) —
If provided, this function constraints the beam search to allowed tokens only at each step. If not
provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id
and
input_ids
. It has to return a list with the allowed tokens for the next generation step conditioned
on the batch ID batch_id
and the previously generated tokens inputs_ids
. This argument is useful
for constrained generation conditioned on the prefix, as described in Autoregressive Entity
Retrieval. bool
, optional, defaults to False
) —
Whether to continue running the while loop until max_length (needed for ZeRO stage 3) Dict[str, Any]
, optional) —
Ad hoc parametrization of generate_config
and/or additional model-specific kwargs that will be
forwarded to the forward
function of the model. Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True
) or a torch.FloatTensor
. The possible
ModelOutput types are:
Generates sequences of token ids.
Most generation-controlling parameters are set in generation_config
which, if not passed, will be set to the
model’s default generation configuration. You can override any generation_config
by passing the corresponding
parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True)
.
For an overview of generation strategies and code examples, check out the following guide.
( vocab_size = 256102 t2u_vocab_size = 10082 hidden_size = 1024 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True max_position_embeddings = 1024 is_encoder_decoder = True encoder_layerdrop = 0.05 decoder_layerdrop = 0.05 activation_function = 'relu' dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.0 scale_embedding = True encoder_layers = 24 encoder_ffn_dim = 8192 encoder_attention_heads = 16 decoder_layers = 24 decoder_ffn_dim = 8192 decoder_attention_heads = 16 decoder_start_token_id = 3 max_new_tokens = 256 pad_token_id = 0 bos_token_id = 2 eos_token_id = 3 speech_encoder_layers = 24 speech_encoder_attention_heads = 16 speech_encoder_intermediate_size = 4096 speech_encoder_hidden_act = 'swish' speech_encoder_dropout = 0.0 add_adapter = True speech_encoder_layerdrop = 0.1 feature_projection_input_dim = 160 num_conv_pos_embeddings = 128 num_conv_pos_embedding_groups = 16 adaptor_kernel_size = 8 adaptor_stride = 8 adaptor_dropout = 0.1 num_adapter_layers = 1 position_embeddings_type = 'relative' rotary_embedding_base = 10000 max_source_positions = 4096 conv_depthwise_kernel_size = 31 t2u_bos_token_id = 0 t2u_pad_token_id = 1 t2u_eos_token_id = 2 t2u_decoder_start_token_id = 2 t2u_max_new_tokens = 1024 t2u_encoder_layers = 6 t2u_encoder_ffn_dim = 8192 t2u_encoder_attention_heads = 16 t2u_decoder_layers = 6 t2u_decoder_ffn_dim = 8192 t2u_decoder_attention_heads = 16 t2u_max_position_embeddings = 2048 sampling_rate = 16000 upsample_initial_channel = 512 upsample_rates = [5, 4, 4, 2, 2] upsample_kernel_sizes = [11, 8, 8, 4, 4] resblock_kernel_sizes = [3, 7, 11] resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]] leaky_relu_slope = 0.1 unit_hifi_gan_vocab_size = 10000 unit_embed_dim = 1280 lang_embed_dim = 256 spkr_embed_dim = 256 vocoder_num_langs = 36 vocoder_num_spkrs = 200 variance_predictor_kernel_size = 3 var_pred_dropout = 0.5 vocoder_offset = 4 **kwargs )
Parameters
int
, optional, defaults to 256102) —
Vocabulary size of the SeamlessM4T model. Defines the number of different tokens that can be represented by
the inputs_ids
passed when calling ~SeamlessM4TModel, ~SeamlessM4TForTextToSpeech or
~SeamlessM4TForTextToText. int
, optional, defaults to 10082) —
Unit vocabulary size of the SeamlessM4T model. Defines the number of different unit tokens that can be
represented by the inputs_ids
passed when calling the Text-To-Units sub-model of ~SeamlessM4TModel,
~SeamlessM4TForSpeechToSpeech or ~SeamlessM4TForTextToSpeech. Parameters shared across sub-models
int
, optional, defaults to 1024) —
Dimensionality of the “intermediate” layers in the architecture. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 1e-05) —
The epsilon used by the layer normalization layers. bool
, optional, defaults to True
) —
Whether or not the model should return the last key/values attentions (not used by all models). int
, optional, defaults to 1024) —
The maximum sequence length that this model text encoder and decoder might ever be used with. Typically set
this to something large just in case (e.g., 512 or 1024 or 2048). bool
, optional, defaults to True
) —
Whether the model is used as an encoder/decoder or not. float
, optional, defaults to 0.05) —
The LayerDrop probability for the encoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details. float
, optional, defaults to 0.05) —
The LayerDrop probability for the decoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details. str
or function
, optional, defaults to "relu"
) —
The non-linear activation function (function or string) in the decoder and feed-forward layers. If string,
"gelu"
, "relu"
, "selu"
, "swish"
and "gelu_new"
are supported. float
, optional, defaults to 0.1) —
The dropout probability for all fully connected layers in the embeddings, encoder, decoder, and pooler. float
, optional, defaults to 0.1) —
The dropout probability for all attention layers. float
, optional, defaults to 0.0) —
The dropout probability for all activation layers in the model. bool
, optional, defaults to True
) —
Scale embeddings by diving by sqrt(d_model). Text encoder and text decoder specific parameters
int
, optional, defaults to 24) —
Number of hidden layers in the Transformer text encoder. int
, optional, defaults to 8192) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text encoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer text encoder. int
, optional, defaults to 24) —
Number of hidden layers in the Transformer text decoder. int
, optional, defaults to 8192) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text decoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer text decoder. int
, optional, defaults to 3) —
If an encoder-decoder model starts decoding with a different token than bos, the id of that token. Only
applied in the text decoder. int
, optional, defaults to 256) —
The maximum numbers of text tokens to generate, ignoring the number of tokens in the prompt. int
, optional, defaults to 0) —
The id of the padding text token. Only applied to the text-decoder model. int
, optional, defaults to 2) —
The id of the beginning-of-stream text token. Only applied to the text-decoder model. int
, optional, defaults to 3) —
The id of the end-of-stream text token. Only applied to the text-decoder model. Speech encoder specific parameters
int
, optional, defaults to 24) —
Number of hidden layers in the Transformer speech encoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer speech encoder. int
, optional, defaults to 4096) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer speech encoder. str
or function
, optional, defaults to "swish"
) —
The non-linear activation function (function or string) in the speech encoder. If string, "gelu"
,
"relu"
, "selu"
, "swish"
and "gelu_new"
are supported. float
, optional, defaults to 0.0) —
The dropout probability for all layers in the speech encoder. bool
, optional, defaults to True
) —
Add an adapter layer on top of the speech encoder. float
, optional, defaults to 0.1) —
The LayerDrop probability for the speech encoder. See the [LayerDrop paper](see
https://arxiv.org/abs/1909.11556) for more details. int
, optional, defaults to 160) —
Input dimension of the input feature projection of the speech encoder, i.e the dimension after processing
input audios with SeamlessM4TFeatureExtractor. int
, optional, defaults to 128) —
Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
embeddings layer of the speech encoder. int
, optional, defaults to 16) —
Number of groups of 1D convolutional positional embeddings layer of the speech encoder. int
, optional, defaults to 8) —
Kernel size of the convolutional layers in the adapter network. Only relevant if add_adapter is True
. int
, optional, defaults to 8) —
Stride of the convolutional layers in the adapter network. Only relevant if add_adapter is True
. float
, optional, defaults to 0.1) —
The dropout probability for all layers in the speech adapter. int
, optional, defaults to 1) —
Number of convolutional layers that should be used in the adapter network. Only relevant if add_adapter is True
. str
, optional, defaults to "relative"
) —
Can be specified to relative
or rotary
for relative or rotary position embeddings respectively. If left
None
no relative position embedding is applied. Only applied to the speech encoder. int
, optional, defaults to 10000) —
If "rotary"
position embeddings are used, defines the size of the embedding base. Only applied to the
speech encoder. int
, optional, defaults to 4096) —
if "relative"
position embeddings are used, defines the maximum source input positions. Only applied to
the speech encoder. int
, optional, defaults to 31) —
Kernel size of convolutional depthwise 1D layer in Conformer blocks. Only applied to the speech encoder. Text-To-Unit (t2u) model specific parameters
int
, optional, defaults to 0) —
The id of the beginning-of-stream unit token. Only applied to the text-to-unit seq2seq model. int
, optional, defaults to 1) —
The id of the padding unit token. Only applied to the text-to-unit seq2seq model. int
, optional, defaults to 2) —
The id of the end-of-stream unit token. Only applied to the text-to-unit seq2seq model. int
, optional, defaults to 2) —
If an encoder-decoder model starts decoding with a different token than bos, the id of that token. Only
applied to the text-to-unit seq2seq model. int
, optional, defaults to 1024) —
The maximum numbers of unit tokens to generate, ignoring the number of tokens in the prompt. Only applied
to the text-to-unit seq2seq model. int
, optional, defaults to 6) —
Number of hidden layers in the Transformer text-to-unit encoder. int
, optional, defaults to 8192) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit encoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer text-to-unit encoder. int
, optional, defaults to 6) —
Number of hidden layers in the Transformer text-to-unit decoder. int
, optional, defaults to 8192) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit decoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer text-to-unit decoder. int
, optional, defaults to 2048) —
The maximum sequence length that this model text-to-unit component might ever be used with. Typically set
this to something large just in case (e.g., 512 or 1024 or 2048).
Hifi-Gan Vocoder specific parameters
int
, optional, defaults to 16000) —
The sampling rate at which the output audio will be generated, expressed in hertz (Hz). int
, optional, defaults to 512) —
The number of input channels into the hifi-gan upsampling network. Applies to the vocoder only. Tuple[int]
or List[int]
, optional, defaults to [5, 4, 4, 2, 2]
) —
A tuple of integers defining the stride of each 1D convolutional layer in the vocoder upsampling network.
The length of upsample_rates defines the number of convolutional layers and has to match the length of
upsample_kernel_sizes. Applies to the vocoder only. Tuple[int]
or List[int]
, optional, defaults to [11, 8, 8, 4, 4]
) —
A tuple of integers defining the kernel size of each 1D convolutional layer in the vocoder upsampling
network. The length of upsample_kernel_sizes defines the number of convolutional layers and has to match
the length of upsample_rates. Applies to the vocoder only. Tuple[int]
or List[int]
, optional, defaults to [3, 7, 11]
) —
A tuple of integers defining the kernel sizes of the vocoder 1D convolutional layers in the multi-receptive
field fusion (MRF) module. Applies to the vocoder only. Tuple[Tuple[int]]
or List[List[int]]
, optional, defaults to [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
) —
A nested tuple of integers defining the dilation rates of the vocoder dilated 1D convolutional layers in
the multi-receptive field fusion (MRF) module. Applies to the vocoder only. float
, optional, defaults to 0.1) —
The angle of the negative slope used by the leaky ReLU activation in the vocoder. Applies to the vocoder
only. int
, optional, defaults to 10000) —
Vocabulary size of the SeamlessM4T vocoder. Defines the number of different unit tokens that can be
represented by the inputs_ids
passed when calling the vocoder of ~SeamlessM4TModel,
~SeamlessM4TForSpeechToSpeech or ~SeamlessM4TForTextToSpeech. int
, optional, defaults to 1280) —
The projection dimension of the input ids given to the hifi-gan vocoder. Applies to the vocoder only. int
, optional, defaults to 256) —
The projection dimension of the target language given to the hifi-gan vocoder. Applies to the vocoder only. int
, optional, defaults to 256) —
The projection dimension of the speaker id given to the hifi-gan vocoder. Applies to the vocoder only. int
, optional, defaults to 36) —
Number of langs supported by the vocoder. Might be different from t2u_num_langs
. int
, optional, defaults to 200) —
Number of speakers supported by the vocoder. int
, optional, defaults to 3) —
Kernel size of the duration predictor. Applies to the vocoder only. float
, optional, defaults to 0.5) —
The dropout probability of the duration predictor. Applies to the vocoder only. int
, optional, defaults to 4) —
Offset the unit token ids by this number to account for symbol tokens. Applies to the vocoder only. This is the configuration class to store the configuration of a ~SeamlessM4TModel. It is used to instantiate an SeamlessM4T model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SeamlessM4T “facebook/hf-seamless-m4t-medium” architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
>>> from transformers import SeamlessM4TModel, SeamlessM4TConfig
>>> # Initializing a SeamlessM4T "facebook/hf-seamless-m4t-medium" style configuration
>>> configuration = SeamlessM4TConfig()
>>> # Initializing a model from the "facebook/hf-seamless-m4t-medium" style configuration
>>> model = SeamlessM4TModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
( vocab_file bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' tokenizer_file = None src_lang = 'eng' tgt_lang = 'fra' sp_model_kwargs: Optional = None additional_special_tokens = None **kwargs )
Parameters
str
) —
Path to the vocabulary file. str
, optional, defaults to "<s>"
) —
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the cls_token
.
str
, optional, defaults to "</s>"
) —
The end of sequence token.
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the sep_token
.
str
, optional, defaults to "</s>"
) —
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens. str
, optional, defaults to "<s>"
) —
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens. str
, optional, defaults to "<unk>"
) —
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. str
, optional, defaults to "<pad>"
) —
The token used for padding, for example when batching sequences of different lengths. str
, optional) —
The path to a tokenizer file to use instead of the vocab file. str
, optional, defaults to "eng"
) —
The language to use as source language for translation. str
, optional, defaults to "fra"
) —
The language to use as target language for translation. Dict[str, Any]
, optional) —
Additional keyword arguments to pass to the model initialization. str
or tokenizers.AddedToken
, optional) —
A tuple or a list of additional special tokens. Can be used to specify the list of languages that will be
supported by the tokenizer. Construct a SeamlessM4T tokenizer.
Adapted from RobertaTokenizer and XLNetTokenizer. Based on SentencePiece.
The tokenization method is <language code> <tokens> <eos>
for source language documents, and <eos> <language code> <tokens> <eos>
for target language documents.
Examples:
>>> from transformers import SeamlessM4TTokenizer
>>> tokenizer = SeamlessM4TTokenizer.from_pretrained(
... "facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="fra"
... )
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
>>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie."
>>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt")
( text: Union = None text_pair: Union = None text_target: Union = None text_pair_target: Union = None padding: Union = True pad_to_multiple_of: Optional = 2 src_lang: Optional = None tgt_lang: Optional = None **kwargs )
Parameters
str
, List[str]
, List[List[str]]
, optional) —
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
is_split_into_words=True
(to lift the ambiguity with a batch of sequences). str
, List[str]
, List[List[str]]
, optional) —
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
is_split_into_words=True
(to lift the ambiguity with a batch of sequences). str
, List[str]
, List[List[str]]
, optional) —
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences). str
, List[str]
, List[List[str]]
, optional) —
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences). bool
, str
or PaddingStrategy, optional, defaults to True
) —
Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum
acceptable input length for the model if that argument is not provided.False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different
lengths).int
, optional) —
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5
(Volta).
str
, optional) —
A string representing the source language. If not specified, the last src_lang
specified (either
during initialization or when calling this tokenizer) will be used. str
, optional) —
A string representing the target language. If not specified, the last tgt_lang
specified (either
during initialization or when calling this tokenizer) will be used. ( token_ids_0: List token_ids_1: Optional = None ) → List[int]
Parameters
List[int]
) —
List of IDs to which the special tokens will be added. List[int]
, optional) —
Optional second list of IDs for sequence pairs. Returns
List[int]
List of input IDs with the appropriate special tokens.
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An NLLB sequence has the following format, where X
represents the sequence:
input_ids
(for encoder) X [eos, src_lang_code]
decoder_input_ids
: (for decoder) X [eos, tgt_lang_code]
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.
( token_ids_0: List token_ids_1: Optional = None already_has_special_tokens: bool = False ) → List[int]
Parameters
List[int]
) —
List of IDs. List[int]
, optional) —
Optional second list of IDs for sequence pairs. bool
, optional, defaults to False
) —
Whether or not the token list is already formatted with special tokens for the model. Returns
List[int]
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model
method.
( token_ids_0: List token_ids_1: Optional = None ) → List[int]
Create a mask from the two sequences passed to be used in a sequence-pair classification task. nllb does not make use of token type ids, therefore a list of zeros is returned.
( vocab_file = None tokenizer_file = None bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' src_lang = 'eng' tgt_lang = 'fra' additional_special_tokens = None **kwargs )
Parameters
str
, optional) —
Path to the vocabulary file. str
, optional) —
The path to a tokenizer file to use instead of the vocab file. str
, optional, defaults to "<s>"
) —
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the cls_token
.
str
, optional, defaults to "</s>"
) —
The end of sequence token.
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
The token used is the sep_token
.
str
, optional, defaults to "</s>"
) —
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens. str
, optional, defaults to "<s>"
) —
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens. str
, optional, defaults to "<unk>"
) —
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. str
, optional, defaults to "<pad>"
) —
The token used for padding, for example when batching sequences of different lengths. str
, optional, defaults to "eng"
) —
The language to use as source language for translation. str
, optional, defaults to "fra"
) —
The language to use as target language for translation. str
or tokenizers.AddedToken
, optional) —
A tuple or a list of additional special tokens. Construct a “fast” SeamlessM4T tokenizer (backed by HuggingFace’s tokenizers library). Based on BPE.
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
The tokenization method is <language code> <tokens> <eos>
for source language documents, and <eos> <language code> <tokens> <eos>
for target language documents.
Examples:
>>> from transformers import SeamlessM4TTokenizerFast
>>> tokenizer = SeamlessM4TTokenizerFast.from_pretrained(
... "facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="fra"
... )
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
>>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie."
>>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt")
( text: Union = None text_pair: Union = None text_target: Union = None text_pair_target: Union = None padding: Union = True pad_to_multiple_of: Optional = 2 src_lang: Optional = None tgt_lang: Optional = None **kwargs )
Parameters
str
, List[str]
, List[List[str]]
, optional) —
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
is_split_into_words=True
(to lift the ambiguity with a batch of sequences). str
, List[str]
, List[List[str]]
, optional) —
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
is_split_into_words=True
(to lift the ambiguity with a batch of sequences). str
, List[str]
, List[List[str]]
, optional) —
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences). str
, List[str]
, List[List[str]]
, optional) —
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences). bool
, str
or PaddingStrategy, optional, defaults to True
) —
Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum
acceptable input length for the model if that argument is not provided.False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different
lengths).int
, optional) —
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5
(Volta).
str
, optional) —
A string representing the source language. If not specified, the last src_lang
specified (either
during initialization or when calling this tokenizer) will be used. str
, optional) —
A string representing the target language. If not specified, the last tgt_lang
specified (either
during initialization or when calling this tokenizer) will be used. ( feature_size = 80 sampling_rate = 16000 num_mel_bins = 80 padding_value = 0.0 stride = 2 **kwargs )
Parameters
int
, optional, defaults to 80) —
The feature dimension of the extracted features. int
, optional, defaults to 16000) —
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). int
, optional, defaults to 80) —
Number of Mel-frequency bins. float
, optional, defaults to 0.0) —
The value that is used to fill the padding vectors. int
, optional, defaults to 2) —
Stride used to reshape audios from shape (batch_size,num_frames,num_mel_bins) to
(batch_size,num_frames//stride,num_mel_bins*stride). Constructs a SeamlessM4T feature extractor.
This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
This class extracts mel-filter bank features from raw speech.
( raw_speech: Union padding: Union = True pad_to_multiple_of: Optional = 2 max_length: Optional = None truncation: bool = False return_tensors: Union = None sampling_rate: Optional = None return_attention_mask: Optional = None do_normalize_per_mel_bins: Optional = True **kwargs )
Parameters
np.ndarray
, torch.Tensor
, List[float]
, List[np.ndarray]
, List[torch.Tensor]
, — List[List[float]]
, List[List[List[float]]]
) —
The sequence or batch of sequences to be padded. Each sequence can be a numpy array,
a torch tensor, a list of float values, a list of numpy arrays, a list of torch tensors,
a list of list of float values or a list of a list of list of float values.
If raw_speech
is a one-dimensional np.ndarray
, torch.Tensor
or a List[float]
, raw_speech
is
considered a single-channel, single-sample sound. In all other cases, the first dimension of
raw_speech
, whether from an np.ndarray
, a torch.Tensor
or a List[...]
,
corresponds to the number of samples in the batch, and the number of channels
(i.e. mono or stereo character) is derived from the other dimensions
(1D -> single-channel waveform batches; 2D-> stereo-channel waveform batches). bool
, str
or PaddingStrategy, optional, defaults to True
) —
Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum
acceptable input length for the model if that argument is not provided.False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different
lengths).int
, optional, defaults to 2) —
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5
(Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
int
, optional) —
Maximum length of the returned list and optionally padding length (see above). bool
) —
Activates truncation to cut input sequences longer than max_length to max_length. bool
, optional) —
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific feature_extractor’s default.
For SeamlessM4T models, attention_mask
should always be passed for batched inference, to avoid subtle
bugs.
str
or TensorType, optional) —
If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.'pt'
: Return PyTorch torch.Tensor
objects.'np'
: Return Numpy np.ndarray
objects.int
, optional) —
The sampling rate at which the raw_speech
input was sampled. It is strongly recommended to pass
sampling_rate
at the forward call to prevent silent errors. bool
, optional, defaults to True
) —
Whether or not to zero-mean unit-variance normalize the input per mel-channel. Main method to featurize and prepare for the model one or several sequence(s).
( feature_extractor tokenizer )
Parameters
Constructs a SeamlessM4T processor which wraps a SeamlessM4T feature extractor and a SeamlessM4T tokenizer into a single processor.
SeamlessM4TProcessor offers all the functionalities of SeamlessM4TFeatureExtractor and
SeamlessM4TTokenizerFast. See the call() and decode()
for
more information.
( text = None audios = None src_lang = None tgt_lang = None **kwargs ) → BatchEncoding
Parameters
str
, List[str]
, List[List[str]]
) —
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
is_split_into_words=True
(to lift the ambiguity with a batch of sequences). np.ndarray
, torch.Tensor
, List[np.ndarray]
, List[torch.Tensor]
) —
The audio or batch of audios to be prepared. Each audio can be NumPy array or PyTorch tensor. In case
of a NumPy array/PyTorch tensor, each audio should be of shape (C, T), where C is a number of channels,
and T the sample length of the audio. str
, optional) —
The language code of the input texts/audios. If not specified, the last src_lang
specified will be
used. str
, optional) —
The code of the target language. If not specified, the last tgt_lang
specified will be used. Returns
A BatchEncoding with the following fields:
text
is not None
.return_attention_mask=True
or if “attention_mask” is in self.model_input_names
and if text
is not
None
).audios
is not None
.Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the text
and kwargs
arguments to SeamlessM4TTokenizerFast’s call() if text
is not
None
to encode the text. To prepare the audio(s), this method forwards the audios
and kwrags
arguments to
SeamlessM4TFeatureExtractor’s call() if audios
is not None
. Please refer
to the doctsring of the above two methods for more information.
( config )
Parameters
Code HiFi-GAN vocoder as described in this repository. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: LongTensor spkr_id: Tensor lang_id: Tensor )
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using SeamlessM4TTextToUnitForConditionalGeneration. What are input IDs?
int
, optional) —
The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs
. str
, optional) —
The language id to use as target language for translation. ( input_embeds: FloatTensor ) → torch.FloatTensor
Parameters
torch.FloatTensor
) —
Tensor containing the log-mel spectrograms. Can be batched and of shape (batch_size, sequence_length, model_in_dim)
, or un-batched and of shape (sequence_length, model_in_dim)
. Note that model_in_dim
is the sum of config.unit_embed_dim
, config.lang_embed_dim
and config.spkr_embed_dim
. Returns
torch.FloatTensor
Tensor containing the speech waveform. If the input spectrogram is batched, will be of
shape (batch_size, num_frames,)
. If un-batched, will be of shape (num_frames,)
.
Converts a log-mel spectrogram into a speech waveform. Passing a batch of log-mel spectrograms returns a batch of speech waveforms. Passing a single, un-batched log-mel spectrogram returns a single, un-batched speech waveform.
( config: SeamlessM4TConfig embed_tokens_decoder: Optional = None )
Parameters
nn.Embedding
, optional) — input embedding of the decoder. Transformer bare text-to-unit encoder-decoder. The encoder is a SeamlessM4TEncoder
without embeddings and the decoder is a SeamlessM4TDecoder
.
This model is a PyTorch torch.nn.Module sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
( config: SeamlessM4TConfig embed_tokens_decoder: Optional = None )
Parameters
nn.Embedding
, optional) — input embedding of the decoder. Transformer text-to-unit encoder-decoder with a language model head. The base encoder-decoder model is a SeamlessM4TTextToUnit
.
This model is a PyTorch torch.nn.Module sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
( input_ids: LongTensor = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None decoder_inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None )
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) —
Indices of decoder input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
Bart uses the eos_token_id
as the starting token for decoder_input_ids
generation. If past_key_values
is used, optionally only the last decoder_input_ids
have to be input (see past_key_values
).
For translation and summarization training, decoder_input_ids
should be provided. If no
decoder_input_ids
is provided, the model will create this tensor by shifting the input_ids
to the right
for denoising pre-training following the paper.
torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) —
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids
. Causal mask will also
be used by default.
If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
tuple(tuple(torch.FloatTensor)
, optional) —
Tuple consists of (last_hidden_state
, optional: hidden_states
, optional: attentions
)
last_hidden_state
of shape (batch_size, sequence_length, hidden_size)
, optional) is a sequence of
hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) —
Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape
(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that
don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all
decoder_input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, target_sequence_length, hidden_size)
, optional) —
Optionally, instead of passing decoder_input_ids
you can choose to directly pass an embedded
representation. If past_key_values
is used, optionally only the last decoder_inputs_embeds
have to be
input (see past_key_values
). This is useful if you want more control over how to convert
decoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
If decoder_input_ids
and decoder_inputs_embeds
are both unset, decoder_inputs_embeds
takes the value
of inputs_embeds
.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size]
(see input_ids
docstring) Tokens with indices set to -100
are ignored (masked), the
loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. The SeamlessM4TTextToUnitForConditionalGeneration forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.