This model is in maintenance mode only, so we won’t accept any new PRs changing its code.
If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0.
You can do so by running the following command: pip install -U transformers==4.30.0
.
The M-CTC-T model was proposed in Pseudo-Labeling For Massively Multilingual Speech Recognition by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. The model is a 1B-param transformer encoder, with a CTC head over 8065 character labels and a language identification head over 60 language ID labels. It is trained on Common Voice (version 6.1, December 2020 release) and VoxPopuli. After training on Common Voice and VoxPopuli, the model is trained on Common Voice only. The labels are unnormalized character-level transcripts (punctuation and capitalization are not removed). The model takes as input Mel filterbank features from a 16Khz audio signal.
The abstract from the paper is the following:
Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art monolingual speech recognition systems. In this work, we extend pseudo-labeling to massively multilingual speech recognition with 60 languages. We propose a simple pseudo-labeling recipe that works well even with low-resource languages: train a supervised multilingual model, fine-tune it with semi-supervised learning on a target language, generate pseudo-labels for that language, and train a final model using pseudo-labels for all languages, either from scratch or by fine-tuning. Experiments on the labeled Common Voice and unlabeled VoxPopuli datasets show that our recipe can yield a model with better performance for many languages that also transfers well to LibriSpeech.
This model was contributed by cwkeam. The original code can be found here.
The PyTorch version of this model is only available in torch 1.9 and higher.
( vocab_size = 8065 hidden_size = 1536 num_hidden_layers = 36 intermediate_size = 6144 num_attention_heads = 4 attention_head_dim = 384 max_position_embeddings = 920 layer_norm_eps = 1e-05 layerdrop = 0.3 hidden_act = 'relu' initializer_range = 0.02 hidden_dropout_prob = 0.3 attention_probs_dropout_prob = 0.3 pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 conv_glu_dim = 1 conv_dropout = 0.3 num_conv_layers = 1 conv_kernel = (7,) conv_stride = (3,) input_feat_per_channel = 80 input_channels = 1 conv_channels = None ctc_loss_reduction = 'sum' ctc_zero_infinity = False **kwargs )
Parameters
int
, optional, defaults to 8065) —
Vocabulary size of the M-CTC-T model. Defines the number of different tokens that can be represented by the
inputs_ids
passed when calling MCTCTModel. int
, optional, defaults to 1536) —
Dimension of the encoder layers and the pooler layer. int
, optional, defaults to 36) —
Number of hidden layers in the Transformer encoder. int
, optional, defaults to 6144) —
Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder. int
, optional, defaults to 4) —
Number of attention heads for each attention layer in the Transformer encoder. int
, optional, defaults to 384) —
Dimensions of each attention head for each attention layer in the Transformer encoder. int
, optional, defaults to 920) —
The maximum sequence length that this model might ever be used with (after log-mel spectrogram extraction). float
, optional, defaults to 1e-05) —
The epsilon used by the layer normalization layers. float
, optional, defaults to 0.3) —
The probability of dropping an encoder layer during training. The default 0.3 value is used in the original
implementation. str
or function
, optional, defaults to "relu"
) —
The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu"
,
"relu"
, "selu"
and "gelu_new"
are supported. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 0.3) —
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. float
, optional, defaults to 0.3) —
The dropout ratio for the attention probabilities. int
, optional, defaults to 1) —
The tokenizer index of the pad token. int
, optional, defaults to 0) —
The tokenizer index of the bos token. int
, optional, defaults to 2) —
The tokenizer index of the eos token. int
, optional, defaults to 1) —
The dimension of the output of the Conv1dSubsampler
layer in which GLU is applied on. Though the original
Flashlight code uses the value of 2, here it’s adapted to 1 due to transposition differences. int
, optional, defaults to 0.3) —
The probability of randomly dropping the Conv1dSubsampler
layer during training. int
, optional, defaults to 1) —
Number of convolution layers before applying transformer encoder layers. Sequence[int]
, optional, defaults to (7,)
) —
The kernel size of the 1D convolution applied before transformer layers. len(conv_kernel)
must be equal
to num_conv_layers
. Sequence[int]
, optional, defaults to (3,)
) —
The stride length of the 1D convolution applied before transformer layers. len(conv_stride)
must be equal
to num_conv_layers
. int
, optional, defaults to 80) —
Feature dimensions of the channels of the input to the Conv1D layer. int
, optional, defaults to 1) —
Number of input channels of the input to the Conv1D layer. List[int]
, optional) —
Channel sizes of intermediate Conv1D layers. str
, optional, defaults to "sum"
) —
Specifies the reduction to apply to the output of torch.nn.CTCLoss
. Only relevant when training an
instance of MCTCTForCTC. bool
, optional, defaults to False
) —
Whether to zero infinite losses and the associated gradients of torch.nn.CTCLoss
. Infinite losses mainly
occur when the inputs are too short to be aligned to the targets. Only relevant when training an instance
of MCTCTForCTC. This is the configuration class to store the configuration of a MCTCTModel. It is used to instantiate an M-CTC-T model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the M-CTC-T speechbrain/m-ctc-t-large architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import MCTCTConfig, MCTCTModel
>>> # Initializing a M-CTC-T mctct-large style configuration
>>> configuration = MCTCTConfig()
>>> # Initializing a model (with random weights) from the mctct-large style configuration
>>> model = MCTCTModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
( feature_size = 80 sampling_rate = 16000 padding_value = 0.0 hop_length = 10 win_length = 25 win_function = 'hamming_window' frame_signal_scale = 32768.0 preemphasis_coeff = 0.97 mel_floor = 1.0 normalize_means = True normalize_vars = True return_attention_mask = False **kwargs )
Parameters
int
, defaults to 80) —
The feature dimension of the extracted features. This is the number of mel_frequency int
, defaults to 16000) —
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). float
, defaults to 0.0) —
The value that is used to fill the padding values. int
, defaults to 10) —
Number of audio samples between windows. Otherwise referred to as “shift” in many papers. int
, defaults to 25) —
Number of ms per window str
, defaults to "hamming_window"
) —
Name for the window function used for windowing, must be accessible via torch.{win_function}
float
, defaults to 32768.0) —
Constant multiplied in creating the frames before applying DFT. float
, defaults to 0.97) —
Constant multiplied in applying Pre-emphasis before DFT. float
defaults to 1.0) —
Minimum value of mel frequency banks. bool
, optional, defaults to True
) —
Whether or not to zero-mean normalize the extracted features. bool
, optional, defaults to True
) —
Whether or not to unit-variance normalize the extracted features. Constructs a M-CTC-T feature extractor.
This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. This code has been adapted from Flashlight’s C++ code. For more information about the implementation, one can refer to this notebook that takes the user step-by-step in the implementation.
( raw_speech: Union padding: Union = False max_length: Optional = None truncation: bool = False pad_to_multiple_of: Optional = None return_attention_mask: Optional = None return_tensors: Union = None sampling_rate: Optional = None **kwargs )
Parameters
torch.Tensor
, np.ndarray
, List[float]
, List[torch.Tensor]
, List[np.ndarray]
, List[List[float]]
) —
The sequence or batch of sequences to be padded. Each sequence can be a tensor, a numpy array, a list
of float values, a list of tensors, a list of numpy arrays or a list of list of float values. Must be
mono channel audio, not stereo, i.e. single float per timestep. bool
, str
or PaddingStrategy, optional, defaults to False
) —
Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum
acceptable input length for the model if that argument is not provided.False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different
lengths).int
, optional) —
Maximum length of the returned list and optionally padding length (see above). bool
) —
Activates truncation to cut input sequences longer than max_length to max_length. int
, optional) —
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5
(Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
bool
, optional) —
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific feature_extractor’s default.
str
or TensorType, optional) —
If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.'pt'
: Return PyTorch torch.Tensor
objects.'np'
: Return Numpy np.ndarray
objects.int
, optional) —
The sampling rate at which the raw_speech
input was sampled. It is strongly recommended to pass
sampling_rate
at the forward call to prevent silent errors. float
, defaults to 0.0) — Main method to featurize and prepare for the model one or several sequence(s). sequences. It returns the log-mel spectrogram of the input audio, as implemented in the original Flashlight MFSC feature extraction code.
( feature_extractor tokenizer )
Parameters
MCTCTFeatureExtractor
) —
An instance of MCTCTFeatureExtractor. The feature extractor is a required input. AutoTokenizer
) —
An instance of AutoTokenizer. The tokenizer is a required input. Constructs a MCTCT processor which wraps a MCTCT feature extractor and a MCTCT tokenizer into a single processor.
MCTCTProcessor offers all the functionalities of MCTCTFeatureExtractor and AutoTokenizer. See the call() and decode() for more information.
When used in normal mode, this method forwards all its arguments to MCTCTFeatureExtractor’s
call() and returns its output. If used in the context
as_target_processor()
this method forwards all its arguments to AutoTokenizer’s
__call__()
. Please refer to the doctsring of the above two methods for more information.
( pretrained_model_name_or_path: Union cache_dir: Union = None force_download: bool = False local_files_only: bool = False token: Union = None revision: str = 'main' **kwargs )
Parameters
str
or os.PathLike
) —
This can be either:
./my_model_directory/
../my_model_directory/preprocessor_config.json
.
**kwargs —
Additional keyword arguments passed along to both
from_pretrained() and
~tokenization_utils_base.PreTrainedTokenizer.from_pretrained
.Instantiate a processor associated with a pretrained model.
This class method is simply calling the feature extractor
from_pretrained(), image processor
ImageProcessingMixin and the tokenizer
~tokenization_utils_base.PreTrainedTokenizer.from_pretrained
methods. Please refer to the docstrings of the
methods above for more information.
( save_directory push_to_hub: bool = False **kwargs )
Parameters
str
or os.PathLike
) —
Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will
be created if it does not exist). bool
, optional, defaults to False
) —
Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
repository you want to push to with repo_id
(will default to the name of save_directory
in your
namespace). Dict[str, Any]
, optional) —
Additional key word arguments passed along to the push_to_hub() method. Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.
This class method is simply calling save_pretrained() and save_pretrained(). Please refer to the docstrings of the methods above for more information.
This method forwards all its arguments to AutoTokenizer’s batch_decode(). Please refer to the docstring of this method for more information.
This method forwards all its arguments to AutoTokenizer’s decode(). Please refer to the docstring of this method for more information.
( config )
Parameters
The bare M-CTC-T Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_features: Tensor attention_mask: Optional = None head_mask: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using Wav2Vec2CTCTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (MCTCTConfig) and inputs.
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The MCTCTModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, MCTCTModel
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> processor = AutoProcessor.from_pretrained("speechbrain/m-ctc-t-large")
>>> model = MCTCTModel.from_pretrained("speechbrain/m-ctc-t-large")
>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)
>>> # transcribe speech
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription[0]
[1, 195, 1536]
>>> inputs["labels"] = processor(text=dataset[0]["text"], return_tensors="pt").input_ids
>>> # compute loss
>>> loss = model(**inputs).loss
( config )
Parameters
MCTCT Model with a language modeling
head on top for Connectionist Temporal Classification (CTC).
This model is a PyTorch torch.nn.Module sub-class. Use
it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
behavior.
( input_features: Tensor attention_mask: Optional = None head_mask: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None labels: Optional = None ) → transformers.modeling_outputs.CausalLMOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape ({0})
) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using Wav2Vec2CTCTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.FloatTensor
of shape ({0})
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size, target_length)
, optional) —
Labels for connectionist temporal classification. Note that target_length
has to be smaller or equal to
the sequence length of the output logits. Indices are selected in [-100, 0, ..., config.vocab_size - 1]
.
All labels set to -100
are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size - 1]
. Returns
transformers.modeling_outputs.CausalLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (MCTCTConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The MCTCTForCTC forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, MCTCTForCTC
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> processor = AutoProcessor.from_pretrained("speechbrain/m-ctc-t-large")
>>> model = MCTCTForCTC.from_pretrained("speechbrain/m-ctc-t-large")
>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)
>>> # transcribe speech
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription[0]
"Mr. Quilter is the apostle of the middle classes, and we're glad to welcome his gospel."
>>> inputs["labels"] = processor(text=dataset[0]["text"], return_tensors="pt").input_ids
>>> # compute loss
>>> loss = model(**inputs).loss
>>> round(loss.item(), 2)
1885.65