StableLM 3B 4E1T
was proposed in StableLM 3B 4E1T
: Technical Report by Stability AI and is the first model in a series of multi-epoch pre-trained language models.
StableLM 3B 4E1T
is a decoder-only base language model pre-trained on 1 trillion tokens of diverse English and code datasets for four epochs.
The model architecture is transformer-based with partial Rotary Position Embeddings, SwiGLU activation, LayerNorm, etc.
We also provide StableLM Zephyr 3B
, an instruction fine-tuned version of the model that can be used for chat-based applications.
StableLM 3B 4E1T
-based models uses the same tokenizer as GPTNeoXTokenizerFast.StableLM 3B 4E1T
and StableLM Zephyr 3B
can be found on the Huggingface Hub
The following code snippet demonstrates how to use StableLM 3B 4E1T
for inference:
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> device = "cuda" # the device to load the model onto
>>> tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-3b-4e1t")
>>> model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-3b-4e1t")
>>> model.to(device)
>>> model_inputs = tokenizer("The weather is always wonderful in", return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(**model_inputs, max_length=32, do_sample=True)
>>> responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
>>> responses
['The weather is always wonderful in Santa Barbara and, for visitors hoping to make the move to our beautiful seaside city, this town offers plenty of great places to...']
First, make sure to install the latest version of Flash Attention v2.
pip install -U flash-attn --no-build-isolation
Also make sure that your hardware is compatible with Flash-Attention 2. Read more about it in the official documentation of the flash-attn
repository. Note: you must load your model in half-precision (e.g. torch.bfloat16
).
Now, to run the model with Flash Attention 2, refer to the snippet below:
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> device = "cuda" # the device to load the model onto
>>> tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-3b-4e1t")
>>> model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-3b-4e1t", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")
>>> model.to(device)
>>> model_inputs = tokenizer("The weather is always wonderful in", return_tensors="pt").to(model.device)
>>> generated_ids = model.generate(**model_inputs, max_length=32, do_sample=True)
>>> responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
>>> responses
['The weather is always wonderful in Santa Barbara and, for visitors hoping to make the move to our beautiful seaside city, this town offers plenty of great places to...']
( vocab_size = 50304 intermediate_size = 6912 hidden_size = 2560 num_hidden_layers = 32 num_attention_heads = 32 num_key_value_heads = 32 hidden_act = 'silu' max_position_embeddings = 4096 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True tie_word_embeddings = False rope_theta = 10000 rope_scaling = None use_qkv_bias = False hidden_dropout = 0.0 attention_dropout = 0.0 partial_rotary_factor = 0.25 bos_token_id = 0 eos_token_id = 0 **kwargs )
Parameters
int
, optional, defaults to 50304) —
Vocabulary size of the StableLM model. Defines the number of different tokens that
can be represented by the inputs_ids
passed when calling StableLmModel. int
, optional, defaults to 6912) —
Dimension of the MLP representations. int
, optional, defaults to 2560) —
Number of hidden layers in the Transformer decoder. int
, optional, defaults to 32) —
Number of hidden layers in the Transformer decoder. int
, optional, defaults to 32) —
Number of attention heads for each attention layer in the Transformer encoder. int
, optional, defaults to 32) —
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
num_key_value_heads=num_attention_heads
, the model will use Multi Head Attention (MHA), if
num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
num_attention_heads`. str
or function
, optional, defaults to "silu"
) —
The non-linear activation function (function or string). int
, optional, defaults to 4096) —
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048). float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing
all weight matrices. float
, optional, defaults to 1e-05) —
The epsilon used by the normalization layers. bool
, optional, defaults to True
) —
Whether or not the model should return the last key/values attentions
(not used by all models). Only relevant if config.is_decoder=True
. bool
, optional, defaults to False
) —
Whether the model’s input and output word embeddings should be tied. float
, optional, defaults to 10000.0
) —
The base period of the RoPE embeddings. Dict
, optional) —
Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
{"type": strategy name, "factor": scaling factor}
. When using this flag, don’t update
max_position_embeddings
to the expected new maximum. See the following thread for more information on how
these scaling strategies behave:
https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This
is an experimental feature, subject to breaking API changes in future versions. bool
, optional, defaults to False
) —
Whether or not the model should use bias for qkv layers. float
, optional, defaults to 0.0) —
The dropout ratio after applying the MLP to the hidden states. float
, optional, defaults to 0.0) —
The dropout ratio for the attention probabilities. float
, optional, defaults to 0.25) —
Percentage of the query and keys which will have rotary embedding. BOS
token in the vocabulary. EOS
token in the vocabulary. This is the configuration class to store the configuration of a ~StableLmModel. It is used to instantiate an StableLM model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the StableLM stabilityai/stablelm-3b-4e1t architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
( config: StableLmConfig )
Parameters
The bare StableLm Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a StableLmDecoderLayer
( input_ids: LongTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None )
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last decoder_input_ids
have to be input (see
past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
Cache
or tuple(tuple(torch.FloatTensor))
, optional) —
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
Two formats are allowed:
tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of
shape (batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy
cache format.The model will output the same cache format that is fed as input. If no past_key_values
are passed, the
legacy cache format will be returned.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t
have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. The StableLmModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
( input_ids: LongTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last decoder_input_ids
have to be input (see
past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
Cache
or tuple(tuple(torch.FloatTensor))
, optional) —
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
Two formats are allowed:
tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of
shape (batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy
cache format.The model will output the same cache format that is fed as input. If no past_key_values
are passed, the
legacy cache format will be returned.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t
have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple.
Args —
labels (torch.LongTensor
of shape (batch_size, sequence_length)
, optional):
Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size]
or -100 (see input_ids
docstring). Tokens with indices set to -100
are ignored
(masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
.
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (StableLmConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The StableLmForCausalLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, StableLmForCausalLM
>>> model = StableLmForCausalLM.from_pretrained("stabilityai/stablelm-3b-4e1t")
>>> tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-3b-4e1t")
>>> prompt = "The weather is always wonderful in"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
'The weather is always wonderful in the summer in the city of San Diego. The city is located on the coast of the Pacific Ocean and is surrounded by'
( config )
Parameters
The StableLm transformer with a sequence classification head on top (linear layer).
StableLmForSequenceClassification uses the last token in order to do the classification, as other causal models (e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token. If a
pad_token_id
is defined in the configuration, it finds the last token that is not a padding token in each row. If
no pad_token_id
is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
padding tokens when inputs_embeds
are passed instead of input_ids
, it does the same (take the last value in
each row of the batch).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: LongTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None )
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last decoder_input_ids
have to be input (see
past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
Cache
or tuple(tuple(torch.FloatTensor))
, optional) —
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
Two formats are allowed:
tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of
shape (batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy
cache format.The model will output the same cache format that is fed as input. If no past_key_values
are passed, the
legacy cache format will be returned.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t
have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size,)
, optional) —
Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If
config.num_labels > 1
a classification loss is computed (Cross-Entropy). The StableLmForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.