The Llama2 model was proposed in LLaMA: Open Foundation and Fine-Tuned Chat Models by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. It is a collection of foundation language models ranging from 7B to 70B parameters, with checkpoints finetuned for chat application!
The abstract from the paper is the following:
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
Checkout all Llama2 model checkpoints here. This model was contributed by Arthur Zucker with contributions from Lysandre Debut. The code of the implementation in Hugging Face is based on GPT-NeoX here. The original code of the authors can be found here.
The Llama2
models were trained using bfloat16
, but the original inference uses float16
. The checkpoints uploaded on the Hub use torch_dtype = 'float16'
, which will be
used by the AutoModel
API to cast the checkpoints from torch.float32
to torch.float16
.
The dtype
of the online weights is mostly irrelevant unless you are using torch_dtype="auto"
when initializing a model using model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")
. The reason is that the model will first be downloaded ( using the dtype
of the checkpoints online), then it will be casted to the default dtype
of torch
(becomes torch.float32
), and finally, if there is a torch_dtype
provided in the config, it will be used.
Training the model in float16
is not recommended and is known to produce nan
; as such, the model should be trained in bfloat16
.
Tips:
config.pretraining_tp
to a value different than 1 will activate the more accurate but slower computation of the linear layers, which should better match the original logits.pad_id = -1
which means that there is no padding token. We can’t have the same logic, make sure to add a padding token using tokenizer.add_special_tokens({"pad_token":"<pad>"})
and resize the token embedding accordingly. You should also set the model.config.pad_token_id
. The embed_tokens
layer of the model is initialized with self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx)
, which makes sure that encoding the padding token will output zeros, so passing it when initializing is recommended.python src/transformers/models/llama/convert_llama_weights_to_hf.py \ --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
from transformers import LlamaForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("/output/path")
model = LlamaForCausalLM.from_pretrained("/output/path")
Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 75B model, it’s thus 145GB of RAM needed.
The LLaMA tokenizer is a BPE model based on sentencepiece. One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. “Banana”), the tokenizer does not prepend the prefix space to the string.
When using Flash Attention 2 via attn_implementation="flash_attention_2"
, don’t pass torch_dtype
to the from_pretrained
class method and use Automatic Mixed-Precision training. When using Trainer
, it is simply specifying either fp16
or bf16
to True
. Otherwise, make sure you are using torch.autocast
. This is required because the Flash Attention only support fp16
and bf16
data type.
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LLaMA2. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
⚗️ Optimization
⚡️ Inference
🚀 Deploy
( vocab_size = 32000 hidden_size = 4096 intermediate_size = 11008 num_hidden_layers = 32 num_attention_heads = 32 num_key_value_heads = None hidden_act = 'silu' max_position_embeddings = 2048 initializer_range = 0.02 rms_norm_eps = 1e-06 use_cache = True pad_token_id = None bos_token_id = 1 eos_token_id = 2 pretraining_tp = 1 tie_word_embeddings = False rope_theta = 10000.0 rope_scaling = None attention_bias = False attention_dropout = 0.0 **kwargs )
Parameters
int
, optional, defaults to 32000) —
Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
inputs_ids
passed when calling LlamaModel int
, optional, defaults to 4096) —
Dimension of the hidden representations. int
, optional, defaults to 11008) —
Dimension of the MLP representations. int
, optional, defaults to 32) —
Number of hidden layers in the Transformer decoder. int
, optional, defaults to 32) —
Number of attention heads for each attention layer in the Transformer decoder. int
, optional) —
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
num_key_value_heads=num_attention_heads
, the model will use Multi Head Attention (MHA), if
num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
num_attention_heads`. str
or function
, optional, defaults to "silu"
) —
The non-linear activation function (function or string) in the decoder. int
, optional, defaults to 2048) —
The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens,
Llama 2 up to 4096, CodeLlama up to 16384. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 1e-06) —
The epsilon used by the rms normalization layers. bool
, optional, defaults to True
) —
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if config.is_decoder=True
. int
, optional) —
Padding token id. int
, optional, defaults to 1) —
Beginning of stream token id. int
, optional, defaults to 2) —
End of stream token id. int
, optional, defaults to 1) —
Experimental feature. Tensor parallelism rank used during pretraining. Please refer to this
document to understand more about it. This value is
necessary to ensure exact reproducibility of the pretraining results. Please refer to this
issue. bool
, optional, defaults to False
) —
Whether to tie weight embeddings float
, optional, defaults to 10000.0) —
The base period of the RoPE embeddings. Dict
, optional) —
Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
{"type": strategy name, "factor": scaling factor}
. When using this flag, don’t update
max_position_embeddings
to the expected new maximum. See the following thread for more information on how
these scaling strategies behave:
https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
experimental feature, subject to breaking API changes in future versions. bool
, defaults to False
, optional, defaults to False
) —
Whether to use a bias in the query, key, value and output projection layers during self-attention. float
, optional, defaults to 0.0) —
The dropout ratio for the attention probabilities. This is the configuration class to store the configuration of a LlamaModel. It is used to instantiate an LLaMA model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LLaMA-7B.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
>>> from transformers import LlamaModel, LlamaConfig
>>> # Initializing a LLaMA llama-7b style configuration
>>> configuration = LlamaConfig()
>>> # Initializing a model from the llama-7b style configuration
>>> model = LlamaModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
( vocab_file unk_token = '<unk>' bos_token = '<s>' eos_token = '</s>' pad_token = None sp_model_kwargs: Optional = None add_bos_token = True add_eos_token = False clean_up_tokenization_spaces = False use_default_system_prompt = False spaces_between_special_tokens = False legacy = None **kwargs )
Parameters
str
) —
Path to the vocabulary file. str
or tokenizers.AddedToken
, optional, defaults to "<unk>"
) —
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. str
or tokenizers.AddedToken
, optional, defaults to "<s>"
) —
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. str
or tokenizers.AddedToken
, optional, defaults to "</s>"
) —
The end of sequence token. str
or tokenizers.AddedToken
, optional) —
A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by
attention mechanisms or loss computation. Dict[str, Any]
, Optional
, optional) —
Will be passed to the SentencePieceProcessor.__init__()
method. The Python wrapper for
SentencePiece can be used, among other things,
to set:
enable_sampling
: Enable subword regularization.
nbest_size
: Sampling parameters for unigram. Invalid for BPE-Dropout.
nbest_size = {0,1}
: No sampling is performed.nbest_size > 1
: samples from the nbest_size results.nbest_size < 0
: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
using forward-filtering-and-backward-sampling algorithm.alpha
: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
BPE-dropout.
bool
, optional, defaults to True
) —
Whether or not to add an bos_token
at the start of sequences. bool
, optional, defaults to False
) —
Whether or not to add an eos_token
at the end of sequences. bool
, optional, defaults to False
) —
Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like
extra spaces. bool
, optional, defaults to False
) —
Whether or not the default system prompt for Llama should be used. bool
, optional, defaults to False
) —
Whether or not to add spaces between special tokens. bool
, optional) —
Whether or not the legacy
behavior of the tokenizer should be used. Legacy is before the merge of #24622
and #25224 which includes fixes to properly handle tokens that appear after special tokens. A simple
example:
legacy=True
:Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is no padding token in the original model.
( token_ids_0: List token_ids_1: Optional = None already_has_special_tokens: bool = False ) → List[int]
Parameters
List[int]
) —
List of IDs. List[int]
, optional) —
Optional second list of IDs for sequence pairs. bool
, optional, defaults to False
) —
Whether or not the token list is already formatted with special tokens for the model. Returns
List[int]
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model
method.
( token_ids_0: List token_ids_1: Optional = None ) → List[int]
Parameters
List[int]
) —
List of ids. List[int]
, optional) —
Optional second list of IDs for sequence pairs. Returns
List[int]
List of token type IDs according to the given sequence(s).
Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
if token_ids_1 is None, only returns the first portion of the mask (0s).
( save_directory filename_prefix: Optional = None ) → Tuple(str)
Save the vocabulary and special tokens file to a directory.
( vocab_file = None tokenizer_file = None clean_up_tokenization_spaces = False unk_token = '<unk>' bos_token = '<s>' eos_token = '</s>' add_bos_token = True add_eos_token = False use_default_system_prompt = False **kwargs )
Parameters
str
, optional) —
SentencePiece file (generally has a .model extension) that
contains the vocabulary necessary to instantiate a tokenizer. str
, optional) —
tokenizers file (generally has a .json extension) that
contains everything needed to load the tokenizer. bool
, optional, defaults to False
) —
Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like
extra spaces. str
or tokenizers.AddedToken
, optional, defaults to "<unk>"
) —
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. str
or tokenizers.AddedToken
, optional, defaults to "<s>"
) —
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. str
or tokenizers.AddedToken
, optional, defaults to "</s>"
) —
The end of sequence token. bool
, optional, defaults to True
) —
Whether or not to add an bos_token
at the start of sequences. bool
, optional, defaults to False
) —
Whether or not to add an eos_token
at the end of sequences. bool
, optional, defaults to False
) —
Whether or not the default system prompt for Llama should be used. Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.
This uses notably ByteFallback and no normalization.
>>> from transformers import LlamaTokenizerFast
>>> tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
>>> tokenizer.encode("Hello this is a test")
[1, 15043, 445, 338, 263, 1243]
If you want to change the bos_token
or the eos_token
, make sure to specify them when initializing the model, or
call tokenizer.update_post_processor()
to make sure that the post-processing is correctly done (otherwise the
values of the first token and final token of an encoded sequence will not be correct). For more details, checkout
[post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
( token_ids_0: List token_ids_1: Optional = None already_has_special_tokens: bool = False ) → A list of integers in the range [0, 1]
Parameters
List[int]
) —
List of ids of the first sequence. List[int]
, optional) —
List of ids of the second sequence. bool
, optional, defaults to False
) —
Whether or not the token list is already formatted with special tokens for the model. Returns
A list of integers in the range [0, 1]
1 for a special token, 0 for a sequence token.
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model
or encode_plus
methods.
( token_ids_0: List token_ids_1: Optional = None ) → List[int]
Create the token type IDs corresponding to the sequences passed. What are token type IDs?
Should be overridden in a subclass if the model has a special way of building those.
Updates the underlying post processor with the current bos_token
and eos_token
.
( config: LlamaConfig )
Parameters
The bare LLaMA Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a LlamaDecoderLayer
( input_ids: LongTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None cache_position: Optional = None )
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last input_ids
have to be input (see
past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
Cache
or tuple(tuple(torch.FloatTensor))
, optional) —
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
Two formats are allowed:
tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of
shape (batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy
cache format.The model will output the same cache format that is fed as input. If no past_key_values
are passed, the
legacy cache format will be returned.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t
have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. The LlamaModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
( input_ids: LongTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None cache_position: Optional = None ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last input_ids
have to be input (see
past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
Cache
or tuple(tuple(torch.FloatTensor))
, optional) —
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
Two formats are allowed:
tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of
shape (batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy
cache format.The model will output the same cache format that is fed as input. If no past_key_values
are passed, the
legacy cache format will be returned.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t
have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple.
Args —
labels (torch.LongTensor
of shape (batch_size, sequence_length)
, optional):
Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size]
or -100 (see input_ids
docstring). Tokens with indices set to -100
are ignored
(masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
.
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (LlamaConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LlamaForCausalLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, LlamaForCausalLM
>>> model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
( config )
Parameters
The LLaMa Model transformer with a sequence classification head on top (linear layer).
LlamaForSequenceClassification uses the last token in order to do the classification, as other causal models (e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token. If a
pad_token_id
is defined in the configuration, it finds the last token that is not a padding token in each row. If
no pad_token_id
is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
padding tokens when inputs_embeds
are passed instead of input_ids
, it does the same (take the last value in
each row of the batch).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: LongTensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None )
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last input_ids
have to be input (see
past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
Cache
or tuple(tuple(torch.FloatTensor))
, optional) —
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
Two formats are allowed:
tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of
shape (batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy
cache format.The model will output the same cache format that is fed as input. If no past_key_values
are passed, the
legacy cache format will be returned.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t
have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size,)
, optional) —
Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If
config.num_labels > 1
a classification loss is computed (Cross-Entropy). The LlamaForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.