The RWKV model was proposed in this repo
It suggests a tweak in the traditional Transformer attention to make it linear. This way, the model can be used as recurrent network: passing inputs for timestamp 0 and timestamp 1 together is the same as passing inputs at timestamp 0, then inputs at timestamp 1 along with the state of timestamp 0 (see example below).
This can be more efficient than a regular Transformer and can deal with sentence of any length (even if the model uses a fixed context length for training).
This model was contributed by sgugger. The original code can be found here.
import torch
from transformers import AutoTokenizer, RwkvConfig, RwkvModel
model = RwkvModel.from_pretrained("sgugger/rwkv-430M-pile")
tokenizer = AutoTokenizer.from_pretrained("sgugger/rwkv-430M-pile")
inputs = tokenizer("This is an example.", return_tensors="pt")
# Feed everything to the model
outputs = model(inputs["input_ids"])
output_whole = outputs.last_hidden_state
outputs = model(inputs["input_ids"][:, :2])
output_one = outputs.last_hidden_state
# Using the state computed on the first inputs, we will get the same output
outputs = model(inputs["input_ids"][:, 2:], state=outputs.state)
output_two = outputs.last_hidden_state
torch.allclose(torch.cat([output_one, output_two], dim=1), output_whole, atol=1e-5)
If you want to make sure the model stops generating when '\n\n'
is detected, we recommend using the following stopping criteria:
from transformers import StoppingCriteria
class RwkvStoppingCriteria(StoppingCriteria):
def __init__(self, eos_sequence = [187,187], eos_token_id = 537):
self.eos_sequence = eos_sequence
self.eos_token_id = eos_token_id
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
last_2_ids = input_ids[:,-2:].tolist()
return self.eos_sequence in last_2_ids
output = model.generate(inputs["input_ids"], max_new_tokens=64, stopping_criteria = [RwkvStoppingCriteria()])
( vocab_size = 50277 context_length = 1024 hidden_size = 4096 num_hidden_layers = 32 attention_hidden_size = None intermediate_size = None layer_norm_epsilon = 1e-05 bos_token_id = 0 eos_token_id = 0 rescale_every = 6 tie_word_embeddings = False use_cache = True **kwargs )
Parameters
int
, optional, defaults to 50277) —
Vocabulary size of the RWKV model. Defines the number of different tokens that can be represented by the
inputs_ids
passed when calling RwkvModel. int
, optional, defaults to 1024) —
The maximum sequence length that this model can be be used with in a single forward (using it in RNN mode
lets use any sequence length). int
, optional, defaults to 4096) —
Dimensionality of the embeddings and hidden states. int
, optional, defaults to 32) —
Number of hidden layers in the model. int
, optional) —
Dimensionality of the attention hidden states. Will default to hidden_size
if unset. int
, optional) —
Dimensionality of the inner feed-forward layers. Will default to 4 times hidden_size
if unset. float
, optional, defaults to 1e-05) —
The epsilon to use in the layer normalization layers. int
, optional, defaults to 0) —
The id of the beginning of sentence token in the vocabulary. Defaults to 0 as RWKV uses the same tokenizer
as GPTNeoX. int
, optional, defaults to 0) —
The id of the end of sentence token in the vocabulary. Defaults to 0 as RWKV uses the same tokenizer as
GPTNeoX. int
, optional, defaults to 6) —
At inference, the hidden states (and weights of the correponding output layers) are divided by 2 every
rescale_every
layer. If set to 0 or a negative number, no rescale is done. bool
, optional, defaults to False
) —
Whether or not to tie the word embeddings with the input token embeddings. bool
, optional, defaults to True
) —
Whether or not the model should return the last state. This is the configuration class to store the configuration of a RwkvModel. It is used to instantiate a RWKV model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the RWVK-4 RWKV/rwkv-4-169m-pile architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import RwkvConfig, RwkvModel
>>> # Initializing a Rwkv configuration
>>> configuration = RwkvConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = RwkvModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
( config )
Parameters
The bare RWKV Model transformer outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None attention_mask: Optional = None inputs_embeds: Optional = None state: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.models.rwkv.modeling_rwkv.RwkvOutput
or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, input_ids_length)
) —
input_ids_length
= sequence_length
if past_key_values
is None
else
past_key_values[0][0].shape[-2]
(sequence_length
of input past key value states). Indices of input
sequence tokens in the vocabulary.
If past_key_values
is used, only input_ids
that do not have their past calculated should be passed as
input_ids
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.LongTensor
of shape (batch_size, input_ids_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
This is currently not used by RwkvModel
, but will be supported in the future.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, hidden_size, num_hidden_layers)
, optional) —
If passed along, the model uses the previous state in all the blocks (which will give the output for the
input_ids
provided as if the model add state_input_ids + input_ids
as context). bool
, optional) —
If set to True
, the last state is returned and can be used to quickly generate the next logits. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.models.rwkv.modeling_rwkv.RwkvOutput
or tuple(torch.FloatTensor)
A transformers.models.rwkv.modeling_rwkv.RwkvOutput
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (RwkvConfig) and inputs.
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
state (list of five torch.FloatTensor
of shape (batch_size, hidden_size, num_hidden_layers)
) — The state of the model at the last time step. Can be used in a forward method with the next input_ids
to
avoid providing the old input_ids
.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The RwkvModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, RwkvModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("RWKV/rwkv-4-169m-pile")
>>> model = RwkvModel.from_pretrained("RWKV/rwkv-4-169m-pile")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
( config )
Parameters
The RWKV Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None attention_mask: Optional = None inputs_embeds: Optional = None state: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.models.rwkv.modeling_rwkv.RwkvCausalLMOutput
or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, input_ids_length)
) —
input_ids_length
= sequence_length
if past_key_values
is None
else
past_key_values[0][0].shape[-2]
(sequence_length
of input past key value states). Indices of input
sequence tokens in the vocabulary.
If past_key_values
is used, only input_ids
that do not have their past calculated should be passed as
input_ids
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.LongTensor
of shape (batch_size, input_ids_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
This is currently not used by RwkvModel
, but will be supported in the future.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. torch.FloatTensor
of shape (batch_size, hidden_size, num_hidden_layers)
, optional) —
If passed along, the model uses the previous state in all the blocks (which will give the output for the
input_ids
provided as if the model add state_input_ids + input_ids
as context). bool
, optional) —
If set to True
, the last state is returned and can be used to quickly generate the next logits. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set
labels = input_ids
Indices are selected in [-100, 0, ..., config.vocab_size]
All labels set to -100
are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]
Returns
transformers.models.rwkv.modeling_rwkv.RwkvCausalLMOutput
or tuple(torch.FloatTensor)
A transformers.models.rwkv.modeling_rwkv.RwkvCausalLMOutput
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (RwkvConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
state (list of five torch.FloatTensor
of shape (batch_size, hidden_size, num_hidden_layers)
) — The state of the model at the last time step. Can be used in a forward method with the next input_ids
to
avoid providing the old input_ids
.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The RwkvForCausalLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> import torch
>>> from transformers import AutoTokenizer, RwkvForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("RWKV/rwkv-4-169m-pile")
>>> model = RwkvForCausalLM.from_pretrained("RWKV/rwkv-4-169m-pile")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs, labels=inputs["input_ids"])
>>> loss = outputs.loss
>>> logits = outputs.logits
In a traditional auto-regressive Transformer, attention is written as
with, and are matrices of shape seq_len x hidden_size
named query, key and value (they are actually bigger matrices with a batch dimension and an attention head dimension but we’re only interested in the last two, which is where the matrix product is taken, so for the sake of simplicity we only consider those two). The product then has shape seq_len x seq_len
and we can take the matrix product with to get the output of the same shape as the others.
Replacing the softmax by its value gives:
Note that the entries in corresponding to are masked (the sum stops at j) because the attention is not allowed to look at future tokens (only past ones).
In comparison, the RWKV attention is given by
where is a new matrix called receptance by the author, and are still the key and value (\(\sigma\) here is the sigmoid function). is a new vector that represents the position of the token and is given by
with and learnable parameters called in the code time_first
and time_decay
respectively. The numerator and denominator can both be expressed recursively. Naming them and we have:
so (called numerator_state
in the code) satisfies
and
so (called denominator_state
in the code) satisfies
The actual recurrent formula used are a tiny bit more complex, as for numerical stability we don’t want to compute exponentials of big numbers. Usually the softmax is not computed as is, but the exponential of the maximum term is divided of the numerator and denominator:
with the maximum of all. So here on top of saving the numerator state (\(\hat{N}\)) and the denominator state (\(\hat{D}\)) we also keep track of the maximum of all terms encountered in the exponentials. So we actually use
defined by the following recurrent formulas:
and
and. With those, we can then compute
and
which finally gives us