The CodeGen model was proposed in A Conversational Paradigm for Program Synthesis by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.
CodeGen is an autoregressive language model for program synthesis trained sequentially on The Pile, BigQuery, and BigPython.
The abstract from the paper is the following:
Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI’s Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: this https URL.
This model was contributed by Hiroaki Hayashi. The original code can be found here.
Salesforce/codegen-{size}-{data}
, wheresize
: 350M
, 2B
, 6B
, 16B
data
: nl
: Pre-trained on the Pilemulti
: Initialized with nl
, then further pre-trained on multiple programming languages datamono
: Initialized with multi
, then further pre-trained on Python dataSalesforce/codegen-350M-mono
offers a 350 million-parameter checkpoint pre-trained sequentially on the Pile, multiple programming languages, and Python.>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> checkpoint = "Salesforce/codegen-350M-mono"
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> text = "def hello_world():"
>>> completion = model.generate(**tokenizer(text, return_tensors="pt"))
>>> print(tokenizer.decode(completion[0]))
def hello_world():
print("Hello World")
hello_world()
( vocab_size = 50400 n_positions = 2048 n_ctx = 2048 n_embd = 4096 n_layer = 28 n_head = 16 rotary_dim = 64 n_inner = None activation_function = 'gelu_new' resid_pdrop = 0.0 embd_pdrop = 0.0 attn_pdrop = 0.0 layer_norm_epsilon = 1e-05 initializer_range = 0.02 use_cache = True bos_token_id = 50256 eos_token_id = 50256 tie_word_embeddings = False **kwargs )
Parameters
int
, optional, defaults to 50400) —
Vocabulary size of the CodeGen model. Defines the number of different tokens that can be represented by the
inputs_ids
passed when calling CodeGenModel. int
, optional, defaults to 2048) —
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048). int
, optional, defaults to 2048) —
This attribute is used in CodeGenModel.__init__
without any real effect. int
, optional, defaults to 4096) —
Dimensionality of the embeddings and hidden states. int
, optional, defaults to 28) —
Number of hidden layers in the Transformer encoder. int
, optional, defaults to 16) —
Number of attention heads for each attention layer in the Transformer encoder. int
, optional, defaults to 64) —
Number of dimensions in the embedding that Rotary Position Embedding is applied to. int
, optional) —
Dimensionality of the inner feed-forward layers. None
will set it to 4 times n_embd str
, optional, defaults to "gelu_new"
) —
Activation function, to be selected in the list ["relu", "silu", "gelu", "tanh", "gelu_new"]
. float
, optional, defaults to 0.0) —
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. int
, optional, defaults to 0.0) —
The dropout ratio for the embeddings. float
, optional, defaults to 0.0) —
The dropout ratio for the attention. float
, optional, defaults to 1e-05) —
The epsilon to use in the layer normalization layers. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. bool
, optional, defaults to True
) —
Whether or not the model should return the last key/values attentions (not used by all models). int
, optional, defaults to 50256) —
Beginning of stream token id. int
, optional, defaults to 50256) —
End of stream token id. bool
, optional, defaults to False
) —
Whether the model’s input and output word embeddings should be tied. Note that this is only relevant if the
model has a output word embedding layer. This is the configuration class to store the configuration of a CodeGenModel. It is used to instantiate a CodeGen model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the CodeGen Salesforce/codegen-2B-mono architecture. Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import CodeGenConfig, CodeGenModel
>>> # Initializing a CodeGen 6B configuration
>>> configuration = CodeGenConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = CodeGenModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
( vocab_file merges_file errors = 'replace' unk_token = '<|endoftext|>' bos_token = '<|endoftext|>' eos_token = '<|endoftext|>' pad_token = None add_prefix_space = False add_bos_token = False **kwargs )
Parameters
str
) —
Path to the vocabulary file. str
) —
Path to the merges file. str
, optional, defaults to "replace"
) —
Paradigm to follow when decoding bytes to UTF-8. See
bytes.decode for more information. str
, optional, defaults to "<|endoftext|>"
) —
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. str
, optional, defaults to "<|endoftext|>"
) —
The beginning of sequence token. str
, optional, defaults to "<|endoftext|>"
) —
The end of sequence token. str
, optional) —
The token used for padding, for example when batching sequences of different lengths. bool
, optional, defaults to False
) —
Whether or not to add an initial space to the input. This allows to treat the leading word just as any
other word. (CodeGen tokenizer detect beginning of words by the preceding space). bool
, optional, defaults to False
) —
Whether to add a beginning of sequence token at the start of sequences. Construct a CodeGen tokenizer. Based on byte-level Byte-Pair-Encoding.
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
>>> from transformers import CodeGenTokenizer
>>> tokenizer = CodeGenTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
>>> tokenizer("Hello world")["input_ids"]
[15496, 995]
>>> tokenizer(" Hello world")["input_ids"]
[18435, 995]
You can get around that behavior by passing add_prefix_space=True
when instantiating this tokenizer or when you
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
When used with is_split_into_words=True
, this tokenizer will add a space before each word (even the first one).
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
( vocab_file = None merges_file = None tokenizer_file = None unk_token = '<|endoftext|>' bos_token = '<|endoftext|>' eos_token = '<|endoftext|>' add_prefix_space = False **kwargs )
Parameters
str
, optional) —
Path to the vocabulary file. str
, optional) —
Path to the merges file. str
, optional) —
Path to tokenizers file (generally has a .json extension) that
contains everything needed to load the tokenizer. str
, optional, defaults to "<|endoftext|>"
) —
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead. str
, optional, defaults to "<|endoftext|>"
) —
The beginning of sequence token. str
, optional, defaults to "<|endoftext|>"
) —
The end of sequence token. bool
, optional, defaults to False
) —
Whether or not to add an initial space to the input. This allows to treat the leading word just as any
other word. (CodeGen tokenizer detect beginning of words by the preceding space). Construct a “fast” CodeGen tokenizer (backed by HuggingFace’s tokenizers library). Based on byte-level Byte-Pair-Encoding.
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
>>> from transformers import CodeGenTokenizerFast
>>> tokenizer = CodeGenTokenizerFast.from_pretrained("Salesforce/codegen-350M-mono")
>>> tokenizer("Hello world")["input_ids"]
[15496, 995]
>>> tokenizer(" Hello world")["input_ids"]
[18435, 995]
You can get around that behavior by passing add_prefix_space=True
when instantiating this tokenizer, but since
the model was not pretrained this way, it might yield a decrease in performance.
When used with is_split_into_words=True
, this tokenizer needs to be instantiated with add_prefix_space=True
.
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
( token_ids: Union skip_special_tokens: bool = False clean_up_tokenization_spaces: bool = None truncate_before_pattern: Optional = None **kwargs ) → str
Parameters
Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]
) —
List of tokenized input ids. Can be obtained using the __call__
method. bool
, optional, defaults to False
) —
Whether or not to remove special tokens in the decoding. bool
, optional) —
Whether or not to clean up the tokenization spaces. If None
, will default to
self.clean_up_tokenization_spaces
(available in the tokenizer_config
). List[str]
, optional, defaults to None
) —
A list of regular expression strings that will be used to truncate the returned string. This can be
used to remove extra pieces of code (e.g. truncate if observing a comment symbol ”#” at the beginning
of a new line). An example pattern could be `[”^#”, re.escape(”<|endoftext|>”), ”^'''”, ” Returns
str
The decoded sentence.
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))
.
”]`. kwargs (additional keyword arguments, optional): Will be passed to the underlying model specific decode method.
( config )
Parameters
The bare CodeGen Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None past_key_values: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using AutoProcenizer
. See PreTrainedTokenizer.encode() and
PreTrainedTokenizer.call() for details.
torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
torch.FloatTensor
of shape (num_attention_heads,)
or (n_layer, num_attention_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_dim)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (CodeGenConfig) and inputs.
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
If past_key_values
is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size)
is output.
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally if
config.is_encoder_decoder=True
2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
config.is_encoder_decoder=True
in the cross-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The CodeGenModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, CodeGenModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
>>> model = CodeGenModel.from_pretrained("Salesforce/codegen-2B-mono")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
( config )
Parameters
The CodeGen Model transformer with a language modeling head on top.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: Optional = None past_key_values: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using AutoProcenizer
. See PreTrainedTokenizer.encode() and
PreTrainedTokenizer.call() for details.
torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
torch.FloatTensor
of shape (num_attention_heads,)
or (n_layer, num_attention_heads)
, optional) —
Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_dim)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set
labels = input_ids
Indices are selected in [-100, 0, ..., config.vocab_size]
All labels set to -100
are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (CodeGenConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The CodeGenForCausalLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> import torch
>>> from transformers import AutoTokenizer, CodeGenForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
>>> model = CodeGenForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs, labels=inputs["input_ids"])
>>> loss = outputs.loss
>>> logits = outputs.logits