CodeGen
Overview
The CodeGen model was proposed in A Conversational Paradigm for Program Synthesis by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.
CodeGen is an autoregressive language model for program synthesis trained sequentially on The Pile, BigQuery, and BigPython.
The abstract from the paper is the following:
Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAIβs Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: this https URL.
This model was contributed by Hiroaki Hayashi. The original code can be found here.
Checkpoint Naming
- CodeGen model checkpoints are available on different pre-training data with variable sizes.
- The format is:
Salesforce/codegen-{size}-{data}
, wheresize
:350M
,2B
,6B
,16B
data
:nl
: Pre-trained on the Pilemulti
: Initialized withnl
, then further pre-trained on multiple programming languages datamono
: Initialized withmulti
, then further pre-trained on Python data
- For example,
Salesforce/codegen-350M-mono
offers a 350 million-parameter checkpoint pre-trained sequentially on the Pile, multiple programming languages, and Python.
Usage example
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> checkpoint = "Salesforce/codegen-350M-mono"
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> text = "def hello_world():"
>>> completion = model.generate(**tokenizer(text, return_tensors="pt"))
>>> print(tokenizer.decode(completion[0]))
def hello_world():
print("Hello World")
hello_world()
Resources
CodeGenConfig
class transformers.CodeGenConfig
< source >( vocab_size = 50400 n_positions = 2048 n_ctx = 2048 n_embd = 4096 n_layer = 28 n_head = 16 rotary_dim = 64 n_inner = None activation_function = 'gelu_new' resid_pdrop = 0.0 embd_pdrop = 0.0 attn_pdrop = 0.0 layer_norm_epsilon = 1e-05 initializer_range = 0.02 use_cache = True bos_token_id = 50256 eos_token_id = 50256 tie_word_embeddings = False **kwargs )
Parameters
- vocab_size (
int
, optional, defaults to 50400) — Vocabulary size of the CodeGen model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling CodeGenModel. - n_positions (
int
, optional, defaults to 2048) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). - n_ctx (
int
, optional, defaults to 2048) — This attribute is used inCodeGenModel.__init__
without any real effect. - n_embd (
int
, optional, defaults to 4096) — Dimensionality of the embeddings and hidden states. - n_layer (
int
, optional, defaults to 28) — Number of hidden layers in the Transformer encoder. - n_head (
int
, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer encoder. - rotary_dim (
int
, optional, defaults to 64) — Number of dimensions in the embedding that Rotary Position Embedding is applied to. - n_inner (
int
, optional) — Dimensionality of the inner feed-forward layers.None
will set it to 4 times n_embd - activation_function (
str
, optional, defaults to"gelu_new"
) — Activation function, to be selected in the list["relu", "silu", "gelu", "tanh", "gelu_new"]
. - resid_pdrop (
float
, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. - embd_pdrop (
int
, optional, defaults to 0.0) — The dropout ratio for the embeddings. - attn_pdrop (
float
, optional, defaults to 0.0) — The dropout ratio for the attention. - layer_norm_epsilon (
float
, optional, defaults to 1e-05) — The epsilon to use in the layer normalization layers. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models). - bos_token_id (
int
, optional, defaults to 50256) — Beginning of stream token id. - eos_token_id (
int
, optional, defaults to 50256) — End of stream token id. - tie_word_embeddings (
bool
, optional, defaults toFalse
) — Whether the model’s input and output word embeddings should be tied. Note that this is only relevant if the model has a output word embedding layer.
This is the configuration class to store the configuration of a CodeGenModel. It is used to instantiate a CodeGen model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the CodeGen Salesforce/codegen-2B-mono architecture. Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import CodeGenConfig, CodeGenModel
>>> # Initializing a CodeGen 6B configuration
>>> configuration = CodeGenConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = CodeGenModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
CodeGenTokenizer
class transformers.CodeGenTokenizer
< source >( vocab_file merges_file errors = 'replace' unk_token = '<|endoftext|>' bos_token = '<|endoftext|>' eos_token = '<|endoftext|>' pad_token = None add_prefix_space = False add_bos_token = False return_token_type_ids = False **kwargs )
Parameters
- vocab_file (
str
) — Path to the vocabulary file. - merges_file (
str
) — Path to the merges file. - errors (
str
, optional, defaults to"replace"
) — Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information. - unk_token (
str
, optional, defaults to"<|endoftext|>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - bos_token (
str
, optional, defaults to"<|endoftext|>"
) — The beginning of sequence token. - eos_token (
str
, optional, defaults to"<|endoftext|>"
) — The end of sequence token. - pad_token (
str
, optional) — The token used for padding, for example when batching sequences of different lengths. - add_prefix_space (
bool
, optional, defaults toFalse
) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (CodeGen tokenizer detect beginning of words by the preceding space). - add_bos_token (
bool
, optional, defaults toFalse
) — Whether to add a beginning of sequence token at the start of sequences. - return_token_type_ids (
bool
, optional, defaults toFalse
) — Whether to return token type IDs.
Construct a CodeGen tokenizer. Based on byte-level Byte-Pair-Encoding.
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
>>> from transformers import CodeGenTokenizer
>>> tokenizer = CodeGenTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
>>> tokenizer("Hello world")["input_ids"]
[15496, 995]
>>> tokenizer(" Hello world")["input_ids"]
[18435, 995]
You can get around that behavior by passing add_prefix_space=True
when instantiating this tokenizer or when you
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
When used with is_split_into_words=True
, this tokenizer will add a space before each word (even the first one).
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
create_token_type_ids_from_sequences
< source >( token_ids_0: List token_ids_1: Optional = None ) β List[int]
Parameters
- token_ids_0 (
List[int]
) — List of IDs. - token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs.
Returns
List[int]
List of token type IDs according to the given sequence(s).
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A sequence
pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If token_ids_1
is None
, this method only returns the first portion of the mask (0s).
CodeGenTokenizerFast
class transformers.CodeGenTokenizerFast
< source >( vocab_file = None merges_file = None tokenizer_file = None unk_token = '<|endoftext|>' bos_token = '<|endoftext|>' eos_token = '<|endoftext|>' add_prefix_space = False return_token_type_ids = False **kwargs )
Parameters
- vocab_file (
str
, optional) — Path to the vocabulary file. - merges_file (
str
, optional) — Path to the merges file. - tokenizer_file (
str
, optional) — Path to tokenizers file (generally has a .json extension) that contains everything needed to load the tokenizer. - unk_token (
str
, optional, defaults to"<|endoftext|>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - bos_token (
str
, optional, defaults to"<|endoftext|>"
) — The beginning of sequence token. - eos_token (
str
, optional, defaults to"<|endoftext|>"
) — The end of sequence token. - add_prefix_space (
bool
, optional, defaults toFalse
) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (CodeGen tokenizer detect beginning of words by the preceding space). - return_token_type_ids (
bool
, optional, defaults toFalse
) — Whether to return token type IDs.
Construct a βfastβ CodeGen tokenizer (backed by HuggingFaceβs tokenizers library). Based on byte-level Byte-Pair-Encoding.
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
be encoded differently whether it is at the beginning of the sentence (without space) or not:
>>> from transformers import CodeGenTokenizerFast
>>> tokenizer = CodeGenTokenizerFast.from_pretrained("Salesforce/codegen-350M-mono")
>>> tokenizer("Hello world")["input_ids"]
[15496, 995]
>>> tokenizer(" Hello world")["input_ids"]
[18435, 995]
You can get around that behavior by passing add_prefix_space=True
when instantiating this tokenizer, but since
the model was not pretrained this way, it might yield a decrease in performance.
When used with is_split_into_words=True
, this tokenizer needs to be instantiated with add_prefix_space=True
.
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
create_token_type_ids_from_sequences
< source >( token_ids_0: List token_ids_1: Optional = None ) β List[int]
Parameters
- token_ids_0 (
List[int]
) — List of IDs. - token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs.
Returns
List[int]
List of token type IDs according to the given sequence(s).
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A sequence
pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If token_ids_1
is None
, this method only returns the first portion of the mask (0s).
decode
< source >( token_ids: Union skip_special_tokens: bool = False clean_up_tokenization_spaces: bool = None truncate_before_pattern: Optional = None **kwargs ) β str
Parameters
- token_ids (
Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]
) — List of tokenized input ids. Can be obtained using the__call__
method. - skip_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not to remove special tokens in the decoding. - clean_up_tokenization_spaces (
bool
, optional) — Whether or not to clean up the tokenization spaces. IfNone
, will default toself.clean_up_tokenization_spaces
(available in thetokenizer_config
). - truncate_before_pattern (
List[str]
, optional, defaults toNone
) — A list of regular expression strings that will be used to truncate the returned string. This can be used to remove extra pieces of code (e.g. truncate if observing a comment symbol ”#” at the beginning of a new line). An example pattern could be `[”^#”, re.escape(”<|endoftext|>”), ”^'''”, ”
Returns
str
The decoded sentence.
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))
.
β]`. kwargs (additional keyword arguments, optional): Will be passed to the underlying model specific decode method.
CodeGenModel
class transformers.CodeGenModel
< source >( config )
Parameters
- config (CodeGenConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare CodeGen Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: Optional = None past_key_values: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using
AutoProcenizer
. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. - attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- token_type_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 0 corresponds to a sentence A token,
- 1 corresponds to a sentence B token.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - head_mask (
torch.FloatTensor
of shape(num_attention_heads,)
or(n_layer, num_attention_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_dim)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (CodeGenConfig) and inputs.
-
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) β Sequence of hidden-states at the output of the last layer of the model.If
past_key_values
is used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)
is output. -
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) β Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
config.is_encoder_decoder=True
in the cross-attention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) β Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) β Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The CodeGenModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, CodeGenModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
>>> model = CodeGenModel.from_pretrained("Salesforce/codegen-2B-mono")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
CodeGenForCausalLM
class transformers.CodeGenForCausalLM
< source >( config )
Parameters
- config (CodeGenConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The CodeGen Model transformer with a language modeling head on top.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: Optional = None past_key_values: Optional = None attention_mask: Optional = None token_type_ids: Optional = None position_ids: Optional = None head_mask: Optional = None inputs_embeds: Optional = None labels: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using
AutoProcenizer
. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. - attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- token_type_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 0 corresponds to a sentence A token,
- 1 corresponds to a sentence B token.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - head_mask (
torch.FloatTensor
of shape(num_attention_heads,)
or(n_layer, num_attention_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_dim)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can setlabels = input_ids
Indices are selected in[-100, 0, ..., config.vocab_size]
All labels set to-100
are ignored (masked), the loss is only computed for labels in[0, ..., config.vocab_size]
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (CodeGenConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) β Language modeling loss (for next-token prediction). -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) β Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
)Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) β Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) β Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The CodeGenForCausalLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> import torch
>>> from transformers import AutoTokenizer, CodeGenForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
>>> model = CodeGenForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs, labels=inputs["input_ids"])
>>> loss = outputs.loss
>>> logits = outputs.logits