Transformers documentation

CodeLlama

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

CodeLlama

Overview

The Code Llama model was proposed in Code Llama: Open Foundation Models for Code by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.

The abstract from the paper is the following:

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

Check out all Code Llama model checkpoints here and the officially released ones in the codellama org.

This model was contributed by ArthurZucker. The original code of the authors can be found here.

Usage tips and examples

The Llama2 family models, on which Code Llama is based, were trained using bfloat16, but the original inference uses float16. Let’s look at the different precisions:

  • float32: PyTorch convention on model initialization is to load models in float32, no matter with which dtype the model weights were stored. transformers also follows this convention for consistency with PyTorch. This will be picked by default. If you want the AutoModel API to cast the load the checkpoints with the storage weights type, you must specify torch_dtype="auto", e.g. model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto").
  • bfloat16: Code Llama was trained with this precision, so we recommend using it for further training or fine-tuning.
  • float16: We recommend running inference using this precision, as it’s usually faster than bfloat16, and evaluation metrics show no discernible degradation with respect to bfloat16. You can also run inference using bfloat16, and we recommend you check inference results with both float16 and bfloat16 after fine-tuning.

As mentioned above, the dtype of the storage weights is mostly irrelevant unless you are using torch_dtype="auto" when initializing a model using. The reason is that the model will first be downloaded (using the dtype of the checkpoints online) and then will be casted to the default dtype of torch (becomes torch.float32). If there is a specified torch_dtype, it will be used instead.

Tips:

  • The infilling task is supported out of the box. You should be using the tokenizer.fill_token where you want your input to be filled.
  • The model conversion script is the same as for the Llama2 family:

Here is a sample usage:

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path

Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM).

After conversion, the model and tokenizer can be loaded via:

>>> from transformers import LlamaForCausalLM, CodeLlamaTokenizer

>>> tokenizer = CodeLlamaTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")
>>> model = LlamaForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf")
>>> PROMPT = '''def remove_non_ascii(s: str) -> str:
...     """ <FILL_ME>
...     return result
... '''
>>> input_ids = tokenizer(PROMPT, return_tensors="pt")["input_ids"]
>>> generated_ids = model.generate(input_ids, max_new_tokens=128)

>>> filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
>>> print(PROMPT.replace("<FILL_ME>", filling))
def remove_non_ascii(s: str) -> str:
    """ Remove non-ASCII characters from a string.
<BLANKLINE>
    Args:
        s: The string to remove non-ASCII characters from.
<BLANKLINE>
    Returns:
        The string with non-ASCII characters removed.
    """
    result = ""
    for c in s:
        if ord(c) < 128:
            result += c
    return result
<BLANKLINE>

If you only want the infilled part:

>>> from transformers import pipeline
>>> import torch

>>> generator = pipeline("text-generation",model="codellama/CodeLlama-7b-hf",torch_dtype=torch.float16, device_map="auto")
>>> generator('def remove_non_ascii(s: str) -> str:\n    """ <FILL_ME>\n    return result', max_new_tokens = 128)
[{'generated_text': 'def remove_non_ascii(s: str) -> str:\n    """ <FILL_ME>\n    return resultRemove non-ASCII characters from a string. """\n    result = ""\n    for c in s:\n        if ord(c) < 128:\n            result += c'}]

Under the hood, the tokenizer automatically splits by <FILL_ME> to create a formatted input string that follows the original training pattern. This is more robust than preparing the pattern yourself: it avoids pitfalls, such as token glueing, that are very hard to debug. To see how much CPU and GPU memory you need for this model or others, try this calculator which can help determine that value.

The LLaMA tokenizer is a BPE model based on sentencepiece. One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. “Banana”), the tokenizer does not prepend the prefix space to the string.

Code Llama has the same architecture as the Llama2 models, refer to Llama2’s documentation page for the API reference. Find Code Llama tokenizer reference below.

CodeLlamaTokenizer

class transformers.CodeLlamaTokenizer

< >

( vocab_file unk_token = '<unk>' bos_token = '<s>' eos_token = '</s>' prefix_token = '▁<PRE>' middle_token = '▁<MID>' suffix_token = '▁<SUF>' eot_token = '▁<EOT>' fill_token = '<FILL_ME>' suffix_first = False sp_model_kwargs: Optional = None add_bos_token = True add_eos_token = False clean_up_tokenization_spaces = False additional_special_tokens = None use_default_system_prompt = False **kwargs )

Parameters

  • vocab_file (str) — Path to the vocabulary file.
  • unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
  • bos_token (str, optional, defaults to "<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
  • eos_token (str, optional, defaults to "</s>") — The end of sequence token.

    When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

  • prefix_token (str, optional, defaults to "▁<PRE>") — Prefix token used for infilling.
  • middle_token (str, optional, defaults to "▁<MID>") — Middle token used for infilling.
  • suffix_token (str, optional, defaults to "▁<SUF>") — Suffix token used for infilling.
  • eot_token (str, optional, defaults to "▁<EOT>") — End of text token used for infilling.
  • fill_token (str, optional, defaults to "<FILL_ME>") — The token used to split the input between the prefix and suffix.
  • suffix_first (bool, optional, defaults to False) — Whether the input prompt and suffix should be formatted with the suffix first.
  • sp_model_kwargs (dict, optional) — Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiece can be used, among other things, to set:

    • enable_sampling: Enable subword regularization.

    • nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.

      • nbest_size = {0,1}: No sampling is performed.
      • nbest_size > 1: samples from the nbest_size results.
      • nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
    • alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.

  • add_bos_token (bool, optional, defaults to True) — Whether to add a beginning of sequence token at the start of sequences.
  • add_eos_token (bool, optional, defaults to False) — Whether to add an end of sequence token at the end of sequences.
  • clean_up_tokenization_spaces (bool, optional, defaults to False) — Whether or not to clean up the tokenization spaces.
  • additional_special_tokens (List[str], optional) — Additional special tokens used by the tokenizer.
  • use_default_system_prompt (bool, optional, defaults to False) — Whether or not the default system prompt for Llama should be used.

Construct a CodeLlama tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is no padding token in the original model.

The default configuration match that of codellama/CodeLlama-7b-Instruct-hf which supports prompt infilling.

build_inputs_with_special_tokens

< >

( token_ids_0 token_ids_1 = None )

get_special_tokens_mask

< >

( token_ids_0: List token_ids_1: Optional = None already_has_special_tokens: bool = False ) List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs.
  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.
  • already_has_special_tokens (bool, optional, defaults to False) — Whether or not the token list is already formatted with special tokens for the model.

Returns

List[int]

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

create_token_type_ids_from_sequences

< >

( token_ids_0: List token_ids_1: Optional = None ) List[int]

Parameters

  • token_ids_0 (List[int]) — List of ids.
  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.

Returns

List[int]

List of token type IDs according to the given sequence(s).

Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT

sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |

if token_ids_1 is None, only returns the first portion of the mask (0s).

save_vocabulary

< >

( save_directory filename_prefix: Optional = None ) Tuple(str)

Parameters

  • save_directory (str) — The directory in which to save the vocabulary.

Returns

Tuple(str)

Paths to the files saved.

Save the vocabulary and special tokens file to a directory.

CodeLlamaTokenizerFast

class transformers.CodeLlamaTokenizerFast

< >

( vocab_file = None tokenizer_file = None clean_up_tokenization_spaces = False unk_token = '<unk>' bos_token = '<s>' eos_token = '</s>' prefix_token = '▁<PRE>' middle_token = '▁<MID>' suffix_token = '▁<SUF>' eot_token = '▁<EOT>' fill_token = '<FILL_ME>' additional_special_tokens = None add_bos_token = True add_eos_token = False use_default_system_prompt = False **kwargs )

Parameters

  • vocab_file (str, optional) — SentencePiece file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer.
  • tokenizer_file (str, optional) — tokenizers file (generally has a .json extension) that contains everything needed to load the tokenizer.
  • clean_up_tokenization_spaces (str, optional, defaults to False) — Wether to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces.
  • unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
  • bos_token (str, optional, defaults to "<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
  • eos_token (str, optional, defaults to "</s>") — The end of sequence token.
  • prefix_token (str, optional, defaults to "▁<PRE>") — Prefix token used for infilling.
  • middle_token (str, optional, defaults to "▁<MID>") — Middle token used for infilling.
  • suffix_token (str, optional, defaults to "▁<SUF>") — Suffix token used for infilling.
  • eot_token (str, optional, defaults to "▁<EOT>") — End of text token used for infilling.
  • fill_token (str, optional, defaults to "<FILL_ME>") — The token used to split the input between the prefix and suffix.
  • additional_special_tokens (List[str], optional) — Additional special tokens used by the tokenizer.
  • add_bos_token (bool, optional, defaults to True) — Whether to add a beginning of sequence token at the start of sequences.
  • add_eos_token (bool, optional, defaults to False) — Whether to add an end of sequence token at the end of sequences.
  • use_default_system_prompt (bool, optional, defaults to False) — Whether or not the default system prompt for Llama should be used.

Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.

This uses notably ByteFallback and no normalization.

>>> from transformers import CodeLlamaTokenizerFast

>>> tokenizer = CodeLlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
>>> tokenizer.encode("Hello this is a test")
[1, 15043, 445, 338, 263, 1243]

If you want to change the bos_token or the eos_token, make sure to specify them when initializing the model, or call tokenizer.update_post_processor() to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct). For more details, checkout [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.

This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. The default configuration match that of codellama/CodeLlama-7b-Instruct-hf which supports prompt infilling.

build_inputs_with_special_tokens

< >

( token_ids_0: List token_ids_1: Optional = None ) List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs to which the special tokens will be added.
  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.

Returns

List[int]

list of input IDs with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. The special tokens depend on calling set_lang.

An NLLB sequence has the following format, where X represents the sequence:

  • input_ids (for encoder) X [eos, src_lang_code]
  • decoder_input_ids: (for decoder) X [eos, tgt_lang_code]

BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

get_special_tokens_mask

< >

( token_ids_0: List token_ids_1: Optional = None already_has_special_tokens: bool = False ) A list of integers in the range [0, 1]

Parameters

  • token_ids_0 (List[int]) — List of ids of the first sequence.
  • token_ids_1 (List[int], optional) — List of ids of the second sequence.
  • already_has_special_tokens (bool, optional, defaults to False) — Whether or not the token list is already formatted with special tokens for the model.

Returns

A list of integers in the range [0, 1]

1 for a special token, 0 for a sequence token.

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model or encode_plus methods.

create_token_type_ids_from_sequences

< >

( token_ids_0: List token_ids_1: Optional = None ) List[int]

Parameters

  • token_ids_0 (List[int]) — The first tokenized sequence.
  • token_ids_1 (List[int], optional) — The second tokenized sequence.

Returns

List[int]

The token type ids.

Create the token type IDs corresponding to the sequences passed. What are token type IDs?

Should be overridden in a subclass if the model has a special way of building those.

update_post_processor

< >

( )

Updates the underlying post processor with the current bos_token and eos_token.

save_vocabulary

< >

( save_directory: str filename_prefix: Optional = None )

< > Update on GitHub