Transformers documentation

Llama2

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.40.2).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Llama2

Overview

The Llama2 model was proposed in LLaMA: Open Foundation and Fine-Tuned Chat Models by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. It is a collection of foundation language models ranging from 7B to 70B parameters, with checkpoints finetuned for chat application!

The abstract from the paper is the following:

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

Checkout all Llama2 model checkpoints here. This model was contributed by Arthur Zucker with contributions from Lysandre Debut. The code of the implementation in Hugging Face is based on GPT-NeoX here. The original code of the authors can be found here.

Usage tips

The Llama2 models were trained using bfloat16, but the original inference uses float16. The checkpoints uploaded on the Hub use torch_dtype = 'float16', which will be used by the AutoModel API to cast the checkpoints from torch.float32 to torch.float16.

The dtype of the online weights is mostly irrelevant unless you are using torch_dtype="auto" when initializing a model using model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto"). The reason is that the model will first be downloaded ( using the dtype of the checkpoints online), then it will be casted to the default dtype of torch (becomes torch.float32), and finally, if there is a torch_dtype provided in the config, it will be used.

Training the model in float16 is not recommended and is known to produce nan; as such, the model should be trained in bfloat16.

Tips:

Weights for the Llama2 models can be obtained by filling out this form
The architecture is very similar to the first Llama, with the addition of Grouped Query Attention (GQA) following this paper
Setting config.pretraining_tp to a value different than 1 will activate the more accurate but slower computation of the linear layers, which should better match the original logits.
The original model uses pad_id = -1 which means that there is no padding token. We can’t have the same logic, make sure to add a padding token using tokenizer.add_special_tokens({"pad_token":"<pad>"}) and resize the token embedding accordingly. You should also set the model.config.pad_token_id. The embed_tokens layer of the model is initialized with self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx), which makes sure that encoding the padding token will output zeros, so passing it when initializing is recommended.
After filling out the form and gaining access to the model checkpoints, you should be able to use the already converted checkpoints. Otherwise, if you are converting your own model, feel free to use the conversion script. The script can be called with the following (example) command:

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path

After conversion, the model and tokenizer can be loaded via:

from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("/output/path")
model = LlamaForCausalLM.from_pretrained("/output/path")

Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 75B model, it’s thus 145GB of RAM needed.

The LLaMA tokenizer is a BPE model based on sentencepiece. One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. “Banana”), the tokenizer does not prepend the prefix space to the string.
When using Flash Attention 2 via attn_implementation="flash_attention_2", don’t pass torch_dtype to the from_pretrained class method and use Automatic Mixed-Precision training. When using Trainer, it is simply specifying either fp16 or bf16 to True. Otherwise, make sure you are using torch.autocast. This is required because the Flash Attention only support fp16 and bf16 data type.

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LLaMA2. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

Llama 2 is here - get it on Hugging Face, a blog post about Llama 2 and how to use it with 🤗 Transformers and 🤗 PEFT.
LLaMA 2 - Every Resource you need, a compilation of relevant resources to learn about LLaMA 2 and how to get started quickly.

Text Generation

A notebook on how to fine-tune Llama 2 in Google Colab using QLoRA and 4-bit precision. 🌎
A notebook on how to fine-tune the “Llama-v2-7b-guanaco” model with 4-bit QLoRA and generate Q&A datasets from PDFs. 🌎

Text Classification

A notebook on how to fine-tune the Llama 2 model with QLoRa, TRL, and Korean text classification dataset. 🌎🇰🇷

⚗️ Optimization

Fine-tune Llama 2 with DPO, a guide to using the TRL library’s DPO method to fine tune Llama 2 on a specific dataset.
Extended Guide: Instruction-tune Llama 2, a guide to training Llama 2 to generate instructions from inputs, transforming the model from instruction-following to instruction-giving.
A notebook on how to fine-tune the Llama 2 model on a personal computer using QLoRa and TRL. 🌎

⚡️ Inference

A notebook on how to quantize the Llama 2 model using GPTQ from the AutoGPTQ library. 🌎
A notebook on how to run the Llama 2 Chat Model with 4-bit quantization on a local computer or Google Colab. 🌎

🚀 Deploy

Fine-tune LLaMA 2 (7-70B) on Amazon SageMaker, a complete guide from setup to QLoRA fine-tuning and deployment on Amazon SageMaker.
Deploy Llama 2 7B/13B/70B on Amazon SageMaker, a guide on using Hugging Face’s LLM DLC container for secure and scalable deployment.

Transformers

Llama2

Overview

Usage tips

Resources

LlamaConfig

class transformers.LlamaConfig

LlamaTokenizer

class transformers.LlamaTokenizer

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

save_vocabulary

LlamaTokenizerFast

class transformers.LlamaTokenizerFast

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

update_post_processor

save_vocabulary

LlamaModel

class transformers.LlamaModel

forward

LlamaForCausalLM

class transformers.LlamaForCausalLM

forward

LlamaForSequenceClassification

class transformers.LlamaForSequenceClassification

forward