Fine-tune and Test Llama-3 8B on AWS Trainium

Note: The complete script for this tutorial can be downloaded here.

This tutorial will teach you how to fine-tune open source LLMs like Llama 3 on AWS Trainium. In our example, we are going to leverage the Optimum Neuron, Transformers and Datasets libraries.

You will learn how to:

Setup AWS Environment
Load and process the dataset
Fine-tune Llama on AWS Trainium using the NeuronTrainer
Launch Training
Evaluate and test fine-tuned Llama model

While we will use Llama-3 8B in this tutorial, it is completely possible to use other models, simply by swtiching the model_id. For instance, it is possible to fine-tune:

Mistral models, such as Mistral 7b (mistralai/Mistral-7B-Instruct-v0.3)
Llama-2 models, such as Llama-2 7b (meta-llama/Llama-2-7b-hf)

And many others!

1. Setup AWS Environment

Before starting this tutorial, you will need to setup your environment:

Create an AWS Trainium instance. You will need a trn1.32xlarge, which contains 16 Neuron Devices. You can follow this guide to create one.
Make sure you are logged in on the Hugging Face Hub:

huggingface-cli login --token YOUR_TOKEN

Check that you have access to the model. Some open source models are gated, meaning that users need to apply to the model owner to be able to use the model weights. Here we will be training Llama-3 8B, for which there are two possibilities:

The official gated repo: meta-llama/Meta-Llama-3-8B
The non-official un-gated repo: NousResearch/Meta-Llama-3-8B

Clone the Optimum Neuron repository, which contains the complete script described in this tutorial:

git clone https://github.com/huggingface/optimum-neuron.git

2. Load and prepare the dataset

For this tutorial, we will use Dolly, an open source dataset of instruction-following records on categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

Example:

{
  "instruction": "What is world of warcraft",
  "context": "",
  "response": (
        "World of warcraft is a massive online multi player role playing game. "
        "It was released in 2004 by blizarre entertainment"
    )
}

We can use the load_dataset() method from the 🤗 Datasets library to load the dolly dataset very easily.

from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# dataset size: 15011

To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a format_dolly that takes a raw sample and returns a string with our format instruction.

def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt

In addition to formatting our samples, we also want to pack multiple samples to one sequence to have a more efficient training. In other words, we are stacking multiple samples to one sequence and split them with an EOS Token. Packing/stacking samples can be done during training or before.

The following function pack_dataset takes a dataset and a chunk_length and returns a packed dataset:

from functools import partial
from itertools import chain

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def pack_dataset(dataset, chunk_length=2048):
    print(f"Chunking dataset into chunks of {chunk_length} tokens.")

    def chunk(sample, chunk_length=chunk_length):
        # define global remainder variable to save remainder from batches to use in next batch
        global remainder
        # Concatenate all texts and add remainder from previous batch
        concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
        concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
        # get total number of tokens for batch
        batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

        # get max number of chunks for batch
        if batch_total_length >= chunk_length:
            batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

        # Split by chunks of max_len.
        result = {
            k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
            for k, t in concatenated_examples.items()
        }
        # add remainder to global variable for next batch
        remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
        # prepare labels
        result["labels"] = result["input_ids"].copy()
        return result

    # tokenize and chunk dataset
    lm_dataset = dataset.map(
        partial(chunk, chunk_length=chunk_length),
        batched=True,
    )
    print(f"Total number of samples: {len(lm_dataset)}")
    return lm_dataset

To summarize to prepare our dataset we will:

Format our samples using the template method and add an EOS token at the end of each sample
Tokenize our dataset to convert it from text to tokens
Pack our dataset to 2048 tokens

from transformers import AutoTokenizer
from random import randint

# Hugging Face Hub model id 
# model_id = "meta-llama/Meta-Llama-3-8B" # gated
model_id = "NousResearch/Meta-Llama-3-8B" # ungated

tokenizer = AutoTokenizer.from_pretrained(model_id)

# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample

# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))

# print random sample
print(dataset[randint(0, len(dataset))]["text"])

# tokenize dataset
dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
)

# chunk dataset
lm_dataset = pack_dataset(dataset, chunk_length=2048) # We use 2048 as the maximum length for packing

3. Fine-tune Llama on AWS Trainium using the NeuronTrainer

Normally you would use the Trainer and TrainingArguments classes to fine-tune PyTorch-based transformer models.

But together with AWS, we have developed the [~optimum.neuron.NeuronTrainer] to improve performance, robustness, and ease-of-use when training on Trainium instances. It can be used as a 1-to-1 replacement for the Trainer.

Since Llama-3 8B is a big model it will not fit on a single Neuron core, we need distributed training. In Optimum Neuron we support:

ZeRO-1: It is an optimization of data-parallelism which consists in sharding the optimizer state (which usually represents half or more of the memory needed on the device) over the data-parallel ranks.
Tensor Parallelism: It is a technique which consists in sharding each of your model matrix-multiplications along a given axis (row or column) on multiple devices. It also known as intra-layer model parallelism. The number of devices to shard your parameters on is called the tensor_parallel_size.
Sequence parallelism: It is an optimization over Tensor Parallelism which shards the activations on the sequence axis outside of the tensor parallel regions. It is useful because it saves memory by sharding the activations.
Pipeline Parallelism: It consists in sharding the model block layers on multiple devices. It is also known as inter-layer model parallelism. The number of devices to shard your layers on is called the pipeline_parallel_size.

If you want to know more about distributed training you can take a look at the documentation.

Here, since we want to fine-tune an 8B model, we will not need to use pipeline parallelism. Our training code will look as follows:

from optimum.neuron import NeuronTrainer as Trainer
from optimum.neuron.distributed import lazy_load_for_parallelism

# Define the tensor_parallel_size
tensor_parallel_size = 8

# Load model from the Hugging face Hub 
with lazy_load_for_parallelism(tensor_parallel_size=tensor_parallel_size):
    model = AutoModelForCausalLM.from_pretrained(model_id)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset,
    data_collator=default_data_collator,  # no special collator needed since we stacked the dataset
)

# Start training
trainer.train()

trainer.save_model()  # saves the tokenizer too for easy upload

The key points here are:

We use the lazy_load_for_parallelism context manager to lazily load the model. This will not load the full model weights on each worker, but instead only load the required weights (sharded or full). This is much more memory efficient, and often mandatory to use.
We use the [~optimum.neuron.NeuronTrainer] to perform training. It will take the lazily loaded model, along with the training_args, which are an instance of [~optimum.neuron.NeuronTrainingArguments], and will handle all the parallelization and training on the Neuron cores.

4. Launch Training

We prepared a script called finetune_llm.py summing up everything mentioned in this tutorial.

This script is a minimalistic version of our official example training script to run causal language modeling fine-tuning, called run_clm.py. For the sake of this tutorial, we tried to get rid of anything that is not necessary, and added the formatting step necessary for fine-tuning, but if you want to do more custom things, maybe the solution is already implemented in run_clm.py!

Also, these scripts are more designed as templates than final scripts. Feel free to take finetune_llm.py or run_clm.py and adapt them to your own needs!

PyTorch Neuron uses torch_xla. It evaluates operations lazily during execution of the training loops, which means it builds a symbolic graph in the background and the graph is executed on the hardware only when the tensor is printed, transfered to CPU, or xm.mark_step() is called. During execution, multiple graphs can be build depending on control-flow and it can take time to compile each graph sequentially. To alleviate that, the Neuron SDK provides neuron_parallel_compile, a tool which performs a fast trial run that builds all the graphs and compile them in parallel. This step is usually called precompilation.

Precompilation

When training models on AWS Trainium we first need to compile our model with our training arguments.

To overcome this, we added a model cache repository, which allows us to use precompiled models from the Hugging Face Hub to skip the compilation step. But be careful: every change in the model configuration might lead to a new compilation, which could result in some cache misses.

Note: If your model configuration is not cached please open an issue on Github, we are happy to include it.

The compilation command simply consists in calling your script as an input to the neuron_parallel_compile utility:

MALLOC_ARENA_MAX=64 XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node=32 finetune_llm.py \
 --model_id meta-llama/Meta-Llama-3-8B \
 --bf16 True \
 --learning_rate 5e-5 \
 --output_dir dolly_llama \
 --overwrite_output_dir True \
 --per_device_train_batch_size 1 \
 --gradient_accumulation_steps 16 \
 --gradient_checkpointing True \
 --tensor_parallel_size 8 \
 --max_steps 10 \
 --logging_steps 10

Make sure to run this precompilation phase for around 10 training steps. It is usually enough to accumulate and compile all the graphs that will be needed during the actual training.

Note: Compiling without a cache can take a while. It will also create dummy files in the dolly_llama_sharded during compilation you will have to remove them afterwards. We also need to add MALLOC_ARENA_MAX=64 to limit the CPU allocation to avoid potential crashes, don’t remove it for now.

# remove dummy artifacts which are created by the precompilation command
rm -rf dolly_llama

Actual Training

After compilation is done we can start our actual training with a similar command, we just need to remove the use of neuron_parallel_compile.

We will use torchrun to launch our training script. torchrun is a tool that automatically distributes a PyTorch model across multiple accelerators. We can pass the number of accelerators as nproc_per_node arguments alongside our hyperparameters.

The difference to the compilation command is that we changed from max_steps=10 to num_train_epochs=3.

Launch the training, with the following command.

MALLOC_ARENA_MAX=64 XLA_USE_BF16=1 torchrun --nproc_per_node=32 finetune_llm.py \
 --model_id meta-llama/Meta-Llama-3-8B \
 --bf16 True \
 --learning_rate 5e-5 \
 --output_dir dolly_llama \
 --overwrite_output_dir True \
 --skip_cache_push True \
 --per_device_train_batch_size 1 \
 --gradient_accumulation_steps 16 \
 --gradient_checkpointing True \
 --tensor_parallel_size 8 \
 --num_train_epochs 3 \
 --logging_steps 10

That’s it, we successfully trained Llama-3 8B on AWS Trainium!

But before we can share and test our model we need to consolidate our model. Since we used Tensor Parallelism during training, we saved sharded versions of the checkpoints. We need to consolidate them now.

Consolidate the Checkpoint

The Optimum CLI provides a way of doing that very easily via the optimum neuron consolidate [sharded_checkpoint] [output_dir] command:

optimum-cli neuron consolidate dolly_llama dolly_llama

5. Evaluate and test fine-tuned Llama model

As for training, to be able to run inference on AWS Trainium or AWS Inferentia2 we need to compile our model. In this case, we will use our Trainium instance for the inference test, but we recommend customer to switch to Inferentia2 for inference.

Optimum Neuron implements similar to Transformers AutoModel classes for easy inference use. We will use the NeuronModelForCausalLM class to load our vanilla transformers checkpoint and convert it to neuron.

from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

compiler_args = {"num_cores": 2, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}

tokenizer = AutoTokenizer.from_pretrained("dolly_llama")
model = NeuronModelForCausalLM.from_pretrained(
        "dolly_llama",
        export=True,
        **compiler_args,
        **input_shapes)

Note: Inference compilation can take ~25minutes. Luckily, you need to only run this onces. Since you can save the model afterwards. If you are going to run on Inferentia2 you need to recompile again. The compilation is parameter and hardware specific.

# COMMENT IN if you want to save the compiled model
# model.save_pretrained("compiled_dolly_llama")

We can now test inference, but have to make sure we format our input to our prompt format we used for fine-tuning. Therefore we created a helper method, which accepts a dict with our instruction and optionally a context.

def format_dolly_inference(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if "context" in sample else None
    response = f"### Answer\n"
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt


def generate(sample):
    prompt = format_dolly_inference(sample)
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.9,
        top_k=50,
        top_p=0.9
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=False)[len(prompt):]

Let’s test inference. First we test without a context.

Note: Inference is not expected to be super fast on AWS Trainium using 2 cores. For Inference we recommend using Inferentia2.

prompt = {
  "instruction": "Can you tell me something about AWS?"
}
res = generate(prompt)

print(res)

AWS stands for Amazon Web Services. AWS is a suite of remote computing services offered by Amazon. The most widely used of these include Amazon Elastic Compute Cloud (Amazon EC2), which provides resizable compute capacity in the cloud; Amazon Simple Storage Service (Amazon S3), which is an object storage service; and Amazon Elastic Block Store (Amazon EBS), which is designed to provide high performance, durable block storage volumes for use with AWS instances. AWS also provides other services, such as AWS Identity and Access Management (IAM), a service that enables organizations to control access to their AWS resources, and AWS Key Management Service (AWS KMS), which helps customers create and control the use of encryption keys.

That looks correct. Now, lets add some context, e.g. as you would do for RAG applications:

prompt = {
  "instruction": "How can I train models on AWS Trainium?",
  "context": "🤗 Optimum Neuron is the interface between the 🤗 Transformers library and AWS Accelerators including [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/?nc1=h_ls) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/?nc1=h_ls). It provides a set of tools enabling easy model loading, training and inference on single- and multi-Accelerator settings for different downstream tasks."
}
res = generate(prompt)

print(res)

You can use the Optimum Neuron interface to train models on AWS Trainium.

Awesome, our model also correctly uses the provided context. We are done. Congrats on fine-tuning Llama on AWS Trainium.

AWS Trainium & Inferentia