Phinetuning 2.0

Community Article Published January 31, 2024

Meet Phi-2, Microsoft’s newly released small model, remarkably powerful yet compact. This tutorial will guide you through fine-tuning Phi-2, demonstrating how to build a unique dataset and fine-tune the model using QLoRA.

image/jpeg “Small Greek robots hatching out of a Microsoft egg”, by DALL-E

In this tutorial, we are looking at Microsoft’s Phi-2, a model with only 2.7 billion parameters. Despite its smaller size compared to some of its contemporaries, Phi-2 impresses with its remarkable performance.

We focus on how to fine-tune Phi-2 using QLoRA and synthetic data. This tutorial is divided into two parts:

  • Create a synthetic dataset from a seed of instructions (Part 1)
  • Fine-tune Phi-2 using QLoRA (Part 2)

As always, simple and minimalistic Python code is provided.

Prerequisites

Before we start, you will need the latest Hugging Face packages. For reference, these are the specific versions that were used to put together this tutorial:

accelerate==0.26.1
bitsandbytes==0.42.0
datasets==2.16.1
peft==0.8.1
transformers==4.37.2
einops==0.7.0
torch==2.1.0

Part 1: Synthetic dataset

In this part, we focus on crafting a dataset to fine-tune Phi-2, aiming to build a chat/instruct-model. We start with an existing dataset and transform it into a series of engaging dialogues.

Initial Dataset: A Collection of Riddles

Some time ago I stumbled upon this dataset: Hypersniper/riddles_v1. It’s a collection of riddles, each paired with a clever, step-by-step solution provided by GPT-4.

This dataset will be our seed. The goal is to use a(nother) powerful language model to transform these ~470 riddles into actual conversations like this:

user: Asks a question.

assistant: Provides an answer with a detailed explanation and possibly a counter-question.

user: Follows up with another question.

assistant: Responds with an answer.

Let’s walk through the process, with code and comments. All the code for Part 1 is available in a notebook.

Load dataset

First, let’s load the dataset containing the riddles:

from datasets import load_dataset, DatasetDict, Dataset,load_from_disk  
  
dataset_name="Hypersniper/riddles_v1"  
dataset = load_dataset(dataset_name, split="all")

This is what the data looks like:

{  
  "instruction": "What is pronounced like one letter, written with three   
letters. And belongs to all animals?",  
  "output": "The answer to this question is \"eye.\" Let's break down the   
logic behind this:\n\n1. Pronounced like one letter: The word \"eye\" is   
pronounced like the letter \"I.\" This is a common type of word play in   
English where a word sounds like another word or a letter.\n\n2. Written   
with three letters: The word \"eye\" is indeed spelled with three letters:   
E, Y, and E.\n\n3. Belongs to all animals: All animals, including humans,   
have eyes. They are an essential part of most animals' anatomy and serve   
the critical function of allowing them to see.\n\nSo, by considering these   
three clues together, we can deduce that the answer to the riddle is \"eye.\"",  
  "answer": "eye"  
}

The problem is: This dataset is too small and it’s only a single turn, a single question and a single answer.

Turning riddles into conversations

To expand the dataset and turn it into conversations we use one of the smartest LLMs around: mistralai/Mistral-7B-Instruct-v0.2. The process involves:

  1. Generating new riddles: Prompt Mistral with 10 riddles and ask it to generate 10 more.
  2. Answer these synthetic riddles.
  3. Formulate a follow-up question as if from a human user.
  4. Generate a response to this follow-up question.

Step 0: Load a smart LLM

from transformers import AutoModelForCausalLM, AutoTokenizer  
import torch  
  
model = AutoModelForCausalLM.from_pretrained(  
    "mistralai/Mistral-7B-Instruct-v0.2",  
    device_map="auto",  
    torch_dtype=torch.bfloat16,  
)  
  
tokenizer = AutoTokenizer.from_pretrained("models/Mistral-7B-Instruct-v0.2")

Step 1: Expand riddles

We prompt mistralai/Mistral-7B-Instruct-v0.2 to generate new riddles based on the existing ones. This is done by feeding a set of 10 riddles into Mistral and asking it to create 10 more.

import random  
  
questions=[ q for q in dataset["instruction"] ]  
  
prompt_template=""""Below are 10 riddles. Come up with 10 more.   
Output just the riddles, no numbering. Don't output anything else.  
  
Riddles:  
{questions}"""  
  
synthetic_riddles = []  
  
# Ask Mistral 300 times = 3,000 new riddles  
for _ in range(300):  
    # Pick 10 random questions to include in prompt  
    random.shuffle(questions)  
    q10_sample = questions[0:10]  
      
    # Put 10 questions into prompt template = prompt  
    prompt=prompt_template.format( questions="\n\n".join(q10_sample) )  
    messages = [{"role": "user", "content": prompt}]  
      
    # Apply Mistral chat format to prompt, tokenize, generate  
    input_tokens = tokenizer.apply_chat_template(
        messages, 
        return_tensors="pt"
        ).to("cuda")  
        
    output_tokens = model.generate(  
        input_tokens,   
        max_new_tokens = 500,   
        do_sample = True,   
        pad_token_id = tokenizer.eos_token_id
        )  
  
    output_tokens = output_tokens[0][len(input_tokens[0]):] # Cut prompt from output  
    output = tokenizer.decode(output_tokens, skip_special_tokens=True)  
  
    synthetic_riddles.extend( output.split("\n") )

Examples of riddles generated by Mistral:

A bag contains apples, some red, some green. If you reach in without looking, and pick one apple, what is the chance that it was ripe?

A bath can’t give you one, a river can’t float one, but what finds it in the hot desert?

A big tree was once a little seed. A little bird was once a little egg. What was Alice before she was a girl?

Note: neither cherry nor lemon-picked, alphabetically sorted and first three generated riddles shown.

With this step, Mistral generates ~3000 riddles. These synthetic riddles are pretty “dirty” however. After cleaning (~1k riddles trashed) and deduplication (~300 duplicates) we are left with 1682 new riddles to be used for the next step. The basic code for data cleaning can be found in the previously mentioned notebook.

Use GPT-4 (or any other model) for dataset expansion

We used Mistral to expand the riddles, but any other smart model like Mixtral or the OpenAI API could be used to do this.

To use GPT-4 for example, a function like ask_GPT4 could be plugged into the code above and used instead of generate:

from openai import OpenAI  
  
client = OpenAI()  
def ask_GPT4(prompt):  
    completion = client.chat.completions.create(  
      model = "gpt-4",  
      messages = [ {"role": "user", "content": prompt} ]  
    )  
    return completion.choices[0].message.content  
  
ask_GPT4(prompt)

Step 2: Generate answers to synthetic riddles

Next, we use Mistral to generate answers to our new riddles. This step completes the first turn of the conversation we are building.

import copy   
  
prompt_template = """"{riddle}  
  
Think step-by-step, keep your explanations simple, try your very best.   
If there is information missing for you to come up with a specific   
answer, just ask me a short question at the very end of your answer."""  
  
# copy the dict with the synthetic riddles to a new one which will contain the answers   
synthetic_riddles_step2 = copy.deepcopy(synthetic_riddles_dict)  
  
for riddle in synthetic_riddles_step2:  
    # format prompt using the template, insert the riddle  
    prompt = prompt_template.format( riddle=riddle["messages"][0] )  
  
    # apply Mistal prompt format and tokenize  
    messages = [{"role": "user", "content": prompt}]      
    input_tokens = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")  
  
    # generate. 500 output tokens are enough for the average answer length of Mistral  
    output_tokens = model.generate(  
        input_tokens,   
        max_new_tokens = 500,   
        do_sample = True,   
        pad_token_id = tokenizer.eos_token_id)  
  
    output_tokens = output_tokens[0][len(input_tokens[0]):]  
    output = tokenizer.decode(output_tokens, skip_special_tokens=True)  
  
    # append answer to each conversation  
    riddle["messages"].append(output)

This yields something like the following (previous examples continued):

user: A bag contains apples, some red, some green. If you reach in without looking, and pick one apple, what is the chance that it was ripe?
assistant: The information given does not specify anything about the ripeness of the apples being red or green. Therefore, we cannot determine the chance that a picked apple is ripe based on its color alone. Additionally, ripeness is a subjective factor that can vary from apple to apple even within the same color group. So, we would need more information, such as the definition of ripeness used or any context about the storage conditions of the apples, to provide an answer.

user: A bath can’t give you one, a river can’t float one, but what finds it in the hot desert?
assistant: The answer to the riddle is a shadow. A shadow cannot exist in the absence of sunlight, which is typically found in hot desert climates. Shadows cannot be physically created by a bath or a river, as they result from the blocking of light by an object.

user: A big tree was once a little seed. A little bird was once a little egg. What was Alice before she was a girl?
assistant: Alice was a child before she was a girl. However, I interpret your question as asking what Alice was before she grew up to be a girl, not just before she was labeled as a girl. If Alice was a character in a story or in real life, she might have been a baby before she was a girl. But if you meant something else by ”Alice”, feel free to ask for clarification.

These are pretty good answers!

What is still missing for a complete conversation? We want to build a multi-turn dataset. To keep things simple, we will generate two turns for each conversation. There’s still one follow-up question (by the user) and one answer (by the LLM) missing to complete the second turn.

Step 3: Generate a human follow-up question

If we present the conversation we have so far to Mistral, and ask for a “smart” follow-up questions, it tries to be very smart and starts explaining things, much like an assistant and not like a human user. We don’t want that.

We want simple, almost stupid, questions. This is the prompt that works best, asking Mistral to imitate a curious child:

import copy   
  
prompt_template = """"Please continue the converstation below. Provide   
the next reply by the user. Formulate a very short question.   
Imitate a curious 10 year old kid asking a question.  
  
user: {question}  
assistant: {answer}"""  
  
# copy the dict with the synthetic riddles to a new one which will contain the answers too  
synthetic_riddles_step3 = copy.deepcopy(synthetic_riddles_step2)  
  
for riddle in synthetic_riddles_step3:  
    # format prompt using the template, insert the conversation we have so far  
    prompt = prompt_template.format(   
        question = riddle["messages"][0],  
        answer = riddle["messages"][1]   
    )  
    messages = [ {"role": "user", "content": prompt} ]  
      
    input_tokens = tokenizer.apply_chat_template(
        messages, 
        return_tensors="pt"
        ).to("cuda")  
  
    output_tokens = model.generate(  
        input_tokens,   
        max_new_tokens = 500,   
        do_sample = True,   
        pad_token_id = tokenizer.eos_token_id
        )  
        
    output_tokens = output_tokens[0][len(input_tokens[0]):]  
    output = tokenizer.decode(output_tokens, skip_special_tokens = True)  
  
    riddle["messages"].append(output)

Apparently this is how Mistral thinks a 10-year-old child talks:

Cool! So they eat only what, in the garden?

Cool! So which bird is the fifth President of the United States symbolically linked to?

Cool! So you're like a container that helps make cookies, right? But what about those tiny ones for tea or soup? What are they called?

Cool! So, like, if I share my toys with my friends, they might share their toys with me too, right?

Cool! So, what can make you laugh but also makes you cry?

Cool! So, what happens if we don't drink enough water?

Cool! So, what other mythical creatures talk without using a mouth?

Cool! What about a pencil? It goes in your hand and gets stronger too, right?

Cool! What does lead look like when it's not all maimed and impure?

Example conversations generated so far:

user: A bag contains apples, some red, some green. If you reach in without looking, and pick one apple, what is the chance that it was ripe?
assistant: The information given does not specify anything about the ripeness of the apples being red or green. Therefore, we cannot determine the chance that a picked apple is ripe based on its color alone. Additionally, ripeness is a subjective factor that can vary from apple to apple even within the same color group. So,we would need more information, such as the definition of ripeness used or any context about the storage conditions of the apples, to provide an answer.
user: But how can we tell if it’s more likely to be a ripe apple if it’s red, since sometimes red apples aren’t ripe yet?

user: A bath can’t give you one, a river can’t float one, but what finds it in the hot desert?
assistant: The answer to the riddle is a shadow. A shadow cannot exist in the absence of sunlight, which is typically found in hot desert climates. Shadows cannot be physically created by a bath or a river, as they result from the blocking of light by an object.
user: So you’re saying a shadow is something that finds sunshine in the desert? Cool!

user: A big tree was once a little seed. A little bird was once a little egg. What was Alice before she was a girl?
assistant: Alice was a child before she was a girl. However, I interpret your question as asking what Alice was before she grew up to be a girl, not just before she was labeled as a girl. If Alice was a character in a story or in real life, she might have been a baby before she was a girl. But if you meant something else by ”Alice”, feel free to ask for clarification.
user: What was Alice like when she was really, really small?

Now on to the next step: Completing the second turn.

Step 4: Generate answer to the follow-up question

from tqdm import tqdm  
import copy   
  
synthetic_riddles_step4 = copy.deepcopy(synthetic_riddles_step3)  
  
for riddle in tqdm(synthetic_riddles_step4):  
  
    # this time no prompt, just apply the Mistral chat template   
    # to the three messages we generated so far   
    messages = [  
        {"role": "user", "content": riddle["messages"][0]},  
        {"role": "assistant", "content": riddle["messages"][1]},  
        {"role": "user", "content": riddle["messages"][2]},  
    ]  
      
  
    input_tokens = tokenizer.apply_chat_template(
        messages, 
        return_tensors = "pt"
        ).to("cuda")  
        
    output_tokens = model.generate(  
        input_tokens,   
        max_new_tokens = 500,   
        do_sample = True,   
        pad_token_id = tokenizer.eos_token_id
        )  
        
    output_tokens = output_tokens[0][len(input_tokens[0]):]  
    output = tokenizer.decode(output_tokens, skip_special_tokens = True)  
  
    riddle["messages"].append(output)

Example conversations:

user: A bag contains apples, some red, some green. If you reach in without looking, and pick one apple, what is the chance that it was ripe?
assistant: The information given does not specify anything about the ripeness of the apples being red or green. Therefore, we cannot determine the chance that a picked apple is ripe based on its color alone. Additionally, ripeness is a subjective factor that can vary from apple to apple even within the same color group. So, we would need more information, such as the definition of ripeness used or any context about the storage conditions of the apples, to provide an answer.

user: But how can we tell if it’s more likely to be a ripe apple if it’s red, since sometimes red apples aren’t ripe yet?
assistant: You are correct that the color of an apple (being red) does not necessarily indicate that it is ripe. Some apple varieties, such as Granny Smith apples, remain green even when fully ripe. Therefore, the color of an apple alone cannot be used to determine with certainty whether it is ripe or not.
To judge the ripeness of an apple, other factors must be taken into consideration, such as its texture, firmness, and the give when gently pressed. A ripe apple usually has a uniform, unblemished skin, and yields slightly to gentle pressure. The fruit should also have a sweet aroma.
So, while a red apple may be more likely to be ripe than, say, a green apple that is still unripe, there is still no foolproof way to predict ripeness based solely on color. Ultimately, the only accurate way to determine if an apple is ripe is to examine it for the above-mentioned signs of ripeness.

user: A bath can’t give you one, a river can’t float one, but what finds it in the hot desert?
assistant: The answer to the riddle is a shadow. A shadow cannot exist in the absence of sunlight, which is typically found in hot desert climates. Shadows cannot be physically created by a bath or a river, as they result from the blocking of light by an object.
user: So you’re saying a shadow is something that finds sunshine in the desert? Cool!
assistant: Yes, that’s correct! A shadow is an area or shape where light from a source is blocked, creating a darkened or depressed area. In this riddle, the desert was specified as a hot location, which could suggest that the sun is particularly intense there. Shadows are dependent on sunlight and therefore, they can be found in the desert, even though the desert itself cannot provide a shadow.

user: A big tree was once a little seed. A little bird was once a little egg. What was Alice before she was a girl?
assistant: Alice was a child before she was a girl. However, I interpret your question as asking what Alice was before she grew up to be a girl, not just before she was labeled as a girl. If Alice was a character in a story or in real life, she might have been a baby before she was a girl. But if you meant something else by ”Alice”, feel free to ask for clarification.
user: What was Alice like when she was really, really small?
assistant: If ”Alice” refers to a person, it is not possible for me to know exactly what she was like when she was really, really small, as I do not have access to her personal history or experiences. However, when a person is really, really small, they are typically referred to as a baby. Babies are not yet capable of complex thoughts or actions, and they rely on others for their basic needs such as food, shelter, and care. They communicate primarily through crying, cooing, and other nonverbal means, and they explore their environment through touch, taste, smell, and sight. Babies develop and grow rapidly during this stage of their life, and they eventually learn to crawl, walk, talk, and become independent. If ”Alice” is a character in a story or a figment of someone’s imagination, her early life may be described in different ways based on the intentions of the author or creator.

These conversations exemplify the depth and interactivity we’re aiming for in our chatbot’s responses.

In conclusion, Part 1 of this tutorial demonstrates the process of using Mistral to evolve a simple riddle dataset into a comprehensive two-turn conversational dataset. The dataset can be found on Hugging Face at g-ronimo/riddles_evolved. We ended up with 1.7k conversations only, but despite this rather small size, it led to good results (see Part 2: Training). However, we did not apply any quality control or filtering. This dataset, as always, would definitely benefit from additional data curation.

Seed your conversations with other kinds of data

We used a seed of a few hundred riddles. Here are some additional ideas of datasets to use as a base:

Part 2: Fine-tune Phi-2

In this part, we fine-tune Phi-2 using our newly created synthetic dataset.
The process involves:

  1. Load and prepare model and tokenizer for ChatML format
  2. Attach LoRA adapters to the model
  3. Load and tokenize dataset
  4. Set hyperparameters and train

We will conclude this second part with an analysis of the training and model output. All the code for Part 2 is available in a notebook.

Load 4-bit model and tokenizer

import torch  
from transformers import AutoModelForCausalLM, BitsAndBytesConfig  
  
# Load model
modelpath = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4",
    ),
    torch_dtype=torch.bfloat16,
    # FA2 does not work yet
    # attn_implementation="flash_attention_2",          
)

Note: Flash attention 2 saves VRAM and accelerates training, but the current implementation for Phi-2 has issues. I expect this to be fixed soon.

Load the tokenizer and add the ChatML special tokens to model and tokenizer:

from transformers import AutoTokenizer  
  
# fast tokenizer sometimes ignores the added tokens  
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast=False)      
  
# add special tokens for ChatML formatting and a pad token  
tokenizer.add_tokens(["<|im_start|>", "<PAD>"])
tokenizer.pad_token = "<PAD>"
tokenizer.add_special_tokens(dict(eos_token="<|im_end|>"))
model.config.eos_token_id = tokenizer.eos_token_id

Note: Resizing the token embeddings is not necessary in this case. Phi-2 comes with a vocab size of 51200, while its tokenizer knows 50295 tokens. This means you can add around 700 tokens to the tokenizer before having to resize the model's embeddings.

Prepare LoRA adapters

LoRA (Low-Rank Adaptation) is an efficient way to fine-tune large models. It updates selected parts of the model only during training, speeding up the process and saving memory. For a deeper dive, checkout these resources for LoRA in general and QLoRA in particular.

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model  
  
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True) 

# Adapter settings
lora_config = LoraConfig(
    r=32, 
    lora_alpha=32, 
    target_modules = [ "q_proj", "k_proj", "v_proj", "dense" ],
    modules_to_save = ["lm_head", "embed_tokens"],
    lora_dropout=0.1, 
    bias="none", 
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

model.config.use_cache = False
  • target_modules : With the settings above we train ~9.2% (283 million) of the model parameters only. We could also train all linear layers — in the case of Phi-2 that would be the layers ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"] which increases the number of trainable parameters to 10.0% (309 M) for the given rank. 🛠 Training all linear layers should increase the models performance since it’s closer to a full fine-tune, but also requires more VRAM and increases size of the checkpoints.
  • rank: The rank in Low-Rank Adapters (LoRA) also influences the number of trainable parameters. A higher rank increases the size of the update matrices, this means more trained parameters and greater model flexibility, but at the cost of increased computational complexity. Conversely, a lower rank results in fewer parameters, leading to more efficient training and less computational burden, but potentially less flexibility in adapting the model. Thus, the choice of rank represents a trade-off between model adaptability and computational resources required for training.
    🛠 Increasing the rank from 32 to 64, for example, increases the number of trainable parameters to 9.8% (304 million) for the given target_modules.
  • lora_alpha: This is a scaling factor that adjusts the influence of the low-rank updates on the original weights of the model. It modulates how much the original behaviour of the model is altered. The LoRA paper states that “tuning alpha is roughly the same as tuning the learning rate”.
    🛠 There is no consensus on how to set lora_alpha in relation to rank (reddit, Farty Pants on medium, Platypus paper, Ahead of AI blog). One approach seems to be setting lora_alpha = rank which is what we use here.
  • lora_dropout: Dropout-rate during the training process. A value of 0.1 means that 10% of the trainable parameters are randomly set to non-trainable (or "dropped"), this should help the model generalize and prevent overfitting. 5% and 10% are common values, it does not matter which one you pick in my limited experience with this parameter.

Load dataset

Now, let’s load the dataset we created in Part 1 and split it into a training and validation set.

from datasets import load_dataset  
  
# load the dataset created in Part 1  
dataset = load_dataset("g-ronimo/riddles_evolved")  
  
# split into training (90%) and test set (10%)  
dataset = dataset["train"].train_test_split(test_size=0.1)

Example entry:

{  
  'number': 1509,  
  'messages': [  
    "What starts with a C is covered with keys, but we use only one. What is it?",  
    "The answer to the riddle is a coconut. While a coconut is covered with many small keys or eyes, we typically open it with just one tool, such   
      as a coconut crusher or a large nail.",  
    "Cool, so what's that thing we eat that looks like a big brown ball and   
      has lots of little bumps on it, but we only use one thing to open it?   
      I guess it's not a basketball?",  
    "I believe you're referring to an artichoke. While an artichoke looks   
      like a large brown ball with small bumps on it, we typically use a tool   
      called a dip tool or a lemon tool with a sharp tip or a pair of   
      scissors to remove the small, sharp thorns covering the artichoke's   
      bracts, allowing us to eat the tender heart inside."  
  ]  
}

Apply ChatML format and tokenize dataset

We first organize our conversations using ChatML format and then tokenize them. It’s important to set up our dataset so that the model learns to answer like an assistant, not to ask questions like a user. To do this, we set the labels to the ignore index of -100 for user messages and corresponding input tokens for the assistant’s replies during training.

Note: If you are familiar with the QLoRA repo you will notice that the original Guanaco models are trained on both, question and answer (train_on_source=False by default in qlora.py). These models, trained on raw conversation text, perform well. However, focusing just on the assistant's part is the correct way to train and what we will do here.

import os  
from functools import partial  
  
# ChatML format  
templates = [  
    "<|im_start|>assistant\n{msg}<|im_end|>",      # message by assistant  
    "<|im_start|>user\n{msg}<|im_end|>".           # message by user  
]  
  
# This special index is used to ignore certain tokens during loss calculation.  
IGNORE_INDEX = -100  
  
def tokenize(input, max_length):  
    input_ids, attention_mask, labels = [], [], []  
  
    # Iterate over each message in the dataset  
    for i, msg in enumerate(input["messages"]):  
  
        # Check if the message is from human (user) or assistant, apply ChatML template  
        isHuman = i%2==0  
        msg_chatml = templates[isHuman].format(msg=msg)  
  
        # tokenize all, truncate later  
        msg_tokenized = tokenizer(  
          msg_chatml,   
          truncation=False,   
          add_special_tokens=False)  
  
        # Copy tokens and attention mask without changes  
        input_ids += msg_tokenized["input_ids"]  
        attention_mask += msg_tokenized["attention_mask"]  
  
        # Adapt labels for loss calculation: if user->IGNORE_INDEX, if assistant->input_ids  (=ignore human messages, calculate loss only for assistant messages since these are the reponses we want to learn)  
        labels += [IGNORE_INDEX]*len(msg_tokenized["input_ids"]) if isHuman else msg_tokenized["input_ids"]  
  
    # truncate to max. length  
    return {  
        "input_ids": input_ids[:max_length],   
        "attention_mask": attention_mask[:max_length],  
        "labels": labels[:max_length],  
    }  
  
dataset_tokenized = dataset.map(  
    # cut samples at 1024 tokens  
    # enough for the riddles dataset (max. length ~1000 tokens)  
    # has to be adapted for other datasets, higher = more VRAM needed  
    partial(tokenize, max_length=1024),   
    batched = False,  
    num_proc = os.cpu_count(),    # multithreaded  
    remove_columns = dataset["train"].column_names  # Remove original columns, no longer needed  
)

Define collator

A collator function is used to batch our data during training. The crucial part is to pad the inputs properly.

The purpose of the collate function is to process and prepare batches of data for training (and evaluation). It standardizes the length of each data point in the batch by padding to the length of the longest sample using specific tokens. The input_ids are padded with the pad token, the labels with the IGNORE_INDEX (to indicate that these tokens shouldn't contribute to the loss calculation), and the attention_mask with 0 (to ignore the padded tokens).

# collate function - to transform list of dictionaries [ {input_ids: [123, ..]}, {.. ] to a single dictionary forming a batch { input_ids: [..], labels: [..], attention_mask: [..] }  
def collate(elements):  
  
    # Extract input_ids from each element and find the maximum length among them 
    tokens = [e["input_ids"] for e in elements]  
    tokens_maxlen = max([len(t) for t in tokens])  
  
    for i, sample in enumerate(elements):  
        input_ids = sample["input_ids"]  
        labels = sample["labels"]  
        attention_mask = sample["attention_mask"]  
  
        # Calculate the padding length required to match the maximum token length  
        pad_len = tokens_maxlen-len(input_ids)  
  
        # Pad 'input_ids' with the pad token ID, 'labels' with IGNORE_INDEX, and 'attention_mask' with 0  
        input_ids.extend( pad_len * [tokenizer.pad_token_id] )  
        labels.extend( pad_len * [IGNORE_INDEX] )  
        attention_mask.extend( pad_len * [0] )  
  
    # create and return batch with all the data in elements  
    batch={  
        "input_ids": torch.tensor( [e["input_ids"] for e in elements] ),  
        "labels": torch.tensor( [e["labels"] for e in elements] ),  
        "attention_mask": torch.tensor( [e["attention_mask"] for e in elements] ),  
    }  
    return batch

Train

Finally, we set our training hyperparameters and start training.

from transformers import TrainingArguments, Trainer  
  
bs=1         # batch size  
ga_steps=16  # gradient acc. steps  
epochs=20  
lr=0.00002  
  
steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps)  
  
args = TrainingArguments(  
    output_dir="out",  
    per_device_train_batch_size=bs,  
    per_device_eval_batch_size=16,  
    evaluation_strategy="steps",  
    logging_steps=1,  
    eval_steps=steps_per_epoch//2,      # eval twice per epoch  
    save_steps=steps_per_epoch,         # save once per epoch  
    gradient_accumulation_steps=ga_steps,  
    num_train_epochs=epochs,  
    lr_scheduler_type="constant",  
    optim="paged_adamw_32bit",      # val_loss will go NaN with paged_adamw_8bit  
    learning_rate=lr,  
    group_by_length=False,  
    bf16=True,  
    ddp_find_unused_parameters=False,  
)  
  
trainer = Trainer(  
    model=model,  
    tokenizer=tokenizer,  
    args=args,  
    data_collator=collate,  
    train_dataset=dataset_tokenized["train"],  
    eval_dataset=dataset_tokenized["test"],  
)  
  
trainer.train()

Note: To not blow this text up any further, please find the code for merging the LoRA adapters, inference, and model upload to the Hugging Face hub in this notebook_.

Hyperparameters explained

A few words on the specific hyperparameters used:

  • batch_size: Larger batch sizes are preferable but constrained by available VRAM. The longer the training samples (increasing max_length during tokenization), the more VRAM is needed.
    🛠 ️In this specific case of samples with a max_length of 1024 tokens, a batch_size of 1 is the maximum on a 24GB VRAM GPU. Gradient checkpointing is a feature that saves VRAM and was recently made available for Phi-2. Higher batch sizes than at the time of writing are now possible. To increase the effective batch size, gradient_accumulation_steps was set to 16 which has the downside of slowing down the training process.
  • learning_rate: Selected empirically. As I will try to convince you below, a rate of 2e-5 (0.00002) has shown effective results for this dataset.
    ️🛠 A learning rate of 4e-5also “works” and results in a finetuned model that responds in line with the training data. Which learning is better, what is the best setting? This depends on the size and kind of training data. You would simply have to try and see how the model behaves.
    Note on the topic of measuring the performance of a fine-tuned model: LLM evaluation is hard, see below for my thoughts on benchmarks.
  • lr_scheduler_type: Following the recommendation of the QLoRA author Tim Dettmers for using a constant learning rate schedule, I’ve adopted this approach and found it consistently effective for Phi-2, and also Llama 1/2 and Mistral.
  • bf16: For mixed precision training, we utilize bfloat16 (bf16), which consumes less memory compared to 32-bit formats and provides a broader dynamic range than fp16. Using fp16 previously led to Not a Number (NaN) errors when working with Phi-2. However, bf16 demands an NVIDIA Ampere (or newer) GPU.
  • epochs: 20 epochs is an unusually high number. We use such a high number because our dataset is tiny. A more detailed explanation follows below.

The specific parameter settings chosen here might not be perfect, but they are good. To convince ourselves that this is true, let’s take a look at loss curves, benchmarks and model output.

Analysis

Training for 20 epochs might seem excessive. For comparison, a dataset with around 12k conversations typically requires only 3 epochs. Applying this logic to our riddles dataset: 1 epoch = 1680 conversations, our target was to train on approximately 36k conversations in total, which translates to around 21 epochs.

image/jpeg Phi-2 fine-tuned on g-ronimo/riddles_evolved. QLoRa. 20 epochs. LR=2*10e-5 (constant), BS=1, grad. acc. steps=16. Took 40 minutes on 4x3090. Training on a single NVIDIA GeForce RTX 3090 takes around 2.5 hours.

The model is overfitting it seems. Training loss decreases while the loss calculated on the validation set stays the same, the loss curves are diverging. This is a known phenomenon when training LLMs, it has been observed before. Jeremy Howard wrote a great blog post about this.

Even with what appears to be suboptimal training performance, models trained past the point of loss divergence turn out quite well. I compared the outputs at epochs 7 (where loss divergence starts) and epoch 20 and the model's responses improved significantly.

Benchmarks

One more quality control step. Let’s check with a few benchmarks whether the fine-tuning process compromised the model’s original capabilities. The evaluations were carried out using the EleutherAI’s LM Eval Harness.

Note: Benchmarks, while useful, can be misleading as they are often targets rather than mere metrics. The grand 🤗 Open LLM Leaderboard is polluted with models trained on benchmark data. Related: Pretraining on the Test Set Is All You Need. Dataset decontamination efforts apparently even opened up a new sub-field of research. Please interpret benchmark results in general with many grains of salt.

Phi-2 base model

Tasks Version Filter n-shot Metric Value Stderr
arc_challenge Yaml none 0 acc 0.5299 ± 0.0146
none 0 acc_norm 0.5410 ± 0.0146
arc_easy Yaml none 0 acc 0.7997 ± 0.0082
none 0 acc_norm 0.7824 ± 0.0085
hellaswag Yaml none 0 acc 0.5587 ± 0.0050
none 0 acc_norm 0.7378 ± 0.0044
piqa Yaml none 0 acc 0.7867 ± 0.0096
none 0 acc_norm 0.7927 ± 0.0095
winogrande Yaml none 0 acc 0.7561 ± 0.0121

Phi-2 fine-tuned model

Tasks Version Filter n-shot Metric Value Stderr
arc_challenge Yaml none 0 acc 0.5026 ± 0.0146
none 0 acc_norm 0.5316 ± 0.0146
arc_easy Yaml none 0 acc 0.8035 ± 0.0082
none 0 acc_norm 0.7689 ± 0.0086
hellaswag Yaml none 0 acc 0.5602 ± 0.0050
none 0 acc_norm 0.7340 ± 0.0044
piqa Yaml none 0 acc 0.7933 ± 0.0094
none 0 acc_norm 0.8009 ± 0.0093
winogrande Yaml none 0 acc 0.7680 ± 0.0119

The benchmark results show minor fluctuations in performance metrics for the fine-tuned model compared to the base model, with no significant declines indicating knowledge loss.

With these positive indicators, let’s proceed to chat with the model and evaluate its conversational skills.

Sample conversations with fine-tuned Phi-2

Conversation 1/3 (ChatML tags removed): Hello World

user: hello!
assistant: I'm here to help you with any questions or tasks. What would you like me to do for you?
user: who are you?
assistant: My name is an artificial intelligence language model developed by OpenAI, specifically the GPT-3 model. I am designed to generate human-like text based on patterns and context provided in my training data. If you have a specific question or need assistance with something else, please let me know what it is.
user: are you real?
assistant: Yes, I am real as long as there's someone using this interface to communicate with me through typing commands or asking questions. My responses come from complex algorithms that process vast amounts of information stored within my knowledge base. However, keep in mind that I don't possess consciousness or emotions like humans do. I simply provide helpful answers based on the given input. Is there anything else I can assist you with?

Note: The Phi-2 base model was trained on conversational data too. OpenAI does not appear a single time in our synthetic training set but still the model introduces itself as being trained by OpenAI.

Conversation 2/3

user: explain quantum mechanics


assistant: I'd be happy to help explain Quantum Mechanics! It's a fascinating field of physics that deals with the behavior and properties of matter and energy at very small scales, such as atoms and subatomic particles. Here are some key concepts:

  1. Uncertainty Principle: This principle states that it is impossible to know both the exact position and momentum (or velocity) of a particle at the same time. The more precisely you try to measure one property, the less accurately you can determine the other.
  2. Wave-particle duality: According to this concept, all particles exhibit wave-like properties, including electrons, photons, and even larger objects like planets. Conversely, waves also have particle-like characteristics, such as having mass and being localized in space.
  3. Superposition: In superposition, a system or particle exists in multiple states simultaneously until an observation is made, causing it to collapse into a single state. For example, an electron can exist in two places at once before being observed.
  4. Entanglement: When two or more particles become entangled, their quantum states become linked, regardless of how far apart they are. This means that measuring the state of one particle instantaneously affects the state of another, no matter the distance between them.
  5. Quantum tunneling: This phenomenon allows particles to pass through potential barriers that would normally be impenetrable according to classical physics. It occurs because particles can occupy regions where probabilities are high for finding them on the other side of the barrier.
  6. Uncertainty in measurement: Due to the probabilistic nature of quantum mechanics, there is inherent uncertainty in measurements. The act of observing a particle changes its state, making it impossible to predict exactly what will happen without performing many trials.
  7. Applications: Quantum mechanics has numerous practical applications, including quantum computing, cryptography, and advanced materials science. Understanding these principles is crucial for developing new technologies and pushing the boundaries of scientific ...


user: i don't get it


assistant: I understand that Quantum Mechanics can seem confusing at first. Let me break down some common misconceptions and simplify the concepts further:

Misconception 1: "Quantum mechanics only applies to tiny things." While Quantum Mechanics does primarily deal with phenomena at the atomic and subatomic levels, its principles can still apply to everyday situations involving large numbers of particles. For instance, understanding the behavior of light and electricity relies heavily on Quantum Mechanics.

Misconception 2: "The Uncertainty Principle implies randomness." Although the Uncertainty Principle introduces limitations on our ability to precisely measure certain pairs of physical quantities, it doesn't imply randomness. Instead, it reflects the fundamental limits imposed by the nature of reality itself.

Misconception 3: "Entangled particles always communicate instantly." Entanglement refers to the correlation between the quantum states of two or more particles. However, when we say that measuring one particle instantaneously affects the other, we mean that the information about the measurement outcome propagates faster than the speed of light within the context of classical physics. But in reality, no communication happens faster than the speed of light.

Misconception 4: "Quantum Tunneling defies logic." Quantum Tunneling may appear counterintuitive since it suggests that particles can go through seemingly impenetrable barriers. However, it follows from the mathematical foundations of Quantum Mechanics and is not a contradiction but rather a manifestation of the unique behaviors of particles at the quantum level.

Misconception 5: "Measurement outcomes are predetermined." In Quantum Mechanics, the act of measurement causes a particle to collapse into a specific state based on probabilities determined by its initial conditions. Therefore, while we cannot predict the precise outcome beforehand, we can make predictions about the likelihood of different possibilities.

These simplified explanations should provide a clearer understanding of Quantum Mechanics. If you have any questions or need further clarification, feel free to ask!

Conversation 3/3

user: what is real strength?
assistant: The concept of "real" or true strength can vary depending on the context. In general, it refers to physical power and ability to perform tasks efficiently without being easily fatigued. However, there are also other forms of strength such as mental toughness, emotional resilience, and spiritual fortitude that may not be directly related to physical abilities but still contribute significantly to a person's overall well-being and success in life. Ultimately, what one considers to be their own personal definition of real strength will depend on their values, experiences, and goals.
user: provide an example
assistant: Sure! Here's an example: A weightlifter who consistently lifts heavy weights with ease and minimal effort could be considered physically strong. Their muscles have developed enough mass and endurance to handle the demands placed upon them during lifting sessions. Another example might be someone who has overcome significant adversity in their lives, such as illness, injury, or financial hardship. Despite these challenges, they continue to persevere and thrive, demonstrating great inner strength and resilience. These individuals may not possess traditional measures of physical strength like muscle size or athletic prowess, but they exhibit qualities that many people consider to be equally valuable and admirable - determination, courage, and perseverance.

Looking at these conversations, I must say, the model’s responses are really impressive.

To wrap up, in Part 2 we covered fine-tuning Microsoft’s Phi-2 using a small synthetic dataset. Key steps include:

  • Preparing the model, tokenizer, and dataset for ChatML format.
  • Setting up LoRA adapters for efficient fine-tuning.
  • Selecting training hyperparameters suited for Phi-2 and confirming them by analyzing loss curves, benchmarks and model output.

One last thing

Having run benchmarks on both the base and fine-tuned models, I decided to broaden the scope and compare several current small models for a more comprehensive view.

image/png Normalized accuracies shown. Models tested: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T, state-spaces/mamba-2.8b-slimpj, microsoft/phi-2, mistralai/Mistral-7B-Instruct-v0.2. Evaluation performed using LM Eval Harness

Interestingly, Phi-2’s performance is comparable, and in some cases superior, to the latest Instruct version of Mistral-7B. This raises questions: Trained on benchmarks or actually true? I’m not sure what to make of this.

In any case, it’s evident that Phi-2 is a great small-model, definitely not overhyped in Microsoft’s announcement: “_Phi-2: The surprising power of small language models_”.

Wrap up

Phi-2 demonstrates the power of quality training data over sheer model size, showcasing remarkable capabilities despite its relatively low number of parameters. Through this tutorial, you should now have a fundamental understanding of how to generate a synthetic dataset and (parameter-)efficiently fine-tune Microsoft's Phi-2.

The code and data used and produced in this tutorial can be found here:

  • Evolved riddles: Hugging Face
  • Dataset notebook: GitHub
  • Fine-tuning notebook (runs on a single 24 GB VRAM GPU) and
    🤗 accelerate-ready code for multi-GPU training

I hope you enjoyed this tutorial/braindump and build your own fine-tunes on top of this amazing model! Share your experiences, questions, and ideas with me on Twitter @Geronimo_AI. Or simply follow me.