Finetuning Mixtral 8x7B Instruct-v0.1 using Transformers

#178
by Ateeqq - opened

Fine-Tuning Mixtral Using qlora

In this tutorial, we will delve into the fine-tuning process of Mixtral using one A100 (40GB) GPU. Specifically, we will employ qlora to fine-tune our model on a unique dataset: the Shakespeare dataset sourced from Hugging Face. This dataset serves a distinctive purpose of converting modern English text into the eloquent and archaic style of Shakespearean English. Our evaluation strategy involves translating a set of 5 English sentences into Shakespearean English using both the original and fine-tuned models. We will manually assess the results to gauge the efficacy of the fine-tuning process.

Check: The easiest way to fine-tune Mistral-7B-v0.1 here.

Importing Libraries and Loading the Model

Before diving into the fine-tuning process, it's imperative to import the necessary libraries and load the model and tokenizer. Ensure that the required libraries are installed beforehand. The complete code for this tutorial is available on GitHub for reference.

import torch
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, PeftModel

# Loading the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1",
                                             load_in_4bit=True,
                                             torch_dtype=torch.float16,
                                             device_map="auto",
                                             )

Preparing the Model for Training with qlora

Next, we prepare our model for training with qlora in 4-bit precision. This step is crucial for optimizing the model's performance during fine-tuning.

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# Adjusting tokenizer settings
tokenizer.pad_token = "!"  # Not using EOS; explanation provided elsewhere

# Setting parameters for LORA
CUTOFF_LEN = 256  # Dataset consists of short text
LORA_R = 8
LORA_ALPHA = 2 * LORA_R
LORA_DROPOUT = 0.1

# Configuring LORA
config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["w1", "w2", "w3"],  # Targeting MoE layers
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM"
)

# Incorporating LORA into the model
model = get_peft_model(model, config)

Loading the Dataset and Defining Utility Functions

Now, we load our dataset and define utility functions for prompt template generation and dataset tokenization. The dataset can be easily adapted from various formats such as CSV or JSON.

# Loading the dataset
dataset = load_dataset("harpreetsahota/modern-to-shakesperean-translation")
train_data = dataset["train"]  # Using only training data

# Utility functions for prompt generation and tokenization
def generate_prompt(user_query):
    sys_msg = "Translate the given text to Shakespearean style."
    prompt = "<s> [INST]" + sys_msg + "\n" + user_query["modern"] + "[/INST]" + user_query["shakespearean"] + "</s>"
    return prompt

def tokenize(prompt):
    return tokenizer(
        prompt + tokenizer.eos_token,
        truncation=True,
        max_length=CUTOFF_LEN,
        padding="max_length"
    )

# Tokenizing the dataset
train_data = train_data.shuffle().map(lambda x: tokenize(generate_prompt(x)), remove_columns=["modern", "shakespearean"])

Fine-Tuning the Model

With everything set up, we pass the data to the Hugging Face trainer and initiate the fine-tuning process.

trainer = Trainer(
    model=model,
    train_dataset=train_data,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        num_train_epochs=6,
        learning_rate=1e-4,
        logging_steps=2,
        optim="adamw_torch",
        save_strategy="epoch",
        output_dir="mixtral-moe-lora-instruct-shakespeare"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False

# Initiating training
trainer.train()

And like magic our models starts to train!!

Evaluation

For evaluation, we prepare a list of 5 statements similar to those in the dataset. These statements are input into both the original Mixtral model and the fine-tuned model to generate Shakespearean English translations.

phrases = [
    "He dipped out on me after a couple of hangouts.",
    "I'm all about that artist, their vibe is straight fire!",
    "Linking up with the famjam this weekend.",
    "The expression 'Spill the beans' means to reveal secret information.",
    "Saying 'Hear it through the grapevine' indicates learning news through informal sources.",
]

Generating translations using the original Mixtral model

 [INST]Translate the given text to Shakespearean style.
He dipped out on me after a couple of hangouts.[/INST] Mine acquaintance, so familiar in our meetings but a few,
Has forsook me, left me in the lurch, with nary a goodbye or adieu.
Our gatherings, once filled with mirth and good company,
He hath abandoned, with naught but silence and absence in their stead.
Alas, the cause I know not, nor can I fathom the reason,
But know this, dear friend, his departure hath left a bitter taste.

Generating translations using the fine-tuned model

  [INST]Translate the given text to Shakespearean style.
He dipped out on me after a couple of hangouts.[/INST]With much delight, we oft did meet,
Two souls engaged in friendly treat,
But sudden, like the wind, he vanished,
Our camaraderie, by him, abandoned.

Conclusion: Let's compare the translations generated by the original Mixtral model and the fine-tuned model:

In this tutorial, we demonstrated the fine-tuning process of Mixtral for translating modern English into Shakespearean English using qlora. By fine-tuning the model on the Shakespeare dataset, we achieved translations that closely resemble the style and elegance of Shakespeare's language. The comparative analysis highlighted the effectiveness of fine-tuning in improving translation quality, as evidenced by the refined and concise translations generated by the fine-tuned model.

This approach not only enhances the performance of Mixtral but also opens avenues for leveraging fine-tuning techniques in other natural language processing tasks. As showcased with the Japanese language chat model, fine-tuning holds the potential to significantly improve model fluency and performance across diverse linguistic domains.

LEARN TO FINE TUNE LLMs HERE: https://exnrt.com/tag/huggingface/

The GPU is currently utilizing approximately 31GB of memory. Hence, I speculate that it could also be fine-tuned with a 32GB VRAM GPU. If anyone achieves success with this, please inform me in the comments!

thanks for sharing! very useful information!

Sign up or log in to comment