Comparing the Performance of LLMs: A Deep Dive into Roberta, Llama 2, and Mistral for Disaster Tweets Analysis with Lora

Published November 7, 2023
Update on GitHub

Introduction

In the fast-moving world of Natural Language Processing (NLP), we often find ourselves comparing different language models to see which one works best for specific tasks. This blog post is all about comparing three models: RoBERTa, Mistral-7b, and Llama-2-7b. We used them to tackle a common problem - classifying tweets about disasters. It is important to note that Mistral and Llama 2 are large models with 7 billion parameters. In contrast, RoBERTa-large (355M parameters) is a relatively smaller model used as a baseline for the comparison study.

In this blog, we used PEFT (Parameter-Efficient Fine-Tuning) technique: LoRA (Low-Rank Adaptation of Large Language Models) for fine-tuning the pre-trained model on the sequence classification task. LoRa is designed to significantly reduce the number of trainable parameters while maintaining strong downstream task performance.

The main objective of this blog post is to implement LoRA fine-tuning for sequence classification tasks using three pre-trained models from Hugging Face: meta-llama/Llama-2-7b-hf, mistralai/Mistral-7B-v0.1, and roberta-large

Hardware Used

  • Number of nodes: 1
  • Number of GPUs per node: 1
  • GPU type: A6000
  • GPU memory: 48GB

Goals

  • Implement fine-tuning of pre-trained LLMs using LoRA PEFT methods.
  • Learn how to use the HuggingFace APIs (transformers, peft, and datasets).
  • Setup the hyperparameter tuning and experiment logging using Weights & Biases.

Dependencies

datasets
evaluate
peft
scikit-learn
torch
transformers
wandb 

Note: For reproducing the reported results, please check the pinned versions in the wandb reports.

Pre-trained Models

RoBERTa

RoBERTa (Robustly Optimized BERT Approach) is an advanced variant of the BERT model proposed by Meta AI research team. BERT is a transformer-based language model using self-attention mechanisms for contextual word representations and trained with a masked language model objective. Note that BERT is an encoder only model used for natural language understanding tasks (such as sequence classification and token classification).

RoBERTa is a popular model to fine-tune and appropriate as a baseline for our experiments. For more information, you can check the Hugging Face model card.

Llama 2

Llama 2 models, which stands for Large Language Model Meta AI, belong to the family of large language models (LLMs) introduced by Meta AI. The Llama 2 models vary in size, with parameter counts ranging from 7 billion to 65 billion.

Llama 2 is an auto-regressive language model, based on the transformer decoder architecture. To generate text, Llama 2 processes a sequence of words as input and iteratively predicts the next token using a sliding window. Llama 2 architecture is slightly different from models like GPT-3. For instance, Llama 2 employs the SwiGLU activation function rather than ReLU and opts for rotary positional embeddings in place of absolute learnable positional embeddings.

The recently released Llama 2 introduced architectural refinements to better leverage very long sequences by extending the context length to up to 4096 tokens, and using grouped-query attention (GQA) decoding.

Mistral 7B

Mistral 7B v0.1, with 7.3 billion parameters, is the first LLM introduced by Mistral AI. The main novel techniques used in Mistral 7B's architecture are:

  • Sliding Window Attention: Replace the full attention (square compute cost) with a sliding window based attention where each token can attend to at most 4,096 tokens from the previous layer (linear compute cost). This mechanism enables Mistral 7B to handle longer sequences, where higher layers can access historical information beyond the window size of 4,096 tokens.
  • Grouped-query Attention: used in Llama 2 as well, the technique optimizes the inference process (reduce processing time) by caching the key and value vectors for previously decoded tokens in the sequence.

LoRA

PEFT, Parameter Efficient Fine-Tuning, is a collection of techniques (p-tuning, prefix-tuning, IA3, Adapters, and LoRa) designed to fine-tune large models using a much smaller set of training parameters while preserving the performance levels typically achieved through full fine-tuning.

LoRA, Low-Rank Adaptation, is a PEFT method that shares similarities with Adapter layers. Its primary objective is to reduce the model's trainable parameters. LoRA's operation involves learning a low rank update matrix while keeping the pre-trained weights frozen.

image

Setup

RoBERTa has a limitatiom of maximum sequence length of 512, so we set the MAX_LEN=512 for all models to ensure a fair comparison.

MAX_LEN = 512 
roberta_checkpoint = "roberta-large"
mistral_checkpoint = "mistralai/Mistral-7B-v0.1"
llama_checkpoint = "meta-llama/Llama-2-7b-hf"

Data preparation

Data loading

We will load the dataset from Hugging Face:

from datasets import load_dataset
dataset = load_dataset("mehdiiraqui/twitter_disaster")

Now, let's split the dataset into training and validation datasets. Then add the test set:

from datasets import Dataset
# Split the dataset into training and validation datasets
data = dataset['train'].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
data['val'] = data.pop("test")
# Convert the test dataframe to HuggingFace dataset and add it into the first dataset
data['test'] = dataset['test']

Here's an overview of the dataset:

DatasetDict({
    train: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target'],
        num_rows: 6090
    })
    val: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target'],
        num_rows: 1523
    })
    test: Dataset({
        features: ['id', 'keyword', 'location', 'text', 'target'],
        num_rows: 3263
    })
})

Let's check the data distribution:

import pandas as pd

data['train'].to_pandas().info()
data['test'].to_pandas().info()
  • Train dataset
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB
  • Test dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
 4   target    3263 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 127.6+ KB

Target distribution in the train dataset

target
0    4342
1    3271
Name: count, dtype: int64

As the classes are not balanced, we will compute the positive and negative weights and use them for loss calculation later:

pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[0])

The final weights are:

POS_WEIGHT, NEG_WEIGHT = (1.1637114032405993, 0.8766697374481806)

Then, we compute the maximum length of the column text:

# Number of Characters
max_char = data['train'].to_pandas()['text'].str.len().max()
# Number of Words
max_words = data['train'].to_pandas()['text'].str.split().str.len().max()
The maximum number of characters is 152.
The maximum number of words is 31.

Data Processing

Let's take a look to one row example of training data:

data['train'][0]
{'id': 5285,
 'keyword': 'fear',
 'location': 'Thibodaux, LA',
 'text': 'my worst fear. https://t.co/iH8UDz8mq3',
 'target': 0}

The data comprises a keyword, a location and the text of the tweet. For the sake of simplicity, we select the text feature as the only input to the LLM.

At this stage, we prepared the train, validation, and test sets in the HuggingFace format expected by the pre-trained LLMs. The next step is to define the tokenized dataset for training using the appropriate tokenizer to transform the text feature into two Tensors of sequence of token ids and attention masks. As each model has its specific tokenizer, we will need to define three different datasets.

We start by defining the RoBERTa dataloader:

  • Load the tokenizer:
from transformers import AutoTokenizer
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta_checkpoint, add_prefix_space=True)

Note: The RoBERTa tokenizer has been trained to treat spaces as part of the token. As a result, the first word of the sentence is encoded differently if it is not preceded by a white space. To ensure the first word includes a space, we set add_prefix_space=True. Also, to maintain consistent pre-processing for all three models, we set the parameter to 'True' for Llama 2 and Mistral 7b.

  • Define the preprocessing function for converting one row of the dataframe:
def roberta_preprocessing_function(examples):
    return roberta_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

By applying the preprocessing function to the first example of our training dataset, we have the tokenized inputs (input_ids) and the attention mask:

roberta_preprocessing_function(data['train'][0])
{'input_ids': [0, 127, 2373, 2490, 4, 1205, 640, 90, 4, 876, 73, 118, 725, 398, 13083, 329, 398, 119, 1343, 246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
  • Now, let's apply the preprocessing function to the entire dataset:
col_to_delete = ['id', 'keyword','location', 'text']
# Apply the preprocessing function and remove the undesired columns
roberta_tokenized_datasets = data.map(roberta_preprocessing_function, batched=True, remove_columns=col_to_delete)
# Rename the target to label as for HugginFace standards
roberta_tokenized_datasets = roberta_tokenized_datasets.rename_column("target", "label")
# Set to torch format
roberta_tokenized_datasets.set_format("torch")

Note: we deleted the undesired columns from our data: id, keyword, location and text. We have deleted the text because we have already converted it into the inputs ids and the attention mask:

We can have a look into our tokenized training dataset:

roberta_tokenized_datasets['train'][0]
{'label': tensor(0),
 'input_ids': tensor([    0,   127,  2373,  2490,     4,  1205,   640,    90,     4,   876,
            73,   118,   725,   398, 13083,   329,   398,   119,  1343,   246,
             2]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}
  • For generating the training batches, we also need to pad the rows of a given batch to the maximum length found in the batch. For that, we will use the DataCollatorWithPadding class:
# Data collator for padding a batch of examples to the maximum length seen in the batch
from transformers import DataCollatorWithPadding
roberta_data_collator = DataCollatorWithPadding(tokenizer=roberta_tokenizer)

You can follow the same steps for preparing the data for Mistral 7B and Llama 2 models:

Note that Llama 2 and Mistral 7B don't have a default pad_token_id. So, we use the eos_token_id for padding as well.

  • Mistral 7B:
# Load Mistral 7B Tokenizer
from transformers import AutoTokenizer, DataCollatorWithPadding
mistral_tokenizer = AutoTokenizer.from_pretrained(mistral_checkpoint, add_prefix_space=True)
mistral_tokenizer.pad_token_id = mistral_tokenizer.eos_token_id
mistral_tokenizer.pad_token = mistral_tokenizer.eos_token

def mistral_preprocessing_function(examples):
    return mistral_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

mistral_tokenized_datasets = data.map(mistral_preprocessing_function, batched=True, remove_columns=col_to_delete)
mistral_tokenized_datasets = mistral_tokenized_datasets.rename_column("target", "label")
mistral_tokenized_datasets.set_format("torch")

# Data collator for padding a batch of examples to the maximum length seen in the batch
mistral_data_collator = DataCollatorWithPadding(tokenizer=mistral_tokenizer)
  • Llama 2:
# Load Llama 2 Tokenizer
from transformers import AutoTokenizer, DataCollatorWithPadding
llama_tokenizer = AutoTokenizer.from_pretrained(llama_checkpoint, add_prefix_space=True)
llama_tokenizer.pad_token_id = llama_tokenizer.eos_token_id
llama_tokenizer.pad_token = llama_tokenizer.eos_token

def llama_preprocessing_function(examples):
    return llama_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

llama_tokenized_datasets = data.map(llama_preprocessing_function, batched=True, remove_columns=col_to_delete)
llama_tokenized_datasets = llama_tokenized_datasets.rename_column("target", "label")
llama_tokenized_datasets.set_format("torch")

# Data collator for padding a batch of examples to the maximum length seen in the batch
llama_data_collator = DataCollatorWithPadding(tokenizer=llama_tokenizer)

Now that we have prepared the tokenized datasets, the next section will showcase how to load the pre-trained LLMs checkpoints and how to set the LoRa weights.

Models

RoBERTa

Load RoBERTa Checkpoints for the Classification Task

We load the pre-trained RoBERTa model with a sequence classification head using the Hugging Face AutoModelForSequenceClassification class:

from transformers import AutoModelForSequenceClassification 
roberta_model = AutoModelForSequenceClassification.from_pretrained(roberta_checkpoint, num_labels=2)

LoRA setup for RoBERTa classifier

We import LoRa configuration and set some parameters for RoBERTa classifier:

  • TaskType: Sequence classification
  • r(rank): Rank for our decomposition matrices
  • lora_alpha: Alpha parameter to scale the learned weights. LoRA paper advises fixing alpha at 16
  • lora_dropout: Dropout probability of the LoRA layers
  • bias: Whether to add bias term to LoRa layers

The code below uses the values recommended by the Lora paper. Later in this post we will perform hyperparameter tuning of these parameters using wandb.

from peft import get_peft_model, LoraConfig, TaskType
roberta_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none",
)
roberta_model = get_peft_model(roberta_model, roberta_peft_config)
roberta_model.print_trainable_parameters()

We can see that the number of trainable parameters represents only 0.64% of the RoBERTa model parameters:

trainable params: 2,299,908 || all params: 356,610,052 || trainable%: 0.6449363911929212

Mistral

Load checkpoints for the classfication model

Let's load the pre-trained Mistral-7B model with a sequence classification head:

from transformers import AutoModelForSequenceClassification
import torch
mistral_model =  AutoModelForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=mistral_checkpoint,
  num_labels=2,
  device_map="auto"
)

For Mistral 7B, we have to add the padding token id as it is not defined by default.

mistral_model.config.pad_token_id = mistral_model.config.eos_token_id

LoRa setup for Mistral 7B classifier

For Mistral 7B model, we need to specify the target_modules (the query and value vectors from the attention modules):

from peft import get_peft_model, LoraConfig, TaskType

mistral_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none", 
    target_modules=[
        "q_proj",
        "v_proj",
    ],
)

mistral_model = get_peft_model(mistral_model, mistral_peft_config)
mistral_model.print_trainable_parameters()

The number of trainable parameters reprents only 0.024% of the Mistral model parameters:

trainable params: 1,720,320 || all params: 7,112,380,416 || trainable%: 0.02418768259540745

Llama 2

Load checkpoints for the classfication mode

Let's load pre-trained Llama 2 model with a sequence classification header.

from transformers import AutoModelForSequenceClassification
import torch
llama_model =  AutoModelForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=llama_checkpoint,
  num_labels=2,
  device_map="auto",
  offload_folder="offload",
  trust_remote_code=True
)

For Llama 2, we have to add the padding token id as it is not defined by default.

llama_model.config.pad_token_id = llama_model.config.eos_token_id

LoRa setup for Llama 2 classifier

We define LoRa for Llama 2 with the same parameters as for Mistral:

from peft import get_peft_model, LoraConfig, TaskType
llama_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=16, lora_alpha=16, lora_dropout=0.05, bias="none", 
    target_modules=[
        "q_proj",
        "v_proj",  
    ],
)

llama_model = get_peft_model(llama_model, llama_peft_config)
llama_model.print_trainable_parameters()

The number of trainable parameters reprents only 0.12% of the Llama 2 model parameters:

trainable params: 8,404,992 || all params: 6,615,748,608 || trainable%: 0.1270452143516515

At this point, we defined the tokenized dataset for training as well as the LLMs setup with LoRa layers. The following section will introduce how to launch training using the HuggingFace Trainer class.

Setup the trainer

Evaluation Metrics

First, we define the performance metrics we will use to compare the three models: F1 score, recall, precision and accuracy:

import evaluate
import numpy as np

def compute_metrics(eval_pred):
    # All metrics are already predefined in the HF `evaluate` package
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric= evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    logits, labels = eval_pred # eval_pred is the tuple of predictions and labels returned by the model
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # The trainer is expecting a dictionary where the keys are the metrics names and the values are the scores. 
    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}

Custom Trainer for Weighted Loss

As mentioned at the beginning of this post, we have an imbalanced distribution between positive and negative classes. We need to train our models with a weighted cross-entropy loss to account for that. The Trainer class doesn't support providing a custom loss as it expects to get the loss directly from the model's outputs.

So, we need to define our custom WeightedCELossTrainer that overrides the compute_loss method to calculate the weighted cross-entropy loss based on the model's predictions and the input labels:

from transformers import Trainer

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # Get model's predictions
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # Compute custom loss
        loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor([neg_weights, pos_weights], device=model.device, dtype=logits.dtype))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

Trainer Setup

Let's set the training arguments and the trainer for the three models.

RoBERTa

First important step is to move the models to the GPU device for training.

roberta_model = roberta_model.cuda()
roberta_model.device()

It will print the following:

device(type='cuda', index=0)

Then, we set the training arguments:

from transformers import TrainingArguments

lr = 1e-4
batch_size = 8
num_epochs = 5

training_args = TrainingArguments(
    output_dir="roberta-large-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=False,
    gradient_checkpointing=True,
)

Finally, we define the RoBERTa trainer by providing the model, the training arguments and the tokenized datasets:

roberta_trainer = WeightedCELossTrainer(
    model=roberta_model,
    args=training_args,
    train_dataset=roberta_tokenized_datasets['train'],
    eval_dataset=roberta_tokenized_datasets["val"],
    data_collator=roberta_data_collator,
    compute_metrics=compute_metrics
)

Mistral-7B

Similar to RoBERTa, we initialize the WeightedCELossTrainer as follows:

from transformers import TrainingArguments, Trainer

mistral_model = mistral_model.cuda()

lr = 1e-4
batch_size = 8
num_epochs = 5

training_args = TrainingArguments(
    output_dir="mistral-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=True,
    gradient_checkpointing=True,
)


mistral_trainer = WeightedCELossTrainer(
    model=mistral_model,
    args=training_args,
    train_dataset=mistral_tokenized_datasets['train'],
    eval_dataset=mistral_tokenized_datasets["val"],
    data_collator=mistral_data_collator,
    compute_metrics=compute_metrics
)

Note that we needed to enable half-precision training by setting fp16 to True. The main reason is that Mistral-7B is large, and its weights cannot fit into one GPU memory (48GB) with full float32 precision.

Llama 2

Similar to Mistral 7B, we define the trainer as follows:

from transformers import TrainingArguments, Trainer

llama_model = llama_model.cuda()

lr = 1e-4
batch_size = 8
num_epochs = 5
training_args = TrainingArguments(
    output_dir="llama-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=True,
    gradient_checkpointing=True,
)



llama_trainer = WeightedCELossTrainer(
    model=llama_model,
    args=training_args,
    train_dataset=llama_tokenized_datasets['train'],
    eval_dataset=llama_tokenized_datasets["val"],
    data_collator=llama_data_collator,
    compute_metrics=compute_metrics
)

Hyperparameter Tuning

We have used Wandb Sweep API to run hyperparameter tunning with Bayesian search strategy (30 runs). The hyperparameters tuned are the following.

method metric lora_alpha lora_bias lora_dropout lora_rank lr max_length
bayes goal: maximize distribution: categorical distribution: categorical distribution: uniform distribution: categorical distribution: uniform distribution: categorical
name: eval/f1-score values:
-16
-32
-64
values: None -max: 0.1
-min: 0
values:
-4
-8
-16
-32
-max: 2e-04
-min: 1e-05
values: 512

For more information, you can check the Wandb experiment report in the resources sections.

Results

Models F1 score Training time Memory consumption Number of trainable parameters
RoBERTa 0.8077 538 seconds GPU1: 9.1 Gb
GPU2: 8.3 Gb
0.64%
Mistral 7B 0.7364 2030 seconds GPU1: 29.6 Gb
GPU2: 29.5 Gb
0.024%
Llama 2 0.7638 2052 seconds GPU1: 35 Gb
GPU2: 33.9 Gb
0.12%

Conclusion

In this blog post, we compared the performance of three large language models (LLMs) - RoBERTa, Mistral 7b, and Llama 2 - for disaster tweet classification using LoRa. From the performance results, we can see that RoBERTa is outperforming Mistral 7B and Llama 2 by a large margin. This raises the question about whether we really need a complex and large LLM for tasks like short-sequence binary classification?

One learning we can draw from this study is that one should account for the specific project requirements, available resources, and performance needs to choose the LLMs model to use.

Also, for relatively simple prediction tasks with short sequences base models such as RoBERTa remain competitive.

Finally, we showcase that LoRa method can be applied to both encoder (RoBERTa) and decoder (Llama 2 and Mistral 7B) models.

Resources

  1. You can find the code script in the following Github project.

  2. You can check the hyper-param search results in the following Weight&Bias reports: