Unleashing the Power of Unsloth and QLora:Redefining Language Model Fine-Tuning

Community blog post
Published January 19, 2024



In the dynamic realm of language model optimization, a revolutionary force has emerged - Unsloth. This avant-garde framework, born from the minds of Daniel and Michael Han, is set to redefine the landscape of fine-tuning. As we delve into the definitions, advantages, and benefits, prepare to witness a paradigm shift in the way we approach language model optimization.



Unsloth is not just a library; it's a technological symphony orchestrated for the fine-tuning and training of large language models (LLMs). Specifically designed for optimal performance, Unsloth introduces innovative techniques to enhance speed, reduce memory consumption, and elevate accuracy during the fine-tuning process.

Advantages of Unsloth:

  1. Speed Redefined : Unsloth boasts a staggering 30x increase in training speed. Alpaca, a benchmark task, now takes merely 3 hours instead of the conventional 85. This acceleration is a testament to Unsloth's commitment to efficiency and productivity.

  2. Memory Efficiency: A game-changer in the memory domain, Unsloth promises a 60% reduction in memory usage. This not only enables the handling of larger batches but also ensures a seamless fine-tuning process without compromising on performance.

  3. Accuracy Amplified: The authors proudly declare a 0% loss in accuracy, with an additional option for a +20% increase in accuracy using their MAX offering. This commitment to maintaining and elevating accuracy levels sets Unsloth apart in the competitive landscape.

  4. Hardware Compatibility:Unsloth extends its reach by supporting NVIDIA, Intel, and AMD GPUs. This inclusivity ensures accessibility to a wide array of hardware configurations, making it a versatile choice for developers across different platforms.

Benefits of Fine-Tuning with Unsloth and QLora:

  1. Efficiency Unleashed:The reduction in weights upscaling during QLoRA translates to fewer weights, resulting in a more efficient memory footprint. This efficiency, coupled with the use of bfloat16 directly, empowers developers to achieve fine-tuning goals faster and with fewer resource demands.

  2. Innovative Attention Mechanisms: Unsloth integrates Flash Attention via xformers and Tri Dao's implementation, contributing to optimized transformer models. This innovative approach to attention mechanisms ensures that fine-tuning is not merely a technical task but a creative endeavor.

  3. Causal Mask for Speed: The adoption of a causal mask for speeding up training, instead of a separate attention mask, showcases Unsloth's commitment to reimagining traditional methodologies. This forward-thinking approach paves the way for more efficient and faster fine-tuning.

  4. Optimized Cross Entropy Loss: Unsloth doesn't just fine-tune; it fine-tunes with precision. The optimization of Cross Entropy loss computation significantly reduces memory consumption, ensuring that the process remains resource-friendly without compromising on accuracy.

Code Implementation

Lets deep dive into code section for finetuning with unsloth and QLora

Step 1: Install Libraries

# Import the PyTorch library
import torch

# Get the major and minor version of the current CUDA device (GPU)
major_version, minor_version = torch.cuda.get_device_capability()

# Apply the following if the GPU has Ampere or Hopper architecture (RTX 30xx, RTX 40xx, A100, H100, L40, etc.)
if major_version >= 8:
    # Install the Unsloth library for Ampere and Hopper architecture from GitHub
    !pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git" -q

# Apply the following for older GPUs (V100, Tesla T4, RTX 20xx, etc.)
    # Install the Unsloth library for older GPUs from GitHub
    !pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git" -q

# Placeholder statement (does nothing)

# Install the Hugging Face Transformers library from GitHub, which allows native 4-bit loading
!pip install "git+https://github.com/huggingface/transformers.git" -q

!pip install trl datasets -q

Step 2: Import Libraries

from unsloth import FastLanguageModel
# Import FastLanguageModel from the Unsloth library

max_seq_length = 2048  # Can be set arbitrarily, automatically supports RoPE scaling!
# Set the maximum sequence length to 2048 (can be changed arbitrarily)

dtype = None  # Automatically detect if None. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
# Set the data type (automatically detect if None, can also specify Float16 or Bfloat16)

load_in_4bit = True  # Reduce memory usage using 4-bit quantization. Can be set to False.
# Reduce memory usage using 4-bit quantization (can be set to False to disable)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b-bnb-4bit",  # Use "unsloth/mistral-7b" for 16-bit loading
    # Load the model "unsloth/mistral-7b-bnb-4bit" from pre-training (use "unsloth/mistral-7b" for 16-bit loading)

    # Set the maximum sequence length

    # Set the data type

    # Apply the settings for 4-bit loading

    # token="hf_...", # Use the token when using a gate model (e.g., meta-llama/Llama-2-7b-hf)
    # Use Hugging Face's token when using a gate model, or similar cases

Add LoRA Adapter and update only 1-10% of all parameters!

model = FastLanguageModel.get_peft_model(
    # Specify the existing model

    r=16,  # Choose any positive number! Recommended values include 8, 16, 32, 64, 128, etc.
    # Rank parameter for LoRA. The smaller this value, the fewer parameters will be modified.

    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    # Specify the modules to which LoRA will be applied

    # Alpha parameter for LoRA. This value determines the strength of the applied LoRA.

    lora_dropout=0,  # Currently, only supports dropout = 0
    # Dropout rate for LoRA. Currently supports only 0.

    bias="none",  # Currently, only supports bias = "none"
    # Bias usage setting. Currently supports only the setting without bias.

    # Whether to use gradient checkpointing to improve memory efficiency

    # Seed value for random number generation

    # Set the maximum sequence length

Step 3: Load Dataset

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:

### Input:

### Response:
# Define the prompt format for the Alpaca dataset

def formatting_prompts_func(examples):
    # Define a function to format each example in the dataset

    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    # Get instructions, inputs, and outputs

    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Generate text by combining instructions, inputs, and outputs

        text = alpaca_prompt.format(instruction, input, output)
        # Format the text according to the prompt format

    return { "text" : texts, }
    # Return a list of formatted texts

# Placeholder (does nothing)

from datasets import load_dataset
# Import the load_dataset function from the datasets library

dataset = load_dataset("yahma/alpaca-cleaned", split="train")
# Load the training data of the cleaned version of the Alpaca dataset from yahma

dataset = dataset.map(formatting_prompts_func, batched=True,)
# Apply the formatting_prompts_func function to the dataset with batch processing

Step IV: Training Model

from trl import SFTTrainer
# Import SFTTrainer from the TRL library

from transformers import TrainingArguments
# Import TrainingArguments from the Transformers library

trainer = SFTTrainer(
    # Initialize the SFTTrainer

    # Specify the model to be used

    # Specify the training dataset

    # Specify the text field in the dataset

    # Specify the maximum sequence length

        # Specify training arguments

        # Specify the training batch size per device

        # Specify the number of steps for gradient accumulation

        # Specify the number of warm-up steps

        # Specify the maximum number of steps

        # Specify the learning rate

        fp16=not torch.cuda.is_bf16_supported(),
        # Set whether to use 16-bit floating-point precision (fp16)

        # Set whether to use Bfloat16

        # Specify the logging steps

        # Specify the optimizer (here using 8-bit AdamW)

        # Specify the weight decay value

        # Specify the type of learning rate scheduler (linear)

        # Specify the random seed

        # Specify the output directory


Step V: Display Current Memory Statistics

gpu_stats = torch.cuda.get_device_properties(0)
# Get properties of the GPU device at index 0

start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
# Get the maximum reserved GPU memory in GB and round to 3 decimal places

max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
# Get the total GPU memory in GB and round to 3 decimal places

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
# Display the GPU name and maximum memory

print(f"{start_gpu_memory} GB of memory reserved.")
# Display the reserved memory amount

Step VI: Execute the Train Method

trainer_stats = trainer.train()

Step VII: Conversion Code to GGUF

def colab_quantize_to_gguf(save_directory, quantization_method="q4_k_m"):
    # Define a function for conversion to GGUF

    from transformers.models.llama.modeling_llama import logger
    import os
    # Import necessary libraries

        "Unsloth: `colab_quantize_to_gguf` is still in development mode.\n"\
        "If anything errors or breaks, please file a ticket on Github.\n"\
        "Also, if you used this successfully, please tell us on Discord!"
    # Warn that it's still in development mode and encourage reporting any issues

    # From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html
        # Define currently allowed quantization methods
        # Including descriptions for each quantization method
        "q2_k"   : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
        "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
        "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
        "q3_k_s" : "Uses Q3_K for all tensors",
        "q4_0"   : "Original quant method, 4-bit.",
        "q4_1"   : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
        "q4_k_m" : "Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
        "q4_k_s" : "Uses Q4_K for all tensors",
        "q5_0"   : "Higher accuracy, higher resource usage and slower inference.",
        "q5_1"   : "Even higher accuracy, resource usage and slower inference.",
        "q5_k_m" : "Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
        "q5_k_s" : "Uses Q5_K for all tensors",
        "q6_k"   : "Uses Q8_K for all tensors",
        "q8_0"   : "Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.",

    if quantization_method not in ALLOWED_QUANTS.keys():
        # If the specified quantization method is not allowed, raise an error
        error = f"Unsloth: Quant method = [{quantization_method}] not supported. Choose from below:\n"
        for key, value in ALLOWED_QUANTS.items():
            error += f"[{key}] => {value}\n"
        raise RuntimeError(error)

    # Display information about the conversion
    print_info = \
        f"==((====))==  Unsloth: Conversion from QLoRA to GGUF information\n"\
        f"   \\\   /|    [0] Installing llama.cpp will take 3 minutes.\n"\
        f"O^O/ \_/ \\    [1] Converting HF to GUUF 16bits will take 3 minutes.\n"\
        f"\        /    [2] Converting GGUF 16bits to q4_k_m will take 20 minutes.\n"\
        f' "-____-"     In total, you will have to wait around 26 minutes.\n'
    # Display information about the conversion process

    if not os.path.exists("llama.cpp"):
        # If llama.cpp does not exist, install it
        print("Unsloth: [0] Installing llama.cpp. This will take 3 minutes...")
        !git clone https://github.com/ggerganov/llama.cpp
        !cd llama.cpp && make clean && LLAMA_CUBLAS=1 make -j
        !pip install gguf protobuf

    print("Unsloth: Starting conversion from HF to GGUF 16bit...")
    # Display that conversion from HF to GGUF 16bit is starting
    # print("Unsloth: [1] Converting HF into GGUF 16bit. This will take 3 minutes...")
    !python llama.cpp/convert.py {save_directory} \
        --outfile {save_directory}-unsloth.gguf \
        --outtype f16

    print("Unsloth: Starting conversion from GGUF 16bit to q4_k_m...")
    # Display that conversion from GGUF 16bit to the specified quantization method is starting
    # print("Unsloth: [2] Converting GGUF 16bit into q4_k_m. This will take 20 minutes...")
    final_location = f"./{save_directory}-{quantization_method}-unsloth.gguf"
    !./llama.cpp/quantize ./{save_directory}-unsloth.gguf \
        {final_location} {quantization_method}

    print(f"Unsloth: Output location: {final_location}")
    # Display the output location of the converted file
from unsloth import unsloth_save_model
# Import the unsloth_save_model function from the Unsloth library

# unsloth_save_model has the same args as model.save_pretrained
# unsloth_save_model has the same arguments as model.save_pretrained
unsloth_save_model(model, tokenizer, "output_model", push_to_hub=False, token=None)
# Save the model and tokenizer as "output_model". Do not push to the Hugging Face Hub

colab_quantize_to_gguf("output_model", quantization_method="q4_k_m")
# Convert "output_model" to GGUF format. Use the quantization method "q4_k_m"


In closing, our exploration with Unsloth has been a captivating journey into the frontier of advanced language models and AI innovations. From Ampere and Hopper architectures to the artistry of Low-Rank Adaptation adapters, we navigated the realms of data preparation, model training, and memory optimization.

The Alpaca dataset, enhanced through TRL principles, served as our canvas. We delved into memory usage intricacies, time statistics, and the realm of GGUF transformations, showcasing technical prowess and creativity.

As our article concludes, the Unsloth library stands as a testament to the fusion of technology and creativity. Our journey's final act saw the model transformed into GGUF format, highlighting the adaptability of our tools.

This exploration wasn't just about code; it was a quest for innovation and inspiration. Unsloth's commitment to originality and storytelling invites us to continue pushing the boundaries in the ever-evolving landscape of language models and AI.

“Stay connected and support my work through various platforms:

Medium: You can read my latest articles and insights on Medium at https://medium.com/@andysingal

Paypal: Enjoyed my article? Buy me a coffee! https://paypal.me/alphasingal?country.x=US&locale.x=en_US"

Requests and questions: If you have a project in mind that you’d like me to work on or if you have any questions about the concepts I’ve explained, don’t hesitate to let me know. I’m always looking for new ideas for future Notebooks and I love helping to resolve any doubts you might have.


-Fine-Tuning with Unsloth