Fine-Tuning LLMs: Supervised Fine-Tuning and Reward Modelling

Community blog post
Published December 4, 2023

In the realm of machine learning, fine-tuning is a critical step that allows pre-trained models to adapt to specific tasks. This blog post will delve into two types of fine-tuning methods: Supervised Fine-Tuning (SFT) and Reward Modelling, specifically Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). We will also discuss the tools and resources that facilitate these processes, including the SFTTrainer and DPOTrainer classes from the Transformer Reinforcement Learning (TRL) library developed by Hugging Face.

Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning is a common approach to adapt a pre-trained model to a specific task. It involves training the model on a labeled dataset, where the model learns to predict the correct label for each input.

The TRL library provides the SFTTrainer class, which is designed to facilitate the SFT process. This class accepts a column in your training dataset CSV that contains system instructions, questions, and answers, which form the prompt structure.

Different models may require different prompt structures, but a standard approach is to use the dataset structure described in OpenAI's InstructGPT paper. A good example of such a dataset is the "text" column of the dataset available at this link.

Reward Modelling: RLHF and DPO

Reward modelling is another approach to fine-tuning, which involves training the model to optimize a reward function. This function is typically defined based on human feedback, hence the term Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) is a similar approach, where the model is trained to optimize a reward function defined based on direct comparisons of different model outputs.

The TRL library also provides the DPOTrainer class, which is designed to facilitate the DPO process. This class is used to align the SFT fine-tuned model with AI feedback.

Full Training vs. Low-Rank Adaptation (LoRA) Training

When it comes to training models, there are two main approaches: Full Training and Low-Rank Adaptation (LoRA) Training.

Full training involves training the entire model, which can be computationally expensive. It typically requires around 8 GPUs with 80GB of VRAM to train the full model using DeepSpeed ZeRO-3.

On the other hand, LoRA or QLoRA fine-tuning is a more resource-efficient approach. It involves training only a small part of the model, which can be done on a single consumer 24GB GPU. In this approach, the lora_target_modules are set as q_proj, k_proj, v_proj, and o_proj.

AutoTrain Advanced by Hugging Face

If the process of fine-tuning seems daunting, Hugging Face offers a simpler solution: AutoTrain Advanced. This tool allows you to fine-tune models for free. All you need to do is specify the base pre-trained model to fine-tune and the dataset to fine-tune on. A more detailed blog on using AutoTrain can be found at this link.


Fine-tuning is a critical step in adapting pre-trained models to specific tasks. Whether you choose Supervised Fine-Tuning, Reward Modelling, or a combination of both, the TRL library and AutoTrain Advanced tool by Hugging Face provide powerful resources to facilitate the process. By understanding these methods and tools, you can effectively fine-tune models to achieve impressive results on your specific tasks.