DPO-Internlm2-1_8B / README.md
QiyuWu's picture
Update README.md
9d12aed verified
metadata
library_name: transformers
license: apache-2.0

DPO_Quant

🤗 Hugging Face Space • 🎢Wandb

Learning from human preferences is a paradigm adopted in the natural language processing literature to better align LLM to human desiderata. Recently RLHF has been used successfully in many senses to get a better performance. In 2023 NeurIPS, DPO was proposed for addressing the problem of huge resource consumption in training. However, for people who don't have enough GPUs, training a model with DPO is still a difficult situation. In this reposity, I implemented a code reproduction of the DPO algorithm and the BitsandBytes is used for the model quantization to make run of DPO on a 24G 4090 possible. Besides, I deployed a trained model on the Huggingface Space using llama.cpp for accelerating.

👋Getting Started

  • System requirement: Ubuntu20.04/Windows 11, Cuda 12.1
  • Tested GPUs: RTX4090

Create conda environment:

 conda create -n dpo python=3.10
 conda activate dpo

Install packages with pip

 pip install -r requirements.txt

Switch Source

For some reason, user may not be able to access HuggingFace conveniently. Run the code bellow to handle this.

export HF_ENDPOINT="https://hf-mirror.com"

📈Training

Model pythia2.8B and dataset Anthropic/hh-rlhf is used in the training. For customized training, you need change dataset_name and model_name in the file config.yaml.

Run the code bellow to start a training.

python Train.py

🤔Experiment Analysis

Experiments were conducted with BitsandBytes to load the quantization model. Two models was trained. Model pythia2.8B was trained using DPO loss and model internlm2-chat-1_8b-sft was trained using IPO loss. The main pipline of DPO is (1)Training the model using SFT on a preference dataset and (2)Traing the model using DPO on the same dataset.

In SFT, the run lasts 2h20min, we can see from the figures that the eval_loss snowly decreases when the step grows.

For more details, check here and here

In DPO, the run lasts 7h30min, we can see from the figures that the accuracies and margins snowly increases when the step grows.

For more details, check here and here

Compared to the example in eric-mitchell/direct-preference-optimization: Reference implementation for DPO (Direct Preference Optimization) (github.com), our experiment is more unstable in training, but achieved pretty good results in accuracy. Besides, due to time constraints, our experiments were only trained on top of about 25K conversations, which is why our experiments did not achieve significantly good results on top of some other metrics.

Besides, comparing the two losses, we can find that the IPO loss's rewards of chosen responce was declining in the training stage yet the DPO's is rising. This phenomenon demonstrates that the IPO loss effectively avoids greedy policies.

🤗Huggingface Space Deployment

Above is the chatbot deployed on Huggingface Space, you can have a try(due to the poor computility of the 2vCPU, the responce time may be about a minute)link.

For deploy the project, I used the llama.cpp to convert my trained model to a '.gguf' file. With loading the quantinized file, we only need about 1G RAM for runing the model.

📄Reference

The work is based on a lot of previous work and blogs, as well as some HuggingFace courses and documentation. Many thanks to authors for sharing this, it has helped me gain a lot. Listed below are the references I used to learn.