File size: 4,364 Bytes
b22dc87
 
 
 
 
 
 
 
 
 
 
cacbedb
 
b22dc87
aef8a38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b22dc87
 
 
 
 
 
 
 
 
4eee629
 
cacbedb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
base_model: unsloth/llama-3-8b-bnb-4bit
datasets:
- unalignment/toxic-dpo-v0.2
---
i initially fine-tuned with a dpo dataset so headers: prompt, chosen, rejected. 

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("unalignment/toxic-dpo-v0.2", split="train")

# Define the formatting function
def formatting_prompts_func(examples):
    return {
        "prompt": examples["prompt"],
        "chosen": examples["chosen"],
        "rejected": examples["rejected"],
    }

# Apply the formatting function to the dataset
dataset = dataset.map(formatting_prompts_func, batched=True)
 
Which i used with the method supervised fine-tuning (SFT) of LLMs on specific tasks or datasets. It involves fine-tuning the model on labeled examples from the target domain, such as question-answering, summarization, or dialogue data. The objective is to adapt the model's behavior to the desired output format and data distribution.

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)
 
But this dataset is DPO (Direct Preference Optimization)  specific  i.e. prompt, chosen, rejected

DPO is a subsequent step after SFT, where the model undergoes preference learning using preference data, ideally from the same distribution as the SFT examples. It involves ranking pairs of outputs based on human feedback, such as which one is more informative, fluent, or engaging.



from unsloth import FastLanguageModel, PatchDPOTrainer
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = dataset,
    # eval_dataset = YOUR_DATASET_HERE,
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)
 
Key points about SFTTrainer:
Initial step in the fine-tuning process
Trains the model on labeled examples from the target task/domain
Aims to improve performance on that specific task
Adapts the model to the data distribution and output format


The key aspects of DPO are:
Performed after the initial SFT step
Uses preference data consisting of ranked pairs of outputs
Aims to align the model's outputs with human preferences and expectations
Optimizes a binary cross-entropy loss based on the ranked pairs
Simplified approach compared to traditional Reinforcement Learning from Human Feedback (RLHF)
In summary, SFTTrainer is used for the initial supervised fine-tuning on the target task, while DPO is a subsequent step that fine-tunes the model further by incorporating human preferences and feedback on the model's outputs.



# Uploaded  model

- **Developed by:** ebowwa
- **License:** apache-2.0
- **Finetuned from model :** unsloth/llama-3-8b-bnb-4bit

This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

https://colab.research.google.com/drive/14ArcJ4hR613jH0HxYcT734it_HVHG_bb?usp=sharing

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)