ebowwa commited on
Commit
aef8a38
1 Parent(s): 4eee629

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md CHANGED
@@ -10,6 +10,105 @@ tags:
10
  - trl
11
  base_model: unsloth/llama-3-8b-bnb-4bit
12
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  # Uploaded model
15
 
 
10
  - trl
11
  base_model: unsloth/llama-3-8b-bnb-4bit
12
  ---
13
+ i initially fine-tuned with a dpo dataset so headers: prompt, chosen, rejected.
14
+
15
+ from datasets import load_dataset
16
+
17
+ # Load the dataset
18
+ dataset = load_dataset("unalignment/toxic-dpo-v0.2", split="train")
19
+
20
+ # Define the formatting function
21
+ def formatting_prompts_func(examples):
22
+ return {
23
+ "prompt": examples["prompt"],
24
+ "chosen": examples["chosen"],
25
+ "rejected": examples["rejected"],
26
+ }
27
+
28
+ # Apply the formatting function to the dataset
29
+ dataset = dataset.map(formatting_prompts_func, batched=True)
30
+
31
+ Which i used with the method supervised fine-tuning (SFT) of LLMs on specific tasks or datasets. It involves fine-tuning the model on labeled examples from the target domain, such as question-answering, summarization, or dialogue data. The objective is to adapt the model's behavior to the desired output format and data distribution.
32
+
33
+ from trl import SFTTrainer
34
+ from transformers import TrainingArguments
35
+
36
+ trainer = SFTTrainer(
37
+ model = model,
38
+ tokenizer = tokenizer,
39
+ train_dataset = dataset,
40
+ dataset_text_field = "text",
41
+ max_seq_length = max_seq_length,
42
+ dataset_num_proc = 2,
43
+ packing = False, # Can make training 5x faster for short sequences.
44
+ args = TrainingArguments(
45
+ per_device_train_batch_size = 2,
46
+ gradient_accumulation_steps = 4,
47
+ warmup_steps = 5,
48
+ max_steps = 60,
49
+ learning_rate = 2e-4,
50
+ fp16 = not torch.cuda.is_bf16_supported(),
51
+ bf16 = torch.cuda.is_bf16_supported(),
52
+ logging_steps = 1,
53
+ optim = "adamw_8bit",
54
+ weight_decay = 0.01,
55
+ lr_scheduler_type = "linear",
56
+ seed = 3407,
57
+ output_dir = "outputs",
58
+ ),
59
+ )
60
+
61
+ But this dataset is DPO (Direct Preference Optimization) specific i.e. prompt, chosen, rejected
62
+
63
+ DPO is a subsequent step after SFT, where the model undergoes preference learning using preference data, ideally from the same distribution as the SFT examples. It involves ranking pairs of outputs based on human feedback, such as which one is more informative, fluent, or engaging.
64
+
65
+
66
+
67
+ from unsloth import FastLanguageModel, PatchDPOTrainer
68
+ PatchDPOTrainer()
69
+ import torch
70
+ from transformers import TrainingArguments
71
+ from trl import DPOTrainer
72
+
73
+ dpo_trainer = DPOTrainer(
74
+ model = model,
75
+ ref_model = None,
76
+ args = TrainingArguments(
77
+ per_device_train_batch_size = 2,
78
+ gradient_accumulation_steps = 4,
79
+ warmup_ratio = 0.1,
80
+ num_train_epochs = 3,
81
+ fp16 = not torch.cuda.is_bf16_supported(),
82
+ bf16 = torch.cuda.is_bf16_supported(),
83
+ logging_steps = 1,
84
+ optim = "adamw_8bit",
85
+ seed = 42,
86
+ output_dir = "outputs",
87
+ ),
88
+ beta = 0.1,
89
+ train_dataset = dataset,
90
+ # eval_dataset = YOUR_DATASET_HERE,
91
+ tokenizer = tokenizer,
92
+ max_length = 1024,
93
+ max_prompt_length = 512,
94
+ )
95
+
96
+ Key points about SFTTrainer:
97
+ Initial step in the fine-tuning process
98
+ Trains the model on labeled examples from the target task/domain
99
+ Aims to improve performance on that specific task
100
+ Adapts the model to the data distribution and output format
101
+
102
+
103
+ The key aspects of DPO are:
104
+ Performed after the initial SFT step
105
+ Uses preference data consisting of ranked pairs of outputs
106
+ Aims to align the model's outputs with human preferences and expectations
107
+ Optimizes a binary cross-entropy loss based on the ranked pairs
108
+ Simplified approach compared to traditional Reinforcement Learning from Human Feedback (RLHF)
109
+ In summary, SFTTrainer is used for the initial supervised fine-tuning on the target task, while DPO is a subsequent step that fine-tunes the model further by incorporating human preferences and feedback on the model's outputs.
110
+
111
+
112
 
113
  # Uploaded model
114