CultriX commited on
Commit
1ea5e91
1 Parent(s): 23e02b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -2
README.md CHANGED
@@ -6,5 +6,68 @@ pipeline_tag: text-generation
6
  dtype: bfloat16
7
  ---
8
 
9
- Finetuned zyh3826/GML-Mistral-merged-v1 model with DPO using Intel's dataset for neural-chat-7b-v3-1.
10
- Fine-tuning took about an hour on Google Colab A-1000 GPU with 40GB VRAM.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  dtype: bfloat16
7
  ---
8
 
9
+ # DESCRIPTION
10
+ MistralTrix-v1 is an zyh3826/GML-Mistral-merged-v1 model that has been further fine-tuned with Direct Preference Optimization (DPO) using Intel's dataset for neural-chat-7b-v3-1.
11
+ It surpasses the original model on several benchmarks (see results).
12
+
13
+ It is directly inspired by the RLHF process described by Intel/neural-chat-7b-v3-1's authors to improve performance.
14
+ I used the same dataset and reformatted it to apply the ChatML template.
15
+
16
+ The code to train this model is available on Google Colab and GitHub.
17
+ Fine-tuning took about an hour on Google Colab A-1000 GPU with 40GB VRAM.
18
+
19
+
20
+ # LoRA configuration
21
+ peft_config = LoraConfig(
22
+ r=16,
23
+ lora_alpha=16,
24
+ lora_dropout=0.05,
25
+ bias="none",
26
+ task_type="CAUSAL_LM",
27
+ target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
28
+ )
29
+
30
+ # Model to fine-tune
31
+ model = AutoModelForCausalLM.from_pretrained(
32
+ model_name,
33
+ torch_dtype=torch.float16,
34
+ load_in_4bit=True
35
+ )
36
+ model.config.use_cache = False
37
+
38
+ # Reference model
39
+ ref_model = AutoModelForCausalLM.from_pretrained(
40
+ model_name,
41
+ torch_dtype=torch.float16,
42
+ load_in_4bit=True
43
+ )
44
+
45
+ # Training arguments
46
+ training_args = TrainingArguments(
47
+ per_device_train_batch_size=4,
48
+ gradient_accumulation_steps=4,
49
+ gradient_checkpointing=True,
50
+ learning_rate=5e-5,
51
+ lr_scheduler_type="cosine",
52
+ max_steps=200,
53
+ save_strategy="no",
54
+ logging_steps=1,
55
+ output_dir=new_model,
56
+ optim="paged_adamw_32bit",
57
+ warmup_steps=100,
58
+ bf16=True,
59
+ report_to="wandb",
60
+ )
61
+
62
+ # Create DPO trainer
63
+ dpo_trainer = DPOTrainer(
64
+ model,
65
+ ref_model,
66
+ args=training_args,
67
+ train_dataset=dataset,
68
+ tokenizer=tokenizer,
69
+ peft_config=peft_config,
70
+ beta=0.1,
71
+ max_prompt_length=1024,
72
+ max_length=1536,
73
+ )