File size: 3,743 Bytes
3cc1996
 
7cc451c
 
 
 
 
 
 
3cc1996
 
7cc451c
3cc1996
7cc451c
 
 
 
72c79bf
7cc451c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
library_name: transformers
base_model:
  - nbeerbower/llama-3-wissenschaft-8B
datasets:
- jondurbin/truthy-dpo-v0.1
- kyujinpy/orca_math_dpo
license: other
license_name: llama3
---

![image/png](https://huggingface.co/nbeerbower/bophades-mistral-7B/resolve/main/bophades.png)

# llama-3-bophades-v3-8B

This model is based on Llama-3-8b, and is governed by [META LLAMA 3 COMMUNITY LICENSE AGREEMENT](LICENSE)

[nbeerbower/llama-3-wissenschaft-8B](https://huggingface.co/nbeerbower/llama-3-wissenschaft-8B) finetuned on [jondurbin/truthy-dpo-v0.1](https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1) and [kyujinpy/orca_math_dpo](https://huggingface.co/datasets/kyujinpy/orca_math_dpo). 

### Method

Finetuned using an A100 on Google Colab.

[Fine-tune a Mistral-7b model with Direct Preference Optimization](https://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac) - [Maxime Labonne](https://huggingface.co/mlabonne)

### Configuration

Dataset preperation and message formatting:

```python
def chatml_format(example):
    # Initialize formatted system message
    system = ""

    # Check if 'system' field exists and is not None
    if example.get('system'):
        system = "<|im_start|>system\n" + example['system'] + "<|im_end|>\n"

    # Format instruction
    instruction = ""
    if example.get('prompt'):
        instruction = example['prompt']
    if example.get('question'):
        instruction = example['question']
    prompt = "<|im_start|>user\n" + instruction + "<|im_end|>\n<|im_start|>assistant\n"

    # Format chosen answer
    chosen = example['chosen'] + "<|im_end|>\n"

    # Format rejected answer
    rejected = example['rejected'] + "<|im_end|>\n"

    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

# Array of datasets to concat
ds = [
    "jondurbin/truthy-dpo-v0.1",
    "kyujinpy/orca_math_dpo"
]

# load_dataset and combine all
loaded_datasets = [load_dataset(dataset_name, split='train') for dataset_name in ds]
dataset = concatenate_datasets(loaded_datasets)

# Save columns
original_columns = dataset.column_names

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# Format dataset
dataset = dataset.map(
    chatml_format,
    remove_columns=original_columns
)
```

LoRA, model, and training settings:

```python
# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    load_in_4bit=True
)
model.config.use_cache = False

# Reference model
ref_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    load_in_4bit=True
)

# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    max_steps=1000,
    save_strategy="no",
    logging_steps=1,
    output_dir=new_model,
    optim="paged_adamw_32bit",
    warmup_steps=100,
    bf16=True,
    report_to="wandb",
)

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model,
    ref_model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=2048,
    max_length=4096,
    force_use_ref_model=True
)

# Fine-tune model with DPO
dpo_trainer.train()
```