Update README.md
Browse files
README.md
CHANGED
@@ -16,4 +16,180 @@ tags:
|
|
16 |
---
|
17 |
|
18 |

|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
---
|
17 |
|
18 |

|
19 |
+
I’m excited to introduce my third-generation model:
|
20 |
+
# Qwen2.5-14B-1M-YOYO-V3
|
21 |
+
This time, I’m not only releasing the model but also sharing some model merging techniques, which might be even more valuable than the model itself.
|
22 |
+
|
23 |
+
Let’s start by looking at the initial merge configuration (YAML):
|
24 |
+
```yaml
|
25 |
+
merge_method: model_stock
|
26 |
+
base_model: Qwen/Qwen2.5-14B
|
27 |
+
models:
|
28 |
+
- model: Qwen/Qwen2.5-14B-instruct
|
29 |
+
- model: Qwen/Qwen2.5-14B-instruct-1M
|
30 |
+
dtype: bfloat16
|
31 |
+
```
|
32 |
+
Seems straightforward, right? But the merged model occasionally suffered from **uncontrollable outputs**, likely due to the large divergence between the instruction-tuned models and the base model.
|
33 |
+
|
34 |
+
To address this, I first tried integrating a fine-tuned model with smaller divergence from the base model, like **Virtuoso-Small-v2**.
|
35 |
+
|
36 |
+
This gave rise to [Qwen2.5-14B-YOYO-latest-V2](https://huggingface.co/YOYO-AI/Qwen2.5-14B-YOYO-latest-V2).
|
37 |
+
```yaml
|
38 |
+
merge_method: model_stock
|
39 |
+
base_model: Qwen/Qwen2.5-14B
|
40 |
+
models:
|
41 |
+
- model: Qwen/Qwen2.5-14B-instruct
|
42 |
+
- model: Qwen/Qwen2.5-14B-instruct-1M
|
43 |
+
- model: arcee-ai/Virtuoso-Small-v2
|
44 |
+
dtype: bfloat16
|
45 |
+
name: Qwen2.5-14B-YOYO-latest-V2
|
46 |
+
```
|
47 |
+
This reduced runaway outputs but still left the model unstable.
|
48 |
+
|
49 |
+
Through experimentation, I found that merging **"high-divergence"** models into **"low-divergence"** models (close to the base) using the `della` method produced more stable and performant result
|
50 |
+
|
51 |
+
## Key models used:
|
52 |
+
*1. Low-divergence, high-performance models:*
|
53 |
+
|
54 |
+
- Virtuoso-Small-v2
|
55 |
+
- Blossom-V6-14B
|
56 |
+
|
57 |
+
*2. High-divergence, instruction-focused models:*
|
58 |
+
|
59 |
+
- Qwen2.5-14B-instruct
|
60 |
+
- Qwen2.5-14B-instruct-1M
|
61 |
+
|
62 |
+
## DELLA Merge Configuration:
|
63 |
+
```yaml
|
64 |
+
models:
|
65 |
+
- model: Qwen/Qwen2.5-14B-Instruct
|
66 |
+
parameters:
|
67 |
+
density: 1
|
68 |
+
weight: 1
|
69 |
+
lambda: 0.9
|
70 |
+
merge_method: della
|
71 |
+
base_model: arcee-ai/Virtuoso-Small-v2
|
72 |
+
parameters:
|
73 |
+
density: 1
|
74 |
+
weight: 1
|
75 |
+
lambda: 0.9
|
76 |
+
normalize: true
|
77 |
+
int8_mask: true
|
78 |
+
dtype: bfloat16
|
79 |
+
tokenizer_source: base
|
80 |
+
name: Qwen2.5-14B-YOYO-della1
|
81 |
+
```
|
82 |
+
```yaml
|
83 |
+
models:
|
84 |
+
- model: Qwen/Qwen2.5-14B-Instruct-1M
|
85 |
+
parameters:
|
86 |
+
density: 1
|
87 |
+
weight: 1
|
88 |
+
lambda: 0.9
|
89 |
+
merge_method: della
|
90 |
+
base_model: arcee-ai/Virtuoso-Small-v2
|
91 |
+
parameters:
|
92 |
+
density: 1
|
93 |
+
weight: 1
|
94 |
+
lambda: 0.9
|
95 |
+
normalize: true
|
96 |
+
int8_mask: true
|
97 |
+
dtype: bfloat16
|
98 |
+
tokenizer_source: base
|
99 |
+
name: Qwen2.5-14B-YOYO-della2
|
100 |
+
```
|
101 |
+
```yaml
|
102 |
+
models:
|
103 |
+
- model: Qwen/Qwen2.5-14B-Instruct
|
104 |
+
parameters:
|
105 |
+
density: 1
|
106 |
+
weight: 1
|
107 |
+
lambda: 0.9
|
108 |
+
merge_method: della
|
109 |
+
base_model: Azure99/Blossom-V6-14B
|
110 |
+
parameters:
|
111 |
+
density: 1
|
112 |
+
weight: 1
|
113 |
+
lambda: 0.9
|
114 |
+
normalize: true
|
115 |
+
int8_mask: true
|
116 |
+
dtype: bfloat16
|
117 |
+
tokenizer_source: base
|
118 |
+
name: Qwen2.5-14B-YOYO-della3
|
119 |
+
```
|
120 |
+
```yaml
|
121 |
+
models:
|
122 |
+
- model: Qwen/Qwen2.5-14B-Instruct-1M
|
123 |
+
parameters:
|
124 |
+
density: 1
|
125 |
+
weight: 1
|
126 |
+
lambda: 0.9
|
127 |
+
merge_method: della
|
128 |
+
base_model: Azure99/Blossom-V6-14B
|
129 |
+
parameters:
|
130 |
+
density: 1
|
131 |
+
weight: 1
|
132 |
+
lambda: 0.9
|
133 |
+
normalize: true
|
134 |
+
int8_mask: true
|
135 |
+
dtype: bfloat16
|
136 |
+
tokenizer_source: base
|
137 |
+
name: Qwen2.5-14B-YOYO-della3
|
138 |
+
```
|
139 |
+
This approach yielded four variants:
|
140 |
+
- `Qwen2.5-14B-YOYO-della1`
|
141 |
+
- `Qwen2.5-14B-YOYO-della2`
|
142 |
+
- `Qwen2.5-14B-YOYO-della3`
|
143 |
+
- `Qwen2.5-14B-YOYO-della4`
|
144 |
+
|
145 |
+
## Base Model:
|
146 |
+
To enhance base model roleplay and creative writing capabilities, I applied the same strategy:
|
147 |
+
```yaml
|
148 |
+
models:
|
149 |
+
- model: EVA-UNIT-01/EVA-Qwen2.5-14B-v0.2
|
150 |
+
parameters:
|
151 |
+
density: 1
|
152 |
+
weight: 1
|
153 |
+
lambda: 0.9
|
154 |
+
merge_method: della
|
155 |
+
base_model: Qwen/Qwen2.5-14B
|
156 |
+
parameters:
|
157 |
+
density: 1
|
158 |
+
weight: 1
|
159 |
+
lambda: 0.9
|
160 |
+
normalize: true
|
161 |
+
int8_mask: true
|
162 |
+
dtype: bfloat16
|
163 |
+
tokenizer_source: base
|
164 |
+
name: EVA-Qwen2.5-14B-base
|
165 |
+
```
|
166 |
+
Next, I extended the context length using the SCE method:
|
167 |
+
```yaml
|
168 |
+
merge_method: sce
|
169 |
+
models:
|
170 |
+
- model: EVA-Qwen2.5-14B-base
|
171 |
+
base_model: Qwen/Qwen2.5-14B-Instruct-1M
|
172 |
+
parameters:
|
173 |
+
select_topk: 1
|
174 |
+
dtype: bfloat16
|
175 |
+
tokenizer_source: base
|
176 |
+
normalize: true
|
177 |
+
int8_mask: true
|
178 |
+
name: Qwen2.5-14B-pro
|
179 |
+
```
|
180 |
+
## Final Merge Step:
|
181 |
+
```yaml
|
182 |
+
merge_method: model_stock
|
183 |
+
base_model: Qwen2.5-14B-pro
|
184 |
+
models:
|
185 |
+
- model: Qwen2.5-14B-YOYO-della1
|
186 |
+
- model: Qwen2.5-14B-YOYO-della2
|
187 |
+
- model: Qwen2.5-14B-YOYO-della3
|
188 |
+
- model: Qwen2.5-14B-YOYO-della4
|
189 |
+
dtype: bfloat16
|
190 |
+
tokenizer_source: base
|
191 |
+
int8_mask: true
|
192 |
+
normalize: true
|
193 |
+
name: Qwen2.5-14B-1M-YOYO-V3
|
194 |
+
```
|
195 |
+
Feel free to adapt these strategies for your own merging experiments! 🚀
|