Text Generation
Safetensors
English
qwen2
davidhornshaw commited on
Commit
4c65cb9
1 Parent(s): 5ff8e8c

Nearly finalised modelcard.

Browse files
Files changed (1) hide show
  1. README.md +61 -174
README.md CHANGED
@@ -1,109 +1,84 @@
1
  ---
2
- library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
 
 
 
 
11
 
12
  ## Model Details
13
 
14
  ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
 
 
 
 
 
 
 
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
  ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
39
 
40
  ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
  ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
 
76
  ## Training Details
77
 
78
  ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
 
84
  ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
 
103
  ## Evaluation
104
 
105
- We evaluate base and finetuned models on four general benchmarks and one usecase specific one. We work with an Eleuther test harness.
106
- Our usecase is information extraction of heterogeneous data from government documents. Here, we test on documents of the U.S. [FDA](https://en.wikipedia.org/wiki/Food_and_Drug_Administration).
107
 
108
  Benchmarks used:
109
 
@@ -117,10 +92,21 @@ Benchmarks used:
117
 
118
  3. USECASE: Logical and numerical reasoning.
119
  3.1 Arithmetic
120
- 3.2 ASDiv.
 
 
 
 
 
 
 
 
 
121
 
122
  Evaluation results:
123
 
 
 
124
  <figure>
125
 
126
  | Tasks | Version | Filter | n-shot | Metric | | Value | | Stderr |
@@ -150,14 +136,15 @@ Evaluation results:
150
  |arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
151
  |arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0675|± | 0.0056|
152
  |arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0010|± | 0.0007|
153
- |arithmetic_5da| 1 | none | 0 |acc |↑ | 0.3720|± | 0.0108|
154
- |arithmetic_5ds| 1 | none | 0 |acc |↑ | 0.0260|± | 0.0036|
155
  |--------------|--------:|--------|-------:|--------|---|------:|---|-------:|
156
  |asdiv | 1 | none | 0 |acc |↑ | 0.0187|± | 0.0028|
157
  <figcaption>Collected USECASE benchmarks results for the base model.</figcaption>
158
 
159
  </figure>
160
 
 
161
 
162
  <figure>
163
 
@@ -188,110 +175,10 @@ Evaluation results:
188
  |arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
189
  |arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0710|± | 0.0057|
190
  |arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0005|± | 0.0005|
191
- |arithmetic_5da| 1 | none | 0 |acc |↑ | 0.4005|± | 0.0110|
192
- |arithmetic_5ds| 1 | none | 0 |acc |↑ | 0.0285|± | 0.0037|
193
  |--------------|--------:|--------|-------:|--------|---|------:|---|-------:|
194
  |asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
195
  <figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>
196
 
197
- </figure>
198
-
199
- |----------|--------:|--------|-------:|--------|---|------:|---|-------:|
200
- |asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
201
-
202
-
203
- 1.2 Finetuned model
204
-
205
- ### Testing Data, Factors & Metrics
206
-
207
- #### Testing Data
208
-
209
- <!-- This should link to a Dataset Card if possible. -->
210
-
211
- [More Information Needed]
212
-
213
- #### Factors
214
-
215
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
216
-
217
- [More Information Needed]
218
-
219
- #### Metrics
220
-
221
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
222
-
223
- [More Information Needed]
224
-
225
- ### Results
226
-
227
- [More Information Needed]
228
-
229
- #### Summary
230
-
231
-
232
-
233
- ## Model Examination [optional]
234
-
235
- <!-- Relevant interpretability work for the model goes here -->
236
-
237
- [More Information Needed]
238
-
239
- ## Environmental Impact
240
-
241
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
242
-
243
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
244
-
245
- - **Hardware Type:** [More Information Needed]
246
- - **Hours used:** [More Information Needed]
247
- - **Cloud Provider:** [More Information Needed]
248
- - **Compute Region:** [More Information Needed]
249
- - **Carbon Emitted:** [More Information Needed]
250
-
251
- ## Technical Specifications [optional]
252
-
253
- ### Model Architecture and Objective
254
-
255
- [More Information Needed]
256
-
257
- ### Compute Infrastructure
258
-
259
- [More Information Needed]
260
-
261
- #### Hardware
262
-
263
- [More Information Needed]
264
-
265
- #### Software
266
-
267
- [More Information Needed]
268
-
269
- ## Citation [optional]
270
-
271
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
272
-
273
- **BibTeX:**
274
-
275
- [More Information Needed]
276
-
277
- **APA:**
278
-
279
- [More Information Needed]
280
-
281
- ## Glossary [optional]
282
-
283
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
284
-
285
- [More Information Needed]
286
-
287
- ## More Information [optional]
288
-
289
- [More Information Needed]
290
-
291
- ## Model Card Authors [optional]
292
-
293
- [More Information Needed]
294
-
295
- ## Model Card Contact
296
-
297
- [More Information Needed]
 
1
  ---
2
+ license: other
3
+ license_name: qwen-research
4
+ license_link: https://huggingface.co/Qwen/Qwen2.5-3B/blob/main/LICENSE
5
+ language:
6
+ - en
7
+ pipeline_tag: text-generation
8
+ datasets:
9
+ - mlabonne/orpo-dpo-mix-40k
10
+ base_model:
11
+ - Qwen/Qwen2.5-3B
12
  ---
13
 
14
+ # Qwen2.5-3B-ORPO
 
 
 
15
 
16
+ This model is a finetuned model of the (Qwen2.5-3B base model](https://huggingface.co/Qwen/Qwen2.5-3B) by Qwen using ORPO over the [ORPO-DPO-mix dataset](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k) by M. Labonne.
17
+ We evaluate the model using several benchmarks working with an Eleuther evaluation harness. Apart from ensuring no reduction in general model performance, benchmarks testing for improved performance in logical and numerical reasoning
18
+ are applied. The results are inconclusive on most, but not all, metrics. This shows promise for further improvements on more tailored datasets or by further applying DPO for preference training.
19
 
20
  ## Model Details
21
 
22
  ### Model Description
23
 
24
+ Qwen2.5 is the latest series of Qwen large language models. We finetune the 3B model with the following specifications:
25
 
26
+ - **Developed by:** (Finetuned from) Qwen
27
+ - **Language(s) (NLP):** English
28
+ - **Finetuned from model:** Qwen2.5-3B
29
+ - **Model type:** Causal LM
30
+ - **Architecture:** Transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings
31
+ - **Number of Parameters:** 3.09B
32
+ - **Number of Paramaters (Non-Embedding):** 2.77B
33
+ - **Number of Layers:** 36
34
+ - **Number of Attention Heads (GQA):** 16 for Q and 2 for KV
35
+ - **Context Length:** Full 32,768 tokens
36
 
37
+ For additional details, we refer to the base model repository.
 
 
 
 
 
 
38
 
39
+ ### Model Sources
40
 
41
+ The Qwen2.5-3B base model can be found here:
42
 
43
+ - **Repository:** https://huggingface.co/Qwen/Qwen2.5-3B
 
 
44
 
45
  ## Uses
46
 
47
+ The model is finetuned but with only little performance increase over the base model in logical and numerical reasoning.
48
+ While better than the base model, we do not think it suffices for well-founded logical and numerical reasoning at this stage.
49
+ However, we detect no performance decrease for common sense natural reasoning.
50
 
51
  ### Direct Use
52
 
53
+ Common sense natural language reasoning.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
+ ### Downstream Use
56
 
57
+ Logical and numerical reasoning.
58
 
59
  ### Recommendations
60
 
61
+ Additional finetuning on different datasets, as well as preference training.
 
 
 
 
 
 
 
 
62
 
63
  ## Training Details
64
 
65
  ### Training Data
66
 
67
+ We use the [ORPO-DPO-mix dataset](https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k) by M. Labonne.
68
+ It is a dataset is designed for ORPO or DPO training. See Fine-tune Llama 3 with ORPO for more information about how to use it.
 
69
 
70
  ### Training Procedure
71
 
72
+ We used the trl [ORPO trainer](https://huggingface.co/docs/trl/main/en/orpo_trainer) for finetuning, together with [LoRa](https://arxiv.org/abs/2106.09685) for speed-up.
 
 
 
 
 
 
 
73
 
74
+ ### Training Hyperparameters
75
 
76
+ - **Training regime:** fp16 non-mixed precision
 
 
 
 
77
 
78
  ## Evaluation
79
 
80
+ We evaluate base and finetuned models on four general benchmarks and two usecase specific one. We work with an Eleuther test harness.
81
+ Our usecase is logical and numerical reasoning.
82
 
83
  Benchmarks used:
84
 
 
92
 
93
  3. USECASE: Logical and numerical reasoning.
94
  3.1 Arithmetic
95
+ 3.2 ASDiv
96
+
97
+ Summary of results:
98
+
99
+ Within standard error, there is no difference between base and finetuned model on any general benchmark. This suggests there was no drop in performance for the chosen tasks due to finetuning.
100
+ Benchmarks for logical and numerical reasoning are more mixed. Without standard error, the finetuned model generally outperforms the base model. However, this lies - often just about - within standard error.
101
+ The finetuned model *does* outperform the base model even accounting for standard error with maximal conservative bias on **arithmetic_5da**.
102
+ This is of interest, since this benchmarks a model's ability to add five digits - *the* most fundamental arithmetic operation, and in effect the most difficult of all addition benchmarks.
103
+ Note subtraction appears generally harder for both the finetuned and base models, even as the finetuned model performs better.
104
+ We highlight the relevant rows for five-digit addition and subtraction for easy comparison.
105
 
106
  Evaluation results:
107
 
108
+ **BASE**
109
+
110
  <figure>
111
 
112
  | Tasks | Version | Filter | n-shot | Metric | | Value | | Stderr |
 
136
  |arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
137
  |arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0675|± | 0.0056|
138
  |arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0010|± | 0.0007|
139
+ **|arithmetic_5da| 1 | none | 0 |acc |↑ | 0.3720|± | 0.0108|**
140
+ **|arithmetic_5ds| 1 | none | 0 |acc |↑ | 0.0260|± | 0.0036|**
141
  |--------------|--------:|--------|-------:|--------|---|------:|---|-------:|
142
  |asdiv | 1 | none | 0 |acc |↑ | 0.0187|± | 0.0028|
143
  <figcaption>Collected USECASE benchmarks results for the base model.</figcaption>
144
 
145
  </figure>
146
 
147
+ **FINETUNED**
148
 
149
  <figure>
150
 
 
175
  |arithmetic_3ds| 1 | none | 0 |acc |↑ | 0.0055|± | 0.0017|
176
  |arithmetic_4da| 1 | none | 0 |acc |↑ | 0.0710|± | 0.0057|
177
  |arithmetic_4ds| 1 | none | 0 |acc |↑ | 0.0005|± | 0.0005|
178
+ **|arithmetic_5da| 1 | none | 0 |acc |↑ | 0.4005|± | 0.0110|**
179
+ **|arithmetic_5ds| 1 | none | 0 |acc |↑ | 0.0285|± | 0.0037|**
180
  |--------------|--------:|--------|-------:|--------|---|------:|---|-------:|
181
  |asdiv | 1 | none | 0 |acc |↑ | 0.0204|± | 0.0029|
182
  <figcaption>Collected USECASE benchmarks results for the finetuned model.</figcaption>
183
 
184
+ </figure>