andreaskoepf commited on
Commit
5997675
1 Parent(s): 7d15677

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -24
README.md CHANGED
@@ -4,18 +4,26 @@ language:
4
  - en
5
  datasets:
6
  - OpenAssistant/oasst1
 
 
 
 
 
 
 
7
  ---
8
  # Open-Assistant Llama2 70B SFT v10
9
 
10
  This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
11
-
 
12
 
13
  ## Model Details
14
 
15
  - **Finetuned from:** [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) via [epfLLM/old-Megatron-LM](https://github.com/epfLLM/old-Megatron-LM)
16
  - **Model type:** Causal decoder-only transformer language model
17
- - **Language:** English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish)
18
- - **Weights & Biases:** [Stage 1](https://wandb.ai/open-assistant/public-sft/runs/run45_oasst_pre10_llama2_70b) (1 epoch pretrain-mix, 12k steps), [Stage 2](https://wandb.ai/open-assistant/public-sft/runs/run46_oasst_sft10_llama2_70b) (3 epochs oasst top-1, 519 steps)
19
  - **Demo:** [Continuations for 250 random prompts (TGI, 4bit nf4 quantization)](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-08-22_OpenAssistant_llama2-70b-oasst-sft-v10_sampling_noprefix2_nf4.json%0A)
20
  - **Evaluation** [FastEval-OpenAssistant Overview](https://tju01.github.io/FastEval-OpenAssistant/) (using [FastEval](https://github.com/FastEval/FastEval) & [vLLM](https://github.com/vllm-project/vllm))
21
  - **License:** [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
@@ -47,10 +55,38 @@ If a question does not make any sense, or is not factually coherent, explain why
47
  <|im_end|>
48
  ```
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ## Configuration Details
51
 
 
 
 
52
  ### Stage 1 Pretokenizer Configuration
53
 
 
 
54
  ```
55
  oasst_pre10_min25:
56
  datasets:
@@ -72,6 +108,62 @@ oasst_pre10_min25:
72
  min_assistant_tokens: 25
73
  ```
74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  ### Stage 2 Pretokenizer Configuration
76
 
77
  ```
@@ -86,20 +178,47 @@ oasst_top1:
86
  filename_prefix: "oasst_top1"
87
  ```
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  ### Megatron Fine-Tuning Arguments for Stage 1 (Instruction Tuning):
90
  ```
91
  --tensor_model_parallel_size 8
92
  --pipeline_model_parallel_size 4
93
- --load ./akoepf/checkpoints/llama2-70b-tp8-pp4
94
- --save ./akoepf/checkpoints/llama2-70b-tp8-pp4-oasst_pre10
95
- --tensorboard_dir ./akoepf/checkpoints/llama2-70b-tp8-pp4-oasst_pre10/logging
96
- --data_path ./akoepf/data/oasst_pre10_min25_llama2/oasst_sft10-train
97
  --model_name llama2
98
  --tokenizer_type SentencePieceTokenizer
99
  --bf16
100
  --global_batch_size 64
101
  --micro_batch_size 2
102
- --vocab_file=./akoepf/llama2/Llama-2-7b/tokenizer.model
103
  --use_rms_norm
104
  --glu_activation swiglu
105
  --no_tie_embed_logits
@@ -138,16 +257,16 @@ oasst_top1:
138
  ```
139
  --tensor_model_parallel_size 8
140
  --pipeline_model_parallel_size 4
141
- --load ./akoepf/checkpoints/llama2-70b-tp8-pp4-oasst_pre10
142
- --save ./akoepf/checkpoints/llama2-70b-tp8-pp4-oasst_sft10
143
- --tensorboard_dir ./akoepf/checkpoints/llama2-70b-tp8-pp4-oasst_sft10/logging
144
- --data_path ./akoepf/data/oasst_top1_2023-07-23_llama2/oasst_top1-train
145
  --model_name llama2
146
  --tokenizer_type SentencePieceTokenizer
147
  --bf16
148
  --global_batch_size 64
149
  --micro_batch_size 2
150
- --vocab_file=./akoepf/llama2/Llama-2-7b/tokenizer.model
151
  --use_rms_norm
152
  --glu_activation swiglu
153
  --no_tie_embed_logits
@@ -182,14 +301,4 @@ oasst_top1:
182
  --rope_scaling_factor 1.0
183
  --finetune
184
  --wandb_logger
185
- ```
186
-
187
-
188
- ## Ethical Considerations and Limitations
189
-
190
- Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.
191
- For these reasons, as with all LLMs, the potential outputs of llama2-70b-oasst-sft-v10 cannot be predicted
192
- in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
193
- to user prompts. Therefore, before deploying any applications of llama2-70b-oasst-sft-v10, developers should
194
- perform safety testing and tuning tailored to their specific applications of the model.
195
-
 
4
  - en
5
  datasets:
6
  - OpenAssistant/oasst1
7
+ - ehartford/dolphin
8
+ - rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored
9
+ - argilla/databricks-dolly-15k-curated-multilingual
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ tags:
13
+ - sft
14
  ---
15
  # Open-Assistant Llama2 70B SFT v10
16
 
17
  This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
18
+ The model was fine-tuned in two stages, first on a mix of synthetic instrunctions and coding tasks data and in a 2nd "finishing" stage
19
+ on top-1 human Open-Assistant demonstrations exported on July 23, 2023 (see configuration details section below).
20
 
21
  ## Model Details
22
 
23
  - **Finetuned from:** [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) via [epfLLM/old-Megatron-LM](https://github.com/epfLLM/old-Megatron-LM)
24
  - **Model type:** Causal decoder-only transformer language model
25
+ - **Language:** English (and limited capabilities in German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish)
26
+ - **Weights & Biases training logs:** [Stage 1](https://wandb.ai/open-assistant/public-sft/runs/run45_oasst_pre10_llama2_70b) (1 epoch pretrain-mix, 12k steps), [Stage 2](https://wandb.ai/open-assistant/public-sft/runs/run46_oasst_sft10_llama2_70b) (3 epochs oasst top-1, 519 steps)
27
  - **Demo:** [Continuations for 250 random prompts (TGI, 4bit nf4 quantization)](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-08-22_OpenAssistant_llama2-70b-oasst-sft-v10_sampling_noprefix2_nf4.json%0A)
28
  - **Evaluation** [FastEval-OpenAssistant Overview](https://tju01.github.io/FastEval-OpenAssistant/) (using [FastEval](https://github.com/FastEval/FastEval) & [vLLM](https://github.com/vllm-project/vllm))
29
  - **License:** [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
 
55
  <|im_end|>
56
  ```
57
 
58
+ ### Credits & Special Thanks
59
+
60
+ - Compute was generously sponsored by the eplf [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/).
61
+ - The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
62
+ - [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
63
+ - [ehartford](https://huggingface.co/ehartford) generated and published the [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin) and the [ehartford/oa_leet10k](https://huggingface.co/datasets/ehartford/oa_leet10k) datasets.
64
+ - [Argilla](https://huggingface.co/argilla) curated and published the [argilla/databricks-dolly-15k-curated-multilingual] dataset.
65
+ - [shahules786](https://github.com/shahules786) de-duped and filtered the Dolphin dataset with a cluster-center approach and generated the orca-best (ocra-chat) dataset.
66
+ - [andreaskoepf](https://github.com/andreaskoepf/) prepared & orchestrated the training.
67
+
68
+ We want to especially thank everyone who contributed in the crowed-sourced Open-Assistant dataset creation on https://open-assistant.io/ - without you this project would not have been possible.
69
+
70
+ ## Ethical Considerations and Limitations
71
+
72
+ Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.
73
+ For these reasons, as with all LLMs, the potential outputs of llama2-70b-oasst-sft-v10 cannot be predicted
74
+ in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
75
+ to user prompts. Therefore, before deploying any applications of llama2-70b-oasst-sft-v10, developers should
76
+ perform safety testing and tuning tailored to their specific applications of the model.
77
+
78
+ Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
79
+
80
+
81
  ## Configuration Details
82
 
83
+ The "pretokenizer" utility used to tokenize the datamix is part of the Open-Assistant github repository and can be found here: [model/pretokenizer](https://github.com/LAION-AI/Open-Assistant/tree/main/model/pretokenizer).
84
+
85
+
86
  ### Stage 1 Pretokenizer Configuration
87
 
88
+ Entries of the dataset with assistant replies shorter than 25 tokens were excluded from training.
89
+
90
  ```
91
  oasst_pre10_min25:
92
  datasets:
 
108
  min_assistant_tokens: 25
109
  ```
110
 
111
+ Stage 1 dataset statistics:
112
+ ```
113
+ # Stats for output/oasst_pre10_min25_llama2
114
+
115
+ ## Stats for 'Subset of InstructionDataset (megacode2)' (466364 samples (50.0%))
116
+ -----------------
117
+ Accepted: 398223/466364 (85.4%)
118
+ Accepted tokens: 167676873
119
+ Skipped: 68141 (14.6%)
120
+ Min tokens per sample: 36
121
+ Max tokens per sample: 11810
122
+ Avg tokens per sample: 421.063
123
+ -----------------
124
+
125
+ ## Stats for 'Subset of OrcaChat (orca-chat)' (325616 samples (100.0%))
126
+ -----------------
127
+ Accepted: 325616/325616 (100.0%)
128
+ Accepted tokens: 178307574
129
+ Skipped: 0 (0.0%)
130
+ Min tokens per sample: 105
131
+ Max tokens per sample: 10408
132
+ Avg tokens per sample: 547.601
133
+ -----------------
134
+
135
+ ## Stats for 'Subset of Dolly15kMultilingual' (57020 samples (100.0%))
136
+ -----------------
137
+ Accepted: 47494/57020 (83.3%)
138
+ Accepted tokens: 13883177
139
+ Skipped: 9526 (16.7%)
140
+ Min tokens per sample: 34
141
+ Max tokens per sample: 9172
142
+ Avg tokens per sample: 292.314
143
+ -----------------
144
+
145
+ ## Stats for 'Subset of InstructionDataset (oa_leet10k)' (22236 samples (100.0%))
146
+ -----------------
147
+ Accepted: 22236/22236 (100.0%)
148
+ Accepted tokens: 15905296
149
+ Skipped: 0 (0.0%)
150
+ Min tokens per sample: 168
151
+ Max tokens per sample: 10588
152
+ Avg tokens per sample: 715.295
153
+ -----------------
154
+
155
+ ## Stats for 'total' (871236 samples (100.0%))
156
+ -----------------
157
+ Accepted: 793569/871236 (91.1%)
158
+ Accepted tokens: 375772920
159
+ Skipped: 77667 (8.9%)
160
+ Min tokens per sample: 34
161
+ Max tokens per sample: 11810
162
+ Avg tokens per sample: 473.523
163
+ -----------------
164
+ ```
165
+
166
+
167
  ### Stage 2 Pretokenizer Configuration
168
 
169
  ```
 
178
  filename_prefix: "oasst_top1"
179
  ```
180
 
181
+ Stage 2 dataset statistics:
182
+
183
+ ```
184
+ # Stats for output/oasst_top1_2023-07-23_llama2
185
+
186
+ ## Stats for 'ListDataset' (11441 samples (100.0%))
187
+ -----------------
188
+ Accepted: 11441/11441 (100.0%)
189
+ Accepted tokens: 5315368
190
+ Skipped: 0 (0.0%)
191
+ Min tokens per sample: 20
192
+ Max tokens per sample: 5407
193
+ Avg tokens per sample: 464.58945896337735
194
+ -----------------
195
+
196
+ ## Stats for 'total' (11441 samples (100.0%))
197
+ -----------------
198
+ Accepted: 11441/11441 (100.0%)
199
+ Accepted tokens: 5315368
200
+ Skipped: 0 (0.0%)
201
+ Min tokens per sample: 20
202
+ Max tokens per sample: 5407
203
+ Avg tokens per sample: 464.58945896337735
204
+ -----------------
205
+ ```
206
+
207
+
208
  ### Megatron Fine-Tuning Arguments for Stage 1 (Instruction Tuning):
209
  ```
210
  --tensor_model_parallel_size 8
211
  --pipeline_model_parallel_size 4
212
+ --load ./checkpoints/llama2-70b-tp8-pp4
213
+ --save ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10
214
+ --tensorboard_dir ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10/logging
215
+ --data_path ./data/oasst_pre10_min25_llama2/oasst_sft10-train
216
  --model_name llama2
217
  --tokenizer_type SentencePieceTokenizer
218
  --bf16
219
  --global_batch_size 64
220
  --micro_batch_size 2
221
+ --vocab_file=./llama2/Llama-2-7b/tokenizer.model
222
  --use_rms_norm
223
  --glu_activation swiglu
224
  --no_tie_embed_logits
 
257
  ```
258
  --tensor_model_parallel_size 8
259
  --pipeline_model_parallel_size 4
260
+ --load ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10
261
+ --save ./checkpoints/llama2-70b-tp8-pp4-oasst_sft10
262
+ --tensorboard_dir ./checkpoints/llama2-70b-tp8-pp4-oasst_sft10/logging
263
+ --data_path ./data/oasst_top1_2023-07-23_llama2/oasst_top1-train
264
  --model_name llama2
265
  --tokenizer_type SentencePieceTokenizer
266
  --bf16
267
  --global_batch_size 64
268
  --micro_batch_size 2
269
+ --vocab_file=./llama2/Llama-2-7b/tokenizer.model
270
  --use_rms_norm
271
  --glu_activation swiglu
272
  --no_tie_embed_logits
 
301
  --rope_scaling_factor 1.0
302
  --finetune
303
  --wandb_logger
304
+ ```