andreaskoepf
commited on
Commit
•
5997675
1
Parent(s):
7d15677
Update README.md
Browse files
README.md
CHANGED
@@ -4,18 +4,26 @@ language:
|
|
4 |
- en
|
5 |
datasets:
|
6 |
- OpenAssistant/oasst1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
---
|
8 |
# Open-Assistant Llama2 70B SFT v10
|
9 |
|
10 |
This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
|
11 |
-
|
|
|
12 |
|
13 |
## Model Details
|
14 |
|
15 |
- **Finetuned from:** [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) via [epfLLM/old-Megatron-LM](https://github.com/epfLLM/old-Megatron-LM)
|
16 |
- **Model type:** Causal decoder-only transformer language model
|
17 |
-
- **Language:** English
|
18 |
-
- **Weights & Biases:** [Stage 1](https://wandb.ai/open-assistant/public-sft/runs/run45_oasst_pre10_llama2_70b) (1 epoch pretrain-mix, 12k steps), [Stage 2](https://wandb.ai/open-assistant/public-sft/runs/run46_oasst_sft10_llama2_70b) (3 epochs oasst top-1, 519 steps)
|
19 |
- **Demo:** [Continuations for 250 random prompts (TGI, 4bit nf4 quantization)](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-08-22_OpenAssistant_llama2-70b-oasst-sft-v10_sampling_noprefix2_nf4.json%0A)
|
20 |
- **Evaluation** [FastEval-OpenAssistant Overview](https://tju01.github.io/FastEval-OpenAssistant/) (using [FastEval](https://github.com/FastEval/FastEval) & [vLLM](https://github.com/vllm-project/vllm))
|
21 |
- **License:** [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
|
@@ -47,10 +55,38 @@ If a question does not make any sense, or is not factually coherent, explain why
|
|
47 |
<|im_end|>
|
48 |
```
|
49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
## Configuration Details
|
51 |
|
|
|
|
|
|
|
52 |
### Stage 1 Pretokenizer Configuration
|
53 |
|
|
|
|
|
54 |
```
|
55 |
oasst_pre10_min25:
|
56 |
datasets:
|
@@ -72,6 +108,62 @@ oasst_pre10_min25:
|
|
72 |
min_assistant_tokens: 25
|
73 |
```
|
74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
### Stage 2 Pretokenizer Configuration
|
76 |
|
77 |
```
|
@@ -86,20 +178,47 @@ oasst_top1:
|
|
86 |
filename_prefix: "oasst_top1"
|
87 |
```
|
88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
### Megatron Fine-Tuning Arguments for Stage 1 (Instruction Tuning):
|
90 |
```
|
91 |
--tensor_model_parallel_size 8
|
92 |
--pipeline_model_parallel_size 4
|
93 |
-
--load ./
|
94 |
-
--save ./
|
95 |
-
--tensorboard_dir ./
|
96 |
-
--data_path ./
|
97 |
--model_name llama2
|
98 |
--tokenizer_type SentencePieceTokenizer
|
99 |
--bf16
|
100 |
--global_batch_size 64
|
101 |
--micro_batch_size 2
|
102 |
-
--vocab_file=./
|
103 |
--use_rms_norm
|
104 |
--glu_activation swiglu
|
105 |
--no_tie_embed_logits
|
@@ -138,16 +257,16 @@ oasst_top1:
|
|
138 |
```
|
139 |
--tensor_model_parallel_size 8
|
140 |
--pipeline_model_parallel_size 4
|
141 |
-
--load ./
|
142 |
-
--save ./
|
143 |
-
--tensorboard_dir ./
|
144 |
-
--data_path ./
|
145 |
--model_name llama2
|
146 |
--tokenizer_type SentencePieceTokenizer
|
147 |
--bf16
|
148 |
--global_batch_size 64
|
149 |
--micro_batch_size 2
|
150 |
-
--vocab_file=./
|
151 |
--use_rms_norm
|
152 |
--glu_activation swiglu
|
153 |
--no_tie_embed_logits
|
@@ -182,14 +301,4 @@ oasst_top1:
|
|
182 |
--rope_scaling_factor 1.0
|
183 |
--finetune
|
184 |
--wandb_logger
|
185 |
-
```
|
186 |
-
|
187 |
-
|
188 |
-
## Ethical Considerations and Limitations
|
189 |
-
|
190 |
-
Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.
|
191 |
-
For these reasons, as with all LLMs, the potential outputs of llama2-70b-oasst-sft-v10 cannot be predicted
|
192 |
-
in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
|
193 |
-
to user prompts. Therefore, before deploying any applications of llama2-70b-oasst-sft-v10, developers should
|
194 |
-
perform safety testing and tuning tailored to their specific applications of the model.
|
195 |
-
|
|
|
4 |
- en
|
5 |
datasets:
|
6 |
- OpenAssistant/oasst1
|
7 |
+
- ehartford/dolphin
|
8 |
+
- rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored
|
9 |
+
- argilla/databricks-dolly-15k-curated-multilingual
|
10 |
+
library_name: transformers
|
11 |
+
pipeline_tag: text-generation
|
12 |
+
tags:
|
13 |
+
- sft
|
14 |
---
|
15 |
# Open-Assistant Llama2 70B SFT v10
|
16 |
|
17 |
This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
|
18 |
+
The model was fine-tuned in two stages, first on a mix of synthetic instrunctions and coding tasks data and in a 2nd "finishing" stage
|
19 |
+
on top-1 human Open-Assistant demonstrations exported on July 23, 2023 (see configuration details section below).
|
20 |
|
21 |
## Model Details
|
22 |
|
23 |
- **Finetuned from:** [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) via [epfLLM/old-Megatron-LM](https://github.com/epfLLM/old-Megatron-LM)
|
24 |
- **Model type:** Causal decoder-only transformer language model
|
25 |
+
- **Language:** English (and limited capabilities in German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish)
|
26 |
+
- **Weights & Biases training logs:** [Stage 1](https://wandb.ai/open-assistant/public-sft/runs/run45_oasst_pre10_llama2_70b) (1 epoch pretrain-mix, 12k steps), [Stage 2](https://wandb.ai/open-assistant/public-sft/runs/run46_oasst_sft10_llama2_70b) (3 epochs oasst top-1, 519 steps)
|
27 |
- **Demo:** [Continuations for 250 random prompts (TGI, 4bit nf4 quantization)](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-08-22_OpenAssistant_llama2-70b-oasst-sft-v10_sampling_noprefix2_nf4.json%0A)
|
28 |
- **Evaluation** [FastEval-OpenAssistant Overview](https://tju01.github.io/FastEval-OpenAssistant/) (using [FastEval](https://github.com/FastEval/FastEval) & [vLLM](https://github.com/vllm-project/vllm))
|
29 |
- **License:** [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
|
|
|
55 |
<|im_end|>
|
56 |
```
|
57 |
|
58 |
+
### Credits & Special Thanks
|
59 |
+
|
60 |
+
- Compute was generously sponsored by the eplf [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/).
|
61 |
+
- The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
|
62 |
+
- [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
|
63 |
+
- [ehartford](https://huggingface.co/ehartford) generated and published the [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin) and the [ehartford/oa_leet10k](https://huggingface.co/datasets/ehartford/oa_leet10k) datasets.
|
64 |
+
- [Argilla](https://huggingface.co/argilla) curated and published the [argilla/databricks-dolly-15k-curated-multilingual] dataset.
|
65 |
+
- [shahules786](https://github.com/shahules786) de-duped and filtered the Dolphin dataset with a cluster-center approach and generated the orca-best (ocra-chat) dataset.
|
66 |
+
- [andreaskoepf](https://github.com/andreaskoepf/) prepared & orchestrated the training.
|
67 |
+
|
68 |
+
We want to especially thank everyone who contributed in the crowed-sourced Open-Assistant dataset creation on https://open-assistant.io/ - without you this project would not have been possible.
|
69 |
+
|
70 |
+
## Ethical Considerations and Limitations
|
71 |
+
|
72 |
+
Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.
|
73 |
+
For these reasons, as with all LLMs, the potential outputs of llama2-70b-oasst-sft-v10 cannot be predicted
|
74 |
+
in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
|
75 |
+
to user prompts. Therefore, before deploying any applications of llama2-70b-oasst-sft-v10, developers should
|
76 |
+
perform safety testing and tuning tailored to their specific applications of the model.
|
77 |
+
|
78 |
+
Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
|
79 |
+
|
80 |
+
|
81 |
## Configuration Details
|
82 |
|
83 |
+
The "pretokenizer" utility used to tokenize the datamix is part of the Open-Assistant github repository and can be found here: [model/pretokenizer](https://github.com/LAION-AI/Open-Assistant/tree/main/model/pretokenizer).
|
84 |
+
|
85 |
+
|
86 |
### Stage 1 Pretokenizer Configuration
|
87 |
|
88 |
+
Entries of the dataset with assistant replies shorter than 25 tokens were excluded from training.
|
89 |
+
|
90 |
```
|
91 |
oasst_pre10_min25:
|
92 |
datasets:
|
|
|
108 |
min_assistant_tokens: 25
|
109 |
```
|
110 |
|
111 |
+
Stage 1 dataset statistics:
|
112 |
+
```
|
113 |
+
# Stats for output/oasst_pre10_min25_llama2
|
114 |
+
|
115 |
+
## Stats for 'Subset of InstructionDataset (megacode2)' (466364 samples (50.0%))
|
116 |
+
-----------------
|
117 |
+
Accepted: 398223/466364 (85.4%)
|
118 |
+
Accepted tokens: 167676873
|
119 |
+
Skipped: 68141 (14.6%)
|
120 |
+
Min tokens per sample: 36
|
121 |
+
Max tokens per sample: 11810
|
122 |
+
Avg tokens per sample: 421.063
|
123 |
+
-----------------
|
124 |
+
|
125 |
+
## Stats for 'Subset of OrcaChat (orca-chat)' (325616 samples (100.0%))
|
126 |
+
-----------------
|
127 |
+
Accepted: 325616/325616 (100.0%)
|
128 |
+
Accepted tokens: 178307574
|
129 |
+
Skipped: 0 (0.0%)
|
130 |
+
Min tokens per sample: 105
|
131 |
+
Max tokens per sample: 10408
|
132 |
+
Avg tokens per sample: 547.601
|
133 |
+
-----------------
|
134 |
+
|
135 |
+
## Stats for 'Subset of Dolly15kMultilingual' (57020 samples (100.0%))
|
136 |
+
-----------------
|
137 |
+
Accepted: 47494/57020 (83.3%)
|
138 |
+
Accepted tokens: 13883177
|
139 |
+
Skipped: 9526 (16.7%)
|
140 |
+
Min tokens per sample: 34
|
141 |
+
Max tokens per sample: 9172
|
142 |
+
Avg tokens per sample: 292.314
|
143 |
+
-----------------
|
144 |
+
|
145 |
+
## Stats for 'Subset of InstructionDataset (oa_leet10k)' (22236 samples (100.0%))
|
146 |
+
-----------------
|
147 |
+
Accepted: 22236/22236 (100.0%)
|
148 |
+
Accepted tokens: 15905296
|
149 |
+
Skipped: 0 (0.0%)
|
150 |
+
Min tokens per sample: 168
|
151 |
+
Max tokens per sample: 10588
|
152 |
+
Avg tokens per sample: 715.295
|
153 |
+
-----------------
|
154 |
+
|
155 |
+
## Stats for 'total' (871236 samples (100.0%))
|
156 |
+
-----------------
|
157 |
+
Accepted: 793569/871236 (91.1%)
|
158 |
+
Accepted tokens: 375772920
|
159 |
+
Skipped: 77667 (8.9%)
|
160 |
+
Min tokens per sample: 34
|
161 |
+
Max tokens per sample: 11810
|
162 |
+
Avg tokens per sample: 473.523
|
163 |
+
-----------------
|
164 |
+
```
|
165 |
+
|
166 |
+
|
167 |
### Stage 2 Pretokenizer Configuration
|
168 |
|
169 |
```
|
|
|
178 |
filename_prefix: "oasst_top1"
|
179 |
```
|
180 |
|
181 |
+
Stage 2 dataset statistics:
|
182 |
+
|
183 |
+
```
|
184 |
+
# Stats for output/oasst_top1_2023-07-23_llama2
|
185 |
+
|
186 |
+
## Stats for 'ListDataset' (11441 samples (100.0%))
|
187 |
+
-----------------
|
188 |
+
Accepted: 11441/11441 (100.0%)
|
189 |
+
Accepted tokens: 5315368
|
190 |
+
Skipped: 0 (0.0%)
|
191 |
+
Min tokens per sample: 20
|
192 |
+
Max tokens per sample: 5407
|
193 |
+
Avg tokens per sample: 464.58945896337735
|
194 |
+
-----------------
|
195 |
+
|
196 |
+
## Stats for 'total' (11441 samples (100.0%))
|
197 |
+
-----------------
|
198 |
+
Accepted: 11441/11441 (100.0%)
|
199 |
+
Accepted tokens: 5315368
|
200 |
+
Skipped: 0 (0.0%)
|
201 |
+
Min tokens per sample: 20
|
202 |
+
Max tokens per sample: 5407
|
203 |
+
Avg tokens per sample: 464.58945896337735
|
204 |
+
-----------------
|
205 |
+
```
|
206 |
+
|
207 |
+
|
208 |
### Megatron Fine-Tuning Arguments for Stage 1 (Instruction Tuning):
|
209 |
```
|
210 |
--tensor_model_parallel_size 8
|
211 |
--pipeline_model_parallel_size 4
|
212 |
+
--load ./checkpoints/llama2-70b-tp8-pp4
|
213 |
+
--save ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10
|
214 |
+
--tensorboard_dir ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10/logging
|
215 |
+
--data_path ./data/oasst_pre10_min25_llama2/oasst_sft10-train
|
216 |
--model_name llama2
|
217 |
--tokenizer_type SentencePieceTokenizer
|
218 |
--bf16
|
219 |
--global_batch_size 64
|
220 |
--micro_batch_size 2
|
221 |
+
--vocab_file=./llama2/Llama-2-7b/tokenizer.model
|
222 |
--use_rms_norm
|
223 |
--glu_activation swiglu
|
224 |
--no_tie_embed_logits
|
|
|
257 |
```
|
258 |
--tensor_model_parallel_size 8
|
259 |
--pipeline_model_parallel_size 4
|
260 |
+
--load ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10
|
261 |
+
--save ./checkpoints/llama2-70b-tp8-pp4-oasst_sft10
|
262 |
+
--tensorboard_dir ./checkpoints/llama2-70b-tp8-pp4-oasst_sft10/logging
|
263 |
+
--data_path ./data/oasst_top1_2023-07-23_llama2/oasst_top1-train
|
264 |
--model_name llama2
|
265 |
--tokenizer_type SentencePieceTokenizer
|
266 |
--bf16
|
267 |
--global_batch_size 64
|
268 |
--micro_batch_size 2
|
269 |
+
--vocab_file=./llama2/Llama-2-7b/tokenizer.model
|
270 |
--use_rms_norm
|
271 |
--glu_activation swiglu
|
272 |
--no_tie_embed_logits
|
|
|
301 |
--rope_scaling_factor 1.0
|
302 |
--finetune
|
303 |
--wandb_logger
|
304 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|