OpenAssistant
/

llama2-70b-oasst-sft-v10

@@ -4,18 +4,26 @@ language:
 - en
 datasets:
 - OpenAssistant/oasst1
 ---
 # Open-Assistant Llama2 70B SFT v10
 This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
 ## Model Details
 - **Finetuned from:** [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) via [epfLLM/old-Megatron-LM](https://github.com/epfLLM/old-Megatron-LM)
 - **Model type:** Causal decoder-only transformer language model
-- **Language:** English, German, Spanish, French (and limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish)
-- **Weights & Biases:** [Stage 1](https://wandb.ai/open-assistant/public-sft/runs/run45_oasst_pre10_llama2_70b) (1 epoch pretrain-mix, 12k steps), [Stage 2](https://wandb.ai/open-assistant/public-sft/runs/run46_oasst_sft10_llama2_70b) (3 epochs oasst top-1, 519 steps)
 - **Demo:** [Continuations for 250 random prompts (TGI, 4bit nf4 quantization)](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-08-22_OpenAssistant_llama2-70b-oasst-sft-v10_sampling_noprefix2_nf4.json%0A)
 - **Evaluation** [FastEval-OpenAssistant Overview](https://tju01.github.io/FastEval-OpenAssistant/) (using [FastEval](https://github.com/FastEval/FastEval) & [vLLM](https://github.com/vllm-project/vllm))
 - **License:** [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
@@ -47,10 +55,38 @@ If a question does not make any sense, or is not factually coherent, explain why
 <|im_end|>
 ```
 ## Configuration Details
 ### Stage 1 Pretokenizer Configuration
 ```
 oasst_pre10_min25:
   datasets:
@@ -72,6 +108,62 @@ oasst_pre10_min25:
   min_assistant_tokens: 25
 ```
 ### Stage 2 Pretokenizer Configuration
 ```
@@ -86,20 +178,47 @@ oasst_top1:
   filename_prefix: "oasst_top1"
 ```
 ### Megatron Fine-Tuning Arguments for Stage 1 (Instruction Tuning):
 ```
 --tensor_model_parallel_size 8
 --pipeline_model_parallel_size 4
---load ./akoepf/checkpoints/llama2-70b-tp8-pp4
---save ./akoepf/checkpoints/llama2-70b-tp8-pp4-oasst_pre10
---tensorboard_dir ./akoepf/checkpoints/llama2-70b-tp8-pp4-oasst_pre10/logging
---data_path ./akoepf/data/oasst_pre10_min25_llama2/oasst_sft10-train
 --model_name llama2
 --tokenizer_type SentencePieceTokenizer
 --bf16
 --global_batch_size 64
 --micro_batch_size 2
---vocab_file=./akoepf/llama2/Llama-2-7b/tokenizer.model
 --use_rms_norm
 --glu_activation swiglu
 --no_tie_embed_logits
@@ -138,16 +257,16 @@ oasst_top1:
 ```
 --tensor_model_parallel_size 8
 --pipeline_model_parallel_size 4
---load ./akoepf/checkpoints/llama2-70b-tp8-pp4-oasst_pre10
---save ./akoepf/checkpoints/llama2-70b-tp8-pp4-oasst_sft10
---tensorboard_dir ./akoepf/checkpoints/llama2-70b-tp8-pp4-oasst_sft10/logging
---data_path ./akoepf/data/oasst_top1_2023-07-23_llama2/oasst_top1-train
 --model_name llama2
 --tokenizer_type SentencePieceTokenizer
 --bf16
 --global_batch_size 64
 --micro_batch_size 2
---vocab_file=./akoepf/llama2/Llama-2-7b/tokenizer.model
 --use_rms_norm
 --glu_activation swiglu
 --no_tie_embed_logits
@@ -182,14 +301,4 @@ oasst_top1:
 --rope_scaling_factor 1.0
 --finetune
 --wandb_logger
-```
-## Ethical Considerations and Limitations
-Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.
-For these reasons, as with all LLMs, the potential outputs of llama2-70b-oasst-sft-v10 cannot be predicted
-in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
-to user prompts. Therefore, before deploying any applications of llama2-70b-oasst-sft-v10, developers should
-perform safety testing and tuning tailored to their specific applications of the model.

 - en
 datasets:
 - OpenAssistant/oasst1
+- ehartford/dolphin
+- rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored
+- argilla/databricks-dolly-15k-curated-multilingual
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- sft
 ---
 # Open-Assistant Llama2 70B SFT v10
 This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
+The model was fine-tuned in two stages, first on a mix of synthetic instrunctions and coding tasks data and in a 2nd "finishing" stage
+on top-1 human Open-Assistant demonstrations exported on July 23, 2023 (see configuration details section below).
 ## Model Details
 - **Finetuned from:** [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b) via [epfLLM/old-Megatron-LM](https://github.com/epfLLM/old-Megatron-LM)
 - **Model type:** Causal decoder-only transformer language model
+- **Language:** English (and limited capabilities in German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish)
+- **Weights & Biases training logs:** [Stage 1](https://wandb.ai/open-assistant/public-sft/runs/run45_oasst_pre10_llama2_70b) (1 epoch pretrain-mix, 12k steps), [Stage 2](https://wandb.ai/open-assistant/public-sft/runs/run46_oasst_sft10_llama2_70b) (3 epochs oasst top-1, 519 steps)
 - **Demo:** [Continuations for 250 random prompts (TGI, 4bit nf4 quantization)](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-08-22_OpenAssistant_llama2-70b-oasst-sft-v10_sampling_noprefix2_nf4.json%0A)
 - **Evaluation** [FastEval-OpenAssistant Overview](https://tju01.github.io/FastEval-OpenAssistant/) (using [FastEval](https://github.com/FastEval/FastEval) & [vLLM](https://github.com/vllm-project/vllm))
 - **License:** [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
 <|im_end|>
 ```
+### Credits & Special Thanks
+- Compute was generously sponsored by the eplf [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/).
+- The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
+- [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
+- [ehartford](https://huggingface.co/ehartford) generated and published the [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin) and the [ehartford/oa_leet10k](https://huggingface.co/datasets/ehartford/oa_leet10k) datasets.
+- [Argilla](https://huggingface.co/argilla) curated and published the [argilla/databricks-dolly-15k-curated-multilingual] dataset.
+- [shahules786](https://github.com/shahules786) de-duped and filtered the Dolphin dataset with a cluster-center approach and generated the orca-best (ocra-chat) dataset.
+- [andreaskoepf](https://github.com/andreaskoepf/) prepared & orchestrated the training.
+We want to especially thank everyone who contributed in the crowed-sourced Open-Assistant dataset creation on https://open-assistant.io/ - without you this project would not have been possible.
+## Ethical Considerations and Limitations
+Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.
+For these reasons, as with all LLMs, the potential outputs of llama2-70b-oasst-sft-v10 cannot be predicted
+in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
+to user prompts. Therefore, before deploying any applications of llama2-70b-oasst-sft-v10, developers should
+perform safety testing and tuning tailored to their specific applications of the model.
+Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
 ## Configuration Details
+The "pretokenizer" utility used to tokenize the datamix is part of the Open-Assistant github repository and can be found here: [model/pretokenizer](https://github.com/LAION-AI/Open-Assistant/tree/main/model/pretokenizer).
 ### Stage 1 Pretokenizer Configuration
+Entries of the dataset with assistant replies shorter than 25 tokens were excluded from training.
 ```
 oasst_pre10_min25:
   datasets:
   min_assistant_tokens: 25
 ```
+Stage 1 dataset statistics:
+```
+# Stats for output/oasst_pre10_min25_llama2
+## Stats for 'Subset of InstructionDataset (megacode2)' (466364 samples (50.0%))
+-----------------
+  Accepted: 398223/466364 (85.4%)
+  Accepted tokens: 167676873
+  Skipped: 68141 (14.6%)
+  Min tokens per sample: 36
+  Max tokens per sample: 11810
+  Avg tokens per sample: 421.063
+-----------------
+## Stats for 'Subset of OrcaChat (orca-chat)' (325616 samples (100.0%))
+-----------------
+  Accepted: 325616/325616 (100.0%)
+  Accepted tokens: 178307574
+  Skipped: 0 (0.0%)
+  Min tokens per sample: 105
+  Max tokens per sample: 10408
+  Avg tokens per sample: 547.601
+-----------------
+## Stats for 'Subset of Dolly15kMultilingual' (57020 samples (100.0%))
+-----------------
+  Accepted: 47494/57020 (83.3%)
+  Accepted tokens: 13883177
+  Skipped: 9526 (16.7%)
+  Min tokens per sample: 34
+  Max tokens per sample: 9172
+  Avg tokens per sample: 292.314
+-----------------
+## Stats for 'Subset of InstructionDataset (oa_leet10k)' (22236 samples (100.0%))
+-----------------
+  Accepted: 22236/22236 (100.0%)
+  Accepted tokens: 15905296
+  Skipped: 0 (0.0%)
+  Min tokens per sample: 168
+  Max tokens per sample: 10588
+  Avg tokens per sample: 715.295
+-----------------
+## Stats for 'total' (871236 samples (100.0%))
+-----------------
+  Accepted: 793569/871236 (91.1%)
+  Accepted tokens: 375772920
+  Skipped: 77667 (8.9%)
+  Min tokens per sample: 34
+  Max tokens per sample: 11810
+  Avg tokens per sample: 473.523
+-----------------
+```
 ### Stage 2 Pretokenizer Configuration
 ```
   filename_prefix: "oasst_top1"
 ```
+Stage 2 dataset statistics:
+```
+# Stats for output/oasst_top1_2023-07-23_llama2
+## Stats for 'ListDataset' (11441 samples (100.0%))
+-----------------
+  Accepted: 11441/11441 (100.0%)
+  Accepted tokens: 5315368
+  Skipped: 0 (0.0%)
+  Min tokens per sample: 20
+  Max tokens per sample: 5407
+  Avg tokens per sample: 464.58945896337735
+-----------------
+## Stats for 'total' (11441 samples (100.0%))
+-----------------
+  Accepted: 11441/11441 (100.0%)
+  Accepted tokens: 5315368
+  Skipped: 0 (0.0%)
+  Min tokens per sample: 20
+  Max tokens per sample: 5407
+  Avg tokens per sample: 464.58945896337735
+-----------------
+```
 ### Megatron Fine-Tuning Arguments for Stage 1 (Instruction Tuning):
 ```
 --tensor_model_parallel_size 8
 --pipeline_model_parallel_size 4
+--load ./checkpoints/llama2-70b-tp8-pp4
+--save ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10
+--tensorboard_dir ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10/logging
+--data_path ./data/oasst_pre10_min25_llama2/oasst_sft10-train
 --model_name llama2
 --tokenizer_type SentencePieceTokenizer
 --bf16
 --global_batch_size 64
 --micro_batch_size 2
+--vocab_file=./llama2/Llama-2-7b/tokenizer.model
 --use_rms_norm
 --glu_activation swiglu
 --no_tie_embed_logits
 ```
 --tensor_model_parallel_size 8
 --pipeline_model_parallel_size 4
+--load ./checkpoints/llama2-70b-tp8-pp4-oasst_pre10
+--save ./checkpoints/llama2-70b-tp8-pp4-oasst_sft10
+--tensorboard_dir ./checkpoints/llama2-70b-tp8-pp4-oasst_sft10/logging
+--data_path ./data/oasst_top1_2023-07-23_llama2/oasst_top1-train
 --model_name llama2
 --tokenizer_type SentencePieceTokenizer
 --bf16
 --global_batch_size 64
 --micro_batch_size 2
+--vocab_file=./llama2/Llama-2-7b/tokenizer.model
 --use_rms_norm
 --glu_activation swiglu
 --no_tie_embed_logits
 --rope_scaling_factor 1.0
 --finetune
 --wandb_logger
+```