OpenAssistant
/

llama2-70b-oasst-sft-v10

@@ -14,9 +14,9 @@ tags:
 ---
 # Open-Assistant Llama2 70B SFT v10
-This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
-The model was fine-tuned in two stages, first on a mix of synthetic instrunctions and coding tasks data and in a 2nd "finishing" stage
-on top-1 human Open-Assistant demonstrations exported on July 23, 2023 (see configuration details section below).
 ## Model Details
@@ -32,22 +32,35 @@ on top-1 human Open-Assistant demonstrations exported on July 23, 2023 (see conf
 ## Prompting / Prompt Template
-The model was trained with OpenAI's [chatml](https://github.com/openai/openai-python/blob/main/chatml.md) prompt format:
-"<|im_start|>system\n{system_message}<im_end>\n<|im_start|>user\n{user prompt}<|im_end|>\n<|im_start|>assistant\n{Assistant answer}<|im_end|>\n"
-Multi-line:
 ```
 <|im_start|>system
 {system_message}<|im_end|>
 <|im_start|>user
-{user prompt}<|im_end|>
 <|im_start|>assistant
-{Assistant answer}<|im_end|>
 ```
-The model was partly trained with orca system messages. For inference we can recommend the official [llama2 system prompt](https://github.com/facebookresearch/llama/blob/ea9f33d6d3ea8ed7d560d270986407fd6c2e52b7/example_chat_completion.py#L57-L61):
 ```
 <|im_start|>system
 You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
@@ -57,6 +70,7 @@ If a question does not make any sense, or is not factually coherent, explain why
 ### Credits & Special Thanks
 - Compute was generously sponsored by the eplf [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/).
 - The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
 - [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
@@ -77,6 +91,12 @@ perform safety testing and tuning tailored to their specific applications of the
 Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
 ## Configuration Details

 ---
 # Open-Assistant Llama2 70B SFT v10
+This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
+It was fine-tuned in two stages, first on a mix of synthetic instrunctions and coding tasks and then in a "polishing" stage
+on the best human demonstrations collected at [open-assistant.io](https://open-assistant.io/) up to July 23, 2023 (see *Configuration Details* below).
 ## Model Details
 ## Prompting / Prompt Template
+Due to public demand we changed the prompt-template for this model from custom prompter/assistant tokens to OpenAI's [chatml](https://github.com/openai/openai-python/blob/main/chatml.md) standard prompt format.
+We hope that this leads to greater compatibility with chat inference/frontend applications.
+Prompt template:
 ```
+"""
 <|im_start|>system
 {system_message}<|im_end|>
 <|im_start|>user
+{prompt}<|im_end|>
 <|im_start|>assistant
+"""
 ```
+The model input can contain multiple conversation turns between user and assistant, e.g.
+```
+<|im_start|>user
+{prompt 1}<|im_end|>
+<|im_start|>assistant
+{reply 1}<|im_end|>
+<|im_start|>user
+{prompt 2}<|im_end|>
+<|im_start|>assistant
+(...)
+```
+The model was partly trained with orca system messages.
+For inference we recommend to use the official [Llama2 system message](https://github.com/facebookresearch/llama/blob/ea9f33d6d3ea8ed7d560d270986407fd6c2e52b7/example_chat_completion.py#L57-L61):
 ```
 <|im_start|>system
 You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
 ### Credits & Special Thanks
+- Thanks to [Meta AI](https://ai.meta.com/) for training and releasing the Llama2 model.
 - Compute was generously sponsored by the eplf [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/).
 - The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
 - [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
 Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
+## Note regarding inference with TGI
+During evaluation we noticed that this 70B model produced extremely poor outputs when loaded it was loaded in 16 bit precision sharded in [TGI](https://github.com/huggingface/text-generation-inference).
+In contrast the model could be evaluated without problem using [vLLM](https://github.com/vllm-project/vllm).
+The model also worked decently well when loaded with TGI on a single GPPU nf4 quantized via [TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes).
+Will will get it touch with the TGI authors to find out why sharded 16-bit inference doesn't work as expected.
 ## Configuration Details