Text Generation
Transformers
PyTorch
English
llama
sft
Inference Endpoints
text-generation-inference
andreaskoepf commited on
Commit
036be44
1 Parent(s): 5997675

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -10
README.md CHANGED
@@ -14,9 +14,9 @@ tags:
14
  ---
15
  # Open-Assistant Llama2 70B SFT v10
16
 
17
- This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
18
- The model was fine-tuned in two stages, first on a mix of synthetic instrunctions and coding tasks data and in a 2nd "finishing" stage
19
- on top-1 human Open-Assistant demonstrations exported on July 23, 2023 (see configuration details section below).
20
 
21
  ## Model Details
22
 
@@ -32,22 +32,35 @@ on top-1 human Open-Assistant demonstrations exported on July 23, 2023 (see conf
32
 
33
  ## Prompting / Prompt Template
34
 
35
- The model was trained with OpenAI's [chatml](https://github.com/openai/openai-python/blob/main/chatml.md) prompt format:
36
- "<|im_start|>system\n{system_message}<im_end>\n<|im_start|>user\n{user prompt}<|im_end|>\n<|im_start|>assistant\n{Assistant answer}<|im_end|>\n"
37
 
38
-
39
- Multi-line:
40
 
41
  ```
 
42
  <|im_start|>system
43
  {system_message}<|im_end|>
44
  <|im_start|>user
45
- {user prompt}<|im_end|>
46
  <|im_start|>assistant
47
- {Assistant answer}<|im_end|>
48
  ```
49
 
50
- The model was partly trained with orca system messages. For inference we can recommend the official [llama2 system prompt](https://github.com/facebookresearch/llama/blob/ea9f33d6d3ea8ed7d560d270986407fd6c2e52b7/example_chat_completion.py#L57-L61):
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ```
52
  <|im_start|>system
53
  You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
@@ -57,6 +70,7 @@ If a question does not make any sense, or is not factually coherent, explain why
57
 
58
  ### Credits & Special Thanks
59
 
 
60
  - Compute was generously sponsored by the eplf [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/).
61
  - The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
62
  - [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
@@ -77,6 +91,12 @@ perform safety testing and tuning tailored to their specific applications of the
77
 
78
  Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
79
 
 
 
 
 
 
 
80
 
81
  ## Configuration Details
82
 
 
14
  ---
15
  # Open-Assistant Llama2 70B SFT v10
16
 
17
+ This model is an Open-Assistant fine-tuning of Meta's [Llama2 70B](https://huggingface.co/meta-llama/Llama-2-70b) LLM.
18
+ It was fine-tuned in two stages, first on a mix of synthetic instrunctions and coding tasks and then in a "polishing" stage
19
+ on the best human demonstrations collected at [open-assistant.io](https://open-assistant.io/) up to July 23, 2023 (see *Configuration Details* below).
20
 
21
  ## Model Details
22
 
 
32
 
33
  ## Prompting / Prompt Template
34
 
35
+ Due to public demand we changed the prompt-template for this model from custom prompter/assistant tokens to OpenAI's [chatml](https://github.com/openai/openai-python/blob/main/chatml.md) standard prompt format.
36
+ We hope that this leads to greater compatibility with chat inference/frontend applications.
37
 
38
+ Prompt template:
 
39
 
40
  ```
41
+ """
42
  <|im_start|>system
43
  {system_message}<|im_end|>
44
  <|im_start|>user
45
+ {prompt}<|im_end|>
46
  <|im_start|>assistant
47
+ """
48
  ```
49
 
50
+ The model input can contain multiple conversation turns between user and assistant, e.g.
51
+ ```
52
+ <|im_start|>user
53
+ {prompt 1}<|im_end|>
54
+ <|im_start|>assistant
55
+ {reply 1}<|im_end|>
56
+ <|im_start|>user
57
+ {prompt 2}<|im_end|>
58
+ <|im_start|>assistant
59
+ (...)
60
+ ```
61
+
62
+ The model was partly trained with orca system messages.
63
+ For inference we recommend to use the official [Llama2 system message](https://github.com/facebookresearch/llama/blob/ea9f33d6d3ea8ed7d560d270986407fd6c2e52b7/example_chat_completion.py#L57-L61):
64
  ```
65
  <|im_start|>system
66
  You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
 
70
 
71
  ### Credits & Special Thanks
72
 
73
+ - Thanks to [Meta AI](https://ai.meta.com/) for training and releasing the Llama2 model.
74
  - Compute was generously sponsored by the eplf [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/).
75
  - The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
76
  - [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
 
91
 
92
  Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
93
 
94
+ ## Note regarding inference with TGI
95
+
96
+ During evaluation we noticed that this 70B model produced extremely poor outputs when loaded it was loaded in 16 bit precision sharded in [TGI](https://github.com/huggingface/text-generation-inference).
97
+ In contrast the model could be evaluated without problem using [vLLM](https://github.com/vllm-project/vllm).
98
+ The model also worked decently well when loaded with TGI on a single GPPU nf4 quantized via [TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes).
99
+ Will will get it touch with the TGI authors to find out why sharded 16-bit inference doesn't work as expected.
100
 
101
  ## Configuration Details
102