Honkware commited on
Commit
51193ba
1 Parent(s): 1188b67

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -121
README.md CHANGED
@@ -1,121 +1 @@
1
- ---
2
- language:
3
- - en
4
- tags:
5
- - llama
6
- ---
7
-
8
- # OpenChat: Less is More for Open-source Models
9
-
10
- OpenChat is a series of open-source language models fine-tuned on very little diverse and high-quality multi-round conversations. The [dataset](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset) contains only ~6K GPT-4 conversations filtered from the 90K ShareGPT conversations.
11
-
12
- Generic models:
13
-
14
- - OpenChat: based on LLaMA-13B (2048 context length)
15
- - **105.7%** of ChatGPT score on Vicuna GPT-4 evaluation
16
- - **80.87%** Win-rate on AlpacaEval
17
- - **🚀 Only used 6K data for finetuning!!!**
18
- - OpenChat-8192: based on LLaMA-13B (extended to 8192 context length)
19
- - **106.6%** of ChatGPT score on Vicuna GPT-4 evaluation
20
-
21
- Code models:
22
-
23
- - OpenCoderPlus: based on StarCoderPlus (native 8192 context length)
24
- - **102.5%** of ChatGPT score on Vicuna GPT-4 evaluation
25
- - **78.70%** Win-rate on AlpacaEval
26
-
27
- **NOTE:** Please load the pretrained models using *bfloat16*
28
-
29
- ## Conversation Template
30
-
31
- The conversation template **involves concatenating tokens**.
32
-
33
- Besides base model vocabulary, an end-of-turn token `<|end_of_turn|>` is added, with id `eot_token_id`.
34
-
35
- ```python
36
- # OpenChat
37
- [bos_token_id] + tokenize("Human: ") + tokenize(user_question) + [eot_token_id] + tokenize("Assistant: ")
38
- # OpenCoder
39
- tokenize("User:") + tokenize(user_question) + [eot_token_id] + tokenize("Assistant:")
40
- ```
41
-
42
- *Hint: In BPE, `tokenize(A) + tokenize(B)` does not always equals to `tokenize(A + B)`*
43
-
44
- Following is the code for generating the conversation templates:
45
-
46
- ```python
47
- @dataclass
48
- class ModelConfig:
49
- # Prompt
50
- system: Optional[str]
51
-
52
- role_prefix: dict
53
- ai_role: str
54
- eot_token: str
55
- bos_token: Optional[str] = None
56
-
57
- # Get template
58
- def generate_conversation_template(self, tokenize_fn, tokenize_special_fn, message_list):
59
- tokens = []
60
- masks = []
61
-
62
- # begin of sentence (bos)
63
- if self.bos_token:
64
- t = tokenize_special_fn(self.bos_token)
65
- tokens.append(t)
66
- masks.append(False)
67
-
68
- # System
69
- if self.system:
70
- t = tokenize_fn(self.system) + [tokenize_special_fn(self.eot_token)]
71
- tokens.extend(t)
72
- masks.extend([False] * len(t))
73
-
74
- # Messages
75
- for idx, message in enumerate(message_list):
76
- # Prefix
77
- t = tokenize_fn(self.role_prefix[message["from"]])
78
- tokens.extend(t)
79
- masks.extend([False] * len(t))
80
-
81
- # Message
82
- if "value" in message:
83
- t = tokenize_fn(message["value"]) + [tokenize_special_fn(self.eot_token)]
84
- tokens.extend(t)
85
- masks.extend([message["from"] == self.ai_role] * len(t))
86
- else:
87
- assert idx == len(message_list) - 1, "Empty message for completion must be on the last."
88
-
89
- return tokens, masks
90
-
91
-
92
- MODEL_CONFIG_MAP = {
93
- # OpenChat / OpenChat-8192
94
- "openchat": ModelConfig(
95
- # Prompt
96
- system=None,
97
-
98
- role_prefix={
99
- "human": "Human: ",
100
- "gpt": "Assistant: "
101
- },
102
- ai_role="gpt",
103
- eot_token="<|end_of_turn|>",
104
- bos_token="<s>",
105
- ),
106
-
107
- # OpenCoder / OpenCoderPlus
108
- "opencoder": ModelConfig(
109
- # Prompt
110
- system=None,
111
-
112
- role_prefix={
113
- "human": "User:",
114
- "gpt": "Assistant:"
115
- },
116
- ai_role="gpt",
117
- eot_token="<|end_of_turn|>",
118
- bos_token=None,
119
- )
120
- }
121
- ```
 
1
+ 4-Bit Quantization of https://huggingface.co/openchat/openchat_8192