imone commited on
Commit
f6f7960
1 Parent(s): e2e0809

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - llama
6
+ ---
7
+
8
+ # Model Card: OpenChat
9
+
10
+ OpenChat is a series of open-source language models fine-tuned on 6K diverse and high-quality multi-round conversations.
11
+
12
+ Generic models:
13
+
14
+ - OpenChat: based on LLaMA-13B (**2048** context length)
15
+ - OpenChat-8192: based on LLaMA-13B (**extended to 8192** context length)
16
+
17
+ Code models **(coming)**:
18
+
19
+ - OpenCoder: based on StarCoder (**8192** context length)
20
+ - OpenCoderPlus: based on StarCoderPlus (**8192** context length)
21
+ - OpenCoderBase: based on StarCoderBase (**8192** context length)
22
+
23
+
24
+ ## Conversation Template
25
+
26
+ The conversation template **involves concatenating tokens**.
27
+
28
+ Besides base model vocabulary, an end-of-turn token `<|end_of_turn|>` is added, with id `eot_token_id`.
29
+
30
+ ```python
31
+ # OpenChat
32
+ [bos_token_id] + tokenize("Human: ") + tokenize(user_question) + [eot_token_id] + tokenize("Assistant: ")
33
+ # OpenCoder
34
+ tokenize("User:") + tokenize(user_question) + [eot_token_id] + tokenize("Assistant:")
35
+ ```
36
+
37
+ *Hint: In BPE, `tokenize(A) + tokenize(B)` not always equals to `tokenize(A + B)`*
38
+
39
+ Following is the code for generating the conversation templates:
40
+
41
+ ```python
42
+ @dataclass
43
+ class ModelConfig:
44
+ name: str
45
+
46
+ # Prompt
47
+ system: Optional[str]
48
+
49
+ role_prefix: dict
50
+ ai_role: str
51
+ eot_token: str
52
+ bos_token: Optional[str] = None
53
+
54
+ # Tokenize
55
+ max_tokens: Optional[int] = None
56
+
57
+ # Get template
58
+ def generate_conversation_template(self, tokenize_fn, tokenize_special_fn, message_list):
59
+ tokens = []
60
+ masks = []
61
+
62
+ # begin of sentence (bos)
63
+ if self.bos_token:
64
+ t = tokenize_special_fn(self.bos_token)
65
+ tokens.append(t)
66
+ masks.append(False)
67
+
68
+ # System
69
+ if self.system:
70
+ t = tokenize_fn(self.system) + [tokenize_special_fn(self.eot_token)]
71
+ tokens.extend(t)
72
+ masks.extend([False] * len(t))
73
+
74
+ # Messages
75
+ for idx, message in enumerate(message_list):
76
+ # Prefix
77
+ t = tokenize_fn(self.role_prefix[message["from"]])
78
+ tokens.extend(t)
79
+ masks.extend([False] * len(t))
80
+
81
+ # Message
82
+ if "value" in message:
83
+ t = tokenize_fn(message["value"]) + [tokenize_special_fn(self.eot_token)]
84
+ tokens.extend(t)
85
+ masks.extend([message["from"] == self.ai_role] * len(t))
86
+ else:
87
+ assert idx == len(message_list) - 1, "Empty message for completion must be on the last."
88
+
89
+ # Truncate to specified tokens
90
+ if self.max_tokens:
91
+ tokens = tokens[:self.max_tokens]
92
+ masks = masks[:self.max_tokens]
93
+
94
+ return tokens, masks
95
+
96
+
97
+ MODEL_CONFIG_MAP = {
98
+ # OpenChat
99
+ "openchat": ModelConfig(
100
+ name="OpenChat",
101
+
102
+ # Prompt
103
+ system=None,
104
+
105
+ role_prefix={
106
+ "human": "Human: ",
107
+ "gpt": "Assistant: "
108
+ },
109
+ ai_role="gpt",
110
+ eot_token="<|end_of_turn|>",
111
+ bos_token="<s>",
112
+
113
+ # Tokenize
114
+ max_tokens=2048
115
+ ),
116
+
117
+ # OpenCoder / OpenCoderPlus
118
+ "opencoder": ModelConfig(
119
+ name="OpenCoder",
120
+
121
+ # Prompt
122
+ system=None,
123
+
124
+ role_prefix={
125
+ "human": "User:",
126
+ "gpt": "Assistant:"
127
+ },
128
+ ai_role="gpt",
129
+ eot_token="<|end_of_turn|>",
130
+ bos_token=None,
131
+
132
+ # Tokenize
133
+ max_tokens=8192
134
+ )
135
+ }
136
+ ```