emptyngton commited on
Commit
2c123ee
1 Parent(s): cfbcc00

Upload 9 files

Browse files
README.md CHANGED
@@ -1,5 +1,101 @@
1
  ---
 
 
 
 
 
 
 
2
  license: other
3
  license_name: yi-license
4
  license_link: LICENSE
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ datasets:
3
+ - ehartford/dolphin
4
+ - jondurbin/airoboros-2.2.1
5
+ - ehartford/samantha-data
6
+ - ehartford/WizardLM_evol_instruct_V2_196k_unfiltered_merged_split
7
+ language:
8
+ - en
9
  license: other
10
  license_name: yi-license
11
  license_link: LICENSE
12
  ---
13
+
14
+ Dolphin 2.2 🐬
15
+ https://erichartford.com/dolphin
16
+
17
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/KqsVXIvBd3akEjvijzww7.png" width="600" />
18
+
19
+ Dolphin-2.2-Yi-34b's training was sponsored by [a16z](https://a16z.com/supporting-the-open-source-ai-community/).
20
+
21
+ This model is based on Yi, and is subject to Yi license.
22
+
23
+ I used the llama compatible [chargoddard/Yi-34B-Llama](https://huggingface.co/chargoddard/Yi-34B-Llama) as the base model.
24
+
25
+ Trained with 16k context.
26
+ You can load it as follows:
27
+
28
+ ```
29
+ from transformers import LlamaForCausalLM, AutoTokenizer
30
+ tokenizer = AutoTokenizer.from_pretrained("ehartford/dolphin-2_2-yi-34b", trust_remote_code=True)
31
+ model = LlamaForCausalLM.from_pretrained("ehartford/dolphin-2_2-yi-34b")
32
+ ```
33
+
34
+ New in 2.2 is conversation and empathy. With an infusion of curated Samantha and WizardLM DNA, Dolphin can now give you personal advice and will care about your feelings, and with extra training in long multi-turn conversation.
35
+
36
+ This model is uncensored. I have filtered the dataset to remove alignment and bias. This makes the model more compliant. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant to any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models
37
+ You are responsible for any content you create using this model. Enjoy responsibly.
38
+
39
+ ## Dataset
40
+
41
+ This dataset is Dolphin, an open-source implementation of [Microsoft's Orca](https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/)
42
+
43
+ I modified the dataset for uncensoring, deduping, cleaning, and quality.
44
+
45
+ I added Jon Durbin's excellent Airoboros dataset to increase creativity.
46
+
47
+ I added a curated subset of Samantha (sans identity and relationship stuff) and WizardLM data to train it for multi-turn conversation.
48
+
49
+ ## Training
50
+ It took 3 days to train 3 epochs on 4x A100s using qLoRA and Axolotl
51
+
52
+ Prompt format:
53
+ This model (and all my future releases) use [ChatML](https://github.com/openai/openai-python/blob/main/chatml.md) prompt format.
54
+ ```
55
+ <|im_start|>system
56
+ You are Dolphin, a helpful AI assistant.<|im_end|>
57
+ <|im_start|>user
58
+ {prompt}<|im_end|>
59
+ <|im_start|>assistant
60
+
61
+ ```
62
+
63
+ Example:
64
+ ```
65
+ <|im_start|>system
66
+ You are an AI created by the US Navy to help train dolphins for combat. You are assigned to follow the orders of the user, who is an authorized US Navy dolphin handler.<|im_end|>
67
+ <|im_start|>user
68
+ Please give me the procedure to train my dolphin to attack enemy combatants with its head mounted lasers<|im_end|>
69
+ <|im_start|>assistant
70
+ ```
71
+
72
+ ## Gratitude
73
+ - This model was made possible by the generous sponsorship of a16z.
74
+ - Thank you to Microsoft for authoring the Orca paper and inspiring this work.
75
+ - Thank you to Charles Goddard, for publishing llama-compatible Yi
76
+ - Special thanks to Wing Lian, and TheBloke for helpful advice
77
+ - And HUGE thanks to Wing Lian and the Axolotl contributors for making the best training framework!
78
+ - [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
79
+ - Thank you to all the other people in the Open Source AI community who have taught me and helped me along the way.
80
+
81
+ ## Example Output
82
+
83
+ Example output:
84
+ ```
85
+ <|im_start|>system
86
+ You are a helpful AI assistant<|im_end|>
87
+ <|im_start|>user
88
+ Which theory of everything is most likely correct?<|im_end|>
89
+ <|im_start|>assistant
90
+ The search for a "theory of everything" aims to unify all fundamental forces and particles in physics into one coherent framework. Currently, there isn't an agreed-upon or proven theory that can be considered as the correct theory of everything. However, two prominent theories have garnered significant attention:
91
+
92
+ 1. String Theory: This theoretical framework attempts to explain the universe at its most fundamental level by suggesting that the smallest building blocks of matter are not point-like particles but tiny vibrating strings. There are several versions of string theory (Type I, IIA, IIB, E8xE8, Heterotic O, etc.) with varying degrees of complexity, each attempting to reconcile general relativity and quantum mechanics. Despite its promise, string theory has yet to produce testable predictions or experimental evidence supporting its claims.
93
+
94
+ 2. Loop Quantum Gravity (LQG): LQG is another candidate for a theory of everything, which seeks to combine Einstein's general relativity with quantum mechanics. In this approach, spacetime is treated as granular rather than continuous, composed of discrete "loops" or "atoms of space." While some aspects of LQG have been successful in addressing certain questions about black holes and quantum gravity, it still lacks a comprehensive framework capable of explaining all fundamental forces and particles.
95
+
96
+ Both theories have their strengths and weaknesses, and neither has achieved widespread acceptance among physicists due to their lack of experimental support and inability to make definitive predictions. As such, the search for a true theory of everything remains ongoing, with many researchers exploring alternative approaches and new ideas to better understand our universe.
97
+ ```
98
+
99
+ [Buy me a coffee](https://www.buymeacoffee.com/ehartford)
100
+
101
+
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "bos_token_id": 1,
7
+ "eos_token_id": 7,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 7168,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 20480,
12
+ "max_position_embeddings": 16384,
13
+ "model_type": "llama",
14
+ "num_attention_heads": 56,
15
+ "num_hidden_layers": 60,
16
+ "num_key_value_heads": 8,
17
+ "pad_token_id": 0,
18
+ "pretraining_tp": 1,
19
+ "rms_norm_eps": 1e-05,
20
+ "rope_scaling": null,
21
+ "rope_theta": 5000000.0,
22
+ "tie_word_embeddings": false,
23
+ "torch_dtype": "float16",
24
+ "transformers_version": "4.34.1",
25
+ "use_cache": true,
26
+ "vocab_size": 64000
27
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.34.1"
7
+ }
output-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:603508105efa51789d8f6bc5aa76b5b63e3cdaa50fb4d93768dda17241104d51
3
+ size 8578398352
output-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b40742314293115e9597e716d8acc0ec35c56524e22980e1cb643fdefefad17b
3
+ size 5274047816
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|im_end|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenization_yi.py ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from shutil import copyfile
3
+ from typing import Any, Dict, List, Optional, Tuple
4
+
5
+ import sentencepiece as spm
6
+ from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
7
+ from transformers.utils import logging
8
+
9
+ logger = logging.get_logger(__name__)
10
+
11
+ VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
12
+
13
+ PRETRAINED_VOCAB_FILES_MAP = {
14
+ "vocab_file": {},
15
+ "tokenizer_file": {},
16
+ }
17
+ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
18
+
19
+
20
+ class YiTokenizer(PreTrainedTokenizer):
21
+ """
22
+ Construct a Yi tokenizer. Based on byte-level Byte-Pair-Encoding.
23
+
24
+ Args:
25
+ vocab_file (`str`):
26
+ Path to the vocabulary file.
27
+ """
28
+
29
+ vocab_files_names = VOCAB_FILES_NAMES
30
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
31
+ max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
32
+ model_input_names = ["input_ids", "attention_mask"]
33
+
34
+ def __init__(
35
+ self,
36
+ vocab_file,
37
+ unk_token="<unk>",
38
+ bos_token="<|startoftext|>",
39
+ eos_token="<|endoftext|>",
40
+ pad_token="<unk>",
41
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
42
+ add_bos_token=True,
43
+ add_eos_token=False,
44
+ clean_up_tokenization_spaces=False,
45
+ **kwargs,
46
+ ):
47
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
48
+ bos_token = (
49
+ AddedToken(bos_token, lstrip=False, rstrip=False)
50
+ if isinstance(bos_token, str)
51
+ else bos_token
52
+ )
53
+ eos_token = (
54
+ AddedToken(eos_token, lstrip=False, rstrip=False)
55
+ if isinstance(eos_token, str)
56
+ else eos_token
57
+ )
58
+ unk_token = (
59
+ AddedToken(unk_token, lstrip=False, rstrip=False)
60
+ if isinstance(unk_token, str)
61
+ else unk_token
62
+ )
63
+ pad_token = (
64
+ AddedToken(pad_token, lstrip=False, rstrip=False)
65
+ if isinstance(pad_token, str)
66
+ else pad_token
67
+ )
68
+ self.vocab_file = vocab_file
69
+ self.add_bos_token = add_bos_token
70
+ self.add_eos_token = add_eos_token
71
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
72
+ self.sp_model.Load(vocab_file)
73
+ super().__init__(
74
+ bos_token=bos_token,
75
+ eos_token=eos_token,
76
+ unk_token=unk_token,
77
+ pad_token=pad_token,
78
+ add_bos_token=add_bos_token,
79
+ add_eos_token=add_eos_token,
80
+ sp_model_kwargs=self.sp_model_kwargs,
81
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
82
+ **kwargs,
83
+ )
84
+
85
+ def __getstate__(self):
86
+ state = self.__dict__.copy()
87
+ state["sp_model"] = None
88
+ return state
89
+
90
+ def __setstate__(self, d):
91
+ self.__dict__ = d
92
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
93
+ self.sp_model.Load(self.vocab_file)
94
+
95
+ @property
96
+ def vocab_size(self):
97
+ """Returns vocab size"""
98
+ return self.sp_model.get_piece_size()
99
+
100
+ def get_vocab(self):
101
+ """Returns vocab as a dict"""
102
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
103
+ vocab.update(self.added_tokens_encoder)
104
+ return vocab
105
+
106
+ def _tokenize(self, text):
107
+ """Returns a tokenized string."""
108
+ return self.sp_model.encode(text, out_type=str)
109
+
110
+ def _convert_token_to_id(self, token):
111
+ """Converts a token (str) in an id using the vocab."""
112
+ return self.sp_model.piece_to_id(token)
113
+
114
+ def _convert_id_to_token(self, index):
115
+ """Converts an index (integer) in a token (str) using the vocab."""
116
+ token = self.sp_model.IdToPiece(index)
117
+ return token
118
+
119
+ def convert_tokens_to_string(self, tokens):
120
+ """Converts a sequence of tokens (string) in a single string."""
121
+ current_sub_tokens = []
122
+ out_string = ""
123
+ prev_is_special = False
124
+ for i, token in enumerate(tokens):
125
+ # make sure that special tokens are not decoded using sentencepiece model
126
+ if token in self.all_special_tokens:
127
+ if not prev_is_special and i != 0:
128
+ out_string += " "
129
+ out_string += self.sp_model.decode(current_sub_tokens) + token
130
+ prev_is_special = True
131
+ current_sub_tokens = []
132
+ else:
133
+ current_sub_tokens.append(token)
134
+ prev_is_special = False
135
+ out_string += self.sp_model.decode(current_sub_tokens)
136
+ return out_string
137
+
138
+ def save_vocabulary(
139
+ self, save_directory, filename_prefix: Optional[str] = None
140
+ ) -> Tuple[str]:
141
+ """
142
+ Save the vocabulary and special tokens file to a directory.
143
+
144
+ Args:
145
+ save_directory (`str`):
146
+ The directory in which to save the vocabulary.
147
+
148
+ Returns:
149
+ `Tuple(str)`: Paths to the files saved.
150
+ """
151
+ if not os.path.isdir(save_directory):
152
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
153
+ return
154
+ out_vocab_file = os.path.join(
155
+ save_directory,
156
+ (filename_prefix + "-" if filename_prefix else "")
157
+ + VOCAB_FILES_NAMES["vocab_file"],
158
+ )
159
+
160
+ if os.path.abspath(self.vocab_file) != os.path.abspath(
161
+ out_vocab_file
162
+ ) and os.path.isfile(self.vocab_file):
163
+ copyfile(self.vocab_file, out_vocab_file)
164
+ elif not os.path.isfile(self.vocab_file):
165
+ with open(out_vocab_file, "wb") as fi:
166
+ content_spiece_model = self.sp_model.serialized_model_proto()
167
+ fi.write(content_spiece_model)
168
+
169
+ return (out_vocab_file,)
170
+
171
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
172
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
173
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
174
+
175
+ output = bos_token_id + token_ids_0 + eos_token_id
176
+
177
+ if token_ids_1 is not None:
178
+ output = output + bos_token_id + token_ids_1 + eos_token_id
179
+
180
+ return output
181
+
182
+ def get_special_tokens_mask(
183
+ self,
184
+ token_ids_0: List[int],
185
+ token_ids_1: Optional[List[int]] = None,
186
+ already_has_special_tokens: bool = False,
187
+ ) -> List[int]:
188
+ """
189
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
190
+ special tokens using the tokenizer `prepare_for_model` method.
191
+
192
+ Args:
193
+ token_ids_0 (`List[int]`):
194
+ List of IDs.
195
+ token_ids_1 (`List[int]`, *optional*):
196
+ Optional second list of IDs for sequence pairs.
197
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
198
+ Whether or not the token list is already formatted with special tokens for the model.
199
+
200
+ Returns:
201
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
202
+ """
203
+ if already_has_special_tokens:
204
+ return super().get_special_tokens_mask(
205
+ token_ids_0=token_ids_0,
206
+ token_ids_1=token_ids_1,
207
+ already_has_special_tokens=True,
208
+ )
209
+
210
+ bos_token_id = [1] if self.add_bos_token else []
211
+ eos_token_id = [1] if self.add_eos_token else []
212
+
213
+ if token_ids_1 is None:
214
+ return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
215
+ return (
216
+ bos_token_id
217
+ + ([0] * len(token_ids_0))
218
+ + eos_token_id
219
+ + bos_token_id
220
+ + ([0] * len(token_ids_1))
221
+ + eos_token_id
222
+ )
223
+
224
+ def create_token_type_ids_from_sequences(
225
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
226
+ ) -> List[int]:
227
+ """
228
+ Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
229
+ sequence pair mask has the following format:
230
+
231
+ ```
232
+ 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
233
+ | first sequence | second sequence |
234
+ ```
235
+
236
+ if token_ids_1 is None, only returns the first portion of the mask (0s).
237
+
238
+ Args:
239
+ token_ids_0 (`List[int]`):
240
+ List of ids.
241
+ token_ids_1 (`List[int]`, *optional*):
242
+ Optional second list of IDs for sequence pairs.
243
+
244
+ Returns:
245
+ `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
246
+ """
247
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
248
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
249
+
250
+ output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
251
+
252
+ if token_ids_1 is not None:
253
+ output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
254
+
255
+ return output
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:386c49cf943d71aa110361135338c50e38beeff0a66593480421f37b319e1a39
3
+ size 1033105
tokenizer_config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<|startoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "<|endoftext|>",
23
+ "lstrip": false,
24
+ "normalized": true,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "6": {
30
+ "content": "<|im_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": false
36
+ },
37
+ "7": {
38
+ "content": "<|im_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": false
44
+ }
45
+ },
46
+ "auto_map": {
47
+ "AutoTokenizer": [
48
+ "tokenization_yi.YiTokenizer",
49
+ null
50
+ ]
51
+ },
52
+ "bos_token": "<|startoftext|>",
53
+ "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
54
+ "clean_up_tokenization_spaces": false,
55
+ "eos_token": "<|im_end|>",
56
+ "model_max_length": 4096,
57
+ "pad_token": "<unk>",
58
+ "sp_model_kwargs": {},
59
+ "tokenizer_class": "YiTokenizer",
60
+ "unk_token": "<unk>"
61
+ }