Transformers
English
ctranslate2
int8
float16
Inference Endpoints
michaelfeil commited on
Commit
7b61394
1 Parent(s): 3a2eae3

Upload togethercomputer/RedPajama-INCITE-Chat-7B-v0.1 ctranslate fp16 weights

Browse files
README.md ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - ctranslate2
4
+ - int8
5
+ - float16
6
+
7
+ license: apache-2.0
8
+ language:
9
+ - en
10
+ datasets:
11
+ - togethercomputer/RedPajama-Data-1T
12
+ - OpenAssistant/oasst1
13
+ - databricks/databricks-dolly-15k
14
+ widget:
15
+ - text: "<human>: Write an email to my friends inviting them to come to my home on Friday for a dinner party, bring their own food to share.\n<bot>:"
16
+ example_title: "Email Writing"
17
+ - text: "<human>: Create a list of things to do in San Francisco\n<bot>:"
18
+ example_title: "Brainstorming"
19
+ inference:
20
+ parameters:
21
+ temperature: 0.7
22
+ top_p: 0.7
23
+ top_k: 50
24
+ max_new_tokens: 128
25
+ ---
26
+ # # Fast-Inference with Ctranslate2
27
+ Speedup inference by 2x-8x using int8 inference in C++
28
+
29
+ quantized version of [togethercomputer/RedPajama-INCITE-Chat-7B-v0.1](https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-7B-v0.1)
30
+ ```bash
31
+ pip install hf-hub-ctranslate2>=2.0.6 ctranslate2>=3.13.0
32
+ ```
33
+ Converted on 2023-05-19 using
34
+ ```
35
+ ct2-transformers-converter --model togethercomputer/RedPajama-INCITE-Chat-7B-v0.1 --output_dir /home/michael/tmp-ct2fast-RedPajama-INCITE-Chat-7B-v0.1 --force --copy_files tokenizer.json README.md tokenizer_config.json generation_config.json special_tokens_map.json .gitattributes --quantization float16
36
+ ```
37
+
38
+ Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2)
39
+ - `compute_type=int8_float16` for `device="cuda"`
40
+ - `compute_type=int8` for `device="cpu"`
41
+
42
+ ```python
43
+ from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
44
+ from transformers import AutoTokenizer
45
+
46
+ model_name = "michaelfeil/ct2fast-RedPajama-INCITE-Chat-7B-v0.1"
47
+ # use either TranslatorCT2fromHfHub or GeneratorCT2fromHfHub here, depending on model.
48
+ model = GeneratorCT2fromHfHub(
49
+ # load in int8 on CUDA
50
+ model_name_or_path=model_name,
51
+ device="cuda",
52
+ compute_type="int8_float16",
53
+ tokenizer=AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-7B-v0.1")
54
+ )
55
+ outputs = model.generate(
56
+ text=["How do you call a fast Flan-ingo?", "User: How are you doing?"],
57
+ )
58
+ print(outputs)
59
+ ```
60
+
61
+ # Licence and other remarks:
62
+ This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
63
+
64
+ # Original description
65
+
66
+ tags:
67
+ - ctranslate2
68
+ - int8
69
+ - float16
70
+
71
+
72
+ # RedPajama-INCITE-Chat-7B-v0.1
73
+
74
+ RedPajama-INCITE-Chat-7B-v0.1 was developed by Together and leaders from the open-source AI community including Ontocord.ai, ETH DS3Lab, AAI CERC, Université de Montréal, MILA - Québec AI Institute, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.
75
+
76
+ It is fine-tuned on OASST1 and Dolly2 to enhance chatting ability.
77
+
78
+ - Base Model: [RedPajama-INCITE-Base-7B-v0.1](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1)
79
+ - Instruction-tuned Version: [RedPajama-INCITE-Instruct-7B-v0.1](https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1)
80
+ - Chat Version: [RedPajama-INCITE-Chat-7B-v0.1](https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-7B-v0.1)
81
+
82
+
83
+ ## Model Details
84
+ - **Developed by**: Together Computer.
85
+ - **Model type**: Language Model
86
+ - **Language(s)**: English
87
+ - **License**: Apache 2.0
88
+ - **Model Description**: A 6.9B parameter pretrained language model.
89
+
90
+ # Quick Start
91
+
92
+ Please note that the model requires `transformers` version >= 4.25.1.
93
+
94
+ To prompt the chat model, use the following format:
95
+ ```
96
+ <human>: [Instruction]
97
+ <bot>:
98
+ ```
99
+
100
+ ## GPU Inference
101
+
102
+ This requires a GPU with 16GB memory.
103
+
104
+ ```python
105
+ import torch
106
+ import transformers
107
+ from transformers import AutoTokenizer, AutoModelForCausalLM
108
+
109
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
110
+
111
+ # check transformers version
112
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
113
+
114
+ # init
115
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-7B-v0.1")
116
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-7B-v0.1", torch_dtype=torch.float16)
117
+ model = model.to('cuda:0')
118
+ # infer
119
+ prompt = "<human>: Who is Alan Turing?\n<bot>:"
120
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
121
+ input_length = inputs.input_ids.shape[1]
122
+ outputs = model.generate(
123
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
124
+ )
125
+ token = outputs.sequences[0, input_length:]
126
+ output_str = tokenizer.decode(token)
127
+ print(output_str)
128
+ """
129
+ Alan Mathison Turing (23 June 1912 7 June 1954) was an English computer scientist, mathematician, logician, cryptanalyst, philosopher, mathematician, and theoretical biologist.
130
+ """
131
+ ```
132
+
133
+ ## GPU Inference in Int8
134
+
135
+ This requires a GPU with 12GB memory.
136
+
137
+ To run inference with int8, please ensure you have installed accelerate and bitandbytes. You can install them with the following command:
138
+
139
+ ```bash
140
+ pip install accelerate
141
+ pip install bitsandbytes
142
+ ```
143
+
144
+ Then you can run inference with int8 as follows:
145
+
146
+ ```python
147
+ import torch
148
+ import transformers
149
+ from transformers import AutoTokenizer, AutoModelForCausalLM
150
+
151
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
152
+
153
+ # check transformers version
154
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
155
+
156
+ # init
157
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-7B-v0.1")
158
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-7B-v0.1", device_map='auto', torch_dtype=torch.float16, load_in_8bit=True)
159
+
160
+ # infer
161
+ prompt = "<human>: Who is Alan Turing?\n<bot>:"
162
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
163
+ input_length = inputs.input_ids.shape[1]
164
+ outputs = model.generate(
165
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
166
+ )
167
+ token = outputs.sequences[0, input_length:]
168
+ output_str = tokenizer.decode(token)
169
+ print(output_str)
170
+ """
171
+ Alan Mathison Turing (23 June 1912 – 7 June 1954) was an English computer scientist, mathematician, logician, cryptanalyst, philosopher, and theoretical biologist.
172
+ """
173
+ ```
174
+
175
+ ## CPU Inference
176
+
177
+ ```python
178
+ import torch
179
+ import transformers
180
+ from transformers import AutoTokenizer, AutoModelForCausalLM
181
+
182
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
183
+
184
+ # check transformers version
185
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
186
+
187
+ # init
188
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-7B-v0.1")
189
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-7B-v0.1", torch_dtype=torch.bfloat16)
190
+ # infer
191
+ prompt = "<human>: Who is Alan Turing?\n<bot>:"
192
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
193
+ input_length = inputs.input_ids.shape[1]
194
+ outputs = model.generate(
195
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
196
+ )
197
+ token = outputs.sequences[0, input_length:]
198
+ output_str = tokenizer.decode(token)
199
+ print(output_str)
200
+ """
201
+ Alan Mathison Turing, OBE, FRS, (23 June 1912 – 7 June 1954) was an English computer scientist, mathematician, logician, cryptanalyst, philosopher, and theoretical biologist.
202
+ """
203
+ ```
204
+
205
+ Please note that since `LayerNormKernelImpl` is not implemented in fp16 for CPU, we use `bfloat16` for CPU inference.
206
+
207
+
208
+ # Uses
209
+
210
+ ## Direct Use
211
+
212
+ Excluded uses are described below.
213
+
214
+ ### Misuse, Malicious Use, and Out-of-Scope Use
215
+
216
+ It is the responsibility of the end user to ensure that the model is used in a responsible and ethical manner.
217
+
218
+ #### Out-of-Scope Use
219
+
220
+ `RedPajama-INCITE-Chat-7B-v0.1` is a language model and may not perform well for other use cases outside of its intended scope.
221
+ For example, it may not be suitable for use in safety-critical applications or for making decisions that have a significant impact on individuals or society.
222
+ It is important to consider the limitations of the model and to only use it for its intended purpose.
223
+
224
+ #### Misuse and Malicious Use
225
+
226
+ `RedPajama-INCITE-Chat-7B-v0.1` is designed for language modeling.
227
+ Misuse of the model, such as using it to engage in illegal or unethical activities, is strictly prohibited and goes against the principles of the project.
228
+
229
+ Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
230
+
231
+ - Generating fake news, misinformation, or propaganda
232
+ - Promoting hate speech, discrimination, or violence against individuals or groups
233
+ - Impersonating individuals or organizations without their consent
234
+ - Engaging in cyberbullying or harassment
235
+ - Defamatory content
236
+ - Spamming or scamming
237
+ - Sharing confidential or sensitive information without proper authorization
238
+ - Violating the terms of use of the model or the data used to train it
239
+ - Creating automated bots for malicious purposes such as spreading malware, phishing scams, or spamming
240
+
241
+ ## Limitations
242
+
243
+ `RedPajama-INCITE-Chat-7B-v0.1`, like other language models, has limitations that should be taken into consideration.
244
+ For example, the model may not always provide accurate or relevant answers, particularly for questions that are complex, ambiguous, or outside of its training data.
245
+ We therefore welcome contributions from individuals and organizations, and encourage collaboration towards creating a more robust and inclusive chatbot.
246
+
247
+ ## Training
248
+
249
+ **Training Data**
250
+
251
+ Please refer to [togethercomputer/RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
252
+
253
+ **Training Procedure**
254
+
255
+ - **Hardware:** 8 A100
256
+ - **Optimizer:** Adam
257
+ - **Gradient Accumulations**: 1
258
+ - **Num of Tokens:** 131M tokens
259
+ - **Learning rate:** 1e-5
260
+
261
+ ## Community
262
+
263
+ Join us on [Together Discord](https://discord.gg/6ZVDU8tTD4)
config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 0,
5
+ "transformers_version": "4.28.1"
6
+ }
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cbab5c9ca2a8bb76dc0da12acf26b08c0ea1b31a7254e27b6bd2b40cee0d43b0
3
+ size 13714629236
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<|endoftext|>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "eos_token": "<|endoftext|>",
6
+ "model_max_length": 2048,
7
+ "tokenizer_class": "GPTNeoXTokenizer",
8
+ "unk_token": "<|endoftext|>"
9
+ }
vocabulary.txt ADDED
The diff for this file is too large to render. See raw diff