Text Generation
Transformers
Safetensors
English
stablelm_epoch
causal-lm
code
custom_code
Eval Results
4-bit precision
7 papers
TheBloke commited on
Commit
4c34024
1 Parent(s): 34ccbba

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +616 -0
README.md ADDED
@@ -0,0 +1,616 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: stabilityai/stable-code-3b
3
+ datasets:
4
+ - tiiuae/falcon-refinedweb
5
+ - bigcode/the-stack-github-issues
6
+ - bigcode/commitpackft
7
+ - bigcode/starcoderdata
8
+ - EleutherAI/proof-pile-2
9
+ - meta-math/MetaMathQA
10
+ inference: false
11
+ language:
12
+ - en
13
+ library_name: transformers
14
+ license: other
15
+ metrics:
16
+ - code_eval
17
+ model-index:
18
+ - name: StarCoderBase-3B
19
+ results:
20
+ - dataset:
21
+ name: MultiPL-HumanEval (Python)
22
+ type: nuprl/MultiPL-E
23
+ metrics:
24
+ - name: pass@1
25
+ type: pass@1
26
+ value: 32.4
27
+ verified: false
28
+ task:
29
+ type: text-generation
30
+ - dataset:
31
+ name: MultiPL-HumanEval (C++)
32
+ type: nuprl/MultiPL-E
33
+ metrics:
34
+ - name: pass@1
35
+ type: pass@1
36
+ value: 30.9
37
+ verified: false
38
+ task:
39
+ type: text-generation
40
+ - dataset:
41
+ name: MultiPL-HumanEval (Java)
42
+ type: nuprl/MultiPL-E
43
+ metrics:
44
+ - name: pass@1
45
+ type: pass@1
46
+ value: 32.1
47
+ verified: false
48
+ task:
49
+ type: text-generation
50
+ - dataset:
51
+ name: MultiPL-HumanEval (JavaScript)
52
+ type: nuprl/MultiPL-E
53
+ metrics:
54
+ - name: pass@1
55
+ type: pass@1
56
+ value: 32.1
57
+ verified: false
58
+ task:
59
+ type: text-generation
60
+ - dataset:
61
+ name: MultiPL-HumanEval (PHP)
62
+ type: nuprl/MultiPL-E
63
+ metrics:
64
+ - name: pass@1
65
+ type: pass@1
66
+ value: 24.2
67
+ verified: false
68
+ task:
69
+ type: text-generation
70
+ - dataset:
71
+ name: MultiPL-HumanEval (Rust)
72
+ type: nuprl/MultiPL-E
73
+ metrics:
74
+ - name: pass@1
75
+ type: pass@1
76
+ value: 23.0
77
+ verified: false
78
+ task:
79
+ type: text-generation
80
+ model_creator: Stability AI
81
+ model_name: Stable Code 3B
82
+ model_type: stablelm_epoch
83
+ prompt_template: '{prompt}
84
+
85
+ '
86
+ quantized_by: TheBloke
87
+ tags:
88
+ - causal-lm
89
+ - code
90
+ ---
91
+ <!-- markdownlint-disable MD041 -->
92
+
93
+ <!-- header start -->
94
+ <!-- 200823 -->
95
+ <div style="width: auto; margin-left: auto; margin-right: auto">
96
+ <img src="https://i.imgur.com/EBdldam.jpg" alt="TheBlokeAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
97
+ </div>
98
+ <div style="display: flex; justify-content: space-between; width: 100%;">
99
+ <div style="display: flex; flex-direction: column; align-items: flex-start;">
100
+ <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://discord.gg/theblokeai">Chat & support: TheBloke's Discord server</a></p>
101
+ </div>
102
+ <div style="display: flex; flex-direction: column; align-items: flex-end;">
103
+ <p style="margin-top: 0.5em; margin-bottom: 0em;"><a href="https://www.patreon.com/TheBlokeAI">Want to contribute? TheBloke's Patreon page</a></p>
104
+ </div>
105
+ </div>
106
+ <div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">TheBloke's LLM work is generously supported by a grant from <a href="https://a16z.com">andreessen horowitz (a16z)</a></p></div>
107
+ <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
108
+ <!-- header end -->
109
+
110
+ # Stable Code 3B - GPTQ
111
+ - Model creator: [Stability AI](https://huggingface.co/stabilityai)
112
+ - Original model: [Stable Code 3B](https://huggingface.co/stabilityai/stable-code-3b)
113
+
114
+ <!-- description start -->
115
+ # Description
116
+
117
+ This repo contains GPTQ model files for [Stability AI's Stable Code 3B](https://huggingface.co/stabilityai/stable-code-3b).
118
+
119
+ Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
120
+
121
+ These files were quantised using hardware kindly provided by [Massed Compute](https://massedcompute.com/).
122
+
123
+ <!-- description end -->
124
+ <!-- repositories-available start -->
125
+ ## Repositories available
126
+
127
+ * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/stable-code-3b-GPTQ)
128
+ * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/stable-code-3b-GGUF)
129
+ * [Stability AI's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/stabilityai/stable-code-3b)
130
+ <!-- repositories-available end -->
131
+
132
+ <!-- prompt-template start -->
133
+ ## Prompt template: None
134
+
135
+ ```
136
+ {prompt}
137
+
138
+ ```
139
+
140
+ <!-- prompt-template end -->
141
+
142
+
143
+
144
+ <!-- README_GPTQ.md-compatible clients start -->
145
+ ## Known compatible clients / servers
146
+
147
+ GPTQ models are currently supported on Linux (NVidia/AMD) and Windows (NVidia only). macOS users: please use GGUF models.
148
+
149
+ These GPTQ models are known to work in the following inference servers/webuis.
150
+
151
+ - [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
152
+ - [KoboldAI United](https://github.com/henk717/koboldai)
153
+ - [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui)
154
+ - [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
155
+
156
+ This may not be a complete list; if you know of others, please let me know!
157
+ <!-- README_GPTQ.md-compatible clients end -->
158
+
159
+ <!-- README_GPTQ.md-provided-files start -->
160
+ ## Provided files, and GPTQ parameters
161
+
162
+ Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
163
+
164
+ Each separate quant is in a different branch. See below for instructions on fetching from different branches.
165
+
166
+ Most GPTQ files are made with AutoGPTQ. Mistral models are currently made with Transformers.
167
+
168
+ <details>
169
+ <summary>Explanation of GPTQ parameters</summary>
170
+
171
+ - Bits: The bit size of the quantised model.
172
+ - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
173
+ - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
174
+ - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
175
+ - GPTQ dataset: The calibration dataset used during quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ calibration dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
176
+ - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
177
+ - ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama and Mistral models in 4-bit.
178
+
179
+ </details>
180
+
181
+ | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
182
+ | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
183
+ | [main](https://huggingface.co/TheBloke/stable-code-3b-GPTQ/tree/main) | 4 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1/viewer/) | 4096 | 1.84 GB | No | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
184
+ | [gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/stable-code-3b-GPTQ/tree/gptq-4bit-32g-actorder_True) | 4 | 32 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1/viewer/) | 4096 | 1.99 GB | No | 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage. |
185
+ | [gptq-8bit--1g-actorder_True](https://huggingface.co/TheBloke/stable-code-3b-GPTQ/tree/gptq-8bit--1g-actorder_True) | 8 | None | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1/viewer/) | 4096 | 3.06 GB | No | 8-bit, with Act Order. No group size, to lower VRAM requirements. |
186
+ | [gptq-8bit-128g-actorder_True](https://huggingface.co/TheBloke/stable-code-3b-GPTQ/tree/gptq-8bit-128g-actorder_True) | 8 | 128 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1/viewer/) | 4096 | 3.12 GB | No | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. |
187
+ | [gptq-8bit-32g-actorder_True](https://huggingface.co/TheBloke/stable-code-3b-GPTQ/tree/gptq-8bit-32g-actorder_True) | 8 | 32 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1/viewer/) | 4096 | 3.30 GB | No | 8-bit, with group size 32g and Act Order for maximum inference quality. |
188
+ | [gptq-4bit-64g-actorder_True](https://huggingface.co/TheBloke/stable-code-3b-GPTQ/tree/gptq-4bit-64g-actorder_True) | 4 | 64 | Yes | 0.1 | [Evol Instruct Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1/viewer/) | 4096 | 1.89 GB | No | 4-bit, with Act Order and group size 64g. Uses less VRAM than 32g, but with slightly lower accuracy. |
189
+
190
+ <!-- README_GPTQ.md-provided-files end -->
191
+
192
+ <!-- README_GPTQ.md-download-from-branches start -->
193
+ ## How to download, including from branches
194
+
195
+ ### In text-generation-webui
196
+
197
+ To download from the `main` branch, enter `TheBloke/stable-code-3b-GPTQ` in the "Download model" box.
198
+
199
+ To download from another branch, add `:branchname` to the end of the download name, eg `TheBloke/stable-code-3b-GPTQ:gptq-4bit-32g-actorder_True`
200
+
201
+ ### From the command line
202
+
203
+ I recommend using the `huggingface-hub` Python library:
204
+
205
+ ```shell
206
+ pip3 install huggingface-hub
207
+ ```
208
+
209
+ To download the `main` branch to a folder called `stable-code-3b-GPTQ`:
210
+
211
+ ```shell
212
+ mkdir stable-code-3b-GPTQ
213
+ huggingface-cli download TheBloke/stable-code-3b-GPTQ --local-dir stable-code-3b-GPTQ --local-dir-use-symlinks False
214
+ ```
215
+
216
+ To download from a different branch, add the `--revision` parameter:
217
+
218
+ ```shell
219
+ mkdir stable-code-3b-GPTQ
220
+ huggingface-cli download TheBloke/stable-code-3b-GPTQ --revision gptq-4bit-32g-actorder_True --local-dir stable-code-3b-GPTQ --local-dir-use-symlinks False
221
+ ```
222
+
223
+ <details>
224
+ <summary>More advanced huggingface-cli download usage</summary>
225
+
226
+ If you remove the `--local-dir-use-symlinks False` parameter, the files will instead be stored in the central Hugging Face cache directory (default location on Linux is: `~/.cache/huggingface`), and symlinks will be added to the specified `--local-dir`, pointing to their real location in the cache. This allows for interrupted downloads to be resumed, and allows you to quickly clone the repo to multiple places on disk without triggering a download again. The downside, and the reason why I don't list that as the default option, is that the files are then hidden away in a cache folder and it's harder to know where your disk space is being used, and to clear it up if/when you want to remove a download model.
227
+
228
+ The cache location can be changed with the `HF_HOME` environment variable, and/or the `--cache-dir` parameter to `huggingface-cli`.
229
+
230
+ For more documentation on downloading with `huggingface-cli`, please see: [HF -> Hub Python Library -> Download files -> Download from the CLI](https://huggingface.co/docs/huggingface_hub/guides/download#download-from-the-cli).
231
+
232
+ To accelerate downloads on fast connections (1Gbit/s or higher), install `hf_transfer`:
233
+
234
+ ```shell
235
+ pip3 install hf_transfer
236
+ ```
237
+
238
+ And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
239
+
240
+ ```shell
241
+ mkdir stable-code-3b-GPTQ
242
+ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download TheBloke/stable-code-3b-GPTQ --local-dir stable-code-3b-GPTQ --local-dir-use-symlinks False
243
+ ```
244
+
245
+ Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command.
246
+ </details>
247
+
248
+ ### With `git` (**not** recommended)
249
+
250
+ To clone a specific branch with `git`, use a command like this:
251
+
252
+ ```shell
253
+ git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/stable-code-3b-GPTQ
254
+ ```
255
+
256
+ Note that using Git with HF repos is strongly discouraged. It will be much slower than using `huggingface-hub`, and will use twice as much disk space as it has to store the model files twice (it stores every byte both in the intended target folder, and again in the `.git` folder as a blob.)
257
+
258
+ <!-- README_GPTQ.md-download-from-branches end -->
259
+ <!-- README_GPTQ.md-text-generation-webui start -->
260
+ ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
261
+
262
+ Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
263
+
264
+ It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
265
+
266
+ 1. Click the **Model tab**.
267
+ 2. Under **Download custom model or LoRA**, enter `TheBloke/stable-code-3b-GPTQ`.
268
+
269
+ - To download from a specific branch, enter for example `TheBloke/stable-code-3b-GPTQ:gptq-4bit-32g-actorder_True`
270
+ - see Provided Files above for the list of branches for each option.
271
+
272
+ 3. Click **Download**.
273
+ 4. The model will start downloading. Once it's finished it will say "Done".
274
+ 5. In the top left, click the refresh icon next to **Model**.
275
+ 6. In the **Model** dropdown, choose the model you just downloaded: `stable-code-3b-GPTQ`
276
+ 7. The model will automatically load, and is now ready for use!
277
+ 8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
278
+
279
+ - Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
280
+
281
+ 9. Once you're ready, click the **Text Generation** tab and enter a prompt to get started!
282
+
283
+ <!-- README_GPTQ.md-text-generation-webui end -->
284
+
285
+ <!-- README_GPTQ.md-use-from-tgi start -->
286
+ ## Serving this model from Text Generation Inference (TGI)
287
+
288
+ It's recommended to use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
289
+
290
+ Example Docker parameters:
291
+
292
+ ```shell
293
+ --model-id TheBloke/stable-code-3b-GPTQ --port 3000 --quantize gptq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
294
+ ```
295
+
296
+ Example Python code for interfacing with TGI (requires huggingface-hub 0.17.0 or later):
297
+
298
+ ```shell
299
+ pip3 install huggingface-hub
300
+ ```
301
+
302
+ ```python
303
+ from huggingface_hub import InferenceClient
304
+
305
+ endpoint_url = "https://your-endpoint-url-here"
306
+
307
+ prompt = "Tell me about AI"
308
+ prompt_template=f'''{prompt}
309
+ '''
310
+
311
+ client = InferenceClient(endpoint_url)
312
+ response = client.text_generation(
313
+ prompt_template,
314
+ max_new_tokens=128,
315
+ do_sample=True,
316
+ temperature=0.7,
317
+ top_p=0.95,
318
+ top_k=40,
319
+ repetition_penalty=1.1
320
+ )
321
+
322
+ print(f"Model output: {response}")
323
+ ```
324
+ <!-- README_GPTQ.md-use-from-tgi end -->
325
+ <!-- README_GPTQ.md-use-from-python start -->
326
+ ## Python code example: inference from this GPTQ model
327
+
328
+ ### Install the necessary packages
329
+
330
+ Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
331
+
332
+ ```shell
333
+ pip3 install --upgrade transformers optimum
334
+ # If using PyTorch 2.1 + CUDA 12.x:
335
+ pip3 install --upgrade auto-gptq
336
+ # or, if using PyTorch 2.1 + CUDA 11.x:
337
+ pip3 install --upgrade auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
338
+ ```
339
+
340
+ If you are using PyTorch 2.0, you will need to install AutoGPTQ from source. Likewise if you have problems with the pre-built wheels, you should try building from source:
341
+
342
+ ```shell
343
+ pip3 uninstall -y auto-gptq
344
+ git clone https://github.com/PanQiWei/AutoGPTQ
345
+ cd AutoGPTQ
346
+ git checkout v0.5.1
347
+ pip3 install .
348
+ ```
349
+
350
+ ### Example Python code
351
+
352
+ ```python
353
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
354
+
355
+ model_name_or_path = "TheBloke/stable-code-3b-GPTQ"
356
+ # To use a different branch, change revision
357
+ # For example: revision="gptq-4bit-32g-actorder_True"
358
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
359
+ device_map="auto",
360
+ trust_remote_code=True,
361
+ revision="main")
362
+
363
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
364
+
365
+ prompt = "Write a story about llamas"
366
+ system_message = "You are a story writing assistant"
367
+ prompt_template=f'''{prompt}
368
+ '''
369
+
370
+ print("\n\n*** Generate:")
371
+
372
+ input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
373
+ output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
374
+ print(tokenizer.decode(output[0]))
375
+
376
+ # Inference can also be done using transformers' pipeline
377
+
378
+ print("*** Pipeline:")
379
+ pipe = pipeline(
380
+ "text-generation",
381
+ model=model,
382
+ tokenizer=tokenizer,
383
+ max_new_tokens=512,
384
+ do_sample=True,
385
+ temperature=0.7,
386
+ top_p=0.95,
387
+ top_k=40,
388
+ repetition_penalty=1.1
389
+ )
390
+
391
+ print(pipe(prompt_template)[0]['generated_text'])
392
+ ```
393
+ <!-- README_GPTQ.md-use-from-python end -->
394
+
395
+ <!-- README_GPTQ.md-compatibility start -->
396
+ ## Compatibility
397
+
398
+ The files provided are tested to work with Transformers. For non-Mistral models, AutoGPTQ can also be used directly.
399
+
400
+ [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama architecture models (including Mistral, Yi, DeepSeek, SOLAR, etc) in 4-bit. Please see the Provided Files table above for per-file compatibility.
401
+
402
+ For a list of clients/servers, please see "Known compatible clients / servers", above.
403
+ <!-- README_GPTQ.md-compatibility end -->
404
+
405
+ <!-- footer start -->
406
+ <!-- 200823 -->
407
+ ## Discord
408
+
409
+ For further support, and discussions on these models and AI in general, join us at:
410
+
411
+ [TheBloke AI's Discord server](https://discord.gg/theblokeai)
412
+
413
+ ## Thanks, and how to contribute
414
+
415
+ Thanks to the [chirper.ai](https://chirper.ai) team!
416
+
417
+ Thanks to Clay from [gpus.llm-utils.org](llm-utils)!
418
+
419
+ I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.
420
+
421
+ If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.
422
+
423
+ Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.
424
+
425
+ * Patreon: https://patreon.com/TheBlokeAI
426
+ * Ko-Fi: https://ko-fi.com/TheBlokeAI
427
+
428
+ **Special thanks to**: Aemon Algiz.
429
+
430
+ **Patreon special mentions**: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, S_X, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros
431
+
432
+
433
+ Thank you to all my generous patrons and donaters!
434
+
435
+ And thank you again to a16z for their generous grant.
436
+
437
+ <!-- footer end -->
438
+
439
+ # Original model card: Stability AI's Stable Code 3B
440
+
441
+ # `stable-code-3b`
442
+
443
+ ## Model Description
444
+
445
+ `stable-code-3b` is a 2.7B billion parameter decoder-only language model pre-trained on 1.3 trillion tokens of diverse textual and code datasets. `stable-code-3b` is trained on 18 programming languages (selected based on the 2023 StackOverflow Developer Survey) and demonstrates state-of-the-art performance (compared to models of similar size) on the MultiPL-E metrics across multiple programming languages tested using [BigCode's Evaluation Harness](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main).
446
+
447
+ ![spiderchart](stable_code_3b_spiderchart.svg)
448
+
449
+ | Model | Size | Python | C++ | Javascript | Java | PHP | Rust |
450
+ |------------------|------|--------|------|------------|------|------|------|
451
+ | **Stable Code** | 3B | 32.4% | 30.9%| 32.1% | 32.1%| 24.2%| 23.0%|
452
+ | CodeLLama | 7B | 30.0% | 28.2%| 32.5% | 31.1%| 25.7%| 26.3%|
453
+ | Deepseek Coder | 1.3B | 28.6% | 29.2%| 28.7% | 29.0%| 23.6%| 18.5%|
454
+ | Wizard Coder | 3B | 31.6% | 25.6%| 26.2% | 25.8%| 25.3%| 20.4%|
455
+ | StarCoder | 3B | 21.6% | 19.8%| 21.5% | 20.5%| 19.0%| 16.9%|
456
+ | Replit Code V1.5 | 3B | 23.0% | 25.9%| 26.2% | 23.6%| 23.2%| 21.5%|
457
+ | Deci Coder | 1B | 19.1% | 6.8% | 18.4% | 16.7%| 2.1% | 1.7% |
458
+
459
+ **Key Features**
460
+ * Fill in Middle Capability (FIM)
461
+ * Supports Long Context, trained with Sequences upto 16,384
462
+
463
+ ## Usage
464
+
465
+ Get started generating text with `stable-code-3b` by using the following code snippet:
466
+
467
+ ```python
468
+ import torch
469
+ from transformers import AutoModelForCausalLM, AutoTokenizer
470
+ tokenizer = AutoTokenizer.from_pretrained("stabilityai/stable-code-3b", trust_remote_code=True)
471
+ model = AutoModelForCausalLM.from_pretrained(
472
+ "stabilityai/stable-code-3b",
473
+ trust_remote_code=True,
474
+ torch_dtype="auto",
475
+ )
476
+ model.cuda()
477
+ inputs = tokenizer("import torch\nimport torch.nn as nn", return_tensors="pt").to(model.device)
478
+ tokens = model.generate(
479
+ **inputs,
480
+ max_new_tokens=48,
481
+ temperature=0.2,
482
+ do_sample=True,
483
+ )
484
+ print(tokenizer.decode(tokens[0], skip_special_tokens=True))
485
+ ```
486
+
487
+ ### Run with Fill in Middle (FIM) ⚡️
488
+
489
+ <details>
490
+ <summary> Click to expand </summary>
491
+
492
+ ```python
493
+ from transformers import AutoModelForCausalLM, AutoTokenizer
494
+ tokenizer = AutoTokenizer.from_pretrained("stabilityai/stable-code-3b", trust_remote_code=True)
495
+ model = AutoModelForCausalLM.from_pretrained(
496
+ "stabilityai/stable-code-3b",
497
+ trust_remote_code=True,
498
+ torch_dtype="auto",
499
+ + attn_implementation="flash_attention_2",
500
+ )
501
+ model.cuda()
502
+ inputs = tokenizer("<fim_prefix>def fib(n):<fim_suffix> else:\n return fib(n - 2) + fib(n - 1)<fim_middle>", return_tensors="pt").to(model.device)
503
+ tokens = model.generate(
504
+ **inputs,
505
+ max_new_tokens=48,
506
+ temperature=0.2,
507
+ do_sample=True,
508
+ )
509
+ print(tokenizer.decode(tokens[0], skip_special_tokens=True))
510
+ ```
511
+
512
+ </details>
513
+
514
+ ### Run with Flash Attention 2 ⚡️
515
+
516
+ <details>
517
+ <summary> Click to expand </summary>
518
+
519
+ ```python
520
+ from transformers import AutoModelForCausalLM, AutoTokenizer
521
+ tokenizer = AutoTokenizer.from_pretrained("stabilityai/stable-code-3b", trust_remote_code=True)
522
+ model = AutoModelForCausalLM.from_pretrained(
523
+ "stabilityai/stable-code-3b",
524
+ trust_remote_code=True,
525
+ torch_dtype="auto",
526
+ + attn_implementation="flash_attention_2",
527
+ )
528
+ model.cuda()
529
+ inputs = tokenizer("import torch\nimport torch.nn as nn", return_tensors="pt").to(model.device)
530
+ tokens = model.generate(
531
+ **inputs,
532
+ max_new_tokens=48,
533
+ temperature=0.2,
534
+ do_sample=True,
535
+ )
536
+ print(tokenizer.decode(tokens[0], skip_special_tokens=True))
537
+ ```
538
+
539
+ </details>
540
+
541
+
542
+ ## Model Details
543
+
544
+ * **Developed by**: [Stability AI](https://stability.ai/)
545
+ * **Model type**: `stable-code-3b` models are auto-regressive language models based on the transformer decoder architecture.
546
+ * **Language(s)**: English, Code
547
+ * **Library**: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
548
+ * **License**: Other
549
+ * **Contact**: For questions and comments about the model, please email `lm@stability.ai`
550
+
551
+ ### Model Architecture
552
+
553
+ The model is a decoder-only transformer similar to the LLaMA ([Touvron et al., 2023](https://arxiv.org/abs/2307.09288)) architecture with the following modifications:
554
+
555
+ | Parameters | Hidden Size | Layers | Heads | Sequence Length |
556
+ |----------------|-------------|--------|-------|-----------------|
557
+ | 2,796,431,360 | 2560 | 32 | 32 | 16384 |
558
+
559
+ * **Position Embeddings**: Rotary Position Embeddings ([Su et al., 2021](https://arxiv.org/abs/2104.09864)) applied to the first 25% of head embedding dimensions for improved throughput following [Black et al. (2022)](https://arxiv.org/pdf/2204.06745.pdf).
560
+ * **Tokenizer**: We use a modified version of the GPTNeoX Tokenizer.[`NeoX`](https://github.com/EleutherAI/gpt-neox). We add special tokens to train for Fill in the Middle (FIM) capabilities like `<FIM_PREFIX>` and `<FIM_SUFFIX>` along with other special tokens.
561
+
562
+ ## Training
563
+
564
+ ### Training Dataset
565
+
566
+ The dataset is comprised of a filtered mixture of open-source large-scale datasets available on the [HuggingFace Hub](https://huggingface.co/datasets): Falcon RefinedWeb extract ([Penedo et al., 2023](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)), along with [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft) and [Github Issues](https://huggingface.co/datasets/bigcode/the-stack-github-issues) (BigCode., 2023), and StarCoder ([Li et al., 2023](https://arxiv.org/abs/2305.06161)). We further supplement our training with data from mathematical domains ([Azerbayev, Zhangir, et al., 2023](https://arxiv.org/abs/2310.10631) and, [Yu, Longhui, et al., 2023](https://arxiv.org/abs/2309.12284)).
567
+
568
+ Top 18 programming languages trained on:
569
+ - C
570
+ - CPP
571
+ - Java
572
+ - JavaScript
573
+ - CSS
574
+ - Go
575
+ - HTML
576
+ - Ruby
577
+ - Rust
578
+ - Markdown
579
+ - Shell
580
+ - Php
581
+ - Sql
582
+ - R
583
+ - Typescript
584
+ - Python
585
+ - Jupyter-Clean
586
+ - RestructuredText
587
+
588
+ ### Training Procedure
589
+
590
+ The model is pre-trained on the aforementioned datasets in `bfloat16` precision, optimized with AdamW.
591
+
592
+ ### Training Infrastructure
593
+
594
+ * **Hardware**: `stable-code-3b` was trained on the Stability AI cluster across 256 NVIDIA A100 40GB GPUs (AWS P4d instances).
595
+
596
+ * **Software**: We use a fork of `gpt-neox` ([EleutherAI, 2021](https://github.com/EleutherAI/gpt-neox)), train under 2D parallelism (Data and Tensor Parallel) with ZeRO-1 ([Rajbhandari et al., 2019](https://arxiv.org/abs/1910.02054v3)), and rely on flash-attention as well as SwiGLU and Rotary Embedding kernels from FlashAttention-2 ([Dao et al., 2023](https://tridao.me/publications/flash2/flash2.pdf))
597
+
598
+ ## Use and Limitations
599
+
600
+ ### Intended Use
601
+
602
+ The model is intended to be used as a foundational base model for application-specific fine-tuning. Developers must evaluate and fine-tune the model for safe performance in downstream applications.
603
+
604
+ ### Limitations and Bias
605
+
606
+ As a base model, this model may exhibit unreliable, unsafe, or other undesirable behaviors that must be corrected through evaluation and fine-tuning prior to deployment. The pre-training dataset may have contained offensive or inappropriate content, even after applying data cleansing filters, which can be reflected in the model-generated text. We recommend that users exercise caution when using these models in production systems. Do not use the models if they are unsuitable for your application, or for any applications that may cause deliberate or unintentional harm to others.
607
+
608
+ ## How to Cite
609
+
610
+ ```bibtex
611
+ @misc{stable-code-3b,
612
+ url={[https://huggingface.co/stabilityai/stable-code-3b](https://huggingface.co/stabilityai/stable-code-3b)},
613
+ title={Stable Code 3B},
614
+ author={Pinnaparaju, Nikhil and Adithyan, Reshinth and Phung, Duy and Tow, Jonathan and Baicoianu, James and and Cooper, Nathan}
615
+ }
616
+ ```