update readme
Browse files- README.md +22 -147
- config.json +1 -1
README.md
CHANGED
@@ -2,180 +2,55 @@
|
|
2 |
license: cc-by-sa-4.0
|
3 |
datasets:
|
4 |
- bigcode/the-stack-dedup
|
|
|
5 |
tags:
|
6 |
- code
|
7 |
language:
|
8 |
- code
|
9 |
-
programming_language:
|
10 |
-
-
|
11 |
-
- Java
|
12 |
-
- JavaScript
|
13 |
-
- Python
|
14 |
-
- TypeScript
|
15 |
-
- PHP
|
16 |
-
- SQL
|
17 |
-
- JSX
|
18 |
-
- reStructuredText
|
19 |
-
- Rust
|
20 |
-
- C
|
21 |
-
- CSS
|
22 |
-
- Go
|
23 |
-
- C++
|
24 |
-
- HTML
|
25 |
-
- Vue
|
26 |
-
- Ruby
|
27 |
-
- Jupyter Notebook
|
28 |
-
- R
|
29 |
-
- Shell
|
30 |
-
model-index:
|
31 |
-
- name: replit-code-v1-3b
|
32 |
-
results:
|
33 |
-
- task:
|
34 |
-
name: Code Generation
|
35 |
-
type: code-generation
|
36 |
-
dataset:
|
37 |
-
name: "HumanEval"
|
38 |
-
type: openai_humaneval
|
39 |
-
metrics:
|
40 |
-
- name: pass@1
|
41 |
-
type: pass@1
|
42 |
-
value: 0.219
|
43 |
-
verified: false
|
44 |
---
|
45 |
|
46 |
|
47 |
-
#
|
48 |
-
Developed by: Replit, Inc.
|
49 |
-
|
50 |
-
[**🧑💻 Test it on our Demo Space! 🧑💻**](https://huggingface.co/spaces/replit/replit-code-v1-3b-demo)
|
51 |
|
52 |
## Model Description
|
53 |
-
`
|
54 |
-
|
55 |
-
The training mixture includes **20 different languages**, listed here in descending order of number of tokens:
|
56 |
-
<br/>
|
57 |
-
`Markdown`, `Java`, `JavaScript`, `Python`, `TypeScript`, `PHP`, `SQL`, `JSX`, `reStructuredText`, `Rust`, `C`, `CSS`, `Go`, `C++`, `HTML`, `Vue`, `Ruby`, `Jupyter Notebook`, `R`, `Shell`
|
58 |
-
<br/>
|
59 |
-
In total, the training dataset contains 175B tokens, which were repeated over 3 epochs -- in total, `replit-code-v1-3b` has been trained on **525B** tokens (~195 tokens per parameter).
|
60 |
-
|
61 |
-
The model has been trained on the [MosaicML](https://www.mosaicml.com/) platform with 256 x A100-40GB GPUs, leveraging their latest [LLM examples repo](https://github.com/mosaicml/examples/tree/release/v0.0.4/examples/llm).
|
62 |
-
<br/>
|
63 |
-
`replit-code-v1-3b` is powered by state-of-the-art LLM techniques, such as:
|
64 |
-
[Flash Attention](https://arxiv.org/abs/2205.14135) for fast training and inference,
|
65 |
-
[AliBi positional embeddings](https://arxiv.org/abs/2108.12409) to support variable context length at inference time,
|
66 |
-
[LionW optimizer](https://arxiv.org/abs/2302.06675),
|
67 |
-
etc.
|
68 |
-
|
69 |
-
## Intended Use
|
70 |
-
Replit intends this model be used by anyone as a foundational model for application-specific fine-tuning without strict limitations on commercial use.
|
71 |
-
|
72 |
-
## Limitations
|
73 |
-
The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters, and such content may be reflected in model generated text. We recommend that users exercise reasonable caution when using in production systems. Do not use for any applications that may cause harm or distress to individuals or groups.
|
74 |
|
75 |
## License
|
76 |
-
The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY-SA-4.0).
|
77 |
|
78 |
## Contact
|
79 |
-
For questions and comments about the model, please post in the community section.
|
80 |
|
81 |
## How to Use
|
82 |
First of all, you need to install the latest versions of the following dependencies:
|
83 |
```
|
84 |
einops
|
85 |
sentencepiece
|
|
|
86 |
torch
|
87 |
transformers
|
88 |
```
|
89 |
|
90 |
-
You can then
|
91 |
```python
|
92 |
-
from transformers import AutoModelForCausalLM
|
93 |
-
|
94 |
-
# load model
|
95 |
-
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
|
96 |
-
```
|
97 |
-
|
98 |
-
To use the optimized Triton implementation of FlashAttention on GPUs with BF16 precision, first install the following dependencies:
|
99 |
-
```
|
100 |
-
flash-attn==0.2.8
|
101 |
-
triton==2.0.0.dev20221202
|
102 |
-
```
|
103 |
-
|
104 |
-
Then, move the model to `bfloat16` and use it as follows:
|
105 |
-
```python
|
106 |
-
from transformers import AutoModelForCausalLM
|
107 |
-
|
108 |
-
# load model
|
109 |
-
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True, attn_impl='triton')
|
110 |
-
model.to(device='cuda:0', dtype=torch.bfloat16)
|
111 |
-
|
112 |
-
# forward pass
|
113 |
-
x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
|
114 |
-
x = x.to(device='cuda:0')
|
115 |
-
y = model(x)
|
116 |
-
|
117 |
-
```
|
118 |
-
|
119 |
-
Note that `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the
|
120 |
-
[Transformers](https://huggingface.co/docs/transformers/index) library.
|
121 |
-
|
122 |
-
### Tokenizer
|
123 |
-
|
124 |
-
We have trained a custom SentencePiece Unigram tokenizer optimized with a vocabulary specifically for code of 32768 tokens.
|
125 |
-
|
126 |
-
Note that using this requires the `sentencepiece` library to be installed.
|
127 |
-
|
128 |
-
The tokenizer can be used as follows:
|
129 |
-
|
130 |
-
```python
|
131 |
-
from transformers import AutoTokenizer
|
132 |
-
|
133 |
-
# load tokenizer
|
134 |
-
tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
|
135 |
-
|
136 |
-
# single input encoding + generation
|
137 |
-
x = tokenizer.encode('def hello():\n print("hello world")\n', return_tensors='pt')
|
138 |
-
y = model.generate(x)
|
139 |
-
|
140 |
-
# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
|
141 |
-
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
|
142 |
-
print(generated_code)
|
143 |
-
```
|
144 |
-
|
145 |
-
Note that:
|
146 |
-
- `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the [Transformers](https://huggingface.co/docs/transformers/index) library.
|
147 |
-
- `clean_up_tokenization_spaces=False` is meant to avoid removing spaces in the output, because that would affect the syntactical correctness of the generated code.
|
148 |
-
|
149 |
-
|
150 |
-
### Generation
|
151 |
-
|
152 |
-
You can generate code using the `transformers` library as follows:
|
153 |
-
|
154 |
-
```python
|
155 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
156 |
-
|
157 |
-
tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
|
158 |
-
model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
|
159 |
-
|
160 |
-
x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
|
161 |
-
y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
|
162 |
-
|
163 |
-
# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
|
164 |
-
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
|
165 |
-
print(generated_code)
|
166 |
-
```
|
167 |
|
168 |
-
|
169 |
|
170 |
-
|
|
|
171 |
|
172 |
-
|
173 |
-
- stop generation when the EOS token is encountered
|
174 |
-
- remove trailing whitespaces
|
175 |
-
- set `max_tokens` to a reasonable value based on your completion use case
|
176 |
-
- truncate generation to stop words such as `return`, `def`, "```", "`\n\n\n`" to avoid generating incomplete code when `max_tokens` is larger than the length of the expected generated code.
|
177 |
|
|
|
|
|
|
|
|
|
178 |
|
|
|
179 |
|
180 |
-
|
181 |
-
|
|
|
2 |
license: cc-by-sa-4.0
|
3 |
datasets:
|
4 |
- bigcode/the-stack-dedup
|
5 |
+
- sadiqj/opam-source
|
6 |
tags:
|
7 |
- code
|
8 |
language:
|
9 |
- code
|
10 |
+
programming_language:
|
11 |
+
- OCaml
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
---
|
13 |
|
14 |
|
15 |
+
# camlcoder
|
|
|
|
|
|
|
16 |
|
17 |
## Model Description
|
18 |
+
`camlcoder` is a 2.7B Causal Language Model focused on **Code Completion** for OCaml. It is a fine-tuned version of [replit-code-v1-3b](https://www.huggingface.com/replit/replit-code-v1-3b). The model has been trained on a subset of the [Stack Dedup v1.2 dataset](https://arxiv.org/abs/2211.15533) and the most recent version of [all packages in Opam that compile on OCaml 5.0](https://www.huggingface.com/sadiqj/opam-source).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
## License
|
21 |
+
The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY-SA-4.0).
|
22 |
|
23 |
## Contact
|
24 |
+
For questions and comments about the model, please post in the community section.
|
25 |
|
26 |
## How to Use
|
27 |
First of all, you need to install the latest versions of the following dependencies:
|
28 |
```
|
29 |
einops
|
30 |
sentencepiece
|
31 |
+
safetensors
|
32 |
torch
|
33 |
transformers
|
34 |
```
|
35 |
|
36 |
+
You can then use the model as follows:
|
37 |
```python
|
38 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, StoppingCriteria, StoppingCriteriaList
|
39 |
+
import torch
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
40 |
|
41 |
+
max_length = 256
|
42 |
|
43 |
+
tokenizer = AutoTokenizer.from_pretrained('sadiqj/camlcoder', trust_remote_code=True, max_length=max_length, use_safetensors=True)
|
44 |
+
model = AutoModelForCausalLM.from_pretrained('sadiqj/camlcoder', trust_remote_code=True, use_safetensors=True).to(device='cuda:0', dtype=torch.bfloat16)
|
45 |
|
46 |
+
input_ids = tokenizer.encode('(* Return the middle element of the list *)\nlet get_middle l =', return_tensors='pt').to(device='cuda:0')
|
|
|
|
|
|
|
|
|
47 |
|
48 |
+
newline_id = tokenizer.encode('\n\n', return_tensors='pt')[0][0].item()
|
49 |
+
class StopOnNewlines(StoppingCriteria):
|
50 |
+
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
|
51 |
+
return newline_id in input_ids
|
52 |
|
53 |
+
output = model.generate(input_ids, max_length=max_length, stopping_criteria=StoppingCriteriaList([StopOnNewlines()]), use_cache=True)
|
54 |
|
55 |
+
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
56 |
+
```
|
config.json
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
{
|
2 |
-
"_name_or_path": "
|
3 |
"alibi": true,
|
4 |
"alibi_bias_max": 8,
|
5 |
"architectures": [
|
|
|
1 |
{
|
2 |
+
"_name_or_path": "camlcoder",
|
3 |
"alibi": true,
|
4 |
"alibi_bias_max": 8,
|
5 |
"architectures": [
|