Text Generation
Transformers
Safetensors
code
replit_lm
custom_code
sadiqj commited on
Commit
29373cc
1 Parent(s): e56d7ea

update readme

Browse files
Files changed (2) hide show
  1. README.md +22 -147
  2. config.json +1 -1
README.md CHANGED
@@ -2,180 +2,55 @@
2
  license: cc-by-sa-4.0
3
  datasets:
4
  - bigcode/the-stack-dedup
 
5
  tags:
6
  - code
7
  language:
8
  - code
9
- programming_language:
10
- - Markdown
11
- - Java
12
- - JavaScript
13
- - Python
14
- - TypeScript
15
- - PHP
16
- - SQL
17
- - JSX
18
- - reStructuredText
19
- - Rust
20
- - C
21
- - CSS
22
- - Go
23
- - C++
24
- - HTML
25
- - Vue
26
- - Ruby
27
- - Jupyter Notebook
28
- - R
29
- - Shell
30
- model-index:
31
- - name: replit-code-v1-3b
32
- results:
33
- - task:
34
- name: Code Generation
35
- type: code-generation
36
- dataset:
37
- name: "HumanEval"
38
- type: openai_humaneval
39
- metrics:
40
- - name: pass@1
41
- type: pass@1
42
- value: 0.219
43
- verified: false
44
  ---
45
 
46
 
47
- # replit-code-v1-3b
48
- Developed by: Replit, Inc.
49
-
50
- [**🧑‍💻 Test it on our Demo Space! 🧑‍💻**](https://huggingface.co/spaces/replit/replit-code-v1-3b-demo)
51
 
52
  ## Model Description
53
- `replit-code-v1-3b` is a 2.7B Causal Language Model focused on **Code Completion**. The model has been trained on a subset of the [Stack Dedup v1.2 dataset](https://arxiv.org/abs/2211.15533).
54
-
55
- The training mixture includes **20 different languages**, listed here in descending order of number of tokens:
56
- <br/>
57
- `Markdown`, `Java`, `JavaScript`, `Python`, `TypeScript`, `PHP`, `SQL`, `JSX`, `reStructuredText`, `Rust`, `C`, `CSS`, `Go`, `C++`, `HTML`, `Vue`, `Ruby`, `Jupyter Notebook`, `R`, `Shell`
58
- <br/>
59
- In total, the training dataset contains 175B tokens, which were repeated over 3 epochs -- in total, `replit-code-v1-3b` has been trained on **525B** tokens (~195 tokens per parameter).
60
-
61
- The model has been trained on the [MosaicML](https://www.mosaicml.com/) platform with 256 x A100-40GB GPUs, leveraging their latest [LLM examples repo](https://github.com/mosaicml/examples/tree/release/v0.0.4/examples/llm).
62
- <br/>
63
- `replit-code-v1-3b` is powered by state-of-the-art LLM techniques, such as:
64
- [Flash Attention](https://arxiv.org/abs/2205.14135) for fast training and inference,
65
- [AliBi positional embeddings](https://arxiv.org/abs/2108.12409) to support variable context length at inference time,
66
- [LionW optimizer](https://arxiv.org/abs/2302.06675),
67
- etc.
68
-
69
- ## Intended Use
70
- Replit intends this model be used by anyone as a foundational model for application-specific fine-tuning without strict limitations on commercial use.
71
-
72
- ## Limitations
73
- The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters, and such content may be reflected in model generated text. We recommend that users exercise reasonable caution when using in production systems. Do not use for any applications that may cause harm or distress to individuals or groups.
74
 
75
  ## License
76
- The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.
77
 
78
  ## Contact
79
- For questions and comments about the model, please post in the community section.
80
 
81
  ## How to Use
82
  First of all, you need to install the latest versions of the following dependencies:
83
  ```
84
  einops
85
  sentencepiece
 
86
  torch
87
  transformers
88
  ```
89
 
90
- You can then load the model as follows:
91
  ```python
92
- from transformers import AutoModelForCausalLM
93
-
94
- # load model
95
- model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
96
- ```
97
-
98
- To use the optimized Triton implementation of FlashAttention on GPUs with BF16 precision, first install the following dependencies:
99
- ```
100
- flash-attn==0.2.8
101
- triton==2.0.0.dev20221202
102
- ```
103
-
104
- Then, move the model to `bfloat16` and use it as follows:
105
- ```python
106
- from transformers import AutoModelForCausalLM
107
-
108
- # load model
109
- model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True, attn_impl='triton')
110
- model.to(device='cuda:0', dtype=torch.bfloat16)
111
-
112
- # forward pass
113
- x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
114
- x = x.to(device='cuda:0')
115
- y = model(x)
116
-
117
- ```
118
-
119
- Note that `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the
120
- [Transformers](https://huggingface.co/docs/transformers/index) library.
121
-
122
- ### Tokenizer
123
-
124
- We have trained a custom SentencePiece Unigram tokenizer optimized with a vocabulary specifically for code of 32768 tokens.
125
-
126
- Note that using this requires the `sentencepiece` library to be installed.
127
-
128
- The tokenizer can be used as follows:
129
-
130
- ```python
131
- from transformers import AutoTokenizer
132
-
133
- # load tokenizer
134
- tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
135
-
136
- # single input encoding + generation
137
- x = tokenizer.encode('def hello():\n print("hello world")\n', return_tensors='pt')
138
- y = model.generate(x)
139
-
140
- # decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
141
- generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
142
- print(generated_code)
143
- ```
144
-
145
- Note that:
146
- - `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the [Transformers](https://huggingface.co/docs/transformers/index) library.
147
- - `clean_up_tokenization_spaces=False` is meant to avoid removing spaces in the output, because that would affect the syntactical correctness of the generated code.
148
-
149
-
150
- ### Generation
151
-
152
- You can generate code using the `transformers` library as follows:
153
-
154
- ```python
155
- from transformers import AutoModelForCausalLM, AutoTokenizer
156
-
157
- tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
158
- model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
159
-
160
- x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
161
- y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
162
-
163
- # decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
164
- generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
165
- print(generated_code)
166
- ```
167
 
168
- Experiment with different decoding methods and parameters to get the best results for your use case.
169
 
170
- ### Post Processing
 
171
 
172
- Note that as with all code generation models, post-processing of the generated code is important. In particular, the following post-processing steps are recommended:
173
- - stop generation when the EOS token is encountered
174
- - remove trailing whitespaces
175
- - set `max_tokens` to a reasonable value based on your completion use case
176
- - truncate generation to stop words such as `return`, `def`, "```", "`\n\n\n`" to avoid generating incomplete code when `max_tokens` is larger than the length of the expected generated code.
177
 
 
 
 
 
178
 
 
179
 
180
- ## Model Hash
181
- 5bc28ce32c6f9aec935ead7b60ea1c46
 
2
  license: cc-by-sa-4.0
3
  datasets:
4
  - bigcode/the-stack-dedup
5
+ - sadiqj/opam-source
6
  tags:
7
  - code
8
  language:
9
  - code
10
+ programming_language:
11
+ - OCaml
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
 
15
+ # camlcoder
 
 
 
16
 
17
  ## Model Description
18
+ `camlcoder` is a 2.7B Causal Language Model focused on **Code Completion** for OCaml. It is a fine-tuned version of [replit-code-v1-3b](https://www.huggingface.com/replit/replit-code-v1-3b). The model has been trained on a subset of the [Stack Dedup v1.2 dataset](https://arxiv.org/abs/2211.15533) and the most recent version of [all packages in Opam that compile on OCaml 5.0](https://www.huggingface.com/sadiqj/opam-source).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## License
21
+ The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY-SA-4.0).
22
 
23
  ## Contact
24
+ For questions and comments about the model, please post in the community section.
25
 
26
  ## How to Use
27
  First of all, you need to install the latest versions of the following dependencies:
28
  ```
29
  einops
30
  sentencepiece
31
+ safetensors
32
  torch
33
  transformers
34
  ```
35
 
36
+ You can then use the model as follows:
37
  ```python
38
+ from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, StoppingCriteria, StoppingCriteriaList
39
+ import torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
+ max_length = 256
42
 
43
+ tokenizer = AutoTokenizer.from_pretrained('sadiqj/camlcoder', trust_remote_code=True, max_length=max_length, use_safetensors=True)
44
+ model = AutoModelForCausalLM.from_pretrained('sadiqj/camlcoder', trust_remote_code=True, use_safetensors=True).to(device='cuda:0', dtype=torch.bfloat16)
45
 
46
+ input_ids = tokenizer.encode('(* Return the middle element of the list *)\nlet get_middle l =', return_tensors='pt').to(device='cuda:0')
 
 
 
 
47
 
48
+ newline_id = tokenizer.encode('\n\n', return_tensors='pt')[0][0].item()
49
+ class StopOnNewlines(StoppingCriteria):
50
+ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
51
+ return newline_id in input_ids
52
 
53
+ output = model.generate(input_ids, max_length=max_length, stopping_criteria=StoppingCriteriaList([StopOnNewlines()]), use_cache=True)
54
 
55
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
56
+ ```
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "replit-code-v1-3b",
3
  "alibi": true,
4
  "alibi_bias_max": 8,
5
  "architectures": [
 
1
  {
2
+ "_name_or_path": "camlcoder",
3
  "alibi": true,
4
  "alibi_bias_max": 8,
5
  "architectures": [