pirroh commited on
Commit
bcf8dff
β€’
1 Parent(s): 4abee59

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -4
README.md CHANGED
@@ -44,9 +44,11 @@ model-index:
44
 
45
 
46
  # replit-code-v1-3b
 
47
 
48
  [**πŸ§‘β€πŸ’» Test it on our Demo Space! πŸ§‘β€πŸ’»**](https://huggingface.co/spaces/replit/replit-code-v1-3b-demo)
49
 
 
50
  `replit-code-v1-3b` is a 2.7B Causal Language Model focused on **Code Completion**. The model has been trained on a subset of the [Stack Dedup v1.2 dataset](https://arxiv.org/abs/2211.15533).
51
 
52
  The training mixture includes **20 different languages**, listed here in descending order of number of tokens:
@@ -55,10 +57,19 @@ The training mixture includes **20 different languages**, listed here in descend
55
 
56
  In total, the training dataset contains 175B tokens, which were repeated over 3 epochs -- in total, `replit-code-v1-3b` has been trained on **525B** tokens (~195 tokens per parameter).
57
 
 
 
58
 
59
- ## How to use the model
 
60
 
 
 
61
 
 
 
 
 
62
  ```python
63
  from transformers import AutoModelForCausalLM
64
 
@@ -85,7 +96,7 @@ y = model(x)
85
  Note that `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the
86
  [Transformers](https://huggingface.co/docs/transformers/index) library.
87
 
88
- ## Tokenizer
89
 
90
  We have trained a custom SentencePiece Unigram tokenizer optimized with a vocabulary specifically for code of 32768 tokens.
91
 
@@ -113,7 +124,7 @@ Note that:
113
  - `clean_up_tokenization_spaces=False` is meant to avoid removing spaces in the output, because that would affect the syntactical correctness of the generated code.
114
 
115
 
116
- ## Generation
117
 
118
  You can generate code using the `transformers` library as follows:
119
 
@@ -131,7 +142,7 @@ print(generated_code)
131
 
132
  Experiment with different decoding methods and parameters to get the best results for your use case.
133
 
134
- ## Post Processing
135
 
136
  Note that as with all code generation models, post-processing of the generated code is important. In particular, the following post-processing steps are recommended:
137
  - stop generation when the EOS token is encountered
@@ -139,5 +150,7 @@ Note that as with all code generation models, post-processing of the generated c
139
  - set `max_tokens` to a reasonable value based on your completion use case
140
  - truncate generation to stop words such as `return`, `def`, "```", "`\n\n\n`" to avoid generating incomplete code when `max_tokens`Β is larger than the length of the expected generated code.
141
 
 
 
142
  ## Model Hash
143
  5bc28ce32c6f9aec935ead7b60ea1c46
 
44
 
45
 
46
  # replit-code-v1-3b
47
+ Developed by: Replit, Inc.
48
 
49
  [**πŸ§‘β€πŸ’» Test it on our Demo Space! πŸ§‘β€πŸ’»**](https://huggingface.co/spaces/replit/replit-code-v1-3b-demo)
50
 
51
+ ## Model Description
52
  `replit-code-v1-3b` is a 2.7B Causal Language Model focused on **Code Completion**. The model has been trained on a subset of the [Stack Dedup v1.2 dataset](https://arxiv.org/abs/2211.15533).
53
 
54
  The training mixture includes **20 different languages**, listed here in descending order of number of tokens:
 
57
 
58
  In total, the training dataset contains 175B tokens, which were repeated over 3 epochs -- in total, `replit-code-v1-3b` has been trained on **525B** tokens (~195 tokens per parameter).
59
 
60
+ ## Intended Use
61
+ Replit intends this model be used by anyone as a foundational model for application-specific fine-tuning without strict limitations on commercial use.
62
 
63
+ ## Limitations
64
+ The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters, and such content may be reflected in model generated text. We recommend that users exercise reasonable caution when using in production systems. Do not use for any applications that may cause harm or distress to individuals or groups.
65
 
66
+ ## License
67
+ The base model checkpoint is licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.
68
 
69
+ ## Contact
70
+ For questions and comments about the model, please post in the community section.
71
+
72
+ ## How to Use
73
  ```python
74
  from transformers import AutoModelForCausalLM
75
 
 
96
  Note that `trust_remote_code=True` is passed to the `from_pretrained` method because ReplitLM is not a class in the
97
  [Transformers](https://huggingface.co/docs/transformers/index) library.
98
 
99
+ ### Tokenizer
100
 
101
  We have trained a custom SentencePiece Unigram tokenizer optimized with a vocabulary specifically for code of 32768 tokens.
102
 
 
124
  - `clean_up_tokenization_spaces=False` is meant to avoid removing spaces in the output, because that would affect the syntactical correctness of the generated code.
125
 
126
 
127
+ ### Generation
128
 
129
  You can generate code using the `transformers` library as follows:
130
 
 
142
 
143
  Experiment with different decoding methods and parameters to get the best results for your use case.
144
 
145
+ ### Post Processing
146
 
147
  Note that as with all code generation models, post-processing of the generated code is important. In particular, the following post-processing steps are recommended:
148
  - stop generation when the EOS token is encountered
 
150
  - set `max_tokens` to a reasonable value based on your completion use case
151
  - truncate generation to stop words such as `return`, `def`, "```", "`\n\n\n`" to avoid generating incomplete code when `max_tokens`Β is larger than the length of the expected generated code.
152
 
153
+
154
+
155
  ## Model Hash
156
  5bc28ce32c6f9aec935ead7b60ea1c46