julien-c HF staff commited on
Commit
703329b
โ€ข
1 Parent(s): 0ee93d7

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/mrm8488/CodeBERTaPy/README.md

Files changed (1) hide show
  1. README.md +123 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: code
3
+ thumbnail:
4
+ ---
5
+
6
+ # CodeBERTaPy
7
+
8
+ CodeBERTaPy is a RoBERTa-like model trained on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset from GitHub for `python` by [Manuel Romero](https://twitter.com/mrm8488)
9
+
10
+ The **tokenizer** is a Byte-level BPE tokenizer trained on the corpus using Hugging Face `tokenizers`.
11
+
12
+ Because it is trained on a corpus of code (vs. natural language), it encodes the corpus efficiently (the sequences are between 33% to 50% shorter, compared to the same corpus tokenized by gpt2/roberta).
13
+
14
+ The (small) **model** is a 6-layer, 84M parameters, RoBERTa-like Transformer model โ€“ thatโ€™s the same number of layers & heads as DistilBERT โ€“ initialized from the default initialization settings and trained from scratch on the full `python` corpus for 4 epochs.
15
+
16
+ ## Quick start: masked language modeling prediction
17
+
18
+ ```python
19
+ PYTHON_CODE = """
20
+ fruits = ['apples', 'bananas', 'oranges']
21
+ for idx, <mask> in enumerate(fruits):
22
+ print("index is %d and value is %s" % (idx, val))
23
+ """.lstrip()
24
+ ```
25
+
26
+ ### Does the model know how to complete simple Python code?
27
+
28
+ ```python
29
+ from transformers import pipeline
30
+
31
+ fill_mask = pipeline(
32
+ "fill-mask",
33
+ model="mrm8488/CodeBERTaPy",
34
+ tokenizer="mrm8488/CodeBERTaPy"
35
+ )
36
+
37
+ fill_mask(PYTHON_CODE)
38
+
39
+ ## Top 5 predictions:
40
+
41
+ 'val' # prob 0.980728805065155
42
+ 'value'
43
+ 'idx'
44
+ ',val'
45
+ '_'
46
+ ```
47
+
48
+ ### Yes! That was easy ๐ŸŽ‰ Let's try with another Flask like example
49
+
50
+ ```python
51
+ PYTHON_CODE2 = """
52
+ @app.route('/<name>')
53
+ def hello_name(name):
54
+ return "Hello {}!".format(<mask>)
55
+
56
+ if __name__ == '__main__':
57
+ app.run()
58
+ """.lstrip()
59
+
60
+
61
+ fill_mask(PYTHON_CODE2)
62
+
63
+ ## Top 5 predictions:
64
+
65
+ 'name' # prob 0.9961813688278198
66
+ ' name'
67
+ 'url'
68
+ 'description'
69
+ 'self'
70
+ ```
71
+
72
+ ### Yeah! It works ๐ŸŽ‰ Let's try with another Tensorflow/Keras like example
73
+
74
+ ```python
75
+ PYTHON_CODE3="""
76
+ model = keras.Sequential([
77
+ keras.layers.Flatten(input_shape=(28, 28)),
78
+ keras.layers.<mask>(128, activation='relu'),
79
+ keras.layers.Dense(10, activation='softmax')
80
+ ])
81
+ """.lstrip()
82
+
83
+
84
+ fill_mask(PYTHON_CODE3)
85
+
86
+ ## Top 5 predictions:
87
+
88
+ 'Dense' # prob 0.4482928514480591
89
+ 'relu'
90
+ 'Flatten'
91
+ 'Activation'
92
+ 'Conv'
93
+ ```
94
+
95
+ > Great! ๐ŸŽ‰
96
+
97
+ ## This work is heavily inspired on [CodeBERTa](https://github.com/huggingface/transformers/blob/master/model_cards/huggingface/CodeBERTa-small-v1/README.md) by huggingface team
98
+
99
+ <br>
100
+
101
+ ## CodeSearchNet citation
102
+
103
+ <details>
104
+
105
+ ```bibtex
106
+ @article{husain_codesearchnet_2019,
107
+ title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
108
+ shorttitle = {{CodeSearchNet} {Challenge}},
109
+ url = {http://arxiv.org/abs/1909.09436},
110
+ urldate = {2020-03-12},
111
+ journal = {arXiv:1909.09436 [cs, stat]},
112
+ author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
113
+ month = sep,
114
+ year = {2019},
115
+ note = {arXiv: 1909.09436},
116
+ }
117
+ ```
118
+
119
+ </details>
120
+
121
+ > Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
122
+
123
+ > Made with <span style="color: #e25555;">&hearts;</span> in Spain