julien-c HF staff commited on
Commit
c0a8824
1 Parent(s): 31046d3

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/mrm8488/codeBERTaJS/README.md

Files changed (1) hide show
  1. README.md +100 -0
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: code
3
+ thumbnail:
4
+ ---
5
+
6
+ # CodeBERTaJS
7
+
8
+ CodeBERTaJS is a RoBERTa-like model trained on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset from GitHub for `javaScript` by [Manuel Romero](https://twitter.com/mrm8488)
9
+
10
+ The **tokenizer** is a Byte-level BPE tokenizer trained on the corpus using Hugging Face `tokenizers`.
11
+
12
+ Because it is trained on a corpus of code (vs. natural language), it encodes the corpus efficiently (the sequences are between 33% to 50% shorter, compared to the same corpus tokenized by gpt2/roberta).
13
+
14
+ The (small) **model** is a 6-layer, 84M parameters, RoBERTa-like Transformer model – that’s the same number of layers & heads as DistilBERT – initialized from the default initialization settings and trained from scratch on the full `javascript` corpus (120M after preproccessing) for 2 epochs.
15
+
16
+ ## Quick start: masked language modeling prediction
17
+
18
+ ```python
19
+ JS_CODE = """
20
+ async function createUser(req, <mask>) {
21
+ if (!validUser(req.body.user)) {
22
+ return res.status(400);
23
+ }
24
+ user = userService.createUser(req.body.user);
25
+ return res.json(user);
26
+ }
27
+ """.lstrip()
28
+ ```
29
+
30
+ ### Does the model know how to complete simple JS/express like code?
31
+
32
+ ```python
33
+ from transformers import pipeline
34
+
35
+ fill_mask = pipeline(
36
+ "fill-mask",
37
+ model="mrm8488/codeBERTaJS",
38
+ tokenizer="mrm8488/codeBERTaJS"
39
+ )
40
+
41
+ fill_mask(JS_CODE)
42
+
43
+ ## Top 5 predictions:
44
+ #
45
+ 'res' # prob 0.069489665329
46
+ 'next'
47
+ 'req'
48
+ 'user'
49
+ ',req'
50
+ ```
51
+
52
+ ### Yes! That was easy 🎉 Let's try with another example
53
+
54
+ ```python
55
+ JS_CODE_= """
56
+ function getKeys(obj) {
57
+ keys = [];
58
+ for (var [key, value] of Object.entries(obj)) {
59
+ keys.push(<mask>);
60
+ }
61
+ return keys
62
+ }
63
+ """.lstrip()
64
+ ```
65
+
66
+ Results:
67
+
68
+ ```python
69
+ 'obj', 'key', ' value', 'keys', 'i'
70
+ ```
71
+
72
+ > Not so bad! Right token was predicted as second option! 🎉
73
+
74
+ ## This work is heavely inspired on [codeBERTa](https://github.com/huggingface/transformers/blob/master/model_cards/huggingface/CodeBERTa-small-v1/README.md) by huggingface team
75
+
76
+ <br>
77
+
78
+ ## CodeSearchNet citation
79
+
80
+ <details>
81
+
82
+ ```bibtex
83
+ @article{husain_codesearchnet_2019,
84
+ title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
85
+ shorttitle = {{CodeSearchNet} {Challenge}},
86
+ url = {http://arxiv.org/abs/1909.09436},
87
+ urldate = {2020-03-12},
88
+ journal = {arXiv:1909.09436 [cs, stat]},
89
+ author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
90
+ month = sep,
91
+ year = {2019},
92
+ note = {arXiv: 1909.09436},
93
+ }
94
+ ```
95
+
96
+ </details>
97
+
98
+ > Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
99
+
100
+ > Made with <span style="color: #e25555;">&hearts;</span> in Spain