dbernsohn commited on
Commit
1733547
1 Parent(s): 131d3f5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # roberta-php
2
+ ---
3
+ language: php
4
+ datasets:
5
+ - CodeSearchNet
6
+ ---
7
+
8
+ This is a [roberta](https://arxiv.org/pdf/1907.11692.pdf) pre-trained version on the [CodeSearchNet dataset](https://github.com/github/CodeSearchNet) for **php** Mask Language Model mission.
9
+
10
+ To load the model:
11
+ (necessary packages: !pip install transformers sentencepiece)
12
+ ```php
13
+ from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
14
+ tokenizer = AutoTokenizer.from_pretrained("dbernsohn/roberta-php")
15
+ model = AutoModelWithLMHead.from_pretrained("dbernsohn/roberta-php")
16
+
17
+ fill_mask = pipeline(
18
+ "fill-mask",
19
+ model=model,
20
+ tokenizer=tokenizer
21
+ )
22
+ ```
23
+
24
+ You can then use this model to fill masked words in a Java code.
25
+
26
+ ```php
27
+ code = """
28
+ $people = array(
29
+ array('name' => 'Kalle', 'salt' => 856412),
30
+ array('name' => 'Pierre', 'salt' => 215863)
31
+ );
32
+
33
+ for($i = 0; $i < count($<mask>); ++$i) {
34
+ $people[$i]['salt'] = mt_rand(000000, 999999);
35
+ }
36
+ """.lstrip()
37
+
38
+ pred = {x["token_str"].replace("Ġ", ""): x["score"] for x in fill_mask(code)}
39
+ sorted(pred.items(), key=lambda kv: kv[1], reverse=True)
40
+ # [('people', 0.785636842250824),
41
+ # ('parts', 0.006270722020417452),
42
+ # ('id', 0.0035842324141412973),
43
+ # ('data', 0.0025512021966278553),
44
+ # ('config', 0.002258970635011792)]
45
+ ```
46
+
47
+ The whole training process and hyperparameters are in my [GitHub repo](https://github.com/DorBernsohn/CodeLM/tree/main/CodeMLM)
48
+
49
+ > Created by [Dor Bernsohn](https://www.linkedin.com/in/dor-bernsohn-70b2b1146/)