roberta-go

roberta-go / README.md
1
# roberta-go
2
---
3
language: Go
4
datasets:
5
- code_search_net
6
---
7
8
This is a [roberta](https://arxiv.org/pdf/1907.11692.pdf) pre-trained version on the [CodeSearchNet dataset](https://github.com/github/CodeSearchNet) for **Golang** Mask Language Model mission.
9
10
To load the model:
11
(necessary packages: !pip install transformers sentencepiece)
12
```python
13
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
14
tokenizer = AutoTokenizer.from_pretrained("dbernsohn/roberta-go")
15
model = AutoModelWithLMHead.from_pretrained("dbernsohn/roberta-go")
16
17
fill_mask = pipeline(
18
    "fill-mask",
19
    model=model,
20
    tokenizer=tokenizer
21
)
22
```
23
24
You can then use this model to fill masked words in a Java code.
25
26
```python
27
code = """
28
package main
29
30
import (
31
	"fmt"
32
	"runtime"
33
)
34
35
func main() {
36
	fmt.Print("Go runs on ")
37
	switch os := runtime.<mask>; os {
38
	case "darwin":
39
		fmt.Println("OS X.")
40
	case "linux":
41
		fmt.Println("Linux.")
42
	default:
43
		// freebsd, openbsd,
44
		// plan9, windows...
45
		fmt.Printf("%s.\n", os)
46
	}
47
}
48
""".lstrip()
49
50
pred = {x["token_str"].replace("Ġ", ""): x["score"] for x in fill_mask(code)}
51
sorted(pred.items(), key=lambda kv: kv[1], reverse=True)
52
[('GOOS', 0.11810332536697388),
53
 ('FileInfo', 0.04276798665523529),
54
 ('Stdout', 0.03572738170623779),
55
 ('Getenv', 0.025064032524824142),
56
 ('FileMode', 0.01462600938975811)]
57
```
58
59
The whole training process and hyperparameters are in my [GitHub repo](https://github.com/DorBernsohn/CodeLM/tree/main/CodeMLM)
60
61
> Created by [Dor Bernsohn](https://www.linkedin.com/in/dor-bernsohn-70b2b1146/)