julien-c HF staff commited on
Commit
6825fc9
1 Parent(s): 6d42fdc

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/huggingface/CodeBERTa-language-id/README.md

Files changed (1) hide show
  1. README.md +300 -0
README.md ADDED
@@ -0,0 +1,300 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: code
3
+ thumbnail: https://cdn-media.huggingface.co/CodeBERTa/CodeBERTa.png
4
+ datasets:
5
+ - code_search_net
6
+ ---
7
+
8
+ # CodeBERTa-language-id: The World’s fanciest programming language identification algo 🤯
9
+
10
+
11
+ To demonstrate the usefulness of our CodeBERTa pretrained model on downstream tasks beyond language modeling, we fine-tune the [`CodeBERTa-small-v1`](https://huggingface.co/huggingface/CodeBERTa-small-v1) checkpoint on the task of classifying a sample of code into the programming language it's written in (*programming language identification*).
12
+
13
+ We add a sequence classification head on top of the model.
14
+
15
+ On the evaluation dataset, we attain an eval accuracy and F1 > 0.999 which is not surprising given that the task of language identification is relatively easy (see an intuition why, below).
16
+
17
+ ## Quick start: using the raw model
18
+
19
+ ```python
20
+ CODEBERTA_LANGUAGE_ID = "huggingface/CodeBERTa-language-id"
21
+
22
+ tokenizer = RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
23
+ model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID)
24
+
25
+ input_ids = tokenizer.encode(CODE_TO_IDENTIFY)
26
+ logits = model(input_ids)[0]
27
+
28
+ language_idx = logits.argmax() # index for the resulting label
29
+ ```
30
+
31
+
32
+ ## Quick start: using Pipelines 💪
33
+
34
+ ```python
35
+ from transformers import TextClassificationPipeline
36
+
37
+ pipeline = TextClassificationPipeline(
38
+ model=RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID),
39
+ tokenizer=RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
40
+ )
41
+
42
+ pipeline(CODE_TO_IDENTIFY)
43
+ ```
44
+
45
+ Let's start with something very easy:
46
+
47
+ ```python
48
+ pipeline("""
49
+ def f(x):
50
+ return x**2
51
+ """)
52
+ # [{'label': 'python', 'score': 0.9999965}]
53
+ ```
54
+
55
+ Now let's probe shorter code samples:
56
+
57
+ ```python
58
+ pipeline("const foo = 'bar'")
59
+ # [{'label': 'javascript', 'score': 0.9977546}]
60
+ ```
61
+
62
+ What if I remove the `const` token from the assignment?
63
+ ```python
64
+ pipeline("foo = 'bar'")
65
+ # [{'label': 'javascript', 'score': 0.7176245}]
66
+ ```
67
+
68
+ For some reason, this is still statistically detected as JS code, even though it's also valid Python code. However, if we slightly tweak it:
69
+
70
+ ```python
71
+ pipeline("foo = u'bar'")
72
+ # [{'label': 'python', 'score': 0.7638422}]
73
+ ```
74
+ This is now detected as Python (Notice the `u` string modifier).
75
+
76
+ Okay, enough with the JS and Python domination already! Let's try fancier languages:
77
+
78
+ ```python
79
+ pipeline("echo $FOO")
80
+ # [{'label': 'php', 'score': 0.9995257}]
81
+ ```
82
+
83
+ (Yes, I used the word "fancy" to describe PHP 😅)
84
+
85
+ ```python
86
+ pipeline("outcome := rand.Intn(6) + 1")
87
+ # [{'label': 'go', 'score': 0.9936151}]
88
+ ```
89
+
90
+ Why is the problem of language identification so easy (with the correct toolkit)? Because code's syntax is rigid, and simple tokens such as `:=` (the assignment operator in Go) are perfect predictors of the underlying language:
91
+
92
+ ```python
93
+ pipeline(":=")
94
+ # [{'label': 'go', 'score': 0.9998052}]
95
+ ```
96
+
97
+ By the way, because we trained our own custom tokenizer on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset, and it handles streams of bytes in a very generic way, syntactic constructs such `:=` are represented by a single token:
98
+
99
+ ```python
100
+ self.tokenizer.encode(" :=", add_special_tokens=False)
101
+ # [521]
102
+ ```
103
+
104
+ <br>
105
+
106
+ ## Fine-tuning code
107
+
108
+ <details>
109
+
110
+ ```python
111
+ import gzip
112
+ import json
113
+ import logging
114
+ import os
115
+ from pathlib import Path
116
+ from typing import Dict, List, Tuple
117
+
118
+ import numpy as np
119
+ import torch
120
+ from sklearn.metrics import f1_score
121
+ from tokenizers.implementations.byte_level_bpe import ByteLevelBPETokenizer
122
+ from tokenizers.processors import BertProcessing
123
+ from torch.nn.utils.rnn import pad_sequence
124
+ from torch.utils.data import DataLoader, Dataset
125
+ from torch.utils.data.dataset import Dataset
126
+ from torch.utils.tensorboard.writer import SummaryWriter
127
+ from tqdm import tqdm, trange
128
+
129
+ from transformers import RobertaForSequenceClassification
130
+ from transformers.data.metrics import acc_and_f1, simple_accuracy
131
+
132
+
133
+ logging.basicConfig(level=logging.INFO)
134
+
135
+
136
+ CODEBERTA_PRETRAINED = "huggingface/CodeBERTa-small-v1"
137
+
138
+ LANGUAGES = [
139
+ "go",
140
+ "java",
141
+ "javascript",
142
+ "php",
143
+ "python",
144
+ "ruby",
145
+ ]
146
+ FILES_PER_LANGUAGE = 1
147
+ EVALUATE = True
148
+
149
+ # Set up tokenizer
150
+ tokenizer = ByteLevelBPETokenizer("./pretrained/vocab.json", "./pretrained/merges.txt",)
151
+ tokenizer._tokenizer.post_processor = BertProcessing(
152
+ ("</s>", tokenizer.token_to_id("</s>")), ("<s>", tokenizer.token_to_id("<s>")),
153
+ )
154
+ tokenizer.enable_truncation(max_length=512)
155
+
156
+ # Set up Tensorboard
157
+ tb_writer = SummaryWriter()
158
+
159
+
160
+ class CodeSearchNetDataset(Dataset):
161
+ examples: List[Tuple[List[int], int]]
162
+
163
+ def __init__(self, split: str = "train"):
164
+ """
165
+ train | valid | test
166
+ """
167
+
168
+ self.examples = []
169
+
170
+ src_files = []
171
+ for language in LANGUAGES:
172
+ src_files += list(
173
+ Path("../CodeSearchNet/resources/data/").glob(f"{language}/final/jsonl/{split}/*.jsonl.gz")
174
+ )[:FILES_PER_LANGUAGE]
175
+ for src_file in src_files:
176
+ label = src_file.parents[3].name
177
+ label_idx = LANGUAGES.index(label)
178
+ print("🔥", src_file, label)
179
+ lines = []
180
+ fh = gzip.open(src_file, mode="rt", encoding="utf-8")
181
+ for line in fh:
182
+ o = json.loads(line)
183
+ lines.append(o["code"])
184
+ examples = [(x.ids, label_idx) for x in tokenizer.encode_batch(lines)]
185
+ self.examples += examples
186
+ print("🔥🔥")
187
+
188
+ def __len__(self):
189
+ return len(self.examples)
190
+
191
+ def __getitem__(self, i):
192
+ # We’ll pad at the batch level.
193
+ return self.examples[i]
194
+
195
+
196
+ model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_PRETRAINED, num_labels=len(LANGUAGES))
197
+
198
+ train_dataset = CodeSearchNetDataset(split="train")
199
+ eval_dataset = CodeSearchNetDataset(split="test")
200
+
201
+
202
+ def collate(examples):
203
+ input_ids = pad_sequence([torch.tensor(x[0]) for x in examples], batch_first=True, padding_value=1)
204
+ labels = torch.tensor([x[1] for x in examples])
205
+ # ^^ uncessary .unsqueeze(-1)
206
+ return input_ids, labels
207
+
208
+
209
+ train_dataloader = DataLoader(train_dataset, batch_size=256, shuffle=True, collate_fn=collate)
210
+
211
+ batch = next(iter(train_dataloader))
212
+
213
+
214
+ model.to("cuda")
215
+ model.train()
216
+ for param in model.roberta.parameters():
217
+ param.requires_grad = False
218
+ ## ^^ Only train final layer.
219
+
220
+ print(f"num params:", model.num_parameters())
221
+ print(f"num trainable params:", model.num_parameters(only_trainable=True))
222
+
223
+
224
+ def evaluate():
225
+ eval_loss = 0.0
226
+ nb_eval_steps = 0
227
+ preds = np.empty((0), dtype=np.int64)
228
+ out_label_ids = np.empty((0), dtype=np.int64)
229
+
230
+ model.eval()
231
+
232
+ eval_dataloader = DataLoader(eval_dataset, batch_size=512, collate_fn=collate)
233
+ for step, (input_ids, labels) in enumerate(tqdm(eval_dataloader, desc="Eval")):
234
+ with torch.no_grad():
235
+ outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
236
+ loss = outputs[0]
237
+ logits = outputs[1]
238
+ eval_loss += loss.mean().item()
239
+ nb_eval_steps += 1
240
+ preds = np.append(preds, logits.argmax(dim=1).detach().cpu().numpy(), axis=0)
241
+ out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)
242
+ eval_loss = eval_loss / nb_eval_steps
243
+ acc = simple_accuracy(preds, out_label_ids)
244
+ f1 = f1_score(y_true=out_label_ids, y_pred=preds, average="macro")
245
+ print("=== Eval: loss ===", eval_loss)
246
+ print("=== Eval: acc. ===", acc)
247
+ print("=== Eval: f1 ===", f1)
248
+ # print(acc_and_f1(preds, out_label_ids))
249
+ tb_writer.add_scalars("eval", {"loss": eval_loss, "acc": acc, "f1": f1}, global_step)
250
+
251
+
252
+ ### Training loop
253
+
254
+ global_step = 0
255
+ train_iterator = trange(0, 4, desc="Epoch")
256
+ optimizer = torch.optim.AdamW(model.parameters())
257
+ for _ in train_iterator:
258
+ epoch_iterator = tqdm(train_dataloader, desc="Iteration")
259
+ for step, (input_ids, labels) in enumerate(epoch_iterator):
260
+ optimizer.zero_grad()
261
+ outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
262
+ loss = outputs[0]
263
+ loss.backward()
264
+ tb_writer.add_scalar("training_loss", loss.item(), global_step)
265
+ optimizer.step()
266
+ global_step += 1
267
+ if EVALUATE and global_step % 50 == 0:
268
+ evaluate()
269
+ model.train()
270
+
271
+
272
+ evaluate()
273
+
274
+ os.makedirs("./models/CodeBERT-language-id", exist_ok=True)
275
+ model.save_pretrained("./models/CodeBERT-language-id")
276
+ ```
277
+
278
+ </details>
279
+
280
+ <br>
281
+
282
+ ## CodeSearchNet citation
283
+
284
+ <details>
285
+
286
+ ```bibtex
287
+ @article{husain_codesearchnet_2019,
288
+ title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
289
+ shorttitle = {{CodeSearchNet} {Challenge}},
290
+ url = {http://arxiv.org/abs/1909.09436},
291
+ urldate = {2020-03-12},
292
+ journal = {arXiv:1909.09436 [cs, stat]},
293
+ author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
294
+ month = sep,
295
+ year = {2019},
296
+ note = {arXiv: 1909.09436},
297
+ }
298
+ ```
299
+
300
+ </details>