File size: 4,167 Bytes
d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a d2ebb2a 898e57a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
language: code
thumbnail: https://cdn-media.huggingface.co/CodeBERTa/CodeBERTa.png
datasets:
- code_search_net
---
This is an *unofficial* reupload of [huggingface/CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1) in the `SafeTensors` format using `transformers` `4.41.1`. The goal of this reupload is to prevent older models that are still relevant baselines from becoming stale as a result of changes in HuggingFace. Additionally, I may include minor corrections, such as model max length configuration.
Original model card below:
---
# CodeBERTa
CodeBERTa is a RoBERTa-like model trained on the [CodeSearchNet](https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/) dataset from GitHub.
Supported languages:
```shell
"go"
"java"
"javascript"
"php"
"python"
"ruby"
```
The **tokenizer** is a Byte-level BPE tokenizer trained on the corpus using Hugging Face `tokenizers`.
Because it is trained on a corpus of code (vs. natural language), it encodes the corpus efficiently (the sequences are between 33% to 50% shorter, compared to the same corpus tokenized by gpt2/roberta).
The (small) **model** is a 6-layer, 84M parameters, RoBERTa-like Transformer model – that’s the same number of layers & heads as DistilBERT – initialized from the default initialization settings and trained from scratch on the full corpus (~2M functions) for 5 epochs.
### Tensorboard for this training ⤵️
[![tb](https://cdn-media.huggingface.co/CodeBERTa/tensorboard.png)](https://tensorboard.dev/experiment/irRI7jXGQlqmlxXS0I07ew/#scalars)
## Quick start: masked language modeling prediction
```python
PHP_CODE = """
public static <mask> set(string $key, $value) {
if (!in_array($key, self::$allowedKeys)) {
throw new \InvalidArgumentException('Invalid key given');
}
self::$storedValues[$key] = $value;
}
""".lstrip()
```
### Does the model know how to complete simple PHP code?
```python
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="huggingface/CodeBERTa-small-v1",
tokenizer="huggingface/CodeBERTa-small-v1"
)
fill_mask(PHP_CODE)
## Top 5 predictions:
#
' function' # prob 0.9999827146530151
'function' #
' void' #
' def' #
' final' #
```
### Yes! That was easy 🎉 What about some Python (warning: this is going to be meta)
```python
PYTHON_CODE = """
def pipeline(
task: str,
model: Optional = None,
framework: Optional[<mask>] = None,
**kwargs
) -> Pipeline:
pass
""".lstrip()
```
Results:
```python
'framework', 'Framework', ' framework', 'None', 'str'
```
> This program can auto-complete itself! 😱
### Just for fun, let's try to mask natural language (not code):
```python
fill_mask("My name is <mask>.")
# {'sequence': '<s> My name is undefined.</s>', 'score': 0.2548016905784607, 'token': 3353}
# {'sequence': '<s> My name is required.</s>', 'score': 0.07290805131196976, 'token': 2371}
# {'sequence': '<s> My name is null.</s>', 'score': 0.06323737651109695, 'token': 469}
# {'sequence': '<s> My name is name.</s>', 'score': 0.021919190883636475, 'token': 652}
# {'sequence': '<s> My name is disabled.</s>', 'score': 0.019681859761476517, 'token': 7434}
```
This (kind of) works because code contains comments (which contain natural language).
Of course, the most frequent name for a Computer scientist must be undefined 🤓.
## Downstream task: [programming language identification](https://huggingface.co/huggingface/CodeBERTa-language-id)
See the model card for **[`huggingface/CodeBERTa-language-id`](https://huggingface.co/huggingface/CodeBERTa-language-id)** 🤯.
<br>
## CodeSearchNet citation
<details>
```bibtex
@article{husain_codesearchnet_2019,
title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
shorttitle = {{CodeSearchNet} {Challenge}},
url = {http://arxiv.org/abs/1909.09436},
urldate = {2020-03-12},
journal = {arXiv:1909.09436 [cs, stat]},
author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
month = sep,
year = {2019},
note = {arXiv: 1909.09436},
}
```
</details> |