File size: 1,672 Bytes
8d16336
 
 
12c57c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
license: cc-by-4.0
---

This model is a RoBERTa model trained on a programming language code - WolfSSL + examples of cybersecurity vulnerabilities related to input validation, diffused with the Linux Kernel code. The model is pre-trained to understand the concep of a singleton in the code

The programming language is C/C++, but the actual inference can also use other languages. 

Using the model to unmask can be done in the following way

```python
from transformers import pipeline
unmasker = pipeline('fill-mask', model='mstaron/CyBERTa')
unmasker("Hello I'm a <mask> model.")
```

To obtain the embeddings for downstream task can be done in the following way:

```python
# import the model via the huggingface library
from transformers import AutoTokenizer, AutoModelForMaskedLM

# load the tokenizer and the model for the pretrained SingBERTa
tokenizer = AutoTokenizer.from_pretrained('mstaron/CyBERTa')

# load the model
model = AutoModelForMaskedLM.from_pretrained("mstaron/CyBERTa")

# import the feature extraction pipeline
from transformers import pipeline

# create the pipeline, which will extract the embedding vectors
# the models are already pre-defined, so we do not need to train anything here
features = pipeline(
    "feature-extraction",
    model=model,
    tokenizer=tokenizer, 
    return_tensor = False
)

# extract the features == embeddings
lstFeatures = features('Class HTTP::X1')

# print the first token's embedding [CLS]
# which is also a good approximation of the whole sentence embedding
# the same as using np.mean(lstFeatures[0], axis=0)
lstFeatures[0][0]
```

In order to use the model, we need to train it on the downstream task.