mstaron commited on
Commit
12c57c3
1 Parent(s): 8d16336

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md CHANGED
@@ -1,3 +1,50 @@
1
  ---
2
  license: cc-by-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
  ---
4
+
5
+ This model is a RoBERTa model trained on a programming language code - WolfSSL + examples of cybersecurity vulnerabilities related to input validation, diffused with the Linux Kernel code. The model is pre-trained to understand the concep of a singleton in the code
6
+
7
+ The programming language is C/C++, but the actual inference can also use other languages.
8
+
9
+ Using the model to unmask can be done in the following way
10
+
11
+ ```python
12
+ from transformers import pipeline
13
+ unmasker = pipeline('fill-mask', model='mstaron/CyBERTa')
14
+ unmasker("Hello I'm a <mask> model.")
15
+ ```
16
+
17
+ To obtain the embeddings for downstream task can be done in the following way:
18
+
19
+ ```python
20
+ # import the model via the huggingface library
21
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
22
+
23
+ # load the tokenizer and the model for the pretrained SingBERTa
24
+ tokenizer = AutoTokenizer.from_pretrained('mstaron/CyBERTa')
25
+
26
+ # load the model
27
+ model = AutoModelForMaskedLM.from_pretrained("mstaron/CyBERTa")
28
+
29
+ # import the feature extraction pipeline
30
+ from transformers import pipeline
31
+
32
+ # create the pipeline, which will extract the embedding vectors
33
+ # the models are already pre-defined, so we do not need to train anything here
34
+ features = pipeline(
35
+ "feature-extraction",
36
+ model=model,
37
+ tokenizer=tokenizer,
38
+ return_tensor = False
39
+ )
40
+
41
+ # extract the features == embeddings
42
+ lstFeatures = features('Class HTTP::X1')
43
+
44
+ # print the first token's embedding [CLS]
45
+ # which is also a good approximation of the whole sentence embedding
46
+ # the same as using np.mean(lstFeatures[0], axis=0)
47
+ lstFeatures[0][0]
48
+ ```
49
+
50
+ In order to use the model, we need to train it on the downstream task.