mstaron commited on
Commit
da26daa
1 Parent(s): 34cd09a

First version of the SingBeRTa model for singleton analysis.

Browse files
Files changed (1) hide show
  1. README.md +44 -1
README.md CHANGED
@@ -4,4 +4,47 @@ license: cc-by-4.0
4
 
5
  This model is a RoBERTa model trained on a programming language code - WolfSSL + examples of Singletons diffused with the Linux Kernel code. The model is pre-trained to understand the concep of a singleton in the code
6
 
7
- The programming language is C/C++, but the actual inference can also use other languages.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  This model is a RoBERTa model trained on a programming language code - WolfSSL + examples of Singletons diffused with the Linux Kernel code. The model is pre-trained to understand the concep of a singleton in the code
6
 
7
+ The programming language is C/C++, but the actual inference can also use other languages.
8
+
9
+ Using the model to unmask can be done in the following way
10
+
11
+ ```python
12
+ from transformers import pipeline
13
+ unmasker = pipeline('fill-mask', model='mstaron/SingBERTa')
14
+ unmasker("Hello I'm a <mask> model.")
15
+ ```
16
+
17
+ To obtain the embeddings for downstream task can be done in the following way:
18
+
19
+ ```python
20
+ # import the model via the huggingface library
21
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
22
+
23
+ # load the tokenizer and the model for the pretrained SingBERTa
24
+ tokenizer = AutoTokenizer.from_pretrained('mstaron/SingBERTa')
25
+
26
+ # load the model
27
+ model = AutoModelForMaskedLM.from_pretrained("mstaron/SingBERTa")
28
+
29
+ # import the feature extraction pipeline
30
+ from transformers import pipeline
31
+
32
+ # create the pipeline, which will extract the embedding vectors
33
+ # the models are already pre-defined, so we do not need to train anything here
34
+ features = pipeline(
35
+ "feature-extraction",
36
+ model="./SingletonSSLBERT",
37
+ tokenizer="./SingletonSSLBERT",
38
+ return_tensor = False
39
+ )
40
+
41
+ # extract the features == embeddings
42
+ lstFeatures = features('Class SingletonX1')
43
+
44
+ # print the first token's embedding [CLS]
45
+ # which is also a good approximation of the whole sentence embedding
46
+ # the same as using np.mean(lstFeatures[0], axis=0)
47
+ lstFeatures[0][0]
48
+ ```
49
+
50
+ In order to use the model, we need to train it on the downstream task.