claudios commited on
Commit
161917d
1 Parent(s): e4059d5

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ arxiv: 2205.12424
4
+ datasets:
5
+ - code_x_glue_cc_defect_detection
6
+ metrics:
7
+ - accuracy
8
+ - precision
9
+ - recall
10
+ - f1
11
+ - roc_auc
12
+ model-index:
13
+ - name: VulBERTa MLP
14
+ results:
15
+ - task:
16
+ type: defect-detection
17
+ dataset:
18
+ name: codexglue-devign
19
+ type: codexglue-devign
20
+ metrics:
21
+ - name: Accuracy
22
+ type: Accuracy
23
+ value: 64.71
24
+ - name: Precision
25
+ type: Precision
26
+ value: 64.80
27
+ - name: Recall
28
+ type: Recall
29
+ value: 50.76
30
+ - name: F1
31
+ type: F1
32
+ value: 56.93
33
+ - name: ROC-AUC
34
+ type: ROC-AUC
35
+ value: 71.02
36
+ pipeline_tag: text-classification
37
+ tags:
38
+ - devign
39
+ - defect detection
40
+ - code
41
+ ---
42
+
43
+ # VulBERTa MLP Devign
44
+ ## VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection
45
+
46
+ ![VulBERTa architecture](https://raw.githubusercontent.com/ICL-ml4csec/VulBERTa/main/VB.png)
47
+
48
+ ## Overview
49
+ This model is the unofficial HuggingFace version of "[VulBERTa](https://github.com/ICL-ml4csec/VulBERTa/tree/main)" with an MLP classification head, trained on CodeXGlue Devign (C code), by Hazim Hanif & Sergio Maffeis (Imperial College London). I simplified the tokenization process by adding the cleaning (comment removal) step to the tokenizer and added the simplified tokenizer to this model repo as an AutoClass.
50
+
51
+ > This paper presents presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
52
+
53
+ ## Usage
54
+ **You must install libclang for tokenization.**
55
+
56
+ ```bash
57
+ pip install libclang
58
+ ```
59
+
60
+ Note that due to the custom tokenizer, you must pass `trust_remote_code=True` when instantiating the model.
61
+ Example:
62
+ ```
63
+ from transformers import pipeline
64
+ pipe = pipeline("text-classification", model="claudios/VulBERTa-MLP-Devign", trust_remote_code=True, return_all_scores=True)
65
+ pipe("static void filter_mirror_setup(NetFilterState *nf, Error **errp)\n{\n MirrorState *s = FILTER_MIRROR(nf);\n Chardev *chr;\n chr = qemu_chr_find(s->outdev);\n if (chr == NULL) {\n error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,\n \"Device '%s' not found\", s->outdev);\n qemu_chr_fe_init(&s->chr_out, chr, errp);")
66
+ >> [[{'label': 'LABEL_0', 'score': 0.014685827307403088},
67
+ {'label': 'LABEL_1', 'score': 0.985314130783081}]]
68
+ ```
69
+
70
+ ***
71
+
72
+ ## Data
73
+ We provide all data required by VulBERTa.
74
+ This includes:
75
+ - Tokenizer training data
76
+ - Pre-training data
77
+ - Fine-tuning data
78
+
79
+ Please refer to the [data](https://github.com/ICL-ml4csec/VulBERTa/tree/main/data "data") directory for further instructions and details.
80
+
81
+ ## Models
82
+ We provide all models pre-trained and fine-tuned by VulBERTa.
83
+ This includes:
84
+ - Trained tokenisers
85
+ - Pre-trained VulBERTa model (core representation knowledge)
86
+ - Fine-tuned VulBERTa-MLP and VulBERTa-CNN models
87
+
88
+ Please refer to the [models](https://github.com/ICL-ml4csec/VulBERTa/tree/main/models "models") directory for further instructions and details.
89
+
90
+ ## How to use
91
+
92
+ In our project, we uses Jupyterlab notebook to run experiments.
93
+ Therefore, we separate each task into different notebook:
94
+
95
+ - [Pretraining_VulBERTa.ipynb](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Pretraining_VulBERTa.ipynb "Pretraining_VulBERTa.ipynb") - Pre-trains the core VulBERTa knowledge representation model using DrapGH dataset.
96
+ - [Finetuning_VulBERTa-MLP.ipynb](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Finetuning_VulBERTa-MLP.ipynb "Finetuning_VulBERTa-MLP.ipynb") - Fine-tunes the VulBERTa-MLP model on a specific vulnerability detection dataset.
97
+ - [Evaluation_VulBERTa-MLP.ipynb](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Evaluation_VulBERTa-MLP.ipynb "Evaluation_VulBERTa-MLP.ipynb") - Evaluates the fine-tuned VulBERTa-MLP models on testing set of a specific vulnerability detection dataset.
98
+ - [Finetuning+evaluation_VulBERTa-CNN](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Finetuning%2Bevaluation_VulBERTa-CNN.ipynb "Finetuning+evaluation_VulBERTa-CNN.ipynb") - Fine-tunes VulBERTa-CNN models and evaluates it on a testing set of a specific vulnerability detection dataset.
99
+
100
+
101
+ ## Citation
102
+
103
+ Accepted as conference paper (oral presentation) at the International Joint Conference on Neural Networks (IJCNN) 2022.
104
+ Link to paper: https://ieeexplore.ieee.org/document/9892280
105
+
106
+
107
+ ```bibtex
108
+ @INPROCEEDINGS{hanif2022vulberta,
109
+ author={Hanif, Hazim and Maffeis, Sergio},
110
+ booktitle={2022 International Joint Conference on Neural Networks (IJCNN)},
111
+ title={VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection},
112
+ year={2022},
113
+ volume={},
114
+ number={},
115
+ pages={1-8},
116
+ doi={10.1109/IJCNN55064.2022.9892280}
117
+
118
+ }
119
+ ```
config.json ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "VulBERTa-MLP-MVD",
3
+ "architectures": [
4
+ "RobertaForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "gradient_checkpointing": false,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "id2label": {
15
+ "0": "non-vulnerable",
16
+ "1": "CWE-404",
17
+ "2": "CWE-476",
18
+ "3": "CWE-119",
19
+ "4": "CWE-706",
20
+ "5": "CWE-670",
21
+ "6": "CWE-673",
22
+ "7": "CWE-119, CWE-666, CWE-573",
23
+ "8": "CWE-573",
24
+ "9": "CWE-668",
25
+ "10": "CWE-400, CWE-665, CWE-020",
26
+ "11": "CWE-662",
27
+ "12": "CWE-400",
28
+ "13": "CWE-665",
29
+ "14": "CWE-020",
30
+ "15": "CWE-074",
31
+ "16": "CWE-362",
32
+ "17": "CWE-191",
33
+ "18": "CWE-190",
34
+ "19": "CWE-610",
35
+ "20": "CWE-704",
36
+ "21": "CWE-170",
37
+ "22": "CWE-676",
38
+ "23": "CWE-187",
39
+ "24": "CWE-138",
40
+ "25": "CWE-369",
41
+ "26": "CWE-662, CWE-573",
42
+ "27": "CWE-834",
43
+ "28": "CWE-400, CWE-665",
44
+ "29": "CWE-400, CWE-404",
45
+ "30": "CWE-221",
46
+ "31": "CWE-754",
47
+ "32": "CWE-311",
48
+ "33": "CWE-404, CWE-668",
49
+ "34": "CWE-506",
50
+ "35": "CWE-758",
51
+ "36": "CWE-666",
52
+ "37": "CWE-467",
53
+ "38": "CWE-327",
54
+ "39": "CWE-666, CWE-573",
55
+ "40": "CWE-469"
56
+ },
57
+ "initializer_range": 0.02,
58
+ "intermediate_size": 3072,
59
+ "label2id": {
60
+ "CWE-020": 14,
61
+ "CWE-074": 15,
62
+ "CWE-119": 3,
63
+ "CWE-119, CWE-666, CWE-573": 7,
64
+ "CWE-138": 24,
65
+ "CWE-170": 21,
66
+ "CWE-187": 23,
67
+ "CWE-190": 18,
68
+ "CWE-191": 17,
69
+ "CWE-221": 30,
70
+ "CWE-311": 32,
71
+ "CWE-327": 38,
72
+ "CWE-362": 16,
73
+ "CWE-369": 25,
74
+ "CWE-400": 12,
75
+ "CWE-400, CWE-404": 29,
76
+ "CWE-400, CWE-665": 28,
77
+ "CWE-400, CWE-665, CWE-020": 10,
78
+ "CWE-404": 1,
79
+ "CWE-404, CWE-668": 33,
80
+ "CWE-467": 37,
81
+ "CWE-469": 40,
82
+ "CWE-476": 2,
83
+ "CWE-506": 34,
84
+ "CWE-573": 8,
85
+ "CWE-610": 19,
86
+ "CWE-662": 11,
87
+ "CWE-662, CWE-573": 26,
88
+ "CWE-665": 13,
89
+ "CWE-666": 36,
90
+ "CWE-666, CWE-573": 39,
91
+ "CWE-668": 9,
92
+ "CWE-670": 5,
93
+ "CWE-673": 6,
94
+ "CWE-676": 22,
95
+ "CWE-704": 20,
96
+ "CWE-706": 4,
97
+ "CWE-754": 31,
98
+ "CWE-758": 35,
99
+ "CWE-834": 27,
100
+ "non-vulnerable": 0
101
+ },
102
+ "layer_norm_eps": 1e-12,
103
+ "max_position_embeddings": 1026,
104
+ "model_type": "roberta",
105
+ "num_attention_heads": 12,
106
+ "num_hidden_layers": 12,
107
+ "pad_token_id": 1,
108
+ "position_embedding_type": "absolute",
109
+ "torch_dtype": "float32",
110
+ "transformers_version": "4.37.0.dev0",
111
+ "type_vocab_size": 1,
112
+ "use_cache": true,
113
+ "vocab_size": 50000
114
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:933e5d838fa628428fec4d15e9fcd7b7cdd7c34fd60e38e35f9a981fe1aa6491
3
+ size 499491572
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "pad_token": "<pad>"
3
+ }
tokenization_vulberta.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List
2
+
3
+ from tokenizers import NormalizedString, PreTokenizedString
4
+ from tokenizers.pre_tokenizers import PreTokenizer
5
+ from transformers import PreTrainedTokenizerFast
6
+
7
+ try:
8
+ from clang import cindex
9
+ except ModuleNotFoundError as e:
10
+ raise ModuleNotFoundError(
11
+ "VulBERTa Clang tokenizer requires `libclang`. Please install it via `pip install libclang`.",
12
+ ) from e
13
+
14
+
15
+ class ClangPreTokenizer:
16
+ cidx = cindex.Index.create()
17
+
18
+ def clang_split(
19
+ self,
20
+ i: int,
21
+ normalized_string: NormalizedString,
22
+ ) -> List[NormalizedString]:
23
+ tok = []
24
+ tu = self.cidx.parse(
25
+ "tmp.c",
26
+ args=[""],
27
+ unsaved_files=[("tmp.c", str(normalized_string.original))],
28
+ options=0,
29
+ )
30
+ for t in tu.get_tokens(extent=tu.cursor.extent):
31
+ spelling = t.spelling.strip()
32
+ if spelling == "":
33
+ continue
34
+ tok.append(NormalizedString(spelling))
35
+ return tok
36
+
37
+ def pre_tokenize(self, pretok: PreTokenizedString):
38
+ pretok.split(self.clang_split)
39
+
40
+
41
+ class VulBERTaTokenizer(PreTrainedTokenizerFast):
42
+ def __init__(
43
+ self,
44
+ *args,
45
+ **kwargs,
46
+ ):
47
+ super().__init__(
48
+ *args,
49
+ **kwargs,
50
+ )
51
+ self._tokenizer.pre_tokenizer = PreTokenizer.custom(ClangPreTokenizer())
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "1": {
4
+ "content": "<pad>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ }
11
+ },
12
+ "clean_up_tokenization_spaces": true,
13
+ "max_length": 1024,
14
+ "model_max_length": 1024,
15
+ "pad_to_multiple_of": null,
16
+ "pad_token": "<pad>",
17
+ "pad_token_type_id": 0,
18
+ "padding_side": "right",
19
+ "stride": 0,
20
+ "tokenizer_class": "VulBERTaTokenizer",
21
+ "auto_map": {
22
+ "AutoTokenizer": ["tokenization_vulberta.VulBERTaTokenizer", null]
23
+ },
24
+ "truncation_side": "right",
25
+ "truncation_strategy": "longest_first"
26
+ }