claudios commited on
Commit
e17731d
1 Parent(s): 439544d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -18
README.md CHANGED
@@ -41,17 +41,17 @@ tags:
41
  ---
42
 
43
  # VulBERTa MLP Devign
44
- ## VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection
45
 
46
  ![VulBERTa architecture](https://raw.githubusercontent.com/ICL-ml4csec/VulBERTa/main/VB.png)
47
 
48
  ## Overview
49
- This model is the unofficial HuggingFace version of "VulBERTa" with an MLP classification head, trained on CodeXGlue Devign, by Hazim Hanif & Sergio Maffeis (Imperial College London).
50
 
51
  > This paper presents presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
52
 
53
  ## Usage
54
- *You must install libclang for tokenization.*
55
 
56
  ```bash
57
  pip install libclang
@@ -67,6 +67,8 @@ pipe("static void filter_mirror_setup(NetFilterState *nf, Error **errp)\n{\n
67
  {'label': 'LABEL_1', 'score': 0.985314130783081}]]
68
  ```
69
 
 
 
70
  ## Data
71
  We provide all data required by VulBERTa.
72
  This includes:
@@ -85,18 +87,6 @@ This includes:
85
 
86
  Please refer to the [models](https://github.com/ICL-ml4csec/VulBERTa/tree/main/models "models") directory for further instructions and details.
87
 
88
- ## Pre-requisites and requirements
89
-
90
- In general, we used this version of packages when running the experiments:
91
-
92
- - Python 3.8.5
93
- - Pytorch 1.7.0
94
- - Transformers 4.4.1
95
- - Tokenizers 0.10.1
96
- - Libclang (any version > 12.0 should work. https://pypi.org/project/libclang/)
97
-
98
- For an exhaustive list of all the packages, please refer to [requirements.txt](https://github.com/ICL-ml4csec/VulBERTa/blob/main/requirements.txt "requirements.txt") file.
99
-
100
  ## How to use
101
 
102
  In our project, we uses Jupyterlab notebook to run experiments.
@@ -107,9 +97,6 @@ Therefore, we separate each task into different notebook:
107
  - [Evaluation_VulBERTa-MLP.ipynb](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Evaluation_VulBERTa-MLP.ipynb "Evaluation_VulBERTa-MLP.ipynb") - Evaluates the fine-tuned VulBERTa-MLP models on testing set of a specific vulnerability detection dataset.
108
  - [Finetuning+evaluation_VulBERTa-CNN](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Finetuning%2Bevaluation_VulBERTa-CNN.ipynb "Finetuning+evaluation_VulBERTa-CNN.ipynb") - Fine-tunes VulBERTa-CNN models and evaluates it on a testing set of a specific vulnerability detection dataset.
109
 
110
- ## Running VulBERTa-CNN or VulBERTa-MLP on arbitrary codes
111
-
112
- Coming soon!
113
 
114
  ## Citation
115
 
 
41
  ---
42
 
43
  # VulBERTa MLP Devign
44
+ ## [VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection](https://github.com/ICL-ml4csec/VulBERTa/tree/main)
45
 
46
  ![VulBERTa architecture](https://raw.githubusercontent.com/ICL-ml4csec/VulBERTa/main/VB.png)
47
 
48
  ## Overview
49
+ This model is the unofficial HuggingFace version of "[VulBERTa](https://github.com/ICL-ml4csec/VulBERTa/tree/main)" with an MLP classification head, trained on CodeXGlue Devign (C code), by Hazim Hanif & Sergio Maffeis (Imperial College London). I simplified the tokenization process by adding the cleaning (comment removal) step to the tokenizer and added the simplified tokenizer to this model repo as an AutoClass.
50
 
51
  > This paper presents presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
52
 
53
  ## Usage
54
+ **You must install libclang for tokenization.**
55
 
56
  ```bash
57
  pip install libclang
 
67
  {'label': 'LABEL_1', 'score': 0.985314130783081}]]
68
  ```
69
 
70
+ ***
71
+
72
  ## Data
73
  We provide all data required by VulBERTa.
74
  This includes:
 
87
 
88
  Please refer to the [models](https://github.com/ICL-ml4csec/VulBERTa/tree/main/models "models") directory for further instructions and details.
89
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  ## How to use
91
 
92
  In our project, we uses Jupyterlab notebook to run experiments.
 
97
  - [Evaluation_VulBERTa-MLP.ipynb](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Evaluation_VulBERTa-MLP.ipynb "Evaluation_VulBERTa-MLP.ipynb") - Evaluates the fine-tuned VulBERTa-MLP models on testing set of a specific vulnerability detection dataset.
98
  - [Finetuning+evaluation_VulBERTa-CNN](https://github.com/ICL-ml4csec/VulBERTa/blob/main/Finetuning%2Bevaluation_VulBERTa-CNN.ipynb "Finetuning+evaluation_VulBERTa-CNN.ipynb") - Fine-tunes VulBERTa-CNN models and evaluates it on a testing set of a specific vulnerability detection dataset.
99
 
 
 
 
100
 
101
  ## Citation
102