Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +40 -4
config.json +28 -0
model.safetensors +3 -0
special_tokens_map.json +3 -0
tokenization_vulberta.py +51 -0
tokenizer.json +0 -0
tokenizer_config.json +26 -0

README.md CHANGED Viewed

@@ -1,19 +1,52 @@
 ---
 license: mit
 arxiv: 2205.12424
-pipeline_tag: fill-mask
 tags:
 - defect detection
 - code
 ---
-# VulBERTa Pretrained
 ## VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection
 ![VulBERTa architecture](https://raw.githubusercontent.com/ICL-ml4csec/VulBERTa/main/VB.png)
 ## Overview
-This model is the unofficial HuggingFace version of "[VulBERTa](https://github.com/ICL-ml4csec/VulBERTa/tree/main)" with just the masked language modeling head for pretraining. I simplified the tokenization process by adding the cleaning (comment removal) step to the tokenizer and added the simplified tokenizer to this model repo as an AutoClass, allowing everyone to load this model without manually pulling any repos (with the caveat of `trust_remote_code`).
 > This paper presents presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
@@ -28,7 +61,10 @@ Note that due to the custom tokenizer, you must pass `trust_remote_code=True` wh
 Example:
 ```
 from transformers import pipeline
-pipe = pipeline("fill-mask", model="claudios/VulBERTa-mlm", trust_remote_code=True, return_all_scores=True)
 ```
 ***

 ---
 license: mit
 arxiv: 2205.12424
+datasets:
+- code_x_glue_cc_defect_detection
+metrics:
+- accuracy
+- precision
+- recall
+- f1
+- roc_auc
+model-index:
+  - name: VulBERTa MLP
+    results:
+      - task:
+          type: defect-detection
+        dataset:
+          name: codexglue-devign
+          type: codexglue-devign
+        metrics:
+          - name: Accuracy
+            type: Accuracy
+            value: 64.71
+          - name: Precision
+            type: Precision
+            value: 64.80
+          - name: Recall
+            type: Recall
+            value: 50.76
+          - name: F1
+            type: F1
+            value: 56.93
+          - name: ROC-AUC
+            type: ROC-AUC
+            value: 71.02
+pipeline_tag: text-classification
 tags:
+- devign
 - defect detection
 - code
 ---
+# VulBERTa MLP Devign
 ## VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection
 ![VulBERTa architecture](https://raw.githubusercontent.com/ICL-ml4csec/VulBERTa/main/VB.png)
 ## Overview
+This model is the unofficial HuggingFace version of "[VulBERTa](https://github.com/ICL-ml4csec/VulBERTa/tree/main)" with an MLP classification head, trained on CodeXGlue Devign (C code), by Hazim Hanif & Sergio Maffeis (Imperial College London). I simplified the tokenization process by adding the cleaning (comment removal) step to the tokenizer and added the simplified tokenizer to this model repo as an AutoClass.
 > This paper presents presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.
 Example:
 ```
 from transformers import pipeline
+pipe = pipeline("text-classification", model="claudios/VulBERTa-MLP-Devign", trust_remote_code=True, return_all_scores=True)
+pipe("static void filter_mirror_setup(NetFilterState *nf, Error **errp)\n{\n    MirrorState *s = FILTER_MIRROR(nf);\n    Chardev *chr;\n    chr = qemu_chr_find(s->outdev);\n    if (chr == NULL) {\n        error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,\n                  \"Device '%s' not found\", s->outdev);\n    qemu_chr_fe_init(&s->chr_out, chr, errp);")
+>> [[{'label': 'LABEL_0', 'score': 0.014685827307403088},
+  {'label': 'LABEL_1', 'score': 0.985314130783081}]]
 ```
 ***

config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "_name_or_path": "VulBERTa",
+  "architectures": [
+    "RobertaForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 1026,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.40.1",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50000
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a9d0d35e89d5f4f97e647744a76852a211825e4f5e0db2d305db8fe08e219264
+size 499363688

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "pad_token": "<pad>"
+}

tokenization_vulberta.py ADDED Viewed

	@@ -0,0 +1,51 @@

+from typing import List
+from tokenizers import NormalizedString, PreTokenizedString
+from tokenizers.pre_tokenizers import PreTokenizer
+from transformers import PreTrainedTokenizerFast
+try:
+    from clang import cindex
+except ModuleNotFoundError as e:
+    raise ModuleNotFoundError(
+        "VulBERTa Clang tokenizer requires `libclang`. Please install it via `pip install libclang`.",
+    ) from e
+class ClangPreTokenizer:
+    cidx = cindex.Index.create()
+    def clang_split(
+        self,
+        i: int,
+        normalized_string: NormalizedString,
+    ) -> List[NormalizedString]:
+        tok = []
+        tu = self.cidx.parse(
+            "tmp.c",
+            args=[""],
+            unsaved_files=[("tmp.c", str(normalized_string.original))],
+            options=0,
+        )
+        for t in tu.get_tokens(extent=tu.cursor.extent):
+            spelling = t.spelling.strip()
+            if spelling == "":
+                continue
+            tok.append(NormalizedString(spelling))
+        return tok
+    def pre_tokenize(self, pretok: PreTokenizedString):
+        pretok.split(self.clang_split)
+class VulBERTaTokenizer(PreTrainedTokenizerFast):
+    def __init__(
+        self,
+        *args,
+        **kwargs,
+    ):
+        super().__init__(
+            *args,
+            **kwargs,
+        )
+        self._tokenizer.pre_tokenizer = PreTokenizer.custom(ClangPreTokenizer())

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "added_tokens_decoder": {
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "max_length": 1024,
+  "model_max_length": 1024,
+  "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "stride": 0,
+  "tokenizer_class": "VulBERTaTokenizer",
+  "auto_map": {
+    "AutoTokenizer": ["tokenization_vulberta.VulBERTaTokenizer", null]
+  },
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first"
+}