File size: 4,044 Bytes
9e891ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf1b7a4
 
 
 
 
 
 
 
9e891ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: cc-by-nc-4.0
pipeline_tag: fill-mask
widget:
- text: >-
    The PDF contains an action object. Upon a victim opening the PDF it will send a query to Google: http://www[.]google[.]com/url?q=http%3A%2F%2F9348243249382479234343284324023432748892349702394023.xyz&sa=D&sntz=1&usg=AFQjCNFWmVffgSGlrrv-2U9sSOJYzfUQqw. This link is a typical <mask> attack.
tags:
- cybersecurity

---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# CyBERTuned

CyBERTuned is a BERT-like model trained with an NLE (non-linguistic element) aware pretraining method tuned for the cybersecurity domain.


## Sample Usage
```python
>>> from transformers import pipeline
>>> folder_dir = "CyBERTuned"
>>> unmasker = pipeline('fill-mask', model=folder_dir)
>>> unmasker("RagnarLocker, LockBit, and REvil are types of <mask>.")

[{'score': 0.8489783406257629, 'token': 25346, 'token_str': ' ransomware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of ransomware.'},
{'score': 0.1364559829235077, 'token': 16886, 'token_str': ' malware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of malware.'},
{'score': 0.0022238395176827908, 'token': 1912, 'token_str': ' attacks', 'sequence': 'RagnarLocker, LockBit, and REvil are types of attacks.'},
{'score': 0.001197474543005228, 'token': 11341, 'token_str': ' infections', 'sequence': 'RagnarLocker, LockBit, and REvil are types of infections.'},
{'score': 0.0009669850114732981, 'token': 6773, 'token_str': ' files', 'sequence': 'RagnarLocker, LockBit, and REvil are types of files.'}]

>>> # text requiring url comprehension (redirection attack), modified from https://intezer.com/blog/research/targeted-phishing-attack-against-ukrainian-government-expands-to-georgia/
>>> url_text = 'The PDF contains an action object. Upon a victim opening the PDF it will send a query to Google: http://www[.]google[.]com/url?q=http%3A%2F%2F9348243249382479234343284324023432748892349702394023.xyz&sa=D&sntz=1&usg=AFQjCNFWmVffgSGlrrv-2U9sSOJYzfUQqw. This link is a typical <mask> attack.'
>>> unmasker(url_text)[0]

{'score': 0.1701660305261612, 'token': 30970, 'token_str': ' redirect', 'sequence': 'The PDF contains an action object. Upon a victim opening the PDF it will send a query to Google: http://www[.]google[.]com/url?q=http%3A%2F%2F9348243249382479234343284324023432748892349702394023.xyz&sa=D&sntz=1&usg=AFQjCNFWmVffgSGlrrv-2U9sSOJYzfUQqw. This link is a typical redirect attack.'}

>>> from transformers import AutoModel, AutoTokenizer
>>> model = AutoModel.from_pretrained(folder_dir)
>>> tokenizer = AutoTokenizer.from_pretrained(folder_dir)
>>> text = "Cybersecurity information is often technically complex and relayed through unstructured text, making automation of cyber threat intelligence highly challenging."
>>> encoded = tokenizer(text, return_tensors="pt")
>>> output = model(**encoded)
>>> output[0].shape

torch.Size([1, 27, 768])

```


# Citation
If you're using CyBERTuned please cite the following paper:

```
Eugene Jang, Jian Cui, Dayeon Yim, Youngjin Jin, Jin-Woo Chung, Seungwon Shin, and Yongjae Lee. 2024. Ignore Me But Don’t Replace Me: Utilizing Non-Linguistic Elements for Pretraining on the Cybersecurity Domain. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 29–42, Mexico City, Mexico. Association for Computational Linguistics.
```

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0006
- train_batch_size: 64
- eval_batch_size: 32
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 8
- total_train_batch_size: 2048
- total_eval_batch_size: 128
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.048
- num_epochs: 200

### Framework versions

- Transformers 4.27.0.dev0
- Pytorch 1.12.1
- Datasets 2.6.1
- Tokenizers 0.13.2