File size: 2,499 Bytes
65743a6
0b1bba2
83074ed
2aa9ee7
734e51c
2aa9ee7
 
83074ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6464326
83074ed
 
 
 
 
 
 
 
 
 
 
65743a6
2aa9ee7
c089279
2aa9ee7
b557f32
 
5207533
2aa9ee7
25e2b57
2aa9ee7
cf75e58
2aa9ee7
25e2b57
863178e
d57e130
2aa9ee7
5a11720
b130037
a617dab
b130037
 
 
 
 
 
 
 
 
 
 
 
aa0cc44
e0e11e3
 
 
 
 
 
 
 
 
 
 
 
 
b130037
2aa9ee7
 
 
 
 
 
 
 
 
 
 
 
 
da148a5
0b1bba2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: apache-2.0
datasets:
- FredZhang7/malicious-website-features-2.4M
wget:
- text: https://chat.openai.com/
- text: https://huggingface.co/FredZhang7/aivance-safesearch-v3
metrics:
- accuracy
language:
- af
- en
- et
- sw
- sv
- sq
- de
- ca
- hu
- da
- tl
- so
- fi
- fr
- cs
- hr
- cy
- es
- sl
- tr
- pl
- pt
- nl
- id
- sk
- lt
- 'no'
- lv
- vi
- it
- ro
- ru
- mk
- bg
- th
- ja
- ko
- multilingual
---

It's very important to note that this model is not production-ready.

<br>

The classification task for v1 is split into two stages:
1. URL features model
    - **96.5%+ accurate** on training and validation data
    - 2,436,727 rows of labelled URLs
    - evaluation from v2: slightly overfitted, by perhaps around 0.8%
2. Website features model
    - **98.4% accurate** on training data, and **98.9% accurate** on validation data
    - 911,180 rows of 42 features
    - evaluation from v2: slightly biased towards the URL feature (bert_confidence) more than the other columns

## Training
I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters.
Here's the dict passed to `sklearn`'s `GridSearchCV` function:
```python
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': ['gbdt', 'dart'],
    'num_leaves': [15, 23, 31, 63],
    'learning_rate': [0.001, 0.002, 0.01, 0.02],
    'feature_fraction': [0.5, 0.6, 0.7, 0.9],
    'early_stopping_rounds': [10, 20],
    'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
}
```
To reproduce the 98.4% accurate model, you can follow the data analysis on the [dataset page](https://huggingface.co/datasets/FredZhang7/malicious-website-features-2.4M) to filter out the unimportant features.
Then train a LightGBM model using the most suited hyperparamters for this task:
```python
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.01,
    'feature_fraction': 0.6,
    'early_stopping_rounds': 10,
    'num_boost_round': 800
}
```


## URL Features
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")
```
## Website Features
```bash
pip install lightgbm
```
```python
import lightgbm as lgb
lgb.Booster(model_file="phishing_model_combined_0.984_train.txt")
```