|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- FredZhang7/malicious-website-features-2.4M |
|
wget: |
|
- text: https://chat.openai.com/ |
|
- text: https://huggingface.co/FredZhang7/aivance-safesearch-v3 |
|
metrics: |
|
- accuracy |
|
language: |
|
- af |
|
- en |
|
- et |
|
- sw |
|
- sv |
|
- sq |
|
- de |
|
- ca |
|
- hu |
|
- da |
|
- tl |
|
- so |
|
- fi |
|
- fr |
|
- cs |
|
- hr |
|
- cy |
|
- es |
|
- sl |
|
- tr |
|
- pl |
|
- pt |
|
- nl |
|
- id |
|
- sk |
|
- lt |
|
- 'no' |
|
- lv |
|
- vi |
|
- it |
|
- ro |
|
- ru |
|
- mk |
|
- bg |
|
- th |
|
- ja |
|
- ko |
|
- multilingual |
|
--- |
|
|
|
It's very important to note that this model is not production-ready. |
|
|
|
<br> |
|
|
|
The classification task for v1 is split into two stages: |
|
1. URL features model |
|
- **96.5%+ accurate** on training and validation data |
|
- 2,436,727 rows of labelled URLs |
|
- evaluation from v2: slightly overfitted, by perhaps around 0.8% |
|
2. Website features model |
|
- **98.4% accurate** on training data, and **98.9% accurate** on validation data |
|
- 911,180 rows of 42 features |
|
- evaluation from v2: slightly biased towards the URL feature (bert_confidence) more than the other columns |
|
|
|
## Training |
|
I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters. |
|
Here's the dict passed to `sklearn`'s `GridSearchCV` function: |
|
```python |
|
params = { |
|
'objective': 'binary', |
|
'metric': 'binary_logloss', |
|
'boosting_type': ['gbdt', 'dart'], |
|
'num_leaves': [15, 23, 31, 63], |
|
'learning_rate': [0.001, 0.002, 0.01, 0.02], |
|
'feature_fraction': [0.5, 0.6, 0.7, 0.9], |
|
'early_stopping_rounds': [10, 20], |
|
'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000] |
|
} |
|
``` |
|
To reproduce the 98.4% accurate model, you can follow the data analysis on the [dataset page](https://huggingface.co/datasets/FredZhang7/malicious-website-features-2.4M) to filter out the unimportant features. |
|
Then train a LightGBM model using the most suited hyperparamters for this task: |
|
```python |
|
params = { |
|
'objective': 'binary', |
|
'metric': 'binary_logloss', |
|
'boosting_type': 'gbdt', |
|
'num_leaves': 31, |
|
'learning_rate': 0.01, |
|
'feature_fraction': 0.6, |
|
'early_stopping_rounds': 10, |
|
'num_boost_round': 800 |
|
} |
|
``` |
|
|
|
|
|
## URL Features |
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher") |
|
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher") |
|
``` |
|
## Website Features |
|
```bash |
|
pip install lightgbm |
|
``` |
|
```python |
|
import lightgbm as lgb |
|
lgb.Booster(model_file="phishing_model_combined_0.984_train.txt") |
|
``` |