codechrl commited on
Commit
7ebc894
·
verified ·
1 Parent(s): ab303c1

Training update: 0/349,936 rows (0.00%) | +100 new @ 2026-04-06 10:33:47

Browse files
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - id
5
+ tags:
6
+ - bert
7
+ - text-classification
8
+ - token-classification
9
+ - cybersecurity
10
+ - fill-mask
11
+ - named-entity-recognition
12
+ - transformers
13
+ - tensorflow
14
+ - pytorch
15
+ - masked-language-modeling
16
+ base_model: boltuix/bert-mini
17
+ library_name: transformers
18
+ pipeline_tag: fill-mask
19
+ ---
20
+ # bert-mini-cybersecurity
21
+
22
+ ## 1. Model Details
23
+ **Model description**
24
+ "bert-mini-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).
25
+ - Model type: fine-tuned lightweight BERT variant
26
+ - Languages: English & Indonesia
27
+ - Finetuned from: `boltuix/bert-mini`
28
+ - Status: **Early version** — trained on **0.00%** of planned data.
29
+
30
+ **Model sources**
31
+ - Base model: [boltuix/bert-mini](https://huggingface.co/boltuix/bert-mini)
32
+ - Data: Cybersecurity Data
33
+
34
+ ## 2. Uses
35
+ ### Direct use
36
+ You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.
37
+ ### Downstream use
38
+ - Embedding extraction for clustering.
39
+ - Named Entity Recognition on log or security data.
40
+ - Classification of security data.
41
+ - Anomaly detection in security logs.
42
+ - As part of a pipeline for phishing detection, malicious email filtering, incident triage.
43
+ - As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).
44
+ ### Out-of-scope use
45
+ - Not meant for high-stakes automated blocking decisions without human review.
46
+ - Not optimized for languages other than English and Indonesian.
47
+ - Not tested for non-cybersecurity domains or out-of-distribution data.
48
+
49
+ ### Downstream Usecase in Development using this model
50
+ - NER on security log, botnet data, and json data.
51
+ - Early classification of SIEM alert & events.
52
+
53
+ ## 3. Bias, Risks, and Limitations
54
+ Because the model is based on a small subset (0.00%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).
55
+ - Inherits any biases present in the base model (`boltuix/bert-mini`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
56
+ - **Should not be used as sole authority for incident decisions; only as an aid to human analysts.**
57
+
58
+ ## 4. Training Details
59
+
60
+ ### Text Processing & Chunking
61
+ Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy:
62
+ - **Max sequence length**: 512 tokens
63
+ - **Stride**: 32 tokens (overlap between consecutive chunks)
64
+ - **Chunking behavior**: Long texts are split into overlapping segments. For example, with max_length=512 and stride=128, a 1000-token document becomes ~3 chunks with 128-token overlaps, preserving context across boundaries.
65
+
66
+ ### Training Hyperparameters
67
+ - **Base model**: `boltuix/bert-mini`
68
+ - **Training epochs**: 3
69
+ - **Learning rate**: 5e-05
70
+ - **Batch size**: 16
71
+ - **Weight decay**: 0.01
72
+ - **Warmup ratio**: 0.06
73
+ - **Gradient accumulation steps**: 1
74
+ - **Optimizer**: AdamW
75
+ - **LR scheduler**: Linear with warmup
76
+
77
+ ### Training Data
78
+ - **Total database rows**: 349,936
79
+ - **Rows processed (cumulative)**: 0 (0.00%)
80
+ - **Training date**: 2026-04-06 10:33:47
81
+
82
+ ### Post-Training Metrics
83
+ - **Final training loss**:
84
+ - **Rows→Samples ratio**:
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "BertForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": null,
8
+ "classifier_dropout": null,
9
+ "dtype": "float32",
10
+ "eos_token_id": null,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 128,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 512,
16
+ "is_decoder": false,
17
+ "layer_norm_eps": 1e-12,
18
+ "max_position_embeddings": 512,
19
+ "model_type": "bert",
20
+ "num_attention_heads": 2,
21
+ "num_hidden_layers": 4,
22
+ "pad_token_id": 0,
23
+ "position_embedding_type": "absolute",
24
+ "tie_word_embeddings": true,
25
+ "transformers_version": "5.2.0",
26
+ "type_vocab_size": 2,
27
+ "use_cache": false,
28
+ "vocab_size": 30522
29
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:912ab7a7e426b316b960b61f3dfca4590dd4638b79e684ea7de8868eaee97d82
3
+ size 19261424
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "clean_up_tokenization_spaces": true,
4
+ "cls_token": "[CLS]",
5
+ "do_basic_tokenize": true,
6
+ "do_lower_case": true,
7
+ "is_local": false,
8
+ "mask_token": "[MASK]",
9
+ "model_max_length": 1000000000000000019884624838656,
10
+ "never_split": null,
11
+ "pad_token": "[PAD]",
12
+ "sep_token": "[SEP]",
13
+ "strip_accents": null,
14
+ "tokenize_chinese_chars": true,
15
+ "tokenizer_class": "BertTokenizer",
16
+ "unk_token": "[UNK]"
17
+ }
training_metadata.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "trained_at": 1775446427.337274,
3
+ "trained_at_readable": "2026-04-06 10:33:47",
4
+ "samples_this_session": 100,
5
+ "new_rows_this_session": 100,
6
+ "trained_rows_total": 0,
7
+ "total_db_rows": 349936,
8
+ "percentage": 0.0,
9
+ "final_loss": 0,
10
+ "epochs": 3,
11
+ "learning_rate": 5e-05,
12
+ "batch_size": 16,
13
+ "stride": 32,
14
+ "max_length": 512
15
+ }