Training update: 0/349,936 rows (0.00%) | +100 new @ 2026-04-06 10:33:47

Browse files

Files changed (6) hide show

README.md +84 -0
config.json +29 -0
model.safetensors +3 -0
tokenizer.json +0 -0
tokenizer_config.json +17 -0
training_metadata.json +15 -0

README.md ADDED Viewed

	@@ -0,0 +1,84 @@

+---
+language:
+- en
+- id
+tags:
+- bert
+- text-classification
+- token-classification
+- cybersecurity
+- fill-mask
+- named-entity-recognition
+- transformers
+- tensorflow
+- pytorch
+- masked-language-modeling
+base_model: boltuix/bert-mini
+library_name: transformers
+pipeline_tag: fill-mask
+---
+# bert-mini-cybersecurity
+## 1. Model Details
+**Model description**
+"bert-mini-cybersecurity" is a compact transformer model adapted for cybersecurity text classification tasks (e.g., threat detection, incident reports, malicious vs benign content).
+- Model type: fine-tuned lightweight BERT variant
+- Languages: English & Indonesia
+- Finetuned from: `boltuix/bert-mini`
+- Status: **Early version** — trained on **0.00%** of planned data.
+**Model sources**
+- Base model: [boltuix/bert-mini](https://huggingface.co/boltuix/bert-mini)
+- Data: Cybersecurity Data
+## 2. Uses
+### Direct use
+You can use this model to classify cybersecurity-related text — for example, whether a given message, report or log entry indicates malicious intent, abnormal behaviour, or threat presence.
+### Downstream use
+- Embedding extraction for clustering.
+- Named Entity Recognition on log or security data.
+- Classification of security data.
+- Anomaly detection in security logs.
+- As part of a pipeline for phishing detection, malicious email filtering, incident triage.
+- As a feature extractor feeding a downstream system (e.g., alert-generation, SOC dashboard).
+### Out-of-scope use
+- Not meant for high-stakes automated blocking decisions without human review.
+- Not optimized for languages other than English and Indonesian.
+- Not tested for non-cybersecurity domains or out-of-distribution data.
+### Downstream Usecase in Development using this model
+- NER on security log, botnet data, and json data.
+- Early classification of SIEM alert & events.
+## 3. Bias, Risks, and Limitations
+Because the model is based on a small subset (0.00%) of planned data, performance is preliminary and may degrade on unseen or specialized domains (industrial control, IoT logs, foreign language).
+- Inherits any biases present in the base model (`boltuix/bert-mini`) and in the fine-tuning data — e.g., over-representation of certain threat types, vendor or tooling-specific vocabulary.
+- **Should not be used as sole authority for incident decisions; only as an aid to human analysts.**
+## 4. Training Details
+### Text Processing & Chunking
+Since cybersecurity data often contains lengthy alert descriptions and execution logs that exceed BERT's 512 token limit, we implement an overlapping chunking strategy:
+- **Max sequence length**: 512 tokens
+- **Stride**: 32 tokens (overlap between consecutive chunks)
+- **Chunking behavior**: Long texts are split into overlapping segments. For example, with max_length=512 and stride=128, a 1000-token document becomes ~3 chunks with 128-token overlaps, preserving context across boundaries.
+### Training Hyperparameters
+- **Base model**: `boltuix/bert-mini`
+- **Training epochs**: 3
+- **Learning rate**: 5e-05
+- **Batch size**: 16
+- **Weight decay**: 0.01
+- **Warmup ratio**: 0.06
+- **Gradient accumulation steps**: 1
+- **Optimizer**: AdamW
+- **LR scheduler**: Linear with warmup
+### Training Data
+- **Total database rows**: 349,936
+- **Rows processed (cumulative)**: 0 (0.00%)
+- **Training date**: 2026-04-06 10:33:47
+### Post-Training Metrics
+- **Final training loss**:
+- **Rows→Samples ratio**:

config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "add_cross_attention": false,
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": null,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "eos_token_id": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 128,
+  "initializer_range": 0.02,
+  "intermediate_size": 512,
+  "is_decoder": false,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 2,
+  "num_hidden_layers": 4,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "tie_word_embeddings": true,
+  "transformers_version": "5.2.0",
+  "type_vocab_size": 2,
+  "use_cache": false,
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:912ab7a7e426b316b960b61f3dfca4590dd4638b79e684ea7de8868eaee97d82
+size 19261424

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "backend": "tokenizers",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "is_local": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_metadata.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "trained_at": 1775446427.337274,
+  "trained_at_readable": "2026-04-06 10:33:47",
+  "samples_this_session": 100,
+  "new_rows_this_session": 100,
+  "trained_rows_total": 0,
+  "total_db_rows": 349936,
+  "percentage": 0.0,
+  "final_loss": 0,
+  "epochs": 3,
+  "learning_rate": 5e-05,
+  "batch_size": 16,
+  "stride": 32,
+  "max_length": 512
+}