CYB002 Baseline Classifier

MITRE ATT&CK kill-chain phase classifier trained on the CYB002 synthetic cyber attack sample. Predicts which of 10 kill-chain phases an attack event belongs to, from observable event + segment features.

Baseline reference, not for production use. This model demonstrates that the CYB002 sample dataset is learnable end-to-end and gives prospective buyers a working starting point. It is not a production threat detector or SOC tool. See Limitations.

Model overview

Property	Value
Task	10-class kill-chain phase classification
Training data	`xpertsystems/cyb002-sample` (4,353 attack events across 100 campaigns)
Models	XGBoost + PyTorch MLP
Input features	90 (after one-hot encoding)
Split	Group-aware by campaign_id (disjoint train/val/test campaigns)
License	CC-BY-NC-4.0 (matches dataset)
Status	Reference baseline

Two model artifacts are published. They are designed to be used together — disagreement is a useful triage signal:

model_xgb.json — gradient-boosted trees, primary recommendation
model_mlp.safetensors — PyTorch MLP in SafeTensors format

Quick start

pip install xgboost torch safetensors pandas huggingface_hub

from huggingface_hub import hf_hub_download
import json, numpy as np, torch, xgboost as xgb
from safetensors.torch import load_file

REPO = "xpertsystems/cyb002-baseline-classifier"

paths = {n: hf_hub_download(REPO, n) for n in [
    "model_xgb.json", "model_mlp.safetensors",
    "feature_engineering.py", "feature_meta.json", "feature_scaler.json",
]}

import sys, os
sys.path.insert(0, os.path.dirname(paths["feature_engineering.py"]))
from feature_engineering import (
    transform_single, load_meta, INT_TO_LABEL, build_segment_lookup
)

meta = load_meta(paths["feature_meta.json"])
xgb_model = xgb.XGBClassifier(); xgb_model.load_model(paths["model_xgb.json"])

# Build the segment-aggregate lookup from the dataset's topology CSV
seg_lookup = build_segment_lookup("path/to/network_topology.csv")

# Predict (see inference_example.ipynb for the full pattern)
seg_agg = seg_lookup.get(my_event["target_segment_id"], {})
X = transform_single(my_event, meta, segment_aggregates=seg_agg)
proba = xgb_model.predict_proba(X)[0]
print(INT_TO_LABEL[int(np.argmax(proba))])

See inference_example.ipynb for an end-to-end copy-paste demo including segment-aggregate setup and batch prediction.

Training data

Trained on the public sample of CYB002, 4,353 attack events from 100 distinct campaigns:

Phase	Train (n=2,822)	Test (n=726)	Test share
`dwell_idle`	581	141	19.4%
`reconnaissance`	411	112	15.4%
`initial_access`	358	106	14.6%
`execution`	324	74	10.2%
`persistence`	287	79	10.9%
`privilege_escalation`	249	68	9.4%
`lateral_movement`	201	54	7.4%
`collection`	162	40	5.5%
`exfiltration`	113	31	4.3%
`impact`	105	21	2.9%

Group-aware split

A single campaign generates ~40 highly-correlated events. Random row-level splitting would put events from the same campaign in both train and test, inflating metrics in a way that does not generalize to new campaigns.

This release uses GroupShuffleSplit by campaign_id:

Fold	Campaigns	Events
Train	69	2,822
Validation	16	805
Test	15	726

All test campaigns are completely unseen during training. Class imbalance is addressed with class_weight='balanced' (XGBoost sample_weight) and weighted cross-entropy (MLP).

Feature pipeline

The bundled feature_engineering.py is the canonical feature recipe.

Three columns are deliberately excluded because they leak the target:

technique_id — 62 of 63 ATT&CK techniques map 1:1 to a single phase. Including it gives perfect-looking metrics that mean nothing.
technique_name — 1:1 alias of technique_id (63 unique values each).
tactic_category — direct alias of kill_chain_phase.

90 features survive after encoding, drawn from:

Event-level numeric (10): timestep, dest_port, bytes_transferred, connection_duration_s, auth_failure_count, process_injection_flag, lateral_hop_count, c2_beacon_interval_s, edr_blocked_flag, siem_rule_triggered
Event-level categorical (7, one-hot encoded): target_asset_type, source_ip_class, protocol, attacker_capability_tier, defender_maturity_level, alert_severity, detection_outcome
Segment-level topology aggregates (13): mean patch_lag_days, mean exposure_score, max vulnerability_count, fraction with EDR/SIEM/NDR/MFA coverage, mean MTTD / MTTR baselines, plus segment_type and defender_maturity_level (segment-constant)
Engineered (6): byte_volume_log, has_c2_beacon, is_brute_forcing, attacker_defender_advantage, is_high_volume, is_privileged_port

None of the engineered features is derived from phase or technique — that would re-introduce the leakage we just excluded.

Note on detection-outcome features

detection_outcome, alert_severity, edr_blocked_flag, and siem_rule_triggered are post-hoc observables from the SOC's perspective. They are kept as features for the realistic use case where a SOC analyst has just seen an action and its initial detection signal and is reasoning about which phase the campaign is in. Buyers who want a strictly pre-detection model can drop these four columns and retrain — the ablation results below show this does not hurt accuracy (the model doesn't lean on them for phase prediction).

Evaluation

Test-set metrics (n = 726 events from 15 disjoint campaigns)

XGBoost

Metric	Value
Macro ROC-AUC (OvR)	0.8599
Accuracy	0.4683
Macro-F1	0.4255
Weighted-F1	0.4604

MLP

Metric	Value
Macro ROC-AUC (OvR)	0.8496
Accuracy	0.4449
Macro-F1	0.3911
Weighted-F1	0.4350

Headline interpretation

Accuracy of 47% looks low at first glance, but the right comparison is:

Baseline	Accuracy	Macro-F1
Random uniform guess (1/10 classes)	0.10	~0.10
Always predict majority (`dwell_idle`)	0.19	n/a
XGBoost (this model)	0.47	0.43

The macro ROC-AUC of 0.86 tells the cleaner story: the model distinguishes the 10 phases meaningfully well even though the argmax-prediction sometimes lands on an adjacent phase.

Per-class F1 — where the signal is and isn't

Phase	XGBoost F1	MLP F1	Note
`reconnaissance`	0.753	0.725	Strong: early timestep, distinct protocols/targets
`lateral_movement`	0.742	0.783	Strong: lateral-hop count, post-privesc pattern
`initial_access`	0.647	0.648	Strong: perimeter targets, specific protocols
`privilege_escalation`	0.500	0.488	Moderate
`execution`	0.441	0.510	Moderate
`persistence`	0.413	0.301	Moderate, easily confused with execution
`exfiltration`	0.273	0.119	Weak: late-phase, similar to collection/impact
`impact`	0.226	0.132	Weak: late-phase clustering
`collection`	0.220	0.191	Weak: late-phase clustering
`dwell_idle`	0.040	0.013	Very weak: no-op steps lack distinguishing features

The model has solid signal on early and mid-campaign phases and genuinely struggles to disambiguate late-stage objective-completion phases (collection / exfiltration / impact), which arrive close in time and look similar at the event level. This is an honest limitation of flat-tabular classification — sequence models would help here.

Ablation: which feature groups matter

Configuration	Accuracy	Macro-F1	Δ accuracy vs full
Full feature set (published)	0.4683	0.4255	—
No `timestep`	0.3264	0.3102	−0.1419
No topology aggregates	0.4601	0.4093	−0.0083
No engineered features	0.4642	0.4240	−0.0041
No detection-signal features	0.4725	0.4284	+0.0041

Two clear findings:

timestep is by far the most important feature (drops 14 pp when removed). The honest reading: kill chains progress in time, and where you are in the campaign timeline carries most of the phase signal.
Detection-signal features (detection_outcome, alert_severity, edr_blocked_flag, siem_rule_triggered) do not help phase prediction. Removing them actually improves the score marginally. A buyer who wants a pre-detection model can drop these four columns with no loss.

Topology and engineered features each contribute roughly 1 pp.

Architecture

XGBoost: multi-class gradient boosting (multi:softprob, 10 classes), hist tree method, class-balanced sample weights, early stopping on validation mlogloss.

MLP: 90 → 128 → 64 → 10, each hidden layer followed by BatchNorm1d → ReLU → Dropout(0.3), weighted cross-entropy loss, AdamW optimizer, early stopping on validation macro-F1.

Training hyperparameters (learning rate, batch size, n_estimators, early-stopping patience, weight decay, class-weighting strategy) are held internally by XpertSystems and are not part of this release.

Limitations

This is a baseline reference, not a production threat detection system.

Late-phase confusion. Per-class F1 for collection, exfiltration, and impact is 0.22–0.27. These phases arrive near campaign-end with similar feature signatures, and a flat-tabular event-level model can't easily disambiguate them. Sequence models (LSTM / transformer over the per-campaign event sequence) would substantially improve this.
dwell_idle is essentially unlearnable in this framing. The class-balanced weights amplify rare classes; dwell_idle is common but featureless ("no action this timestep"), so the model trades dwell_idle recall for late-phase recall. F1 = 0.04. A real SOC pipeline would handle idle steps with a separate gating rule, not a classifier head.
Sample-size constraints. 100 campaigns / 4,353 events with a group-aware split leaves 69 training campaigns. The full 380k-event CYB002 product supports much more reliable per-class estimation, especially on the rare late-phase classes.
Synthetic-vs-real transfer. The dataset is synthetic and calibrated to threat-intelligence benchmark targets (Mandiant M-Trends, IBM CODB, Verizon DBIR, MITRE ATT&CK Evaluations). Real attack telemetry has different noise characteristics, adversary adaptation, and gaps in coverage. Do not assume metrics transfer.
Adversarial robustness not evaluated. The dataset is not adversarially generated; the model has not been red-teamed.
MLP brittleness on OOD inputs. With ~2.8k training events, the MLP can produce confidently-wrong predictions on hand-crafted records far from the training manifold. XGBoost is more robust. Use both; treat disagreement as a signal for human review.

Notes on dataset schema

The CYB002 sample dataset README describes some fields differently from the actual schema. The model was trained on the actual schema; this note is to help buyers reconcile what they read with what they receive.

What the README says	What the data actually contains
"9 ATT&CK phases"	10 phases including `dwell_idle` (idle/no-op steps)
4 attacker tiers: `opportunistic`, `organized_crime`, `apt`, `nation_state`	4 tiers: `opportunistic`, `script_kiddie`, `apt`, `nation_state`
5 defender maturity levels: CMMI names (`ad_hoc`, `defined`, `managed`, `quantitatively_managed`, `optimizing`)	5 levels: `minimal`, `baseline`, `managed`, `advanced`, `zero_trust`
Field name `phase`	Actual column: `kill_chain_phase`
Field name `tactic`	Actual column: `tactic_category`
Field name `segment_id`	Actual column: `target_segment_id`
Field name `attacker_tier`	Actual column: `attacker_capability_tier`
Field name `defender_maturity`	Actual column: `defender_maturity_level`
Field name `detected`, `blocked`, `stealth_score`	Actual: `detection_outcome`, `edr_blocked_flag`, `siem_rule_triggered`; no `stealth_score` on events

None of this affects model correctness — feature_engineering.py uses the actual column names. If you build your own pipeline against the dataset, use the actual columns, not the README descriptions.

Intended use

Evaluating fit of the CYB002 dataset for your ATT&CK / kill-chain research
Baseline reference for new model architectures (especially sequence models, which should beat this baseline on the late-phase classes)
Teaching and demo for tabular classification on attack-event data
Feature engineering reference for MITRE ATT&CK-aligned datasets

Out-of-scope use

Production threat detection on real network telemetry
SOC alert triage on real systems
Forensic attribution of real attacks
Adversarial-evasion evaluation (dataset not adversarially generated)
Any safety-critical or operational security decision

Reproducibility

Outputs above were produced with seed = 42, group-aware nested GroupShuffleSplit (70/15/15 by campaign_id), on the published sample (xpertsystems/cyb002-sample, version 1.0.0, generated 2026-05-16). The feature pipeline in feature_engineering.py is deterministic and the trained weights in this repo correspond exactly to the metrics above.

The training script itself is private to XpertSystems. The published artifacts contain the feature pipeline, model weights, scaler, metadata, and validation results — sufficient to reproduce inference but not training.

Files in this repo

File	Purpose
`model_xgb.json`	XGBoost weights
`model_mlp.safetensors`	PyTorch MLP weights
`feature_engineering.py`	Feature pipeline (load → aggregate topology → engineer → encode)
`feature_meta.json`	Feature column order + categorical levels
`feature_scaler.json`	MLP input mean/std (XGBoost ignores)
`validation_results.json`	Per-class metrics, confusion matrix, architecture
`ablation_results.json`	Per-feature-group ablation (timestep, topology, engineered, detection-signals)
`inference_example.ipynb`	End-to-end inference demo notebook
`README.md`	This file

Contact and full product

The full CYB002 dataset contains ~454,000 rows across four files, with calibrated benchmark validation against 12 metrics drawn from authoritative threat intelligence sources (Mandiant, IBM, Verizon, CrowdStrike, MITRE, SANS, ENISA). The full XpertSystems.ai synthetic data catalogue spans 41 SKUs across Cybersecurity, Healthcare, Insurance & Risk, Oil & Gas, and Materials & Energy.

📧 pradeep@xpertsystems.ai
🌐 https://xpertsystems.ai
🗂 Dataset: https://huggingface.co/datasets/xpertsystems/cyb002-sample
🤖 Companion model (network traffic): https://huggingface.co/xpertsystems/cyb001-baseline-classifier

Citation

@misc{xpertsystems_cyb002_baseline_2026,
  title  = {CYB002 Baseline Classifier: XGBoost and MLP for MITRE ATT&CK Kill-Chain Phase Classification},
  author = {XpertSystems.ai},
  year   = {2026},
  url    = {https://huggingface.co/xpertsystems/cyb002-baseline-classifier},
  note   = {Baseline reference model trained on xpertsystems/cyb002-sample}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train xpertsystems/cyb002-baseline-classifier

Evaluation results

Test macro ROC-AUC OvR (XGBoost) on CYB002 Synthetic Cyber Attack Dataset (Sample)
self-reported

0.860
Test macro-F1 (XGBoost) on CYB002 Synthetic Cyber Attack Dataset (Sample)
self-reported

0.425
Test accuracy (XGBoost) on CYB002 Synthetic Cyber Attack Dataset (Sample)
self-reported

0.468
Test macro ROC-AUC OvR (MLP) on CYB002 Synthetic Cyber Attack Dataset (Sample)
self-reported

0.850
Test macro-F1 (MLP) on CYB002 Synthetic Cyber Attack Dataset (Sample)
self-reported

0.391
Test accuracy (MLP) on CYB002 Synthetic Cyber Attack Dataset (Sample)
self-reported

0.445