CASSANDRA — ASL configuration on AnnoCTR (regression case)

Fine-tuned CTI-BERT models for extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports. This repository contains the ASL configuration of the CASSANDRA recipe trained on AnnoCTR (118 ATT&CK techniques), comprising 3 ensemble members trained with seeds {42, 123, 456}.

Note: Unlike on TRAM2, the ASL configuration underperforms the BCE configuration on AnnoCTR. This is the regression case from the paper's label-density transfer analysis (§3.2, RQ4 results). For deployment on AnnoCTR-like sparse benchmarks, use the BCE configuration: cassandra-bce-annoctr.

Anonymous artifact for double-blind peer review. Author information will be added after the review period.

Headline result

On the AnnoCTR test set (33 scored documents):

  • 3-seed ensemble per-document F1 (Ï„=0.5): 60.17% (a 3.36-point regression vs the BCE configuration)
  • 3-seed ensemble per-document F1 (dev-tuned Ï„=0.69): 61.27% (a 2.26-point regression)
  • BCE configuration on the same benchmark: 63.53% (cassandra-bce-annoctr) — preferred for deployment

This regression is the label-density transfer story analyzed in the paper (§3.2, RQ4 results): on AnnoCTR's 118-technique long tail with mean 15.5 samples per train-present technique, ASL's aggressive easy-negative suppression also starves genuinely rare positive techniques of training signal.

Full per-seed and ensemble metrics are in results.json.

Why include this configuration?

The CASSANDRA paper's central finding is that the same training recipe transfers across benchmarks only when label density is sufficient. ASL helps on TRAM2 (mean ~82 samples/technique) and hurts on AnnoCTR (mean 15.5/technique). Releasing the AnnoCTR ASL weights makes this regression directly verifiable rather than reported-only.

If you want a deployable AnnoCTR classifier, use the BCE configuration linked above.

Architecture

LabelAttentionClassifier with asymmetric loss training:

  • Encoder: ibm-research/CTI-BERT (110M params, 768 hidden)
  • Head: 118 learned 768-dim label queries that attend over the encoder's last_hidden_state, followed by a shared 1-output linear layer applied per-label
  • Loss: Asymmetric Loss (Ridnik et al. 2021) with γ_neg=4, γ_pos=0, clip=0.05
  • Regularization / training tricks: layer-wise learning rate decay (α=0.85), exponential moving average (β=0.999), stochastic weight averaging (last 25% of epochs), per-seed best-of-{base, EMA, SWA} selection on validation macro-F1, multi-seed probability averaging at inference

The architecture is custom (not derived from transformers.PreTrainedModel), so loading requires the modeling.py file shipped with this repo.

Training data

  • AnnoCTR: 104 reports, 5,265 sentences, 118 canonical ATT&CK techniques (113 train-present, 5 unobserved at training but present in test). Mean of 15.5 deduplicated positive examples per train-present technique. 78 of 113 train-present techniques have fewer than 10 positive examples.
  • Splits: report-level train/test split from Buchel et al. (2025) (70 train reports, 34 test reports — one test report excluded from per-document F1 due to empty in-vocabulary ground truth).
  • Validation: 80:20 sentence-level random split within the training reports for early stopping and threshold selection.

Intended use

Primarily as a reproducibility artifact for the paper's ASL-on-AnnoCTR regression analysis. For practical AnnoCTR deployment, prefer the BCE configuration.

Limitations:

  • ASL's easy-negative suppression is mistuned for AnnoCTR's sparsity; rare-technique predictions are noisier than under BCE training.
  • 118-label vocabulary is the canonical AnnoCTR set; sentences describing techniques outside this set produce all-zero predictions.
  • Trained on English-language CTI.

How to load and run

from modeling import load_ensemble, predict_ensemble
import os, glob

seed_dirs = sorted(glob.glob(os.path.join(os.path.dirname(__file__), "seeds", "seed-*")))
seeds = load_ensemble(seed_dirs, device="cuda")

sentences = [
    "The malware uses Windows Command Shell to execute encoded scripts.",
    "After initial access, persistence was established via Registry Run Keys.",
]
results = predict_ensemble(seeds, sentences, threshold=0.5)
for sentence, techniques in results:
    print(sentence, "->", techniques)

A complete CLI example is in inference_example.py.

Per-seed members

Seed Per-document F1 (Ï„=0.5) Selected weights
42 55.90% base
123 58.80% base
456 60.16% base
3-seed ensemble (τ=0.5) 60.17% —
3-seed ensemble (dev-τ=0.69) 61.27% —

Notable: all three seeds selected base weights over EMA and SWA on validation macro-F1, consistent with ASL's regularization being unsuitable for this label-density regime.

Citation

@misc{cassandra2026,
  title  = {CASSANDRA: How Many Parameters Suffice to Automate TTP Extractions from CTI Reports---Pushing Towards the Lower Bound},
  author = {{Anonymous Authors}},
  year   = {2026},
  note   = {Anonymous submission under review}
}

Please also cite the AnnoCTR dataset, the CTI-BERT encoder, and the asymmetric-loss work (Ridnik et al. 2021).

License

Apache-2.0. These fine-tuned weights are derived from ibm-research/CTI-BERT.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cassandra-anon/cassandra-asl-annoctr

Finetuned
(4)
this model