ModernLogBERT (WGCE)
A ModernBERT encoder
fine-tuned to classify the severity level of a single log line into one of
six levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL.
This checkpoint is trained with a Weighted Generalized Cross-Entropy (WGCE)
objective (q = 0.7) โ a noise-tolerant loss designed to be more robust to
mislabeled training data. A sibling checkpoint trained with plain Weighted
Cross-Entropy is at
hazemkhaled-94/modernlogbert-wce.
Built with the log-lens
project, which also provides the Drain3 preprocessing pipeline these inputs
require (see "How to use").
Intended use
- Intended: triage and observability research โ predicting or
sanity-checking log severity, and flagging entries whose predicted severity
disagrees with the emitted level as candidate anomalies.
- Out of scope: a sole source of truth for alerting or incident severity.
Aggregate accuracy hides brittle behavior on unfamiliar log formats โ keep a
human in the loop.
How to use
Inputs must be Drain3-masked the same way as in training (variables
replaced by placeholders such as <NUM>, <IP>, <UUID>); raw text degrades
predictions. The log-lens
repo ships a ready-to-use Drain3 preprocessing pipeline that produces exactly
this masked form.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
repo = "hazemkhaled-94/modernlogbert-gce"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()
text = "Connection refused after <<NUM>> retries"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
pred = model(**inputs).logits.argmax(-1).item()
print(model.config.id2label[pred])
Training data
- In-distribution (train + held-out eval): a publicly available collection
of system log corpora (loghub),
preprocessed into a level-balanced, stratified sample.
- Out-of-distribution (evaluation only): a single private industrial
Kubernetes log deployment. Not released; used purely as an OOD
generalization probe.
Training procedure
| Hyperparameter |
Value |
| Backbone |
ModernBERT-base |
| Loss |
Weighted Generalized Cross-Entropy (q = 0.7) |
| Epochs |
8 |
| Per-device batch size |
32 |
| Gradient accumulation |
4 (effective batch 128) |
| Learning rate |
1e-5 (separate LRs for head vs backbone) |
| Weight decay |
0.01 |
| Warmup ratio |
0.1 |
| Max sequence length |
512 |
| Best-model metric |
macro F1 |
Evaluation
In-distribution (held-out stratified slice)
| Metric |
Value |
| Accuracy |
87.37% |
| Macro precision |
0.7368 |
| Macro recall |
0.7977 |
| Macro F1 |
0.7447 |
| Weighted F1 |
0.8884 |
| Mean confidence (all) |
95.57% |
On the curated in-distribution slice the WCE sibling is slightly stronger and
better-calibrated; this WGCE checkpoint is more confident, the calibration cost
of a noise-tolerant objective.
Out-of-distribution
Evaluated on a private industrial Kubernetes domain โ a different log
distribution than training. Performance degraded modestly but stayed
usable, the expected cost of moving to unfamiliar formats. Consistent with
its noise-tolerant design, WGCE produced ~21% fewer under-predictions than
WCE on this domain. As always for OOD use, validate on your own log
distribution.
Limitations and biases
- OOD generalization โ only modest degradation was observed on a single
private industrial domain; other distributions are unverified, so validate on
your own logs.
- Confidence โ correctness โ this checkpoint is the more confident of the
two; treat scores as signals, not guarantees.
- Preprocessing coupling โ inputs must be Drain3-masked exactly as in
training (use the log-lens preprocessing pipeline).