ModernLogBERT (WGCE)

A ModernBERT encoder fine-tuned to classify the severity level of a single log line into one of six levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL.

This checkpoint is trained with a Weighted Generalized Cross-Entropy (WGCE) objective (q = 0.7) — a noise-tolerant loss designed to be more robust to mislabeled training data. A sibling checkpoint trained with plain Weighted Cross-Entropy is at hazemkhaled-94/modernlogbert-wce.

Built with the log-lens project, which also provides the Drain3 preprocessing pipeline these inputs require (see "How to use").

Intended use

Intended: triage and observability research — predicting or sanity-checking log severity, and flagging entries whose predicted severity disagrees with the emitted level as candidate anomalies.
Out of scope: a sole source of truth for alerting or incident severity. Aggregate accuracy hides brittle behavior on unfamiliar log formats — keep a human in the loop.

How to use

Inputs must be Drain3-masked the same way as in training (variables replaced by placeholders such as <NUM>, <IP>, <UUID>); raw text degrades predictions. The log-lens repo ships a ready-to-use Drain3 preprocessing pipeline that produces exactly this masked form.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

repo = "hazemkhaled-94/modernlogbert-gce"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()

text = "Connection refused after <<NUM>> retries"  # Drain3-masked input
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    pred = model(**inputs).logits.argmax(-1).item()
print(model.config.id2label[pred])

Training data

In-distribution (train + held-out eval): a publicly available collection of system log corpora (loghub), preprocessed into a level-balanced, stratified sample.
Out-of-distribution (evaluation only): a single private industrial Kubernetes log deployment. Not released; used purely as an OOD generalization probe.

Training procedure

Hyperparameter	Value
Backbone	ModernBERT-base
Loss	Weighted Generalized Cross-Entropy (`q = 0.7`)
Epochs	8
Per-device batch size	32
Gradient accumulation	4 (effective batch 128)
Learning rate	1e-5 (separate LRs for head vs backbone)
Weight decay	0.01
Warmup ratio	0.1
Max sequence length	512
Best-model metric	macro F1

Evaluation

In-distribution (held-out stratified slice)

Metric	Value
Accuracy	87.37%
Macro precision	0.7368
Macro recall	0.7977
Macro F1	0.7447
Weighted F1	0.8884
Mean confidence (all)	95.57%

On the curated in-distribution slice the WCE sibling is slightly stronger and better-calibrated; this WGCE checkpoint is more confident, the calibration cost of a noise-tolerant objective.

Out-of-distribution

Evaluated on a private industrial Kubernetes domain — a different log distribution than training. Performance degraded modestly but stayed usable, the expected cost of moving to unfamiliar formats. Consistent with its noise-tolerant design, WGCE produced ~21% fewer under-predictions than WCE on this domain. As always for OOD use, validate on your own log distribution.

Limitations and biases

OOD generalization — only modest degradation was observed on a single private industrial domain; other distributions are unverified, so validate on your own logs.
Confidence ≠ correctness — this checkpoint is the more confident of the two; treat scores as signals, not guarantees.
Preprocessing coupling — inputs must be Drain3-masked exactly as in training (use the log-lens preprocessing pipeline).

Downloads last month: 58

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for hazemkhaled-94/modernlogbert-gce

Base model

answerdotai/ModernBERT-base

Finetuned

(1350)

this model

Evaluation results

Accuracy (in-distribution)
self-reported

0.874
Macro F1 (in-distribution)
self-reported

0.745