ottema/gliner2-ptbr-ontoevidence (v0.18)

Ontology-guided evidence extraction for Brazilian Portuguese operational text (education, technical support, assistance) with hard-negative rejection.

This is the first version with real weights, trained using an anti-yes-man recipe that combines OE positives with diverse HAREM-style samples to prevent the model from collapsing into predicting all labels.

Performance

Evaluation on the OntoEvidence-BR test set (114 samples, 302 entities, full 58-label ontology, threshold 0.3):

Model	OE samples	Mix (OE/HAREM/anti)	Avg pred/text	Precision	Recall	F1	Yes-man?
v0.18 (this model)	2041	31% / 63% / 6%	4.41	0.256	0.427	0.320	No ✅
v0.17	2041	45% / 45% / 10%	4.27	0.253	0.407	0.312	No ✅
v0.19 (5x OE data)	3876	37% / 56% / 7%	15.12	0.102	0.583	0.174	Yes 🔴
v0.20 (3x vol, 31%)	3876	31% / 63% / 6%	14.51	0.105	0.576	0.178	Yes 🔴
v0.16 (OE 70%, no HAREM)	2041	70% / 30% / 0%	17.96	0.091	0.616	0.158	Yes 🔴

Data scaling findings:

More data is NOT always better. Doubling OE positives (v0.19, v0.20) reactivated yes-man even with HAREM majority.
Proportion matters more than volume. v0.18 succeeded at 31% OE / 63% HAREM with 6.5k total. Scaling the same proportion failed (v0.20).
The signal-to-noise ratio breaks down at scale. Synthetic HAREM samples become repetitive, weakening the "real text has few entities" lesson.
Real data would likely help. Synthetic-only training has hit a ceiling around F1=0.32.

A healthy model predicts 1-3 entities per text on average. The v0.18 model sits in this range, confirming the yes-man failure mode is broken.

Real-world examples (threshold 0.3)

Text: "a marcha n esta funcionando"

cambio_signal 0.83 ✅ (correct: gearbox issue)
pane_mecanica_signal 0.32 (correct: mechanical failure)
motor_signal 0.91 ⚠️ (over-predicted, but thresh=0.3 keeps it)

Text: "o motor falhou"

motor 0.998 ✅
pane_mecanica_signal 0.61 ✅
cambio_signal 0.73 ⚠️ (motor ≠ gearbox, but model is uncertain)

Text: "tô com febre"

febre 1.000 ✅
condicao_medica 1.000 ✅
servico_saude 0.59 (borderline)

The model gets the right entities with high confidence but has residual label confusion (predicting multiple plausible labels for the same span). This is the next challenge to address in v0.19+.

What is OntoEvidence-BR?

Operational text in Brazilian Portuguese — atendimento, suporte técnico, educação — is noisy, domain-specific, and full of hard negatives (everyday words that look like entities but aren't):

Text	Surface form	Why it's a hard negative
"dê um passo pra frente"	"frente"	Not a "front" entity; it's a movement direction
"o motor falhou"	"motor"	Not a "car part" entity; it's a generic device
"a marcha foi longa"	"marcha"	Could be "gear" (auto), "march" (protest), or "stride" (walking)
"tô com febre"	"febre"	Medical symptom, not a "condition code"

Standard NER models trained on HAREM (journalistic) collapse on operational text because they learned to predict "local" for any capitalized word, "pessoa" for any first name, etc. OntoEvidence-BR trains models to discriminate between entity and non-entity in noisy domains.

The yes-man problem (and how we fixed it)

Earlier attempts to fine-tune GLiNER2 on OntoEvidence-BR caused a structural failure mode we call "yes-man" — the model learns to predict ALL ontology labels with confidence 1.0, regardless of input. We tried:

❌ Curriculum learning (hard → easy)
❌ Hard-negative mining (Wikipedia)
❌ Decoy injection
❌ Increasing weight of rare labels

What worked:

✅ Mix OE positives with diverse HAREM-style samples (31% OE / 63% HAREM / 6% anti-yes)
✅ Train from HAREM-specialized checkpoint (not from raw base)
✅ Conservative LR (5e-7) and 1 epoch to prevent collapse

The HAREM-style mix teaches the model that most real text has few or no OE entities, breaking the "predict everything" bias.

Usage

from gliner2 import GLiNER2

model = GLiNER2.from_pretrained("ottema/gliner2-ptbr-ontoevidence")

text = "a marcha do carro n esta funcionando"
labels = [
    "marcha", "motor", "cambio_signal", "pane_mecanica_signal",
    "motor_signal", "pane_eletrica_signal",
]
entities = model.extract_entities(text, labels, threshold=0.3, include_confidence=True)
for label, spans in entities["entities"].items():
    for span_info in spans:
        if isinstance(span_info, dict):
            print(f"{label}: '{span_info['text']}' ({span_info['confidence']:.3f})")

Recommended threshold: 0.3 for high recall, 0.5+ for high precision.

Try the schema today

ottema/gliner2-ptbr-demo — interactive Gradio demo. Select the HAREM-specialized model and the OntoEvidence label presets to test hard-negative discrimination. For production use, this v0.18 model is preferred.

Future work (v0.19+)

The model is functional but not great. Known limitations:

Label confusion: predicts multiple plausible labels for the same span
Domain shift: trained mostly on synthetic; real text may degrade
Coverage: ontology has 58 labels, dataset has 62

Planned improvements:

Focal loss for hard-negative emphasis
Span-level negative sampling during training
Real operational data (without PII) — needed to break the F1=0.32 ceiling
Active learning with model predictions to find edge cases
Per-domain specialization (separate models for assistance vs technical_support vs education)

Training

Base: ottema/gliner2-ptbr-harem (HAREM-specialized)
Data: 2041 OE positives + 4082 HAREM-style mixed + 408 anti-yes-man = 6531 samples
Hyperparams: 1 epoch, encoder_lr=5e-7, task_lr=1e-5, batch_size=4, accum=4
Total time: ~2 min on RTX A4500

Credits

Base architecture: GLiNER2 (Urchade Zaratiana et al.)
Base weights: fastino/gliner2-multi-v1 (Fastino)
HAREM base: ottema/gliner2-ptbr-harem
Dataset + research: Ottema

License

Apache-2.0

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for ottema/gliner2-ptbr-ontoevidence

Base model

fastino/gliner2-multi-v1

Finetuned

(8)

this model

Dataset used to train ottema/gliner2-ptbr-ontoevidence

Collection including ottema/gliner2-ptbr-ontoevidence

Ottema Open Models

Collection

Open AI models for Brazilian Portuguese extraction, reliable agents, and edge ASR. Apache-2.0. • 11 items • Updated 2 days ago