ottema/gliner2-ptbr-ontoevidence (v0.18)

Ontology-guided evidence extraction for Brazilian Portuguese operational text (education, technical support, assistance) with hard-negative rejection.

This is the first version with real weights, trained using an anti-yes-man recipe that combines OE positives with diverse HAREM-style samples to prevent the model from collapsing into predicting all labels.

Performance

Evaluation on the OntoEvidence-BR test set (114 samples, 302 entities, full 58-label ontology, threshold 0.3):

Model OE samples Mix (OE/HAREM/anti) Avg pred/text Precision Recall F1 Yes-man?
v0.18 (this model) 2041 31% / 63% / 6% 4.41 0.256 0.427 0.320 No
v0.17 2041 45% / 45% / 10% 4.27 0.253 0.407 0.312 No ✅
v0.19 (5x OE data) 3876 37% / 56% / 7% 15.12 0.102 0.583 0.174 Yes 🔴
v0.20 (3x vol, 31%) 3876 31% / 63% / 6% 14.51 0.105 0.576 0.178 Yes 🔴
v0.16 (OE 70%, no HAREM) 2041 70% / 30% / 0% 17.96 0.091 0.616 0.158 Yes 🔴

Data scaling findings:

  • More data is NOT always better. Doubling OE positives (v0.19, v0.20) reactivated yes-man even with HAREM majority.
  • Proportion matters more than volume. v0.18 succeeded at 31% OE / 63% HAREM with 6.5k total. Scaling the same proportion failed (v0.20).
  • The signal-to-noise ratio breaks down at scale. Synthetic HAREM samples become repetitive, weakening the "real text has few entities" lesson.
  • Real data would likely help. Synthetic-only training has hit a ceiling around F1=0.32.

A healthy model predicts 1-3 entities per text on average. The v0.18 model sits in this range, confirming the yes-man failure mode is broken.

Real-world examples (threshold 0.3)

Text: "a marcha n esta funcionando"

  • cambio_signal 0.83 ✅ (correct: gearbox issue)
  • pane_mecanica_signal 0.32 (correct: mechanical failure)
  • motor_signal 0.91 ⚠️ (over-predicted, but thresh=0.3 keeps it)

Text: "o motor falhou"

  • motor 0.998 ✅
  • pane_mecanica_signal 0.61 ✅
  • cambio_signal 0.73 ⚠️ (motor ≠ gearbox, but model is uncertain)

Text: "tô com febre"

  • febre 1.000 ✅
  • condicao_medica 1.000 ✅
  • servico_saude 0.59 (borderline)

The model gets the right entities with high confidence but has residual label confusion (predicting multiple plausible labels for the same span). This is the next challenge to address in v0.19+.

What is OntoEvidence-BR?

Operational text in Brazilian Portuguese — atendimento, suporte técnico, educação — is noisy, domain-specific, and full of hard negatives (everyday words that look like entities but aren't):

Text Surface form Why it's a hard negative
"dê um passo pra frente" "frente" Not a "front" entity; it's a movement direction
"o motor falhou" "motor" Not a "car part" entity; it's a generic device
"a marcha foi longa" "marcha" Could be "gear" (auto), "march" (protest), or "stride" (walking)
"tô com febre" "febre" Medical symptom, not a "condition code"

Standard NER models trained on HAREM (journalistic) collapse on operational text because they learned to predict "local" for any capitalized word, "pessoa" for any first name, etc. OntoEvidence-BR trains models to discriminate between entity and non-entity in noisy domains.

The yes-man problem (and how we fixed it)

Earlier attempts to fine-tune GLiNER2 on OntoEvidence-BR caused a structural failure mode we call "yes-man" — the model learns to predict ALL ontology labels with confidence 1.0, regardless of input. We tried:

  • ❌ Curriculum learning (hard → easy)
  • ❌ Hard-negative mining (Wikipedia)
  • ❌ Decoy injection
  • ❌ Increasing weight of rare labels

What worked:

  • Mix OE positives with diverse HAREM-style samples (31% OE / 63% HAREM / 6% anti-yes)
  • Train from HAREM-specialized checkpoint (not from raw base)
  • Conservative LR (5e-7) and 1 epoch to prevent collapse

The HAREM-style mix teaches the model that most real text has few or no OE entities, breaking the "predict everything" bias.

Usage

from gliner2 import GLiNER2

model = GLiNER2.from_pretrained("ottema/gliner2-ptbr-ontoevidence")

text = "a marcha do carro n esta funcionando"
labels = [
    "marcha", "motor", "cambio_signal", "pane_mecanica_signal",
    "motor_signal", "pane_eletrica_signal",
]
entities = model.extract_entities(text, labels, threshold=0.3, include_confidence=True)
for label, spans in entities["entities"].items():
    for span_info in spans:
        if isinstance(span_info, dict):
            print(f"{label}: '{span_info['text']}' ({span_info['confidence']:.3f})")

Recommended threshold: 0.3 for high recall, 0.5+ for high precision.

Try the schema today

ottema/gliner2-ptbr-demo — interactive Gradio demo. Select the HAREM-specialized model and the OntoEvidence label presets to test hard-negative discrimination. For production use, this v0.18 model is preferred.

Future work (v0.19+)

The model is functional but not great. Known limitations:

  • Label confusion: predicts multiple plausible labels for the same span
  • Domain shift: trained mostly on synthetic; real text may degrade
  • Coverage: ontology has 58 labels, dataset has 62

Planned improvements:

  • Focal loss for hard-negative emphasis
  • Span-level negative sampling during training
  • Real operational data (without PII) — needed to break the F1=0.32 ceiling
  • Active learning with model predictions to find edge cases
  • Per-domain specialization (separate models for assistance vs technical_support vs education)

Training

  • Base: ottema/gliner2-ptbr-harem (HAREM-specialized)
  • Data: 2041 OE positives + 4082 HAREM-style mixed + 408 anti-yes-man = 6531 samples
  • Hyperparams: 1 epoch, encoder_lr=5e-7, task_lr=1e-5, batch_size=4, accum=4
  • Total time: ~2 min on RTX A4500

Credits

  • Base architecture: GLiNER2 (Urchade Zaratiana et al.)
  • Base weights: fastino/gliner2-multi-v1 (Fastino)
  • HAREM base: ottema/gliner2-ptbr-harem
  • Dataset + research: Ottema

License

Apache-2.0

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ottema/gliner2-ptbr-ontoevidence

Finetuned
(8)
this model

Dataset used to train ottema/gliner2-ptbr-ontoevidence

Collection including ottema/gliner2-ptbr-ontoevidence