ottema/gliner2-ptbr-harem (v0.12b)

GLiNER2 fine-tuned for Brazilian Portuguese NER, benchmarked on HAREM. Best entity F1 among the compared models in our evaluation protocol.

This model is part of the Ottema GLiNER2-PTBR open-source ecosystem. Companion model: ottema/gliner2-ptbr (generalist, informal PT-BR).

Credits and acknowledgments

This model is a fine-tune of fastino/gliner2-multi-v1, the official multilingual GLiNER2 model released by Fastino. GLiNER2 is the open-vocabulary NER architecture originally proposed by Urchade Zaratiana and collaborators (GLiNER paper). We are grateful to the upstream teams for releasing the architecture and base model under Apache-2.0, which made this work possible.

  • Base architecture: GLiNER2 (Urchade et al.)
  • Base weights: fastino/gliner2-multi-v1 (Fastino)
  • Encoder: microsoft/mdeberta-v3-base
  • Fine-tuning, datasets, evaluation: Ottema

If you use this model, please also cite the original GLiNER work and the Fastino GLiNER2 release.

Performance (HAREM benchmark, 163 samples, 2511 entities, GPU)

Metrics are reported as per-sample macro F1 (the standard in our benchmark script). The corresponding global micro F1 is also reported for transparency.

Model entity_F1 (per-sample macro) entity_F1 (global micro) span_F1 label_F1 Latency
ottema/gliner2-ptbr-harem (v0.12b) @ t=0.4 0.4749 0.4501 0.4878 0.8725 31ms
hcaeryks/bert-crf-harem (BERT-Large specialist) 0.4700 0.5220 0.8456 131ms
ottema/gliner2-ptbr-harem v0.11 (previous official) 0.4711 0.4811 0.8776 32ms
fastino/gliner2-multi-v1 (zero-shot) 0.4251 0.4366 0.8480 31ms

Note on aggregation methods:

  • Per-sample macro F1: mean of per-sample F1 scores. Equal weight to each sample.
  • Global micro F1: F1 computed on the union of all entities. Equal weight to each entity.
  • Both are valid; per-sample macro is more lenient on small samples, global micro is stricter.

Key results:

  • Best entity F1 among compared models on our HAREM evaluation protocol (0.4749 vs BERT-CRF 0.4700)
  • 4x faster than BERT-CRF (31ms vs 131ms)
  • Open-vocab (no fixed label set)
  • Generalist (trained on HAREM + lfcc + synthetic + Wikipedia pseudo-labels)

Trade-offs:

  • -3.4 pp span_F1 vs BERT-CRF (boundary detection is BERT-CRF's strength)
  • -0.51 pp label_F1 vs v0.11 (pseudo-labeling slightly reduces label precision)

Usage

from gliner2 import GLiNER2

model = GLiNER2.from_pretrained("ottema/gliner2-ptbr-harem")
model = model.to("cuda")  # or "cpu"

text = "João da Silva nasceu em São Paulo em 1990 e trabalha na Petrobras."
entities = model.extract_entities(
    text,
    entity_types=["pessoa", "organização", "local", "data", "valor_monetário"],
    threshold=0.4,
)
print(entities)
# {'entities': {'pessoa': ['João da Silva'], 'local': ['São Paulo'], 'data': ['1990'], 'organização': ['Petrobras']}}

Recommended threshold: 0.4 (sweet spot from ablation).

Training

  • Base: fastino/gliner2-multi-v1 (Apache-2.0)
  • Init from: v0.11 fine-tuned checkpoint
  • Data: 23k gold (synthetic + lfcc + HAREM train) + 4488 pseudo-labels from Wikipedia PT
  • Hyperparams: 2 epochs, encoder_lr=1e-6, task_lr=2e-5, batch_size=4, accum=4, warmup_ratio=0.1
  • Pseudo-label threshold: 0.85 (confidence-based filter)
  • Total time: ~20min on RTX A4500

Innovation Lab

We ran 5 experiments beyond standard fine-tuning. Full ablation below:

  • v0.12a/b: Pseudo-labeling Wiki PT (sweet spot at t=0.85, lr 1e-6) — +0.38 pp entity_F1
  • v0.13: Pseudo t=0.92 (too conservative) — -1.0 pp entity_F1
  • v0.14: Self-training iteration (v0.12b as teacher) — model became overconfident, no F1 gain
  • v0.15: Augmented hard negatives (truncation, label swap, distractor injection) — +3.4 pp precision but -0.5 pp recall
  • Hard-negative mining on Wikipedia: 95% of "FPs" were actually correct entities in unannotated corpus. Only 47/2000 candidates were safe enough to use as hard negatives.

Findings: Pseudo-labeling works at threshold 0.85 with conservative LR (1e-6). More aggressive filtering or self-training iteration causes overconfidence without F1 improvement. Hard-negative augmentation trades recall for precision (not net positive).

Limitations

  • HAREM is a single benchmark; performance on other PT-BR NER benchmarks (LeNER-Br, Paramopama, etc.) may differ.
  • Trained primarily on encyclopedic + journalistic text. May underperform on chat/WhatsApp (use ottema/gliner2-ptbr v0.4 for that).
  • Span boundary detection is weaker than BERT-CRF (consider ensemble for high-precision use).

License

  • Model weights: Apache-2.0
  • Code: Apache-2.0
  • Training data: synthetic (CC0), real datasets (used for training only, not redistributed)

Citation

@software{ottema_gliner2_ptbr_2026,
  author = {Ottema},
  title = {GLiNER2-PTBR: Open-source Brazilian Portuguese NER},
  year = {2026},
  version = {0.12b},
}

Related models

  • ottema/gliner2-ptbr (v0.4): generalist for informal PT-BR (chat, atendimento)
  • fastino/gliner2-multi-v1: base multilingual GLiNER2
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ottema/gliner2-ptbr-harem

Finetuned
(8)
this model

Datasets used to train ottema/gliner2-ptbr-harem

Space using ottema/gliner2-ptbr-harem 1

Collection including ottema/gliner2-ptbr-harem

Paper for ottema/gliner2-ptbr-harem