ottema/gliner2-ptbr-harem (v0.12b)

GLiNER2 fine-tuned for Brazilian Portuguese NER, benchmarked on HAREM. Best entity F1 among the compared models in our evaluation protocol.

This model is part of the Ottema GLiNER2-PTBR open-source ecosystem. Companion model: ottema/gliner2-ptbr (generalist, informal PT-BR).

Credits and acknowledgments

This model is a fine-tune of fastino/gliner2-multi-v1, the official multilingual GLiNER2 model released by Fastino. GLiNER2 is the open-vocabulary NER architecture originally proposed by Urchade Zaratiana and collaborators (GLiNER paper). We are grateful to the upstream teams for releasing the architecture and base model under Apache-2.0, which made this work possible.

Base architecture: GLiNER2 (Urchade et al.)
Base weights: fastino/gliner2-multi-v1 (Fastino)
Encoder: microsoft/mdeberta-v3-base
Fine-tuning, datasets, evaluation: Ottema

If you use this model, please also cite the original GLiNER work and the Fastino GLiNER2 release.

Performance (HAREM benchmark, 163 samples, 2511 entities, GPU)

Metrics are reported as per-sample macro F1 (the standard in our benchmark script). The corresponding global micro F1 is also reported for transparency.

Model	entity_F1 (per-sample macro)	entity_F1 (global micro)	span_F1	label_F1	Latency
ottema/gliner2-ptbr-harem (v0.12b) @ t=0.4 ⭐	0.4749	0.4501	0.4878	0.8725	31ms
hcaeryks/bert-crf-harem (BERT-Large specialist)	0.4700	—	0.5220	0.8456	131ms
ottema/gliner2-ptbr-harem v0.11 (previous official)	0.4711	—	0.4811	0.8776	32ms
fastino/gliner2-multi-v1 (zero-shot)	0.4251	—	0.4366	0.8480	31ms

Note on aggregation methods:

Per-sample macro F1: mean of per-sample F1 scores. Equal weight to each sample.
Global micro F1: F1 computed on the union of all entities. Equal weight to each entity.
Both are valid; per-sample macro is more lenient on small samples, global micro is stricter.

Key results:

Best entity F1 among compared models on our HAREM evaluation protocol (0.4749 vs BERT-CRF 0.4700)
4x faster than BERT-CRF (31ms vs 131ms)
Open-vocab (no fixed label set)
Generalist (trained on HAREM + lfcc + synthetic + Wikipedia pseudo-labels)

Trade-offs:

-3.4 pp span_F1 vs BERT-CRF (boundary detection is BERT-CRF's strength)
-0.51 pp label_F1 vs v0.11 (pseudo-labeling slightly reduces label precision)

Usage

from gliner2 import GLiNER2

model = GLiNER2.from_pretrained("ottema/gliner2-ptbr-harem")
model = model.to("cuda")  # or "cpu"

text = "João da Silva nasceu em São Paulo em 1990 e trabalha na Petrobras."
entities = model.extract_entities(
    text,
    entity_types=["pessoa", "organização", "local", "data", "valor_monetário"],
    threshold=0.4,
)
print(entities)
# {'entities': {'pessoa': ['João da Silva'], 'local': ['São Paulo'], 'data': ['1990'], 'organização': ['Petrobras']}}

Recommended threshold: 0.4 (sweet spot from ablation).

Training

Base: fastino/gliner2-multi-v1 (Apache-2.0)
Init from: v0.11 fine-tuned checkpoint
Data: 23k gold (synthetic + lfcc + HAREM train) + 4488 pseudo-labels from Wikipedia PT
Hyperparams: 2 epochs, encoder_lr=1e-6, task_lr=2e-5, batch_size=4, accum=4, warmup_ratio=0.1
Pseudo-label threshold: 0.85 (confidence-based filter)
Total time: ~20min on RTX A4500

Innovation Lab

We ran 5 experiments beyond standard fine-tuning. Full ablation below:

v0.12a/b: Pseudo-labeling Wiki PT (sweet spot at t=0.85, lr 1e-6) — +0.38 pp entity_F1
v0.13: Pseudo t=0.92 (too conservative) — -1.0 pp entity_F1
v0.14: Self-training iteration (v0.12b as teacher) — model became overconfident, no F1 gain
v0.15: Augmented hard negatives (truncation, label swap, distractor injection) — +3.4 pp precision but -0.5 pp recall
Hard-negative mining on Wikipedia: 95% of "FPs" were actually correct entities in unannotated corpus. Only 47/2000 candidates were safe enough to use as hard negatives.

Findings: Pseudo-labeling works at threshold 0.85 with conservative LR (1e-6). More aggressive filtering or self-training iteration causes overconfidence without F1 improvement. Hard-negative augmentation trades recall for precision (not net positive).

Limitations

HAREM is a single benchmark; performance on other PT-BR NER benchmarks (LeNER-Br, Paramopama, etc.) may differ.
Trained primarily on encyclopedic + journalistic text. May underperform on chat/WhatsApp (use ottema/gliner2-ptbr v0.4 for that).
Span boundary detection is weaker than BERT-CRF (consider ensemble for high-precision use).

License

Model weights: Apache-2.0
Code: Apache-2.0
Training data: synthetic (CC0), real datasets (used for training only, not redistributed)

Citation

@software{ottema_gliner2_ptbr_2026,
  author = {Ottema},
  title = {GLiNER2-PTBR: Open-source Brazilian Portuguese NER},
  year = {2026},
  version = {0.12b},
}