BrazEmbed-PT-BR

A contamination-clean ~110M Brazilian-Portuguese text-embedding model, built on the Brazilian BERTimbau base. This repo is the single drop-in model (TIES-merge of the four task experts): MTEB(por) mean_16 = 0.6372 (≈ the prior ~100M leader). For the headline 0.6567 (the task-routed system, #1 in the ~100M class), see 🛠️ https://github.com/tardellirs/brazembed-pt-br.

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("tardellirs/brazembed-pt-br")   # mean-pooling, L2-normalized, no instruction prefix
emb = m.encode(["O Brasil é um país tropical da América do Sul."], normalize_embeddings=True)

MTEB(por) — #1 in the ~100M class (routed system 0.6567; this single model 0.6372)

Model params mean_16
BrazEmbed-PT-BR (routed) 110M 0.6567
BrazEmbed-PT-BR (single, this) 110M 0.6372
serafim-100m 100M 0.6313
multilingual-e5-small 118M 0.6142
granite-embedding-107m 107M 0.6080

Beats far larger models too (granite-311m 0.6464, multilingual-e5-base 278M 0.6376, serafim-900m 0.6301, …).

Contamination integrity

All training data was measured (hash overlap) to be disjoint from the MTEB(por) test sets (e.g. mMARCO-pt: 0 / 8.8M overlaps vs the Quati pool). A self-audit removed 10 ASSIN2-test pairs that had leaked into our own STS set. Base BERTimbau is Brazilian (brWaC); training data is Portuguese (a mix of PT-BR-native and PT-translated sources).

Notes

Development was iterative against MTEB(por) (no held-out split); we use only principled uniform weight-soups, a no-tuned-knob headline, and report every variant — see https://github.com/tardellirs/brazembed-pt-br. License MIT (base BERTimbau). Benchmark: MTEB(por) (public release coming soon).

Downloads last month
10
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tardellirs/brazembed-pt-br

Finetuned
(211)
this model

Collection including tardellirs/brazembed-pt-br