BrazEmbed-PT-BR

A contamination-clean ~110M Brazilian-Portuguese text-embedding model, built on the Brazilian BERTimbau base. This repo is the single drop-in model (TIES-merge of the four task experts): MTEB(por) mean_16 = 0.6372 (≈ the prior ~100M leader). For the headline 0.6567 (the task-routed system, #1 in the ~100M class), see 🛠️ https://github.com/tardellirs/brazembed-pt-br.

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("tardellirs/brazembed-pt-br")   # mean-pooling, L2-normalized, no instruction prefix
emb = m.encode(["O Brasil é um país tropical da América do Sul."], normalize_embeddings=True)

MTEB(por) — #1 in the ~100M class (routed system 0.6567; this single model 0.6372)

Model	params	mean_16
BrazEmbed-PT-BR (routed)	110M	0.6567
BrazEmbed-PT-BR (single, this)	110M	0.6372
serafim-100m	100M	0.6313
multilingual-e5-small	118M	0.6142
granite-embedding-107m	107M	0.6080

Beats far larger models too (granite-311m 0.6464, multilingual-e5-base 278M 0.6376, serafim-900m 0.6301, …).

Contamination integrity

All training data was measured (hash overlap) to be disjoint from the MTEB(por) test sets (e.g. mMARCO-pt: 0 / 8.8M overlaps vs the Quati pool). A self-audit removed 10 ASSIN2-test pairs that had leaked into our own STS set. Base BERTimbau is Brazilian (brWaC); training data is Portuguese (a mix of PT-BR-native and PT-translated sources).

Notes

Development was iterative against MTEB(por) (no held-out split); we use only principled uniform weight-soups, a no-tuned-knob headline, and report every variant — see https://github.com/tardellirs/brazembed-pt-br. License MIT (base BERTimbau). Benchmark: MTEB(por) (public release coming soon).

Downloads last month: 10

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for tardellirs/brazembed-pt-br

Base model

neuralmind/bert-base-portuguese-cased

Finetuned

(211)

this model

Collection including tardellirs/brazembed-pt-br

BrazEmbed-PT-BR

Collection

Contamination-clean ~110M Brazilian-Portuguese embeddings (BERTimbau). #1 in the ~100M class on MTEB(por). • 5 items • Updated about 10 hours ago