Instructions to use tardellirs/brazembed-pt-br with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use tardellirs/brazembed-pt-br with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("tardellirs/brazembed-pt-br") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
BrazEmbed-PT-BR
A contamination-clean ~110M Brazilian-Portuguese text-embedding model, built on the Brazilian
BERTimbau base. This repo is the single drop-in
model (TIES-merge of the four task experts): MTEB(por) mean_16 = 0.6372 (≈ the prior ~100M leader). For the
headline 0.6567 (the task-routed system, #1 in the ~100M class), see 🛠️ https://github.com/tardellirs/brazembed-pt-br.
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("tardellirs/brazembed-pt-br") # mean-pooling, L2-normalized, no instruction prefix
emb = m.encode(["O Brasil é um país tropical da América do Sul."], normalize_embeddings=True)
MTEB(por) — #1 in the ~100M class (routed system 0.6567; this single model 0.6372)
| Model | params | mean_16 |
|---|---|---|
| BrazEmbed-PT-BR (routed) | 110M | 0.6567 |
| BrazEmbed-PT-BR (single, this) | 110M | 0.6372 |
| serafim-100m | 100M | 0.6313 |
| multilingual-e5-small | 118M | 0.6142 |
| granite-embedding-107m | 107M | 0.6080 |
Beats far larger models too (granite-311m 0.6464, multilingual-e5-base 278M 0.6376, serafim-900m 0.6301, …).
Contamination integrity
All training data was measured (hash overlap) to be disjoint from the MTEB(por) test sets (e.g. mMARCO-pt: 0 / 8.8M overlaps vs the Quati pool). A self-audit removed 10 ASSIN2-test pairs that had leaked into our own STS set. Base BERTimbau is Brazilian (brWaC); training data is Portuguese (a mix of PT-BR-native and PT-translated sources).
Notes
Development was iterative against MTEB(por) (no held-out split); we use only principled uniform weight-soups, a no-tuned-knob headline, and report every variant — see https://github.com/tardellirs/brazembed-pt-br. License MIT (base BERTimbau). Benchmark: MTEB(por) (public release coming soon).
- Downloads last month
- 10
Model tree for tardellirs/brazembed-pt-br
Base model
neuralmind/bert-base-portuguese-cased