cnmoro/wikipedia-domain-labels-ptbr
Viewer • Updated • 79.5k • 20
Bilingual (EN+PT) domain classifier using a two-stream MLP with gated fusion of inference-free SPLADE sparse features + static Nomic dense features.
Based on NeuML/domain-labeler and trained on a merged bilingual version of NeuML/wikipedia-domain-labels.
Pure numpy approach can be found here (No transformers dependency).
| Metric | Value |
|---|---|
| Test accuracy | 84.5% |
| Classes | 67 |
| Training samples | 159,041 (EN + PT) |
| Architecture | Two-Stream MLP, gated fusion |
| Feature dims | 30,522 (sparse) + 384 (dense) |
from transformers import AutoModel
model = AutoModel.from_pretrained("cnmoro/domain-labeler-enpt", trust_remote_code=True)
# Single prediction (Portuguese)
model.predict("O Google Chrome é um navegador web")
# → ["computer_science_and_technology"]
# Single prediction (English)
model.predict("Edward Gein was an American murderer")
# → ["history"]
# Batch
texts = [
"O universo é vasto e cheio de estrelas",
"This is a movie review of a great film",
]
model.predict(texts)
# → ["astronomy", "movie"]
# With probabilities
model.predict_proba("A história do Brasil")
# → [{"history": 0.93, "movie": 0.04, "geography": 0.01}]