Domain Labeler EN-PT (Two-Stream MLP)

Bilingual (EN+PT) domain classifier using a two-stream MLP with gated fusion of inference-free SPLADE sparse features + static Nomic dense features.

Based on NeuML/domain-labeler and trained on a merged bilingual version of NeuML/wikipedia-domain-labels.

Pure numpy approach can be found here (No transformers dependency).

Performance

Metric	Value
Test accuracy	84.5%
Classes	67
Training samples	159,041 (EN + PT)
Architecture	Two-Stream MLP, gated fusion
Feature dims	30,522 (sparse) + 384 (dense)

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained("cnmoro/domain-labeler-enpt", trust_remote_code=True)

# Single prediction (Portuguese)
model.predict("O Google Chrome é um navegador web")
# → ["computer_science_and_technology"]

# Single prediction (English)
model.predict("Edward Gein was an American murderer")
# → ["history"]

# Batch
texts = [
    "O universo é vasto e cheio de estrelas",
    "This is a movie review of a great film",
]
model.predict(texts)
# → ["astronomy", "movie"]

# With probabilities
model.predict_proba("A história do Brasil")
# → [{"history": 0.93, "movie": 0.04, "geography": 0.01}]

Architecture

Sparse features: cnmoro/inference-free-splade-co-condenser-en-ptbr-v2 → 30,522-dim sparse SPLADE vectors
Dense features: cnmoro/static-nomic-384-pten-v2 → 384-dim static embeddings
Two-Stream MLP: Separate projection heads for each modality, gated fusion, then classifier head

Attribution

Dataset: NeuML/wikipedia-domain-labels
Original project: NeuML/domain-labeler
Portuguese translation: fast-translate
Embedding models: cnmoro/inference-free-splade-co-condenser-en-ptbr-v2 + cnmoro/static-nomic-384-pten-v2

Downloads last month: 44

cnmoro
/

domain-labeler-enpt

Domain Labeler EN-PT (Two-Stream MLP)

Performance

Usage

Architecture

Attribution

Dataset used to train cnmoro/domain-labeler-enpt