Domain Labeler EN-PT (Two-Stream MLP)

Bilingual (EN+PT) domain classifier using a two-stream MLP with gated fusion of inference-free SPLADE sparse features + static Nomic dense features.

Based on NeuML/domain-labeler and trained on a merged bilingual version of NeuML/wikipedia-domain-labels.

Pure numpy approach can be found here (No transformers dependency).

Performance

Metric Value
Test accuracy 84.5%
Classes 67
Training samples 159,041 (EN + PT)
Architecture Two-Stream MLP, gated fusion
Feature dims 30,522 (sparse) + 384 (dense)

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained("cnmoro/domain-labeler-enpt", trust_remote_code=True)

# Single prediction (Portuguese)
model.predict("O Google Chrome é um navegador web")
# → ["computer_science_and_technology"]

# Single prediction (English)
model.predict("Edward Gein was an American murderer")
# → ["history"]

# Batch
texts = [
    "O universo é vasto e cheio de estrelas",
    "This is a movie review of a great film",
]
model.predict(texts)
# → ["astronomy", "movie"]

# With probabilities
model.predict_proba("A história do Brasil")
# → [{"history": 0.93, "movie": 0.04, "geography": 0.01}]

Architecture

  1. Sparse features: cnmoro/inference-free-splade-co-condenser-en-ptbr-v2 → 30,522-dim sparse SPLADE vectors
  2. Dense features: cnmoro/static-nomic-384-pten-v2 → 384-dim static embeddings
  3. Two-Stream MLP: Separate projection heads for each modality, gated fusion, then classifier head

Attribution

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cnmoro/domain-labeler-enpt