ModernBERT-base-MoME-v0

This is a specialized variant of ModernBERT-base designed for Mixture of Multichain Experts (MoME) routing tasks, particularly focusing on determining which blockchain or chain expert (e.g., Aptos, Ripple, Polkadot, Crust) should handle an incoming transaction or query. It retains the core architectural and performance benefits of ModernBERT, while integrating custom training on chain classification data.

Model Summary
Usage
Evaluation
Limitations
Training
License
Citation

Model Summary

ModernBERT-base-MoME-v0 is an encoder-only model (BERT-style) derived from ModernBERT-base. The original ModernBERT was trained on a large corpus of text and code (2T tokens), supporting context lengths of up to 8,192 tokens. Key enhancements include:

Rotary Positional Embeddings (RoPE) for long-context support
Local-Global Alternating Attention for efficient attention over extended sequences
Unpadding + Flash Attention for fast inference times

ModernBERT-base-MoME-v0 extends these capabilities with a fine-tuned head specialized in routing transactions or queries to the correct “chain expert” in a Mixture of Experts (MoME) system. By integrating specialized training data for chain classification (e.g., Polkadot, Aptos, Ripple, Crust), the model can better determine which chain is relevant for a given transaction payload.

Usage

You can load ModernBERT-base-MoME-v0 using Hugging Face Transformers. The steps are largely identical to standard BERT usage, with two key notes:

Long-Context Support
You can input sequences up to 8,192 tokens without degrading performance due to the model’s RoPE-based architecture.
Routing Head
After the core BERT encoding, a classification head (or specialized projective layer) determines the most likely chain or domain.

Quickstart

pip install -U transformers>=4.48.0
pip install flash-attn  # optional but recommended if supported by your GPU

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "momeaicrypto/ModernBERT-base-MoME-v0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# Sample transaction or query
text = "Transaction: {\"action\": \"transfer\", \"chain\": \"polkadot\", ...}"

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# The logits from outputs.logits will indicate which chain this transaction likely belongs to.
print("Logits:", outputs.logits)
predicted_label = outputs.logits.argmax(dim=-1).item()
print("Predicted chain ID:", predicted_label)

Note: If you want to adapt the model to a different classification scheme (e.g., additional chains), you can fine-tune via standard BERT classification recipes.

Evaluation

The base ModernBERT architecture has been shown to outperform or match other leading encoder-only models across GLUE, BEIR, MLDR, CodeSearchNet, and StackQA. For ModernBERT-base-MoME-v0, we specifically evaluate:

Chain Classification Accuracy: Using a specialized dataset of transactions labeled by their respective chains (Polkadot, Aptos, Ripple, Crust, etc.).
Inference Efficiency on Long Inputs: Verifying that the local-global alternating attention and Flash Attention enable high throughput, even for large transaction payloads or logs (up to 8,192 tokens).

See the parent ModernBERT evaluation results for a broad performance context:

Model	IR (DPR) BEIR	IR (ColBERT) BEIR	NLU (GLUE)	Code (CSN)
BERT	38.9	49.0	84.7	41.2
RoBERTa	37.7	48.7	86.4	44.3
ModernBERT	41.6	51.3	88.4	56.4

ModernBERT-base-MoME-v0 maintains the same strong backbone while adding chain-routing capabilities.

Limitations

Domain-Specific Training: While it handles chain routing, performance may degrade if you feed it data outside of the pre-trained or fine-tuned domain (e.g., medical or legal text).
Biases: As with any large language model, biases in the underlying dataset can manifest in certain classification outcomes.
Context Length: Though it can handle sequences up to 8,192 tokens, keep in mind that very long sequences can be slower on certain GPU hardware if Flash Attention is not installed.

Training

Base Model: ModernBERT-base (149M parameters, 22 layers).
Fine-Tuning: Additional training on ~1k chain-labeled transactions, focusing on Polkadot, Aptos, Ripple, Crust, etc.
Long Context: Trained with RoPE and local-global alternating attention for efficient extended context usage.
Optimizer: StableAdamW with trapezoidal LR scheduling, consistent with the original ModernBERT approach.

License

This model inherits the Apache 2.0 license from ModernBERT.

Citation

If you use ModernBERT-base-MoME-v0 in your work, please cite the original ModernBERT:

@misc{modernbert,
      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, 
      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
      year={2024},
      eprint={2412.13663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13663}, 
}

Additional references for the MoME (Mixture of Multichain Experts) concept should be included if relevant.

momeaicrypto
/

ModernBERT-base-MoME-v0