YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ModernBERT-base-MoME-v0

This is a specialized variant of ModernBERT-base designed for Mixture of Multichain Experts (MoME) routing tasks, particularly focusing on determining which blockchain or chain expert (e.g., Aptos, Ripple, Polkadot, Crust) should handle an incoming transaction or query. It retains the core architectural and performance benefits of ModernBERT, while integrating custom training on chain classification data.


Table of Contents

  1. Model Summary
  2. Usage
  3. Evaluation
  4. Limitations
  5. Training
  6. License
  7. Citation

Model Summary

ModernBERT-base-MoME-v0 is an encoder-only model (BERT-style) derived from ModernBERT-base. The original ModernBERT was trained on a large corpus of text and code (2T tokens), supporting context lengths of up to 8,192 tokens. Key enhancements include:

  • Rotary Positional Embeddings (RoPE) for long-context support
  • Local-Global Alternating Attention for efficient attention over extended sequences
  • Unpadding + Flash Attention for fast inference times

ModernBERT-base-MoME-v0 extends these capabilities with a fine-tuned head specialized in routing transactions or queries to the correct “chain expert” in a Mixture of Experts (MoME) system. By integrating specialized training data for chain classification (e.g., Polkadot, Aptos, Ripple, Crust), the model can better determine which chain is relevant for a given transaction payload.


Usage

You can load ModernBERT-base-MoME-v0 using Hugging Face Transformers. The steps are largely identical to standard BERT usage, with two key notes:

  1. Long-Context Support
    You can input sequences up to 8,192 tokens without degrading performance due to the model’s RoPE-based architecture.
  2. Routing Head
    After the core BERT encoding, a classification head (or specialized projective layer) determines the most likely chain or domain.

Quickstart

pip install -U transformers>=4.48.0
pip install flash-attn  # optional but recommended if supported by your GPU
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "momeaicrypto/ModernBERT-base-MoME-v0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# Sample transaction or query
text = "Transaction: {\"action\": \"transfer\", \"chain\": \"polkadot\", ...}"

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# The logits from outputs.logits will indicate which chain this transaction likely belongs to.
print("Logits:", outputs.logits)
predicted_label = outputs.logits.argmax(dim=-1).item()
print("Predicted chain ID:", predicted_label)

Note: If you want to adapt the model to a different classification scheme (e.g., additional chains), you can fine-tune via standard BERT classification recipes.


Evaluation

The base ModernBERT architecture has been shown to outperform or match other leading encoder-only models across GLUE, BEIR, MLDR, CodeSearchNet, and StackQA. For ModernBERT-base-MoME-v0, we specifically evaluate:

  • Chain Classification Accuracy: Using a specialized dataset of transactions labeled by their respective chains (Polkadot, Aptos, Ripple, Crust, etc.).
  • Inference Efficiency on Long Inputs: Verifying that the local-global alternating attention and Flash Attention enable high throughput, even for large transaction payloads or logs (up to 8,192 tokens).

See the parent ModernBERT evaluation results for a broad performance context:

Model IR (DPR) BEIR IR (ColBERT) BEIR NLU (GLUE) Code (CSN)
BERT 38.9 49.0 84.7 41.2
RoBERTa 37.7 48.7 86.4 44.3
ModernBERT 41.6 51.3 88.4 56.4

ModernBERT-base-MoME-v0 maintains the same strong backbone while adding chain-routing capabilities.


Limitations

  1. Domain-Specific Training: While it handles chain routing, performance may degrade if you feed it data outside of the pre-trained or fine-tuned domain (e.g., medical or legal text).
  2. Biases: As with any large language model, biases in the underlying dataset can manifest in certain classification outcomes.
  3. Context Length: Though it can handle sequences up to 8,192 tokens, keep in mind that very long sequences can be slower on certain GPU hardware if Flash Attention is not installed.

Training

  • Base Model: ModernBERT-base (149M parameters, 22 layers).
  • Fine-Tuning: Additional training on ~1k chain-labeled transactions, focusing on Polkadot, Aptos, Ripple, Crust, etc.
  • Long Context: Trained with RoPE and local-global alternating attention for efficient extended context usage.
  • Optimizer: StableAdamW with trapezoidal LR scheduling, consistent with the original ModernBERT approach.

License

This model inherits the Apache 2.0 license from ModernBERT.


Citation

If you use ModernBERT-base-MoME-v0 in your work, please cite the original ModernBERT:

@misc{modernbert,
      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, 
      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
      year={2024},
      eprint={2412.13663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13663}, 
}

Additional references for the MoME (Mixture of Multichain Experts) concept should be included if relevant.

Downloads last month
5
Safetensors
Model size
149M params
Tensor type
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.