feat: Improve mmBERT-base model card with full title, abstract, and updated content
Browse filesThis PR significantly improves the model card for `jhu-clsp/mmBERT-base` by:
1. **Updating Metadata**:
* Changing the `pipeline_tag` from `fill-mask` to `feature-extraction` to better reflect the model's primary use cases as a multilingual encoder for classification, embedding, and retrieval tasks. This ensures the model appears in the correct search results at https://huggingface.co/models?pipeline_tag=feature-extraction.
* Adding `fill-mask` as an `additional tag` to maintain discoverability for masked language modeling, which is a supported capability as demonstrated in the usage examples.
2. **Enhancing Content**:
* Updating the main title to the full paper title: "mmBERT: A Modern Multilingual Encoder with Annealed Language Learning".
* Adding a direct link to the Hugging Face paper page: https://huggingface.co/papers/2509.06888, alongside the existing Arxiv link.
* Including the comprehensive paper abstract.
* Integrating and expanding sections like "Overview", "Quick Start", "Model Family", "Usage Examples", "Fine-tuning Examples", "Training Details", "Evaluation", "FAQ", and "Limitations" using the more detailed and up-to-date information from the official GitHub repository README, making the model card a more complete resource for users.
* Updating the `transformers` installation requirement to `transformers>=4.48.0` for consistency with the GitHub README.
* Renaming the "Base Model for Classification" example in the "Quick Start" to "Base Model for Masked Language Modeling" to accurately reflect its content.
These changes provide users with a clearer understanding of the model's capabilities, how to use it, and its underlying design.
|
@@ -1,71 +1,109 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
datasets:
|
| 4 |
- jhu-clsp/mmbert-decay
|
| 5 |
- jhu-clsp/mmbert-midtraining
|
| 6 |
- jhu-clsp/mmbert-pretrain-p1-fineweb2-langs
|
| 7 |
- jhu-clsp/mmbert-pretrain-p2-fineweb2-remaining
|
| 8 |
- jhu-clsp/mmbert-pretrain-p3-others
|
| 9 |
-
pipeline_tag: fill-mask
|
| 10 |
library_name: transformers
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
# mmBERT: A Modern Multilingual Encoder
|
| 14 |
|
| 15 |
[](https://opensource.org/licenses/MIT)
|
| 16 |
[](https://arxiv.org/abs/2509.06888)
|
|
|
|
| 17 |
[](https://huggingface.co/jhu-clsp/mmBERT-base)
|
| 18 |
[](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
|
| 19 |
[](https://github.com/jhu-clsp/mmBERT)
|
| 20 |
|
| 21 |
-
> TL;DR
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
## Table of Contents
|
| 26 |
-
- [Highlights](#highlights)
|
| 27 |
- [Quick Start](#quick-start)
|
| 28 |
-
- [Model Description](#model-description)
|
| 29 |
-
- [Novel Training Innovations](#novel-training-innovations)
|
| 30 |
- [Model Family](#model-family)
|
|
|
|
| 31 |
- [Training Data](#training-data)
|
| 32 |
- [Usage Examples](#usage-examples)
|
| 33 |
- [Fine-tuning Examples](#fine-tuning-examples)
|
| 34 |
- [Model Architecture](#model-architecture)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
- [Citation](#citation)
|
| 36 |
|
| 37 |
-
|
| 38 |
## Quick Start
|
| 39 |
|
| 40 |
### Installation
|
| 41 |
```bash
|
| 42 |
pip install torch>=1.9.0
|
| 43 |
-
pip install transformers>=4.
|
| 44 |
```
|
| 45 |
|
| 46 |
-
###
|
| 47 |
|
|
|
|
| 48 |
```python
|
| 49 |
from transformers import AutoTokenizer, AutoModel
|
| 50 |
|
| 51 |
-
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/
|
| 52 |
-
model = AutoModel.from_pretrained("jhu-clsp/
|
| 53 |
|
| 54 |
-
|
|
|
|
| 55 |
outputs = model(**inputs)
|
|
|
|
| 56 |
```
|
| 57 |
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
|
| 63 |
-
2. **Modern Architecture** - Built on ModernBERT foundation with Flash Attention 2 and unpadding techniques
|
| 64 |
-
3. **Novel Training Recipe** - Introduces inverse mask scheduling and temperature sampling
|
| 65 |
-
4. **Open Training Data** - Complete 3T+ token dataset publicly available
|
| 66 |
-
5. **Decay Phase Innovation** - Demonstrates effective learning of low-resource languages in final training phase
|
| 67 |
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
## Novel Training Innovations
|
| 71 |
|
|
@@ -77,16 +115,9 @@ The model uses bidirectional attention with masked language modeling objectives,
|
|
| 77 |
|
| 78 |
**Model Merging**: Combine English-focused, high-resource, and all-language decay variants using TIES merging.
|
| 79 |
|
| 80 |
-
## Model Family
|
| 81 |
-
|
| 82 |
-
| Model | Total Params | Non-embed Params | Languages | Download |
|
| 83 |
-
|:------|:-------------|:------------------|:----------|:---------|
|
| 84 |
-
| [mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small) | 140M | 42M | 1800+ | [](https://huggingface.co/jhu-clsp/mmBERT-small) |
|
| 85 |
-
| [mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) | 307M | 110M | 1800+ | [](https://huggingface.co/jhu-clsp/mmBERT-base) |
|
| 86 |
-
|
| 87 |
## Training Data
|
| 88 |
|
| 89 |
-
mmBERT
|
| 90 |
|
| 91 |
| Phase | Dataset | Tokens | Description |
|
| 92 |
|:------|:--------|:-------|:------------|
|
|
@@ -96,89 +127,142 @@ mmBERT training data is publicly available across different phases:
|
|
| 96 |
| Mid-training | [mmbert-midtraining](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B | 110 languages, context extension to 8K |
|
| 97 |
| Decay Phase | [mmbert-decay](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B | 1833 languages, premium quality |
|
| 98 |
|
| 99 |
-
**
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
| Intermediate Size | 1152 | 1152 |
|
| 108 |
-
| Attention Heads | 6 | 12 |
|
| 109 |
-
| Total Parameters | 140M | 307M |
|
| 110 |
-
| Non-embedding Parameters | 42M | 110M |
|
| 111 |
-
| Max Sequence Length | 8192 | 8192 |
|
| 112 |
-
| Vocabulary Size | 256,000 | 256,000 |
|
| 113 |
-
| Tokenizer | Gemma 2 | Gemma 2 |
|
| 114 |
|
| 115 |
## Usage Examples
|
| 116 |
|
| 117 |
-
###
|
| 118 |
-
|
| 119 |
-
```python
|
| 120 |
-
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 121 |
-
import torch
|
| 122 |
|
| 123 |
-
|
| 124 |
-
|
| 125 |
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
#
|
| 138 |
texts = [
|
| 139 |
-
"
|
| 140 |
-
"
|
| 141 |
-
"
|
| 142 |
]
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
predictions = predict_masked_token(text)
|
| 146 |
-
print(f"Text: {text}")
|
| 147 |
-
print(f"Predictions: {predictions}")
|
| 148 |
```
|
| 149 |
|
| 150 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
```python
|
| 153 |
from transformers import AutoTokenizer, AutoModel
|
| 154 |
import torch
|
| 155 |
-
|
| 156 |
|
| 157 |
-
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/
|
| 158 |
-
model = AutoModel.from_pretrained("jhu-clsp/
|
| 159 |
|
| 160 |
def get_embeddings(texts):
|
| 161 |
-
inputs = tokenizer(texts, padding=True, truncation=True
|
| 162 |
-
|
| 163 |
with torch.no_grad():
|
| 164 |
outputs = model(**inputs)
|
| 165 |
-
|
| 166 |
-
|
| 167 |
return embeddings.numpy()
|
| 168 |
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
"
|
| 172 |
-
"L'intelligence artificielle transforme
|
| 173 |
-
"
|
|
|
|
|
|
|
| 174 |
]
|
| 175 |
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
```
|
| 181 |
|
|
|
|
|
|
|
| 182 |
## Fine-tuning Examples
|
| 183 |
|
| 184 |
### Dense Retrieval with Sentence Transformers
|
|
@@ -205,7 +289,6 @@ def main():
|
|
| 205 |
args = parser.parse_args()
|
| 206 |
|
| 207 |
lr = args.lr
|
| 208 |
-
model_name = args.model_name
|
| 209 |
model_shortname = model_name.split("/")[-1]
|
| 210 |
|
| 211 |
model = SentenceTransformer(model_name)
|
|
@@ -443,31 +526,134 @@ if __name__ == "__main__":
|
|
| 443 |
|
| 444 |
</details>
|
| 445 |
|
| 446 |
-
##
|
| 447 |
|
| 448 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 449 |
|
| 450 |
-
|
| 451 |
-
|:------|:--------|:------------|
|
| 452 |
-
| [Pre-training P1](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) | 2.3T tokens | 60 languages, diverse data mixture |
|
| 453 |
-
| [Pre-training P2](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p2-fineweb2-langs) | - | Extension data for pre-training |
|
| 454 |
-
| [Pre-training P3](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p3-fineweb2-langs) | - | Final pre-training data |
|
| 455 |
-
| [Mid-training](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B tokens | 110 languages, context extension |
|
| 456 |
-
| [Decay Phase](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B tokens | 1833 languages, premium quality |
|
| 457 |
|
| 458 |
-
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
|
| 465 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 466 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 467 |
|
| 468 |
## Citation
|
| 469 |
|
| 470 |
-
If you use mmBERT in your research, please cite our work:
|
| 471 |
|
| 472 |
```bibtex
|
| 473 |
@misc{marone2025mmbertmodernmultilingualencoder,
|
|
@@ -479,5 +665,4 @@ If you use mmBERT in your research, please cite our work:
|
|
| 479 |
primaryClass={cs.CL},
|
| 480 |
url={https://arxiv.org/abs/2509.06888},
|
| 481 |
}
|
| 482 |
-
```
|
| 483 |
-
"""
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
datasets:
|
| 3 |
- jhu-clsp/mmbert-decay
|
| 4 |
- jhu-clsp/mmbert-midtraining
|
| 5 |
- jhu-clsp/mmbert-pretrain-p1-fineweb2-langs
|
| 6 |
- jhu-clsp/mmbert-pretrain-p2-fineweb2-remaining
|
| 7 |
- jhu-clsp/mmbert-pretrain-p3-others
|
|
|
|
| 8 |
library_name: transformers
|
| 9 |
+
license: mit
|
| 10 |
+
pipeline_tag: feature-extraction
|
| 11 |
+
tags:
|
| 12 |
+
- fill-mask
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
|
| 16 |
|
| 17 |
[](https://opensource.org/licenses/MIT)
|
| 18 |
[](https://arxiv.org/abs/2509.06888)
|
| 19 |
+
[](https://huggingface.co/papers/2509.06888)
|
| 20 |
[](https://huggingface.co/jhu-clsp/mmBERT-base)
|
| 21 |
[](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
|
| 22 |
[](https://github.com/jhu-clsp/mmBERT)
|
| 23 |
|
| 24 |
+
> 🌍 **TL;DR**: State-of-the-art multilingual encoder models trained on 3T tokens across 1833 languages with novel annealed language learning. Outperforms XLM-R and can even beat OpenAI's o3 and Google's Gemini 2.5 Pro.
|
| 25 |
|
| 26 |
+
## Paper Abstract
|
| 27 |
+
Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.
|
| 28 |
+
|
| 29 |
+
## Overview
|
| 30 |
+
mmBERT introduces the first modern multilingual encoder trained with cascading annealed language learning (ALL), progressively incorporating 1833 languages during training. It significantly outperforms previous generation models like XLM-R on classification, embedding, and retrieval tasks while achieving remarkable efficiency improvements (up to 4x faster). Built on the ModernBERT architecture with novel inverse masking schedules and high-quality multilingual data, mmBERT demonstrates that low-resource languages can be effectively learned during the decay phase of training.
|
| 31 |
|
| 32 |
## Table of Contents
|
|
|
|
| 33 |
- [Quick Start](#quick-start)
|
|
|
|
|
|
|
| 34 |
- [Model Family](#model-family)
|
| 35 |
+
- [Novel Training Innovations](#novel-training-innovations)
|
| 36 |
- [Training Data](#training-data)
|
| 37 |
- [Usage Examples](#usage-examples)
|
| 38 |
- [Fine-tuning Examples](#fine-tuning-examples)
|
| 39 |
- [Model Architecture](#model-architecture)
|
| 40 |
+
- [Training](#training)
|
| 41 |
+
- [Evaluation](#evaluation)
|
| 42 |
+
- [FAQ](#faq)
|
| 43 |
+
- [Limitations](#limitations)
|
| 44 |
- [Citation](#citation)
|
| 45 |
|
|
|
|
| 46 |
## Quick Start
|
| 47 |
|
| 48 |
### Installation
|
| 49 |
```bash
|
| 50 |
pip install torch>=1.9.0
|
| 51 |
+
pip install transformers>=4.48.0
|
| 52 |
```
|
| 53 |
|
| 54 |
+
### 30-Second Examples
|
| 55 |
|
| 56 |
+
**Small Model for Fast Inference:**
|
| 57 |
```python
|
| 58 |
from transformers import AutoTokenizer, AutoModel
|
| 59 |
|
| 60 |
+
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-small")
|
| 61 |
+
model = AutoModel.from_pretrained("jhu-clsp/mmbert-small")
|
| 62 |
|
| 63 |
+
# Example: Get multilingual embeddings
|
| 64 |
+
inputs = tokenizer("Hello world! 你好世界! Bonjour le monde!", return_tensors="pt")
|
| 65 |
outputs = model(**inputs)
|
| 66 |
+
embeddings = outputs.last_hidden_state.mean(dim=1)
|
| 67 |
```
|
| 68 |
|
| 69 |
+
**Base Model for Masked Language Modeling:**
|
| 70 |
+
```python
|
| 71 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 72 |
+
import torch
|
| 73 |
+
|
| 74 |
+
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
|
| 75 |
+
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/mmbert-base")
|
| 76 |
+
|
| 77 |
+
# Example: Multilingual masked language modeling
|
| 78 |
+
text = "The capital of [MASK] is Paris."
|
| 79 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 80 |
+
with torch.no_grad():
|
| 81 |
+
outputs = model(**inputs)
|
| 82 |
+
|
| 83 |
+
# Get predictions for [MASK] tokens
|
| 84 |
+
mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
|
| 85 |
+
predictions = outputs.logits[mask_indices]
|
| 86 |
+
top_tokens = torch.topk(predictions, 5, dim=-1)
|
| 87 |
+
predicted_words = [tokenizer.decode(token) for token in top_tokens.indices[0]]
|
| 88 |
+
print(f"Predictions: {predicted_words}")
|
| 89 |
+
```
|
| 90 |
|
| 91 |
+
## Model Family
|
| 92 |
|
| 93 |
+
### Main Models
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
+
| Size | Model | Parameters | Languages | Context | Best For | Download |
|
| 96 |
+
|:-----|:------|:-----------|:----------|:--------|:---------|:---------|
|
| 97 |
+
| Small | [mmbert-small](https://huggingface.co/jhu-clsp/mmbert-small) | 140M | 1833 | 8192 | Fast inference, edge deployment | [](https://huggingface.co/jhu-clsp/mmbert-small) |
|
| 98 |
+
| Base | [mmbert-base](https://huggingface.co/jhu-clsp/mmbert-base) | 307M | 1833 | 8192 | Best performance, production use | [](https://huggingface.co/jhu-clsp/mmbert-base) |
|
| 99 |
+
|
| 100 |
+
### Key Features
|
| 101 |
+
|
| 102 |
+
- **1833 Languages**: Covers more languages than any previous multilingual encoder
|
| 103 |
+
- **Extended Context**: Up to 8192 tokens (vs 512 for XLM-R)
|
| 104 |
+
- **Efficiency**: 2-4x faster inference than previous multilingual models
|
| 105 |
+
- **Modern Architecture**: Based on ModernBERT with RoPE, GLU activations, and Flash Attention 2
|
| 106 |
+
- **Open Training**: Complete training data, recipes, and checkpoints available
|
| 107 |
|
| 108 |
## Novel Training Innovations
|
| 109 |
|
|
|
|
| 115 |
|
| 116 |
**Model Merging**: Combine English-focused, high-resource, and all-language decay variants using TIES merging.
|
| 117 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
## Training Data
|
| 119 |
|
| 120 |
+
mmBERT was trained on a carefully curated 3T+ token multilingual dataset:
|
| 121 |
|
| 122 |
| Phase | Dataset | Tokens | Description |
|
| 123 |
|:------|:--------|:-------|:------------|
|
|
|
|
| 127 |
| Mid-training | [mmbert-midtraining](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B | 110 languages, context extension to 8K |
|
| 128 |
| Decay Phase | [mmbert-decay](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B | 1833 languages, premium quality |
|
| 129 |
|
| 130 |
+
**Primary Sources:**
|
| 131 |
+
- **Filtered DCLM**: High-quality English content
|
| 132 |
+
- **FineWeb2**: Broad multilingual web coverage (1800+ languages)
|
| 133 |
+
- **FineWeb2-HQ**: Filtered subset of 20 high-resource languages
|
| 134 |
+
- **Code**: StarCoder and ProLong repositories
|
| 135 |
+
- **Academic**: ArXiv papers and PeS2o scientific content
|
| 136 |
+
- **Reference**: Wikipedia (MegaWika) and textbooks
|
| 137 |
+
- **Community**: StackExchange discussions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
## Usage Examples
|
| 140 |
|
| 141 |
+
### Classification Task
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
+
<details>
|
| 144 |
+
<summary><strong>Click to expand classification fine-tuning example</strong></summary>
|
| 145 |
|
| 146 |
+
```python
|
| 147 |
+
from transformers import AutoTokenizer, AutoModel
|
| 148 |
+
import torch.nn as nn
|
| 149 |
+
|
| 150 |
+
# Load model for classification
|
| 151 |
+
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
|
| 152 |
+
encoder = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
|
| 153 |
+
|
| 154 |
+
# Add classification head
|
| 155 |
+
class MultilingualClassifier(nn.Module):
|
| 156 |
+
def __init__(self, encoder, num_classes):
|
| 157 |
+
super().__init__()
|
| 158 |
+
self.encoder = encoder
|
| 159 |
+
self.classifier = nn.Linear(encoder.config.hidden_size, num_classes)
|
| 160 |
+
self.dropout = nn.Dropout(0.1)
|
| 161 |
|
| 162 |
+
def forward(self, input_ids, attention_mask=None):
|
| 163 |
+
outputs = self.encoder(input_ids, attention_mask=attention_mask)
|
| 164 |
+
pooled_output = outputs.last_hidden_state[:, 0] # Use [CLS] token
|
| 165 |
+
pooled_output = self.dropout(pooled_output)
|
| 166 |
+
return self.classifier(pooled_output)
|
| 167 |
+
|
| 168 |
+
# Initialize classifier
|
| 169 |
+
model = MultilingualClassifier(encoder, num_classes=3)
|
| 170 |
|
| 171 |
+
# Example multilingual inputs
|
| 172 |
texts = [
|
| 173 |
+
"This is a positive review.",
|
| 174 |
+
"Ceci est un avis négatif.",
|
| 175 |
+
"这是一个中性评价。"
|
| 176 |
]
|
| 177 |
+
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
|
| 178 |
+
predictions = model(**inputs)
|
|
|
|
|
|
|
|
|
|
| 179 |
```
|
| 180 |
|
| 181 |
+
</details>
|
| 182 |
+
|
| 183 |
+
### Multilingual Retrieval
|
| 184 |
+
|
| 185 |
+
<details>
|
| 186 |
+
<summary><strong>Click to expand multilingual retrieval example</strong></summary>
|
| 187 |
|
| 188 |
```python
|
| 189 |
from transformers import AutoTokenizer, AutoModel
|
| 190 |
import torch
|
| 191 |
+
import numpy as np
|
| 192 |
|
| 193 |
+
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
|
| 194 |
+
model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
|
| 195 |
|
| 196 |
def get_embeddings(texts):
|
| 197 |
+
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
|
|
|
|
| 198 |
with torch.no_grad():
|
| 199 |
outputs = model(**inputs)
|
| 200 |
+
# Mean pooling
|
| 201 |
+
embeddings = outputs.last_hidden_state.mean(dim=1)
|
| 202 |
return embeddings.numpy()
|
| 203 |
|
| 204 |
+
# Multilingual document retrieval
|
| 205 |
+
documents = [
|
| 206 |
+
"Artificial intelligence is transforming healthcare.",
|
| 207 |
+
"L'intelligence artificielle transforme les soins de santé.",
|
| 208 |
+
"人工智能正在改变医疗保健。",
|
| 209 |
+
"Climate change requires immediate action.",
|
| 210 |
+
"El cambio climático requiere acción inmediata."
|
| 211 |
]
|
| 212 |
|
| 213 |
+
query = "AI in medicine"
|
| 214 |
+
|
| 215 |
+
# Get embeddings
|
| 216 |
+
doc_embeddings = get_embeddings(documents)
|
| 217 |
+
query_embedding = get_embeddings([query])
|
| 218 |
+
|
| 219 |
+
# Compute similarities
|
| 220 |
+
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
|
| 221 |
+
ranked_docs = np.argsort(similarities)[::-1]
|
| 222 |
+
|
| 223 |
+
print("Most similar documents:")
|
| 224 |
+
for i, doc_idx in enumerate(ranked_docs[:3]):
|
| 225 |
+
print(f"{i+1}. {documents[doc_idx]} (score: {similarities[doc_idx]:.3f})")
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
</details>
|
| 229 |
+
|
| 230 |
+
### Long Context Processing
|
| 231 |
+
|
| 232 |
+
<details>
|
| 233 |
+
<summary><strong>Click to expand long context processing example</strong></summary>
|
| 234 |
+
|
| 235 |
+
```python
|
| 236 |
+
from transformers import AutoTokenizer, AutoModel
|
| 237 |
+
|
| 238 |
+
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
|
| 239 |
+
model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
|
| 240 |
+
|
| 241 |
+
# Process long multilingual document (up to 8192 tokens)
|
| 242 |
+
long_text = """
|
| 243 |
+
This is a very long multilingual document...
|
| 244 |
+
Ceci est un très long document multilingue...
|
| 245 |
+
这是一个非常长的多语言文档...
|
| 246 |
+
""" * 100 # Simulate long text
|
| 247 |
+
|
| 248 |
+
# Tokenize with extended context
|
| 249 |
+
inputs = tokenizer(
|
| 250 |
+
long_text,
|
| 251 |
+
return_tensors="pt",
|
| 252 |
+
max_length=8192,
|
| 253 |
+
truncation=True
|
| 254 |
+
)
|
| 255 |
+
|
| 256 |
+
# Process efficiently with Flash Attention
|
| 257 |
+
with torch.no_grad():
|
| 258 |
+
outputs = model(**inputs)
|
| 259 |
+
|
| 260 |
+
print(f"Processed {inputs['input_ids'].shape[1]} tokens")
|
| 261 |
+
print(f"Output shape: {outputs.last_hidden_state.shape}")
|
| 262 |
```
|
| 263 |
|
| 264 |
+
</details>
|
| 265 |
+
|
| 266 |
## Fine-tuning Examples
|
| 267 |
|
| 268 |
### Dense Retrieval with Sentence Transformers
|
|
|
|
| 289 |
args = parser.parse_args()
|
| 290 |
|
| 291 |
lr = args.lr
|
|
|
|
| 292 |
model_shortname = model_name.split("/")[-1]
|
| 293 |
|
| 294 |
model = SentenceTransformer(model_name)
|
|
|
|
| 526 |
|
| 527 |
</details>
|
| 528 |
|
| 529 |
+
## Model Architecture
|
| 530 |
|
| 531 |
+
| Parameter | mmBERT-small | mmBERT-base |
|
| 532 |
+
|:----------|:-------------|:------------|
|
| 533 |
+
| Layers | 22 | 22 |
|
| 534 |
+
| Hidden Size | 384 | 768 |
|
| 535 |
+
| Intermediate Size | 1152 | 1152 |
|
| 536 |
+
| Attention Heads | 6 | 12 |
|
| 537 |
+
| Total Parameters | 140M | 307M |
|
| 538 |
+
| Non-embedding Parameters | 42M | 110M |
|
| 539 |
+
| Max Sequence Length | 8192 | 8192 |
|
| 540 |
+
| Vocabulary Size | 256,000 | 256,000 |
|
| 541 |
+
| Tokenizer | Gemma 2 | Gemma 2 |
|
| 542 |
|
| 543 |
+
## Training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 544 |
|
| 545 |
+
Using 8xH100s, training took approximately 10 days for mmBERT-small and 40 days for mmBERT-base.
|
| 546 |
+
|
| 547 |
+
### Training Recipe: Cascading Annealed Language Learning
|
| 548 |
+
|
| 549 |
+
mmBERT introduces novel training techniques:
|
| 550 |
+
|
| 551 |
+
1. **Inverse Masking Schedule**: Start with 30% masking, gradually reduce to 5%
|
| 552 |
+
2. **Language Progression**: 60 → 110 → 1833 languages across training phases
|
| 553 |
+
3. **Temperature Annealing**: 0.7 → 0.5 → 0.3 for increasingly uniform language sampling
|
| 554 |
+
4. **High-Quality Data**: Progressive upgrade from web crawl to filtered premium sources
|
| 555 |
+
|
| 556 |
+
### Training Details
|
| 557 |
|
| 558 |
+
### Architecture
|
| 559 |
+
|
| 560 |
+
| Component | Small | Base |
|
| 561 |
+
|:----------|:------|:-----|
|
| 562 |
+
| Layers | 22 | 22 |
|
| 563 |
+
| Hidden Size | 384 | 768 |
|
| 564 |
+
| Intermediate Size | 1152 | 1152 |
|
| 565 |
+
| Attention Heads | 6 | 12 |
|
| 566 |
+
| Parameters (Total) | 140M | 307M |
|
| 567 |
+
| Parameters (Non-Embed) | 42M | 110M |
|
| 568 |
+
| Max Sequence Length | 8192 | 8192 |
|
| 569 |
+
| Vocabulary Size | 256,000 | 256,000 |
|
| 570 |
+
|
| 571 |
+
### Training Configuration
|
| 572 |
+
|
| 573 |
+
**Data Mixture:**
|
| 574 |
+
* Pre-training (2.0T tokens): Web crawl, code, scientific papers, reference materials
|
| 575 |
+
* Mid-training (600B tokens): Higher quality filtered data with context extension
|
| 576 |
+
* Decay phase (100B tokens): Premium sources including textbooks and curated content
|
| 577 |
+
|
| 578 |
+
**Architecture Features:**
|
| 579 |
+
* ModernBERT-based transformer with RoPE positional embeddings
|
| 580 |
+
* GLU activations and prenorm layer normalization
|
| 581 |
+
* Flash Attention 2 for efficient long-context processing
|
| 582 |
+
* Gemma 2 tokenizer for multilingual coverage
|
| 583 |
+
|
| 584 |
+
**Training Phases:**
|
| 585 |
+
1. **Base Pre-training**: 60 languages, 30% masking, learning rate warmup
|
| 586 |
+
2. **Context Extension**: 110 languages, 15% masking, extended context to 8K
|
| 587 |
+
3. **Decay Phase**: 1833 languages, 5% masking, high-quality data focus
|
| 588 |
+
|
| 589 |
+
## Evaluation
|
| 590 |
+
Evaluation code for retrieval tasks is the same as [Ettin](https://github.com/JHU-CLSP/ettin-encoder-vs-decoder/tree/main/retrieval_eval).
|
| 591 |
+
|
| 592 |
+
Evaluation code for efficiency is taken from the [ModernBERT](https://github.com/AnswerDotAI/ModernBERT/tree/main/efficiency) repo.
|
| 593 |
+
|
| 594 |
+
Evaluation code for NLU tasks is based on the [mGTE codebase](https://github.com/izhx/nlu-evals) and our fork will be uploaded soon. Please raise an issue or message us if this would be helpful for you.
|
| 595 |
+
|
| 596 |
+
## FAQ
|
| 597 |
+
|
| 598 |
+
**Q: How does mmBERT compare to XLM-R?**
|
| 599 |
+
**A:** mmBERT significantly outperforms XLM-R across all benchmarks:
|
| 600 |
+
- +2.4 points average on XTREME
|
| 601 |
+
- +3.0 points on GLUE
|
| 602 |
+
- 16x more languages (1833 vs 100)
|
| 603 |
+
- 16x longer context (8K vs 512 tokens)
|
| 604 |
+
- 2-4x faster inference
|
| 605 |
+
|
| 606 |
+
**Q: Which languages does mmBERT support?**
|
| 607 |
+
**A:** mmBERT supports 1833 languages and scripts from FineWeb2, including:
|
| 608 |
+
- All major world languages (English, Chinese, Spanish, etc.)
|
| 609 |
+
- European languages (including low-resource ones like Faroese)
|
| 610 |
+
- African languages (Swahili, Amharic, etc.)
|
| 611 |
+
- Asian languages (Hindi, Bengali, Thai, etc.)
|
| 612 |
+
- Many low-resource and indigenous languages
|
| 613 |
+
|
| 614 |
+
**Q: How does the annealed language learning work?**
|
| 615 |
+
**A:** We progressively add languages in three phases:
|
| 616 |
+
1. Start with 60 high-resource languages (pre-training)
|
| 617 |
+
2. Add 50 mid-resource languages (mid-training)
|
| 618 |
+
3. Add 1723 low-resource languages (decay phase)
|
| 619 |
+
|
| 620 |
+
This allows efficient learning without overfitting on low-resource data.
|
| 621 |
+
|
| 622 |
+
**Q: Can I fine-tune mmBERT for my specific task?**
|
| 623 |
+
**A:** Yes! mmBERT works as a drop-in replacement for XLM-R:
|
| 624 |
+
```python
|
| 625 |
+
from transformers import AutoModel, AutoTokenizer
|
| 626 |
+
|
| 627 |
+
# Load for fine-tuning
|
| 628 |
+
model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
|
| 629 |
+
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
|
| 630 |
+
|
| 631 |
+
# Add task-specific head and fine-tune normally
|
| 632 |
+
```
|
| 633 |
+
|
| 634 |
+
**Q: What about efficiency and memory requirements?**
|
| 635 |
+
**A:** mmBERT is significantly more efficient:
|
| 636 |
+
- 2-4x faster inference than XLM-R
|
| 637 |
+
- Flash Attention 2 reduces memory usage for long sequences
|
| 638 |
+
- Support for variable-length batching
|
| 639 |
+
- Optimized for both CPU and GPU deployment
|
| 640 |
+
|
| 641 |
+
**Q: How do I access the training data and checkpoints?**
|
| 642 |
+
**A:** All data and checkpoints are publicly available:
|
| 643 |
+
- Training data: [jhu-clsp/mmbert-pretraining-data](https://huggingface.co/datasets/jhu-clsp/mmbert-pretraining-data)
|
| 644 |
+
- Checkpoints: [jhu-clsp/mmbert-checkpoints](https://huggingface.co/models/jhu-clsp/mmbert-checkpoints)
|
| 645 |
+
- Github code: [GitHub repository](https://github.com/jhu-clsp/mmBERT)
|
| 646 |
+
- Data processing code: [Same as Ettin models](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
|
| 647 |
+
|
| 648 |
+
## Limitations
|
| 649 |
+
|
| 650 |
+
- Structured prediction tasks (NER, POS) show slightly lower scores due to tokenizer prefix space handling
|
| 651 |
+
- Very low-resource languages still have limited training data
|
| 652 |
+
- High-quality educational content filtering could benefit from more languages
|
| 653 |
|
| 654 |
## Citation
|
| 655 |
|
| 656 |
+
If you use mmBERT models in your research, please cite our work:
|
| 657 |
|
| 658 |
```bibtex
|
| 659 |
@misc{marone2025mmbertmodernmultilingualencoder,
|
|
|
|
| 665 |
primaryClass={cs.CL},
|
| 666 |
url={https://arxiv.org/abs/2509.06888},
|
| 667 |
}
|
| 668 |
+
```
|
|
|