Fill-Mask
Transformers
PyTorch
modernbert
nielsr HF Staff commited on
Commit
c1f0235
·
verified ·
1 Parent(s): 212719a

feat: Improve mmBERT-base model card with full title, abstract, and updated content

Browse files

This PR significantly improves the model card for `jhu-clsp/mmBERT-base` by:

1. **Updating Metadata**:
* Changing the `pipeline_tag` from `fill-mask` to `feature-extraction` to better reflect the model's primary use cases as a multilingual encoder for classification, embedding, and retrieval tasks. This ensures the model appears in the correct search results at https://huggingface.co/models?pipeline_tag=feature-extraction.
* Adding `fill-mask` as an `additional tag` to maintain discoverability for masked language modeling, which is a supported capability as demonstrated in the usage examples.

2. **Enhancing Content**:
* Updating the main title to the full paper title: "mmBERT: A Modern Multilingual Encoder with Annealed Language Learning".
* Adding a direct link to the Hugging Face paper page: https://huggingface.co/papers/2509.06888, alongside the existing Arxiv link.
* Including the comprehensive paper abstract.
* Integrating and expanding sections like "Overview", "Quick Start", "Model Family", "Usage Examples", "Fine-tuning Examples", "Training Details", "Evaluation", "FAQ", and "Limitations" using the more detailed and up-to-date information from the official GitHub repository README, making the model card a more complete resource for users.
* Updating the `transformers` installation requirement to `transformers>=4.48.0` for consistency with the GitHub README.
* Renaming the "Base Model for Classification" example in the "Quick Start" to "Base Model for Masked Language Modeling" to accurately reflect its content.

These changes provide users with a clearer understanding of the model's capabilities, how to use it, and its underlying design.

Files changed (1) hide show
  1. README.md +293 -108
README.md CHANGED
@@ -1,71 +1,109 @@
1
  ---
2
- license: mit
3
  datasets:
4
  - jhu-clsp/mmbert-decay
5
  - jhu-clsp/mmbert-midtraining
6
  - jhu-clsp/mmbert-pretrain-p1-fineweb2-langs
7
  - jhu-clsp/mmbert-pretrain-p2-fineweb2-remaining
8
  - jhu-clsp/mmbert-pretrain-p3-others
9
- pipeline_tag: fill-mask
10
  library_name: transformers
 
 
 
 
11
  ---
12
 
13
- # mmBERT: A Modern Multilingual Encoder
14
 
15
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
16
  [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2509.06888)
 
17
  [![Model](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue)](https://huggingface.co/jhu-clsp/mmBERT-base)
18
  [![Collection](https://img.shields.io/badge/🤗%20Model%20Collection-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
19
  [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT)
20
 
21
- > TL;DR: A state-of-the-art multilingual encoder trained on 3T+ tokens across 1800+ languages, introducing novel techniques for learning low-resource languages during the decay phase.
22
 
23
- mmBERT is a modern multilingual encoder that significantly outperforms previous generation models like XLM-R on classification, embedding, and retrieval tasks. Built on the ModernBERT architecture with novel multilingual training innovations, mmBERT demonstrates that low-resource languages can be effectively learned during the decay phase of training. It is also significantly faster than any previous multilingual encoder.
 
 
 
 
24
 
25
  ## Table of Contents
26
- - [Highlights](#highlights)
27
  - [Quick Start](#quick-start)
28
- - [Model Description](#model-description)
29
- - [Novel Training Innovations](#novel-training-innovations)
30
  - [Model Family](#model-family)
 
31
  - [Training Data](#training-data)
32
  - [Usage Examples](#usage-examples)
33
  - [Fine-tuning Examples](#fine-tuning-examples)
34
  - [Model Architecture](#model-architecture)
 
 
 
 
35
  - [Citation](#citation)
36
 
37
-
38
  ## Quick Start
39
 
40
  ### Installation
41
  ```bash
42
  pip install torch>=1.9.0
43
- pip install transformers>=4.21.0
44
  ```
45
 
46
- ### Usage
47
 
 
48
  ```python
49
  from transformers import AutoTokenizer, AutoModel
50
 
51
- tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base")
52
- model = AutoModel.from_pretrained("jhu-clsp/mmBERT-base")
53
 
54
- inputs = tokenizer("Hello world", return_tensors="pt")
 
55
  outputs = model(**inputs)
 
56
  ```
57
 
58
- ## Model Description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
- mmBERT represents the first significant advancement over XLM-R for massively multilingual encoder models. Key features include:
61
 
62
- 1. **Massive Language Coverage** - Trained on over 1800 languages with progressive inclusion strategy
63
- 2. **Modern Architecture** - Built on ModernBERT foundation with Flash Attention 2 and unpadding techniques
64
- 3. **Novel Training Recipe** - Introduces inverse mask scheduling and temperature sampling
65
- 4. **Open Training Data** - Complete 3T+ token dataset publicly available
66
- 5. **Decay Phase Innovation** - Demonstrates effective learning of low-resource languages in final training phase
67
 
68
- The model uses bidirectional attention with masked language modeling objectives, optimized specifically for multilingual understanding and cross-lingual transfer.
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ## Novel Training Innovations
71
 
@@ -77,16 +115,9 @@ The model uses bidirectional attention with masked language modeling objectives,
77
 
78
  **Model Merging**: Combine English-focused, high-resource, and all-language decay variants using TIES merging.
79
 
80
- ## Model Family
81
-
82
- | Model | Total Params | Non-embed Params | Languages | Download |
83
- |:------|:-------------|:------------------|:----------|:---------|
84
- | [mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small) | 140M | 42M | 1800+ | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/mmBERT-small) |
85
- | [mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) | 307M | 110M | 1800+ | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/mmBERT-base) |
86
-
87
  ## Training Data
88
 
89
- mmBERT training data is publicly available across different phases:
90
 
91
  | Phase | Dataset | Tokens | Description |
92
  |:------|:--------|:-------|:------------|
@@ -96,89 +127,142 @@ mmBERT training data is publicly available across different phases:
96
  | Mid-training | [mmbert-midtraining](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B | 110 languages, context extension to 8K |
97
  | Decay Phase | [mmbert-decay](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B | 1833 languages, premium quality |
98
 
99
- **Data Sources**: Filtered DCLM (English), FineWeb2 (multilingual), FineWeb2-HQ (20 high-resource languages), Wikipedia (MegaWika), code repositories (StarCoder, ProLong), academic papers (ArXiv, PeS2o), and community discussions (StackExchange).
100
-
101
- ## Model Architecture
102
-
103
- | Parameter | mmBERT-small | mmBERT-base |
104
- |:----------|:-------------|:------------|
105
- | Layers | 22 | 22 |
106
- | Hidden Size | 384 | 768 |
107
- | Intermediate Size | 1152 | 1152 |
108
- | Attention Heads | 6 | 12 |
109
- | Total Parameters | 140M | 307M |
110
- | Non-embedding Parameters | 42M | 110M |
111
- | Max Sequence Length | 8192 | 8192 |
112
- | Vocabulary Size | 256,000 | 256,000 |
113
- | Tokenizer | Gemma 2 | Gemma 2 |
114
 
115
  ## Usage Examples
116
 
117
- ### Masked Language Modeling
118
-
119
- ```python
120
- from transformers import AutoTokenizer, AutoModelForMaskedLM
121
- import torch
122
 
123
- tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base")
124
- model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/mmBERT-base")
125
 
126
- def predict_masked_token(text):
127
- inputs = tokenizer(text, return_tensors="pt")
128
- with torch.no_grad():
129
- outputs = model(**inputs)
130
-
131
- mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
132
- predictions = outputs.logits[mask_indices]
133
- top_tokens = torch.topk(predictions, 5, dim=-1)
 
 
 
 
 
 
 
134
 
135
- return [tokenizer.decode(token) for token in top_tokens.indices[0]]
 
 
 
 
 
 
 
136
 
137
- # Works across languages
138
  texts = [
139
- "The capital of France is <mask>.",
140
- "La capital de España es <mask>.",
141
- "Die Hauptstadt von Deutschland ist <mask>."
142
  ]
143
-
144
- for text in texts:
145
- predictions = predict_masked_token(text)
146
- print(f"Text: {text}")
147
- print(f"Predictions: {predictions}")
148
  ```
149
 
150
- ### Cross-lingual Embeddings
 
 
 
 
 
151
 
152
  ```python
153
  from transformers import AutoTokenizer, AutoModel
154
  import torch
155
- from sklearn.metrics.pairwise import cosine_similarity
156
 
157
- tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base")
158
- model = AutoModel.from_pretrained("jhu-clsp/mmBERT-base")
159
 
160
  def get_embeddings(texts):
161
- inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
162
-
163
  with torch.no_grad():
164
  outputs = model(**inputs)
165
- embeddings = outputs.last_hidden_state.mean(dim=1)
166
-
167
  return embeddings.numpy()
168
 
169
- multilingual_texts = [
170
- "Artificial intelligence is transforming technology",
171
- "La inteligencia artificial está transformando la tecnología",
172
- "L'intelligence artificielle transforme la technologie",
173
- "人工智能正在改变技术"
 
 
174
  ]
175
 
176
- embeddings = get_embeddings(multilingual_texts)
177
- similarities = cosine_similarity(embeddings)
178
- print("Cross-lingual similarity matrix:")
179
- print(similarities)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
180
  ```
181
 
 
 
182
  ## Fine-tuning Examples
183
 
184
  ### Dense Retrieval with Sentence Transformers
@@ -205,7 +289,6 @@ def main():
205
  args = parser.parse_args()
206
 
207
  lr = args.lr
208
- model_name = args.model_name
209
  model_shortname = model_name.split("/")[-1]
210
 
211
  model = SentenceTransformer(model_name)
@@ -443,31 +526,134 @@ if __name__ == "__main__":
443
 
444
  </details>
445
 
446
- ## Training Data
447
 
448
- mmBERT was trained on a carefully curated 3T+ token multilingual dataset:
 
 
 
 
 
 
 
 
 
 
449
 
450
- | Phase | Dataset | Description |
451
- |:------|:--------|:------------|
452
- | [Pre-training P1](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) | 2.3T tokens | 60 languages, diverse data mixture |
453
- | [Pre-training P2](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p2-fineweb2-langs) | - | Extension data for pre-training |
454
- | [Pre-training P3](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p3-fineweb2-langs) | - | Final pre-training data |
455
- | [Mid-training](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B tokens | 110 languages, context extension |
456
- | [Decay Phase](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B tokens | 1833 languages, premium quality |
457
 
458
- **Primary Sources:**
459
- - **Filtered DCLM**: High-quality English content
460
- - **FineWeb2**: Broad multilingual web coverage (1800+ languages)
461
- - **FineWeb2-HQ**: Filtered subset of 20 high-resource languages
462
- - **Code**: StarCoder and ProLong repositories
463
- - **Academic**: ArXiv papers and PeS2o scientific content
464
- - **Reference**: Wikipedia (MegaWika) and textbooks
465
- - **Community**: StackExchange discussions
 
 
 
 
466
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
467
 
468
  ## Citation
469
 
470
- If you use mmBERT in your research, please cite our work:
471
 
472
  ```bibtex
473
  @misc{marone2025mmbertmodernmultilingualencoder,
@@ -479,5 +665,4 @@ If you use mmBERT in your research, please cite our work:
479
  primaryClass={cs.CL},
480
  url={https://arxiv.org/abs/2509.06888},
481
  }
482
- ```
483
- """
 
1
  ---
 
2
  datasets:
3
  - jhu-clsp/mmbert-decay
4
  - jhu-clsp/mmbert-midtraining
5
  - jhu-clsp/mmbert-pretrain-p1-fineweb2-langs
6
  - jhu-clsp/mmbert-pretrain-p2-fineweb2-remaining
7
  - jhu-clsp/mmbert-pretrain-p3-others
 
8
  library_name: transformers
9
+ license: mit
10
+ pipeline_tag: feature-extraction
11
+ tags:
12
+ - fill-mask
13
  ---
14
 
15
+ # mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
16
 
17
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
18
  [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2509.06888)
19
+ [![HF Paper](https://img.shields.io/badge/Paper-HuggingFace-blue)](https://huggingface.co/papers/2509.06888)
20
  [![Model](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue)](https://huggingface.co/jhu-clsp/mmBERT-base)
21
  [![Collection](https://img.shields.io/badge/🤗%20Model%20Collection-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
22
  [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT)
23
 
24
+ > 🌍 **TL;DR**: State-of-the-art multilingual encoder models trained on 3T tokens across 1833 languages with novel annealed language learning. Outperforms XLM-R and can even beat OpenAI's o3 and Google's Gemini 2.5 Pro.
25
 
26
+ ## Paper Abstract
27
+ Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.
28
+
29
+ ## Overview
30
+ mmBERT introduces the first modern multilingual encoder trained with cascading annealed language learning (ALL), progressively incorporating 1833 languages during training. It significantly outperforms previous generation models like XLM-R on classification, embedding, and retrieval tasks while achieving remarkable efficiency improvements (up to 4x faster). Built on the ModernBERT architecture with novel inverse masking schedules and high-quality multilingual data, mmBERT demonstrates that low-resource languages can be effectively learned during the decay phase of training.
31
 
32
  ## Table of Contents
 
33
  - [Quick Start](#quick-start)
 
 
34
  - [Model Family](#model-family)
35
+ - [Novel Training Innovations](#novel-training-innovations)
36
  - [Training Data](#training-data)
37
  - [Usage Examples](#usage-examples)
38
  - [Fine-tuning Examples](#fine-tuning-examples)
39
  - [Model Architecture](#model-architecture)
40
+ - [Training](#training)
41
+ - [Evaluation](#evaluation)
42
+ - [FAQ](#faq)
43
+ - [Limitations](#limitations)
44
  - [Citation](#citation)
45
 
 
46
  ## Quick Start
47
 
48
  ### Installation
49
  ```bash
50
  pip install torch>=1.9.0
51
+ pip install transformers>=4.48.0
52
  ```
53
 
54
+ ### 30-Second Examples
55
 
56
+ **Small Model for Fast Inference:**
57
  ```python
58
  from transformers import AutoTokenizer, AutoModel
59
 
60
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-small")
61
+ model = AutoModel.from_pretrained("jhu-clsp/mmbert-small")
62
 
63
+ # Example: Get multilingual embeddings
64
+ inputs = tokenizer("Hello world! 你好世界! Bonjour le monde!", return_tensors="pt")
65
  outputs = model(**inputs)
66
+ embeddings = outputs.last_hidden_state.mean(dim=1)
67
  ```
68
 
69
+ **Base Model for Masked Language Modeling:**
70
+ ```python
71
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
72
+ import torch
73
+
74
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
75
+ model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/mmbert-base")
76
+
77
+ # Example: Multilingual masked language modeling
78
+ text = "The capital of [MASK] is Paris."
79
+ inputs = tokenizer(text, return_tensors="pt")
80
+ with torch.no_grad():
81
+ outputs = model(**inputs)
82
+
83
+ # Get predictions for [MASK] tokens
84
+ mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
85
+ predictions = outputs.logits[mask_indices]
86
+ top_tokens = torch.topk(predictions, 5, dim=-1)
87
+ predicted_words = [tokenizer.decode(token) for token in top_tokens.indices[0]]
88
+ print(f"Predictions: {predicted_words}")
89
+ ```
90
 
91
+ ## Model Family
92
 
93
+ ### Main Models
 
 
 
 
94
 
95
+ | Size | Model | Parameters | Languages | Context | Best For | Download |
96
+ |:-----|:------|:-----------|:----------|:--------|:---------|:---------|
97
+ | Small | [mmbert-small](https://huggingface.co/jhu-clsp/mmbert-small) | 140M | 1833 | 8192 | Fast inference, edge deployment | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/mmbert-small) |
98
+ | Base | [mmbert-base](https://huggingface.co/jhu-clsp/mmbert-base) | 307M | 1833 | 8192 | Best performance, production use | [![Download](https://img.shields.io/badge/🤗-Download-blue)](https://huggingface.co/jhu-clsp/mmbert-base) |
99
+
100
+ ### Key Features
101
+
102
+ - **1833 Languages**: Covers more languages than any previous multilingual encoder
103
+ - **Extended Context**: Up to 8192 tokens (vs 512 for XLM-R)
104
+ - **Efficiency**: 2-4x faster inference than previous multilingual models
105
+ - **Modern Architecture**: Based on ModernBERT with RoPE, GLU activations, and Flash Attention 2
106
+ - **Open Training**: Complete training data, recipes, and checkpoints available
107
 
108
  ## Novel Training Innovations
109
 
 
115
 
116
  **Model Merging**: Combine English-focused, high-resource, and all-language decay variants using TIES merging.
117
 
 
 
 
 
 
 
 
118
  ## Training Data
119
 
120
+ mmBERT was trained on a carefully curated 3T+ token multilingual dataset:
121
 
122
  | Phase | Dataset | Tokens | Description |
123
  |:------|:--------|:-------|:------------|
 
127
  | Mid-training | [mmbert-midtraining](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B | 110 languages, context extension to 8K |
128
  | Decay Phase | [mmbert-decay](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B | 1833 languages, premium quality |
129
 
130
+ **Primary Sources:**
131
+ - **Filtered DCLM**: High-quality English content
132
+ - **FineWeb2**: Broad multilingual web coverage (1800+ languages)
133
+ - **FineWeb2-HQ**: Filtered subset of 20 high-resource languages
134
+ - **Code**: StarCoder and ProLong repositories
135
+ - **Academic**: ArXiv papers and PeS2o scientific content
136
+ - **Reference**: Wikipedia (MegaWika) and textbooks
137
+ - **Community**: StackExchange discussions
 
 
 
 
 
 
 
138
 
139
  ## Usage Examples
140
 
141
+ ### Classification Task
 
 
 
 
142
 
143
+ <details>
144
+ <summary><strong>Click to expand classification fine-tuning example</strong></summary>
145
 
146
+ ```python
147
+ from transformers import AutoTokenizer, AutoModel
148
+ import torch.nn as nn
149
+
150
+ # Load model for classification
151
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
152
+ encoder = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
153
+
154
+ # Add classification head
155
+ class MultilingualClassifier(nn.Module):
156
+ def __init__(self, encoder, num_classes):
157
+ super().__init__()
158
+ self.encoder = encoder
159
+ self.classifier = nn.Linear(encoder.config.hidden_size, num_classes)
160
+ self.dropout = nn.Dropout(0.1)
161
 
162
+ def forward(self, input_ids, attention_mask=None):
163
+ outputs = self.encoder(input_ids, attention_mask=attention_mask)
164
+ pooled_output = outputs.last_hidden_state[:, 0] # Use [CLS] token
165
+ pooled_output = self.dropout(pooled_output)
166
+ return self.classifier(pooled_output)
167
+
168
+ # Initialize classifier
169
+ model = MultilingualClassifier(encoder, num_classes=3)
170
 
171
+ # Example multilingual inputs
172
  texts = [
173
+ "This is a positive review.",
174
+ "Ceci est un avis négatif.",
175
+ "这是一个中性评价。"
176
  ]
177
+ inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
178
+ predictions = model(**inputs)
 
 
 
179
  ```
180
 
181
+ </details>
182
+
183
+ ### Multilingual Retrieval
184
+
185
+ <details>
186
+ <summary><strong>Click to expand multilingual retrieval example</strong></summary>
187
 
188
  ```python
189
  from transformers import AutoTokenizer, AutoModel
190
  import torch
191
+ import numpy as np
192
 
193
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
194
+ model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
195
 
196
  def get_embeddings(texts):
197
+ inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
 
198
  with torch.no_grad():
199
  outputs = model(**inputs)
200
+ # Mean pooling
201
+ embeddings = outputs.last_hidden_state.mean(dim=1)
202
  return embeddings.numpy()
203
 
204
+ # Multilingual document retrieval
205
+ documents = [
206
+ "Artificial intelligence is transforming healthcare.",
207
+ "L'intelligence artificielle transforme les soins de santé.",
208
+ "人工智能正在改变医疗保健。",
209
+ "Climate change requires immediate action.",
210
+ "El cambio climático requiere acción inmediata."
211
  ]
212
 
213
+ query = "AI in medicine"
214
+
215
+ # Get embeddings
216
+ doc_embeddings = get_embeddings(documents)
217
+ query_embedding = get_embeddings([query])
218
+
219
+ # Compute similarities
220
+ similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
221
+ ranked_docs = np.argsort(similarities)[::-1]
222
+
223
+ print("Most similar documents:")
224
+ for i, doc_idx in enumerate(ranked_docs[:3]):
225
+ print(f"{i+1}. {documents[doc_idx]} (score: {similarities[doc_idx]:.3f})")
226
+ ```
227
+
228
+ </details>
229
+
230
+ ### Long Context Processing
231
+
232
+ <details>
233
+ <summary><strong>Click to expand long context processing example</strong></summary>
234
+
235
+ ```python
236
+ from transformers import AutoTokenizer, AutoModel
237
+
238
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
239
+ model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
240
+
241
+ # Process long multilingual document (up to 8192 tokens)
242
+ long_text = """
243
+ This is a very long multilingual document...
244
+ Ceci est un très long document multilingue...
245
+ 这是一个非常长的多语言文档...
246
+ """ * 100 # Simulate long text
247
+
248
+ # Tokenize with extended context
249
+ inputs = tokenizer(
250
+ long_text,
251
+ return_tensors="pt",
252
+ max_length=8192,
253
+ truncation=True
254
+ )
255
+
256
+ # Process efficiently with Flash Attention
257
+ with torch.no_grad():
258
+ outputs = model(**inputs)
259
+
260
+ print(f"Processed {inputs['input_ids'].shape[1]} tokens")
261
+ print(f"Output shape: {outputs.last_hidden_state.shape}")
262
  ```
263
 
264
+ </details>
265
+
266
  ## Fine-tuning Examples
267
 
268
  ### Dense Retrieval with Sentence Transformers
 
289
  args = parser.parse_args()
290
 
291
  lr = args.lr
 
292
  model_shortname = model_name.split("/")[-1]
293
 
294
  model = SentenceTransformer(model_name)
 
526
 
527
  </details>
528
 
529
+ ## Model Architecture
530
 
531
+ | Parameter | mmBERT-small | mmBERT-base |
532
+ |:----------|:-------------|:------------|
533
+ | Layers | 22 | 22 |
534
+ | Hidden Size | 384 | 768 |
535
+ | Intermediate Size | 1152 | 1152 |
536
+ | Attention Heads | 6 | 12 |
537
+ | Total Parameters | 140M | 307M |
538
+ | Non-embedding Parameters | 42M | 110M |
539
+ | Max Sequence Length | 8192 | 8192 |
540
+ | Vocabulary Size | 256,000 | 256,000 |
541
+ | Tokenizer | Gemma 2 | Gemma 2 |
542
 
543
+ ## Training
 
 
 
 
 
 
544
 
545
+ Using 8xH100s, training took approximately 10 days for mmBERT-small and 40 days for mmBERT-base.
546
+
547
+ ### Training Recipe: Cascading Annealed Language Learning
548
+
549
+ mmBERT introduces novel training techniques:
550
+
551
+ 1. **Inverse Masking Schedule**: Start with 30% masking, gradually reduce to 5%
552
+ 2. **Language Progression**: 60 → 110 → 1833 languages across training phases
553
+ 3. **Temperature Annealing**: 0.7 → 0.5 → 0.3 for increasingly uniform language sampling
554
+ 4. **High-Quality Data**: Progressive upgrade from web crawl to filtered premium sources
555
+
556
+ ### Training Details
557
 
558
+ ### Architecture
559
+
560
+ | Component | Small | Base |
561
+ |:----------|:------|:-----|
562
+ | Layers | 22 | 22 |
563
+ | Hidden Size | 384 | 768 |
564
+ | Intermediate Size | 1152 | 1152 |
565
+ | Attention Heads | 6 | 12 |
566
+ | Parameters (Total) | 140M | 307M |
567
+ | Parameters (Non-Embed) | 42M | 110M |
568
+ | Max Sequence Length | 8192 | 8192 |
569
+ | Vocabulary Size | 256,000 | 256,000 |
570
+
571
+ ### Training Configuration
572
+
573
+ **Data Mixture:**
574
+ * Pre-training (2.0T tokens): Web crawl, code, scientific papers, reference materials
575
+ * Mid-training (600B tokens): Higher quality filtered data with context extension
576
+ * Decay phase (100B tokens): Premium sources including textbooks and curated content
577
+
578
+ **Architecture Features:**
579
+ * ModernBERT-based transformer with RoPE positional embeddings
580
+ * GLU activations and prenorm layer normalization
581
+ * Flash Attention 2 for efficient long-context processing
582
+ * Gemma 2 tokenizer for multilingual coverage
583
+
584
+ **Training Phases:**
585
+ 1. **Base Pre-training**: 60 languages, 30% masking, learning rate warmup
586
+ 2. **Context Extension**: 110 languages, 15% masking, extended context to 8K
587
+ 3. **Decay Phase**: 1833 languages, 5% masking, high-quality data focus
588
+
589
+ ## Evaluation
590
+ Evaluation code for retrieval tasks is the same as [Ettin](https://github.com/JHU-CLSP/ettin-encoder-vs-decoder/tree/main/retrieval_eval).
591
+
592
+ Evaluation code for efficiency is taken from the [ModernBERT](https://github.com/AnswerDotAI/ModernBERT/tree/main/efficiency) repo.
593
+
594
+ Evaluation code for NLU tasks is based on the [mGTE codebase](https://github.com/izhx/nlu-evals) and our fork will be uploaded soon. Please raise an issue or message us if this would be helpful for you.
595
+
596
+ ## FAQ
597
+
598
+ **Q: How does mmBERT compare to XLM-R?**
599
+ **A:** mmBERT significantly outperforms XLM-R across all benchmarks:
600
+ - +2.4 points average on XTREME
601
+ - +3.0 points on GLUE
602
+ - 16x more languages (1833 vs 100)
603
+ - 16x longer context (8K vs 512 tokens)
604
+ - 2-4x faster inference
605
+
606
+ **Q: Which languages does mmBERT support?**
607
+ **A:** mmBERT supports 1833 languages and scripts from FineWeb2, including:
608
+ - All major world languages (English, Chinese, Spanish, etc.)
609
+ - European languages (including low-resource ones like Faroese)
610
+ - African languages (Swahili, Amharic, etc.)
611
+ - Asian languages (Hindi, Bengali, Thai, etc.)
612
+ - Many low-resource and indigenous languages
613
+
614
+ **Q: How does the annealed language learning work?**
615
+ **A:** We progressively add languages in three phases:
616
+ 1. Start with 60 high-resource languages (pre-training)
617
+ 2. Add 50 mid-resource languages (mid-training)
618
+ 3. Add 1723 low-resource languages (decay phase)
619
+
620
+ This allows efficient learning without overfitting on low-resource data.
621
+
622
+ **Q: Can I fine-tune mmBERT for my specific task?**
623
+ **A:** Yes! mmBERT works as a drop-in replacement for XLM-R:
624
+ ```python
625
+ from transformers import AutoModel, AutoTokenizer
626
+
627
+ # Load for fine-tuning
628
+ model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
629
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
630
+
631
+ # Add task-specific head and fine-tune normally
632
+ ```
633
+
634
+ **Q: What about efficiency and memory requirements?**
635
+ **A:** mmBERT is significantly more efficient:
636
+ - 2-4x faster inference than XLM-R
637
+ - Flash Attention 2 reduces memory usage for long sequences
638
+ - Support for variable-length batching
639
+ - Optimized for both CPU and GPU deployment
640
+
641
+ **Q: How do I access the training data and checkpoints?**
642
+ **A:** All data and checkpoints are publicly available:
643
+ - Training data: [jhu-clsp/mmbert-pretraining-data](https://huggingface.co/datasets/jhu-clsp/mmbert-pretraining-data)
644
+ - Checkpoints: [jhu-clsp/mmbert-checkpoints](https://huggingface.co/models/jhu-clsp/mmbert-checkpoints)
645
+ - Github code: [GitHub repository](https://github.com/jhu-clsp/mmBERT)
646
+ - Data processing code: [Same as Ettin models](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
647
+
648
+ ## Limitations
649
+
650
+ - Structured prediction tasks (NER, POS) show slightly lower scores due to tokenizer prefix space handling
651
+ - Very low-resource languages still have limited training data
652
+ - High-quality educational content filtering could benefit from more languages
653
 
654
  ## Citation
655
 
656
+ If you use mmBERT models in your research, please cite our work:
657
 
658
  ```bibtex
659
  @misc{marone2025mmbertmodernmultilingualencoder,
 
665
  primaryClass={cs.CL},
666
  url={https://arxiv.org/abs/2509.06888},
667
  }
668
+ ```