moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT
Abstract
moBERTo, a Portuguese adaptation of ModernBERT, demonstrates strong performance in information retrieval and natural language understanding tasks through continued pretraining and specialized architecture preservation.
Encoder-only transformer models remain essential for production NLP pipelines. We introduce moBERTo, a Portuguese adaptation of ModernBERT obtained through continued pretraining of the ModernBERT-base checkpoint on 60 billion tokens (5 epochs over a 12-billion-token corpus curated from FineWeb2 and filtered with educational and STEM classifiers). We preserve the original architecture, including rotary positional embeddings, alternating local-global attention, flash attention, and unpadding. We evaluate moBERTo across information retrieval (including long-context retrieval at up to 8,192 tokens), document classification, named entity recognition, and natural language understanding. Our best variant, which combines a Portuguese tokenizer with subword-matching embedding transfer and long-context post-training, achieves the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks and the best results on PLUE-PT. Through ablation studies, we show that (i) continued pretraining is strongly preferable to training from scratch, particularly for preserving long-context capabilities; (ii) tokenizer adaptation improves token-level tasks but degrades long-context retrieval; (iii) a dedicated long-context post-training phase at 8,192 tokens further improves reranking and NER; and (iv) encoder-only architectures remain competitive with larger decoder-only alternatives for discriminative tasks. We publicly release the model weights at https://huggingface.co/Tropic-AI/moBERTo and training data at https://huggingface.co/datasets/Tropic-AI/moberto-pretraining-dataset-c4-compatible on Hugging Face.
Get this paper in your agent:
hf papers read 2606.22722 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Tropic-AI/moBERTo-orig-tokenizer
Datasets citing this paper 1
Tropic-AI/moberto-pretraining-dataset-c4-compatible
Spaces citing this paper 0
No Space linking this paper