XLM-R Longformer Model / XLM-Long

XLM-R Longformer (or XLM-Long for short) is a XLM-R model that has been extended to allow sequence lengths up to 4096 tokens, instead of the regular 512. The model was pre-trained from the XLM-RoBERTa checkpoint using the Longformer pre-training scheme on the English WikiText-103 corpus.

The reason for this was to investigate methods for creating efficient Transformers for low-resource languages, such as Swedish, without the need to pre-train them on long-context datasets in each respecitve language. The trained model came as a result of a master thesis project at Peltarion and was fine-tuned on multilingual quesion-answering tasks, with code available here.

Since both XLM-R model and Longformer models are large models, it it recommended to run the models with NVIDIA Apex (16bit precision), large GPU and several gradient accumulation steps.

How to Use

The model can be used as expected to fine-tune on a downstream task.
For instance for QA.

import torch
from transformers import AutoModel, AutoTokenizer

MODEL_NAME_OR_PATH = "markussagen/xlm-roberta-longformer-base-4096"

tokenizer = AutoTokenizer.from_pretrained(

model = AutoModelForQuestionAnswering.from_pretrained(

Training Procedure

The model have been trained on the WikiText-103 corpus, using a 48GB GPU with the following training script and parameters. The model was pre-trained for 6000 iterations and took ~5 days. See the full training script and Github repo for more information

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip   

export DATA_DIR=./wikitext-103-raw

scripts/run_long_lm.py \
    --model_name_or_path xlm-roberta-base \
    --model_name xlm-roberta-to-longformer \
    --output_dir ./output \
    --logging_dir ./logs \
    --val_file_path $DATA_DIR/wiki.valid.raw \
    --train_file_path $DATA_DIR/wiki.train.raw \
    --seed 42 \
    --max_pos 4096 \
    --adam_epsilon 1e-8 \
    --warmup_steps 500 \
    --learning_rate 3e-5 \
    --weight_decay 0.01 \
    --max_steps 6000 \
    --evaluate_during_training \
    --logging_steps 50 \
    --eval_steps 50 \
    --save_steps 6000  \
    --max_grad_norm 1.0 \
    --per_device_eval_batch_size 2 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 64 \
    --overwrite_output_dir \
    --fp16 \
    --do_train \

Select AutoNLP in the “Train” menu to fine-tune this model automatically.

Downloads last month
Hosted inference API
Mask token: <mask>
This model can be loaded on the Inference API on-demand.