Spanish BERT2BERT (BETO) fine-tuned on MLSUM ES for summarization


dccuchile/bert-base-spanish-wwm-cased (BERT Checkpoint)


MLSUM is the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.



Set Metric Value
Test Rouge2 - mid -precision 9.6
Test Rouge2 - mid - recall 8.4
Test Rouge2 - mid - fmeasure 8.7
Test Rouge1 26.24
Test Rouge2 8.9
Test RougeL 21.01
Test RougeLsum 21.02


import torch
from transformers import BertTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'mrm8488/bert2bert_shared-spanish-finetuned-summarization'
tokenizer = BertTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

def generate_summary(text):

   inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
   input_ids =
   attention_mask =
   output = model.generate(input_ids, attention_mask=attention_mask)
   return tokenizer.decode(output[0], skip_special_tokens=True)
text = "Your text here..."

Created by Manuel Romero/@mrm8488 with the support of Narrativa

Made with in Spain

New: fine-tune this model in a few clicks by selecting AutoNLP in the "Train" menu!
Downloads last month
Hosted inference API
This model can be loaded on the Inference API on-demand.