Spanish BERT2BERT (BETO) fine-tuned on MLSUM ES for summarization

Model

dccuchile/bert-base-spanish-wwm-cased (BERT Checkpoint)

Dataset

MLSUM is the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

MLSUM es

Results

Set Metric Value
Test Rouge2 - mid -precision 9.6
Test Rouge2 - mid - recall 8.4
Test Rouge2 - mid - fmeasure 8.7
Test Rouge1 26.24
Test Rouge2 8.9
Test RougeL 21.01
Test RougeLsum 21.02

Usage

import torch
from transformers import BertTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'mrm8488/bert2bert_shared-spanish-finetuned-summarization'
tokenizer = BertTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

def generate_summary(text):

   inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
   input_ids = inputs.input_ids.to(device)
   attention_mask = inputs.attention_mask.to(device)
   output = model.generate(input_ids, attention_mask=attention_mask)
   return tokenizer.decode(output[0], skip_special_tokens=True)
   
text = "Your text here..."
generate_summary(text)

Created by Manuel Romero/@mrm8488 with the support of Narrativa

Made with in Spain

Downloads last month
1,599
Safetensors
Model size
139M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train mrm8488/bert2bert_shared-spanish-finetuned-summarization

Spaces using mrm8488/bert2bert_shared-spanish-finetuned-summarization 8