mrm8488 commited on
Commit
99bcc85
1 Parent(s): 1ef0787

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -0
README.md CHANGED
@@ -10,3 +10,46 @@ widget:
10
  ---
11
 
12
  # Spanish BERT2BERT (BETO) fine-tuned on MLSUM ES for summarization
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
  # Spanish BERT2BERT (BETO) fine-tuned on MLSUM ES for summarization
13
+
14
+ ## Model
15
+ [dccuchile/bert-base-spanish-wwm-cased](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) (BERT Checkpoint)
16
+
17
+ ## Dataset
18
+ **MLSUM** is the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, **Spanish**, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
19
+
20
+ [MLSUM es](https://huggingface.co/datasets/viewer/?dataset=mlsum)
21
+
22
+ ## Results
23
+
24
+ |Set|Metric| Value|
25
+ |----|------|------|
26
+ | Test |Rouge2 - mid -precision | **32.41**|
27
+ | Test | Rouge2 - mid - recall | **28.65**|
28
+ | Test | Rouge2 - mid - fmeasure | **29.48**|
29
+
30
+ ## Usage
31
+
32
+ ```python
33
+ import torch
34
+ from transformers import BertTokenizerFast, EncoderDecoderModel
35
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
36
+ ckpt = 'mrm8488/bert2bert_shared-spanish-finetuned-summarization'
37
+ tokenizer = BertTokenizerFast.from_pretrained(ckpt)
38
+ model = EncoderDecoderModel.from_pretrained(ckpt).to(device)
39
+
40
+ def generate_summary(text):
41
+
42
+ inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
43
+ input_ids = inputs.input_ids.to(device)
44
+ attention_mask = inputs.attention_mask.to(device)
45
+ output = model.generate(input_ids, attention_mask=attention_mask)
46
+ return tokenizer.decode(output[0], skip_special_tokens=True)
47
+
48
+ text = "Your text here..."
49
+ generate_summary(text)
50
+ ```
51
+
52
+ > Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) with the support of [Narrativa](https://www.narrativa.com/)
53
+
54
+ > Made with <span style="color: #e25555;">&hearts;</span> in Spain
55
+