Update README.md
Browse files
README.md
CHANGED
@@ -10,3 +10,46 @@ widget:
|
|
10 |
---
|
11 |
|
12 |
# Spanish BERT2BERT (BETO) fine-tuned on MLSUM ES for summarization
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
# Spanish BERT2BERT (BETO) fine-tuned on MLSUM ES for summarization
|
13 |
+
|
14 |
+
## Model
|
15 |
+
[dccuchile/bert-base-spanish-wwm-cased](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) (BERT Checkpoint)
|
16 |
+
|
17 |
+
## Dataset
|
18 |
+
**MLSUM** is the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, **Spanish**, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
|
19 |
+
|
20 |
+
[MLSUM es](https://huggingface.co/datasets/viewer/?dataset=mlsum)
|
21 |
+
|
22 |
+
## Results
|
23 |
+
|
24 |
+
|Set|Metric| Value|
|
25 |
+
|----|------|------|
|
26 |
+
| Test |Rouge2 - mid -precision | **32.41**|
|
27 |
+
| Test | Rouge2 - mid - recall | **28.65**|
|
28 |
+
| Test | Rouge2 - mid - fmeasure | **29.48**|
|
29 |
+
|
30 |
+
## Usage
|
31 |
+
|
32 |
+
```python
|
33 |
+
import torch
|
34 |
+
from transformers import BertTokenizerFast, EncoderDecoderModel
|
35 |
+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
36 |
+
ckpt = 'mrm8488/bert2bert_shared-spanish-finetuned-summarization'
|
37 |
+
tokenizer = BertTokenizerFast.from_pretrained(ckpt)
|
38 |
+
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)
|
39 |
+
|
40 |
+
def generate_summary(text):
|
41 |
+
|
42 |
+
inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
|
43 |
+
input_ids = inputs.input_ids.to(device)
|
44 |
+
attention_mask = inputs.attention_mask.to(device)
|
45 |
+
output = model.generate(input_ids, attention_mask=attention_mask)
|
46 |
+
return tokenizer.decode(output[0], skip_special_tokens=True)
|
47 |
+
|
48 |
+
text = "Your text here..."
|
49 |
+
generate_summary(text)
|
50 |
+
```
|
51 |
+
|
52 |
+
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) with the support of [Narrativa](https://www.narrativa.com/)
|
53 |
+
|
54 |
+
> Made with <span style="color: #e25555;">♥</span> in Spain
|
55 |
+
|