File size: 2,693 Bytes
7a34ec2
 
 
 
 
c89b2f7
 
e203787
7a34ec2
 
 
 
 
 
5b67024
bff4577
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a34ec2
 
bff4577
 
 
 
126137f
bff4577
126137f
 
 
bff4577
 
 
 
126137f
 
 
bff4577
 
 
 
 
 
5b67024
126137f
5b67024
126137f
bff4577
 
126137f
 
 
 
c89b2f7
2a6a974
5c346e9
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
language:
- ru
tags:
- summarization
- bert
- rubert
license: MIT
---

# rubert_ria_headlines

## Description
*bert2bert* model, initialized with the `DeepPavlov/rubert-base-cased` pretrained weights and 
   fine-tuned on the first 90% of ["Rossiya Segodnya" news dataset](https://github.com/RossiyaSegodnya/ria_news_dataset) for 3 epochs.

## Usage example

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

MODEL_NAME = "dmitry-vorobiev/rubert_ria_headlines"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

text = "Скопируйте текст статьи / новости"

encoded_batch = tokenizer.prepare_seq2seq_batch(
    [text],
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512)

output_ids = model.generate(
    input_ids=encoded_batch["input_ids"],
    max_length=32,
    no_repeat_ngram_size=3,
    num_beams=5,
    top_k=0
)

headline = tokenizer.decode(output_ids[0], 
                            skip_special_tokens=True, 
                            clean_up_tokenization_spaces=False)
print(headline)
```
   
## Datasets
- [ria_news](https://github.com/RossiyaSegodnya/ria_news_dataset)

## How it was trained?

I used free TPUv3 on kaggle. The model was trained for 3 epochs with effective batch size 192 and soft restarts (warmup steps 1500 / 500 / 500 with new optimizer state on each epoch start).

- [1 epoch notebook](https://www.kaggle.com/dvorobiev/try-train-seq2seq-ria-tpu?scriptVersionId=53254694)
- [2 epoch notebook](https://www.kaggle.com/dvorobiev/try-train-seq2seq-ria-tpu?scriptVersionId=53269040)
- [3 epoch notebook](https://www.kaggle.com/dvorobiev/try-train-seq2seq-ria-tpu?scriptVersionId=53280797)

Common train params:

```shell
export XLA_USE_BF16=1
export XLA_TENSOR_ALLOCATOR_MAXSIZE=100000000

python nlp_headline_rus/src/train_seq2seq.py \
    --do_train \
    --tie_encoder_decoder \
    --max_source_length 512 \
    --max_target_length 32 \
    --val_max_target_length 48 \
    --tpu_num_cores 8 \
    --per_device_train_batch_size 24 \
    --gradient_accumulation_steps 1 \
    --learning_rate 5e-4 \
    --adam_epsilon 1e-6 \
    --weight_decay 1e-5 \
```

## Validation results

- Using [last 1% of ria](https://drive.google.com/drive/folders/1xtCnkbGNNu5jGQ9H9Mg55Cx7RTcyhQw9) dataset
- Using [last 10% of ria](https://drive.google.com/drive/folders/1w6rAXhpFUO8I4A7xfHKUjMBPEKBHEO3h) dataset
- Using [gazeta_ru test](https://drive.google.com/drive/folders/185ALuNVbbT_C1ZHQYn1OlOc9vRVILvHs) split
- Using [gazeta_ru val](https://drive.google.com/drive/folders/1BLiL3H0n56e8Q9jSuDgaH_3LLpmKxuVG) split