File size: 2,693 Bytes
7a34ec2 c89b2f7 e203787 7a34ec2 5b67024 bff4577 7a34ec2 bff4577 126137f bff4577 126137f bff4577 126137f bff4577 5b67024 126137f 5b67024 126137f bff4577 126137f c89b2f7 2a6a974 5c346e9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
language:
- ru
tags:
- summarization
- bert
- rubert
license: MIT
---
# rubert_ria_headlines
## Description
*bert2bert* model, initialized with the `DeepPavlov/rubert-base-cased` pretrained weights and
fine-tuned on the first 90% of ["Rossiya Segodnya" news dataset](https://github.com/RossiyaSegodnya/ria_news_dataset) for 3 epochs.
## Usage example
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
MODEL_NAME = "dmitry-vorobiev/rubert_ria_headlines"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
text = "Скопируйте текст статьи / новости"
encoded_batch = tokenizer.prepare_seq2seq_batch(
[text],
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=512)
output_ids = model.generate(
input_ids=encoded_batch["input_ids"],
max_length=32,
no_repeat_ngram_size=3,
num_beams=5,
top_k=0
)
headline = tokenizer.decode(output_ids[0],
skip_special_tokens=True,
clean_up_tokenization_spaces=False)
print(headline)
```
## Datasets
- [ria_news](https://github.com/RossiyaSegodnya/ria_news_dataset)
## How it was trained?
I used free TPUv3 on kaggle. The model was trained for 3 epochs with effective batch size 192 and soft restarts (warmup steps 1500 / 500 / 500 with new optimizer state on each epoch start).
- [1 epoch notebook](https://www.kaggle.com/dvorobiev/try-train-seq2seq-ria-tpu?scriptVersionId=53254694)
- [2 epoch notebook](https://www.kaggle.com/dvorobiev/try-train-seq2seq-ria-tpu?scriptVersionId=53269040)
- [3 epoch notebook](https://www.kaggle.com/dvorobiev/try-train-seq2seq-ria-tpu?scriptVersionId=53280797)
Common train params:
```shell
export XLA_USE_BF16=1
export XLA_TENSOR_ALLOCATOR_MAXSIZE=100000000
python nlp_headline_rus/src/train_seq2seq.py \
--do_train \
--tie_encoder_decoder \
--max_source_length 512 \
--max_target_length 32 \
--val_max_target_length 48 \
--tpu_num_cores 8 \
--per_device_train_batch_size 24 \
--gradient_accumulation_steps 1 \
--learning_rate 5e-4 \
--adam_epsilon 1e-6 \
--weight_decay 1e-5 \
```
## Validation results
- Using [last 1% of ria](https://drive.google.com/drive/folders/1xtCnkbGNNu5jGQ9H9Mg55Cx7RTcyhQw9) dataset
- Using [last 10% of ria](https://drive.google.com/drive/folders/1w6rAXhpFUO8I4A7xfHKUjMBPEKBHEO3h) dataset
- Using [gazeta_ru test](https://drive.google.com/drive/folders/185ALuNVbbT_C1ZHQYn1OlOc9vRVILvHs) split
- Using [gazeta_ru val](https://drive.google.com/drive/folders/1BLiL3H0n56e8Q9jSuDgaH_3LLpmKxuVG) split |