--- language: - ru tags: - summarization - mbart license: apache-2.0 --- # MBARTRuSumGazeta ## Model description This is a ported version of [fairseq model](https://www.dropbox.com/s/fijtntnifbt9h0k/gazeta_mbart_v2_fairseq.tar.gz). For more details, please see, [Dataset for Automatic Summarization of Russian News](https://arxiv.org/abs/2006.11063). ## Intended uses & limitations #### How to use ```python from transformers import MBartTokenizer, MBartForConditionalGeneration article_text = "..." model_name = "IlyaGusev/mbart_ru_sum_gazeta" tokenizer = MBartTokenizer.from_pretrained(model_name) model = MBartForConditionalGeneration.from_pretrained(model_name) input_ids = tokenizer.prepare_seq2seq_batch( [source], src_lang="en_XX", return_tensors="pt", padding="max_length", truncation=True, max_length=600 )["input_ids"][0] output_ids = model.generate( input_ids=input_ids.unsqueeze(0), max_length=162, no_repeat_ngram_size=3, num_beams=5, top_k=0, decoder_start_token_id=tokenizer.lang_code_to_id["ru_RU"] )[0] summary = tokenizer.decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) print(summary) ``` #### Limitations and bias - The model should work well with Gazeta.ru articles, but for any other agencies it can suffer from domain change ## Training data - Dataset: https://github.com/IlyaGusev/gazeta ## Training procedure - Fairseq training script: https://github.com/IlyaGusev/summarus/blob/master/external/bart_scripts/train.sh - Porting: https://colab.research.google.com/drive/13jXOlCpArV-lm4jZQ0VgOpj6nFBYrLAr ## Eval results ### BibTeX entry and citation info ```bibtex @InProceedings{10.1007/978-3-030-59082-6_9, author="Gusev, Ilya", editor="Filchenkov, Andrey and Kauttonen, Janne and Pivovarova, Lidia", title="Dataset for Automatic Summarization of Russian News", booktitle="Artificial Intelligence and Natural Language", year="2020", publisher="Springer International Publishing", address="Cham", pages="122--134", isbn="978-3-030-59082-6" } ```