IlyaGusev commited on
Commit
ae42ffb
1 Parent(s): 02955e7
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ tags:
5
+ - summarization
6
+ - mbart
7
+ license: apache-2.0
8
+ ---
9
+
10
+ # MBARTRuSumGazeta
11
+
12
+ ## Model description
13
+
14
+ This is a ported version of [fairseq model](https://www.dropbox.com/s/fijtntnifbt9h0k/gazeta_mbart_v2_fairseq.tar.gz).
15
+
16
+ For more details, please see, [Dataset for Automatic Summarization of Russian News](https://arxiv.org/abs/2006.11063).
17
+
18
+ ## Intended uses & limitations
19
+
20
+ #### How to use
21
+
22
+ ```python
23
+ from transformers import MBartTokenizer, MBartForConditionalGeneration
24
+
25
+ article_text = "..."
26
+ model_name = "IlyaGusev/mbart_ru_sum_gazeta"
27
+ tokenizer = MBartTokenizer.from_pretrained(model_name)
28
+ model = MBartForConditionalGeneration.from_pretrained(model_name)
29
+
30
+ input_ids = tokenizer.prepare_seq2seq_batch(
31
+ [source],
32
+ src_lang="en_XX",
33
+ return_tensors="pt",
34
+ padding="max_length",
35
+ truncation=True,
36
+ max_length=600
37
+ )["input_ids"][0]
38
+
39
+ output_ids = model.generate(
40
+ input_ids=input_ids.unsqueeze(0),
41
+ max_length=162,
42
+ no_repeat_ngram_size=3,
43
+ num_beams=5,
44
+ top_k=0,
45
+ decoder_start_token_id=tokenizer.lang_code_to_id["ru_RU"]
46
+ )[0]
47
+ summary = tokenizer.decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
48
+ print(summary)
49
+ ```
50
+
51
+ #### Limitations and bias
52
+
53
+ - The model should work well with Gazeta.ru articles, but for any other agencies it can suffer from domain change
54
+
55
+
56
+ ## Training data
57
+
58
+ - Dataset: https://github.com/IlyaGusev/gazeta
59
+
60
+ ## Training procedure
61
+
62
+ - Fairseq training script: https://github.com/IlyaGusev/summarus/blob/master/external/bart_scripts/train.sh
63
+ - Porting: https://colab.research.google.com/drive/13jXOlCpArV-lm4jZQ0VgOpj6nFBYrLAr
64
+
65
+ ## Eval results
66
+
67
+
68
+ ### BibTeX entry and citation info
69
+
70
+ ```bibtex
71
+ @InProceedings{10.1007/978-3-030-59082-6_9,
72
+ author="Gusev, Ilya",
73
+ editor="Filchenkov, Andrey
74
+ and Kauttonen, Janne
75
+ and Pivovarova, Lidia",
76
+ title="Dataset for Automatic Summarization of Russian News",
77
+ booktitle="Artificial Intelligence and Natural Language",
78
+ year="2020",
79
+ publisher="Springer International Publishing",
80
+ address="Cham",
81
+ pages="122--134",
82
+ isbn="978-3-030-59082-6"
83
+ }
84
+ ```