AgaMiko commited on
Commit
38beee7
1 Parent(s): b09240b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -0
README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - pl
5
+ - en
6
+ datasets:
7
+ - Curlicat
8
+ pipeline_tag: text-classification
9
+ tags:
10
+ - keywords-generation
11
+ - text-classifiation
12
+ - other
13
+ widget:
14
+ - text: "Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr."
15
+ example_title: "Keywords generation (English)"
16
+ - text: "Przełomem w dziedzinie sztucznej inteligencji i maszynowego uczenia się było powstanie systemu eksperckiego Dendral na Uniwersytecie Stanforda w 1965. System ten powstał w celu zautomatyzowania analizy i identyfikacji molekuł związków organicznych, które dotychczas nie były znane chemikom. Wyniki badań otrzymane dzięki systemowi Dendral były pierwszym w historii odkryciem dokonanym przez komputer, które zostały opublikowane w prasie specjalistycznej."
17
+ example_title: "Keywords generation (Polish)"
18
+ - text: "El Padrão real (traducible al español como Patrón real) era una obra cartográfica de origen portugués producida secretamente y mantenida por la organización de la corte real en el siglo XVI. La obra estaba disponible para la élite científica de la época, siendo expuesta en la Casa da Índia (Casa de la India). En el Padrão real se añadieron constantemente los nuevos descubrimientos de los portugueses. El primer Padrão real fue producido en la época de Enrique el Navegante, antes de la existencia de la Casa de la India. "
19
+ example_title: "Keywords generation (Spanish)"
20
+ ---
21
+ # Keyword Extraction from Short Texts with T5
22
+
23
+ Our vlT5 model is a keyword generation model based on encoder-decoder architecture using Transformer blocks presented by Google ([https://huggingface.co/t5-base](https://huggingface.co/t5-base)). The model's input is text preceded by a prefix, and the output is the target text, where the prefix defines the type of task: e.g. "Translate from Polish to English:". The vlT5 was trained on scientific articles corpus to predict a given set of keyphrases based on the concatenation of the article’s abstract and title. It generates precise, yet not always complete keyphrases that describe the content of the article based only on the abstract.
24
+
25
+ The biggest advantage is the transferability of the vlT5 model, as it works well on all domains and types of text. The downside is that the text length and the number of keywords are similar to the training data: the text piece of an abstract length generates approximately 3 to 5 keywords. It works both extractive and abstractively. Longer pieces of text must be split into smaller chunks, and then propagated to the model.
26
+
27
+ # Corpus
28
+
29
+ The model was trained on a curlicat corpus
30
+
31
+
32
+ | Domains | Documents | With keywords |
33
+ | -------------------------------------------------------- | --------: | :-----------: |
34
+ | Engineering and technical sciences | 58 974 | 57 165 |
35
+ | Social sciences | 58 166 | 41 799 |
36
+ | Agricultural sciences | 29 811 | 15 492 |
37
+ | Humanities | 22 755 | 11 497 |
38
+ | Exact and natural sciences | 13 579 | 9 185 |
39
+ | Humanities, Social sciences | 12 809 | 7 063 |
40
+ | Medical and health sciences | 6 030 | 3 913 |
41
+ | Medical and health sciences, Social sciences | 828 | 571 |
42
+ | Humanities, Medical and health sciences, Social sciences | 601 | 455 |
43
+ | Engineering and technical sciences, Humanities | 312 | 312 |
44
+
45
+ # Tokenizer
46
+
47
+ As in the original HerBERT implementation, the training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.
48
+
49
+ We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.
50
+
51
+ # Usage
52
+
53
+ ```python
54
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
55
+
56
+ vlt5 = T5ForConditionalGeneration.from_pretrained("Voicelab/t5-base-keywords")
57
+ tokenizer = T5Tokenizer.from_pretrained("Voicelab/t5-base-keywords")
58
+
59
+ task_prefix = "Keywords: "
60
+ inputs = ["Christina Katrakis, who spoke to the BBC from Vorokhta in western Ukraine, relays the account of one family, who say Russian soldiers shot at their vehicles while they were leaving their village near Chernobyl in northern Ukraine. She says the cars had white flags and signs saying they were carrying children.",
61
+ "Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.",
62
+ "Hello, I'd like to order a pizza with salami topping."]
63
+
64
+ for sample in inputs:
65
+ input_sequences = [task_prefix + sample]
66
+ input_ids = tokenizer(input_sequences, return_tensors='pt', truncation=True).input_ids
67
+ output = model.generate(input_ids, no_repeat_ngram_size=3, num_beams=4)
68
+ predicted = tokenizer.decode(output[0], skip_special_tokens=True)
69
+ print(sample, "\n --->", predicted)
70
+
71
+
72
+
73
+ ```
74
+
75
+ # Results
76
+
77
+
78
+
79
+ | Method | Rank | Micro | | | Macro | | |
80
+ | ----------- | ---: | :--------: | ---------: | ---------: | :---: | ----: | ----: |
81
+ | | | P | R | F1 | P | R | F1 |
82
+ | extremeText | 1 | 0.175 | 0.038 | 0.063 | 0.007 | 0.004 | 0.005 |
83
+ | | 3 | 0.117 | 0.077 | 0.093 | 0.011 | 0.011 | 0.011 |
84
+ | | 5 | 0.090 | 0.099 | 0.094 | 0.013 | 0.016 | 0.015 |
85
+ | | 10 | 0.060 | 0.131 | 0.082 | 0.015 | 0.025 | 0.019 |
86
+ | plT5kw | 1 | **0.345** | 0.076 | 0.124 | 0.054 | 0.047 | 0.050 |
87
+ | | 3 | 0.328 | 0.212 | 0.257 | 0.133 | 0.127 | 0.129 |
88
+ | | 5 | 0.318 | **0.237** | **0.271** | 0.143 | 0.140 | 0.141 |
89
+ | KeyBERT | 1 | 0.030 | 0.007 | 0.011 | 0.004 | 0.003 | 0.003 |
90
+ | | 3 | 0.015 | 0.010 | 0.012 | 0.006 | 0.004 | 0.005 |
91
+ | | 5 | 0.011 | 0.012 | 0.011 | 0.006 | 0.005 | 0.005 |
92
+ | TermoPL | 1 | 0.118 | 0.026 | 0.043 | 0.004 | 0.003 | 0.003 |
93
+ | | 3 | 0.070 | 0.046 | 0.056 | 0.006 | 0.005 | 0.006 |
94
+ | | 5 | 0.051 | 0.056 | 0.053 | 0.007 | 0.007 | 0.007 |
95
+ | | all | 0.025 | 0.339 | 0.047 | 0.017 | 0.030 | 0.022 |
96
+ | extremeText | 1 | 0.210 | 0.077 | 0.112 | 0.037 | 0.017 | 0.023 |
97
+ | | 3 | 0.139 | 0.152 | 0.145 | 0.045 | 0.042 | 0.043 |
98
+ | | 5 | 0.107 | 0.196 | 0.139 | 0.049 | 0.063 | 0.055 |
99
+ | | 10 | 0.072 | 0.262 | 0.112 | 0.041 | 0.098 | 0.058 |
100
+ | plT5kw | 1 | **0.377** | 0.138 | 0.202 | 0.119 | 0.071 | 0.089 |
101
+ | | 3 | 0.361 | 0.301 | 0.328 | 0.185 | 0.147 | 0.164 |
102
+ | | 5 | 0.357 | **0.316** | **0.335** | 0.188 | 0.153 | 0.169 |
103
+ | KeyBERT | 1 | 0.018 | 0.007 | 0.010 | 0.003 | 0.001 | 0.001 |
104
+ | | 3 | 0.009 | 0.010 | 0.009 | 0.004 | 0.001 | 0.002 |
105
+ | | 5 | 0.007 | 0.012 | 0.009 | 0.004 | 0.001 | 0.002 |
106
+ | TermoPL | 1 | 0.076 | 0.028 | 0.041 | 0.002 | 0.001 | 0.001 |
107
+ | | 3 | 0.046 | 0.051 | 0.048 | 0.003 | 0.001 | 0.002 |
108
+ | | 5 | 0.033 | 0.061 | 0.043 | 0.003 | 0.001 | 0.002 |
109
+ | | all | 0.021 | 0.457 | 0.040 | 0.004 | 0.008 | 0.005 |
110
+
111
+ # License
112
+
113
+ CC BY 4.0
114
+
115
+ # Citation
116
+
117
+ If you use this model, please cite the following paper:
118
+
119
+ Piotr Pęzik, Agnieszka Mikołajczyk-Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer, ACIIDS 2022
120
+
121
+
122
+ # Authors
123
+
124
+ The model was trained by NLP Research Team at Voicelab.ai.
125
+
126
+ You can contact us [here](https://voicelab.ai/contact/).