igorgavi commited on
Commit
3d341cc
1 Parent(s): cd926e4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -85
README.md CHANGED
@@ -20,25 +20,18 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
20
 
21
  Disclaimer:
22
 
23
- ## According to the abstract,
24
-
25
- Text classification is a traditional problem in Natural Language Processing (NLP). Most of the state-of-the-art implementations
26
- require high-quality, voluminous, labeled data. Pre- trained models on large corpora have shown beneficial for text classification
27
- and other NLP tasks, but they can only take a limited amount of symbols as input. This is a real case study that explores
28
- different machine learning strategies to classify a small amount of long, unstructured, and uneven data to find a proper method
29
- with good performance. The collected data includes texts of financing opportunities the international R&D funding organizations
30
- provided on theirwebsites. The main goal is to find international R&D funding eligible for Brazilian researchers, sponsored by
31
- the Ministry of Science, Technology and Innovation. We use pre-training and word embedding solutions to learn the relationship
32
- of the words from other datasets with considerable similarity and larger scale. Then, using the acquired features, based on the
33
- available dataset from MCTI, we apply transfer learning plus deep learning models to improve the comprehension of each sentence.
34
- Compared to the baseline accuracy rate of 81%, based on the available datasets, and the 85% accuracy rate achieved through a
35
- Transformer-based approach, the Word2Vec-based approach improved the accuracy rate to 88%. The research results serve as
36
- asuccessful case of artificial intelligence in a federal government application.
37
-
38
- This model focus on a more specific problem, creating a Research Financing Products Portfolio (FPP) outside ofthe Union budget,
39
- supported by the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was introduced in ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318) and first released in
40
- [this repository](https://huggingface.co/unb-lamfo-nlp-mcti). This model is uncased: it does not make a difference between english
41
- and English.
42
 
43
  ## Model description
44
 
@@ -65,19 +58,19 @@ methods used for text summarization will be described indvidually in the followi
65
 
66
  ## Methods
67
 
68
- | Method | Kind of ATS | Description | Documentation |
69
- |:----------------------:|:-----------:|:-----------:|:-------------:|
70
- | SumyRandom | Extractive | | []() |
71
- | Sumy Luhn | Extractive | | []() |
72
- | SumyLsa | Extractive | | []() |
73
- | SumyLexRank | Extractive | | []() |
74
- | SumyTextRank | Extractive | | []() |
75
- | SumySumBasic | Extractive | | []() |
76
- | SumyKL | Extractive | | []() |
77
- | SumyReduction | Extractive | | []() |
78
- | BART-Large CNN | Abstractive | | [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) |
79
- | Pegasus-XSUM | Abstractive | | [google/pegasus-xsum](https://huggingface.co/google/pegasus-xsum) |
80
- | mT5 Multilingual XLSUM | Abstractive | | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)|
81
 
82
 
83
  ## Model variations
@@ -179,60 +172,7 @@ output = model(encoded_input)
179
 
180
  ### Limitations and bias
181
 
182
- Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
183
- predictions:
184
-
185
- ```python
186
- >>> from transformers import pipeline
187
- >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
188
- >>> unmasker("The man worked as a [MASK].")
189
-
190
- [{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
191
- 'score': 0.09747550636529922,
192
- 'token': 10533,
193
- 'token_str': 'carpenter'},
194
- {'sequence': '[CLS] the man worked as a waiter. [SEP]',
195
- 'score': 0.0523831807076931,
196
- 'token': 15610,
197
- 'token_str': 'waiter'},
198
- {'sequence': '[CLS] the man worked as a barber. [SEP]',
199
- 'score': 0.04962705448269844,
200
- 'token': 13362,
201
- 'token_str': 'barber'},
202
- {'sequence': '[CLS] the man worked as a mechanic. [SEP]',
203
- 'score': 0.03788609802722931,
204
- 'token': 15893,
205
- 'token_str': 'mechanic'},
206
- {'sequence': '[CLS] the man worked as a salesman. [SEP]',
207
- 'score': 0.037680890411138535,
208
- 'token': 18968,
209
- 'token_str': 'salesman'}]
210
-
211
- >>> unmasker("The woman worked as a [MASK].")
212
-
213
- [{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
214
- 'score': 0.21981462836265564,
215
- 'token': 6821,
216
- 'token_str': 'nurse'},
217
- {'sequence': '[CLS] the woman worked as a waitress. [SEP]',
218
- 'score': 0.1597415804862976,
219
- 'token': 13877,
220
- 'token_str': 'waitress'},
221
- {'sequence': '[CLS] the woman worked as a maid. [SEP]',
222
- 'score': 0.1154729500412941,
223
- 'token': 10850,
224
- 'token_str': 'maid'},
225
- {'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
226
- 'score': 0.037968918681144714,
227
- 'token': 19215,
228
- 'token_str': 'prostitute'},
229
- {'sequence': '[CLS] the woman worked as a cook. [SEP]',
230
- 'score': 0.03042375110089779,
231
- 'token': 5660,
232
- 'token_str': 'cook'}]
233
- ```
234
 
235
- This bias will also affect all fine-tuned versions of this model.
236
 
237
  ## Training data
238
 
 
20
 
21
  Disclaimer:
22
 
23
+ ## According to the abstract of the literature review,
24
+
25
+ We provide a literature review about Automatic Text Summarization systems. We consider a citation-based approach. We start with some popular and well-known
26
+ papers that we have in hand about each topic we want to cover and we have tracked the "backward citations" (papers that are cited by the set of papers we
27
+ knew beforehand) and the "forward citations" (newer papers that cite the set of papers we knew beforehand). In order to organize the different methods, we
28
+ present the diverse approaches to ATS guided by the mechanisms they use to generate a summary. Besides presenting the methods, we also present an extensive
29
+ review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical
30
+ exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
31
+
32
+ This model was an end-result of the above mentioned literature review paper, from which the best solution was drawn to be applied to the problem of
33
+ summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI).
34
+ It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem.
 
 
 
 
 
 
 
35
 
36
  ## Model description
37
 
 
58
 
59
  ## Methods
60
 
61
+ | Method | Kind of ATS | Documentation | Source Article |
62
+ |:----------------------:|:-----------:|:---------------:|:--------------:|
63
+ | SumyRandom | Extractive | [Sumy GitHub](https://github.com/miso-belica/sumy/) | None (picks out random sentences from source text) |
64
+ | SumyLuhn | Extractive | Ibid. | (Luhn, 1958) |
65
+ | SumyLsa | Extractive | Ibid. | [(Steinberger et al., 2004)](http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf) |
66
+ | SumyLexRank | Extractive | Ibid. | (Erkan and Radev, 2004) |
67
+ | SumyTextRank | Extractive | Ibid. | (Mihalcea and Tarau, 2004) |
68
+ | SumySumBasic | Extractive | Ibid. | None (often used as a baseline model in the literature) |
69
+ | SumyKL | Extractive | Ibid. | (Haghighi and Vanderwende, 2009) |
70
+ | SumyReduction | Extractive | Ibid. | None. |
71
+ | BART-Large CNN | Abstractive | | [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) | (Lewis et al., 2019) |
72
+ | Pegasus-XSUM | Abstractive | | [google/pegasus-xsum](https://huggingface.co/google/pegasus-xsum) | (Zhang et al., 2020) |
73
+ | mT5 Multilingual XLSUM | Abstractive | | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| (Raffel et al., 2019) |
74
 
75
 
76
  ## Model variations
 
172
 
173
  ### Limitations and bias
174
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
 
176
 
177
  ## Training data
178