Update README.md
Browse files
README.md
CHANGED
@@ -29,29 +29,29 @@ Disclaimer:
|
|
29 |
review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical
|
30 |
exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
|
31 |
|
32 |
-
This model was an end
|
33 |
summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI).
|
34 |
It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem.
|
35 |
|
36 |
## Model description
|
37 |
|
38 |
-
This Automatic Text
|
39 |
-
Portfolio (FPP) of the Brazilian Ministry of Science, Technology and Innovation. It was produced in parallel with the writing of a
|
40 |
-
|
41 |
as well as a brief overview of the nature of the task itself and the state-of-the-art of its implementation.
|
42 |
|
43 |
-
The input of the model can be either a single text, a
|
44 |
-
|
45 |
-
for the texts, i.e., human
|
46 |
Evaluators like ROUGE, which in its many variations is the most used to perform the task, require gold-standard summaries as inputs. There are,
|
47 |
-
however, Evaluation Methods
|
48 |
-
Divergence method) and this is why an evaluation can be made even when only the text is taken as
|
49 |
|
50 |
The text output is produced by a chosen method of ATS which can be extractive (built with the most relevant sentences of the source document)
|
51 |
or abstractive (written from scratch in an abstractive manner). The latter is achieved by means of transformers, and the ones present in the
|
52 |
model are the already existing and vastly applied BART-Large CNN, Pegasus-XSUM and mT5 Multilingual XLSUM. The extractive methods are taken from
|
53 |
the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the
|
54 |
-
methods used for text summarization will be described
|
55 |
|
56 |
## Methods
|
57 |
|
@@ -101,16 +101,16 @@ import itertools as it
|
|
101 |
import more_itertools as mit
|
102 |
|
103 |
```
|
104 |
-
If any of the above
|
105 |
-
him to install them through the CMD with the
|
106 |
|
107 |
```python
|
108 |
>>> pip install [LIBRARY]
|
109 |
|
110 |
```
|
111 |
|
112 |
-
To run the code with given corpus' of data, the following lines of code need to be inserted. If one or multiple
|
113 |
-
corpora,
|
114 |
|
115 |
```python
|
116 |
if __name__ == "__main__":
|
@@ -167,7 +167,7 @@ if __name__ == "__main__":
|
|
167 |
### Preprocessing
|
168 |
|
169 |
The preprocessing of given texts is done by the clean_text method present in the Data class. It removes (or normalizes) from the text
|
170 |
-
elements
|
171 |
|
172 |
```python
|
173 |
class Data:
|
@@ -214,9 +214,9 @@ used as source texts documents achieved from existing datasets. The chosen datas
|
|
214 |
|
215 |
- **Scientific Papers (arXiv + PubMed)**: [Cohan et al. (2018)](https://arxiv.org/pdf/1804.05685) found out that there were only
|
216 |
datasets with short texts (with an average of 600 words) or datasets with longer texts with
|
217 |
-
extractive
|
218 |
documents for abstractive summarization, the authors compiled two new datasets with scientific
|
219 |
-
papers from arXiv and PubMed databases. Scientific papers are
|
220 |
desired kind of ATS the authors mean to achieve, and that is due to their large length and
|
221 |
the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
|
222 |
abstract.
|
@@ -225,7 +225,7 @@ examples for the task of abstractive summarization. The data dataset is built us
|
|
225 |
Patents Public Datasets, where for each document there is one gold-standard summary which
|
226 |
is the patent’s original abstract. One advantage of this dataset is that it does not present
|
227 |
difficulties inherent to news summarization datasets, where summaries have a flattened discourse
|
228 |
-
structure and the summary content arises
|
229 |
- **CNN Corpus**: [Lins et al. (2019)](https://par.nsf.gov/servlets/purl/10185297) introduced the corpus in order to fill the gap that most news
|
230 |
summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
|
231 |
dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
|
@@ -238,7 +238,7 @@ text via Natural Language Processing techniques. In order to perform that task,
|
|
238 |
around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
|
239 |
to be the key aspect in understanding a text, namely the answering of somewhat complex
|
240 |
questions about it. Even though ATS is not the main focus of the authors, they took inspiration
|
241 |
-
from it to develop their model and include in their dataset the human
|
242 |
news article.
|
243 |
- **XSum**: [Narayan et al. (2018)](https://arxiv.org/pdf/1808.08745) introduced the single-document dataset, which focuses on a
|
244 |
kind of summarization described by the authors as extreme summarization – an abstractive
|
@@ -252,22 +252,22 @@ Each of the datasets' documents was summarized through every summarization metho
|
|
252 |
in comparison with the gold-standard summaries. The following tables provide the results for many evaluation methods applied
|
253 |
to the many datasets:
|
254 |
|
255 |
-
Table 1: Scientific Papers (arXiv + PubMed) results
|
256 |
<img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_arxiv_pubmed.png" width="800">
|
257 |
|
258 |
-
Table 2: BIGPATENT results
|
259 |
<img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_bigpatent.png" width="800">
|
260 |
|
261 |
-
Table 3: CNN Corpus results
|
262 |
<img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_corpus.png" width="800">
|
263 |
|
264 |
-
Table 4: CNN + Daily Mail results
|
265 |
<img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_dailymail.png" width="800">
|
266 |
|
267 |
-
Table 5: XSum results
|
268 |
<img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_xsum.png" width="800">
|
269 |
|
270 |
-
|
271 |
|
272 |
```bibtex
|
273 |
@conference{webist22,
|
|
|
29 |
review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical
|
30 |
exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
|
31 |
|
32 |
+
This model was an end result of the above-mentioned literature review paper, from which the best solution was drawn to be applied to the problem of
|
33 |
summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI).
|
34 |
It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem.
|
35 |
|
36 |
## Model description
|
37 |
|
38 |
+
This Automatic Text Summarization (ATS) Model was developed in the Python language to be applied to the Research Financing Products
|
39 |
+
Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation. It was produced in parallel with the writing of a
|
40 |
+
Systematic Literature Review paper, in which there is a discussion concerning many summarization methods, datasets, and evaluators
|
41 |
as well as a brief overview of the nature of the task itself and the state-of-the-art of its implementation.
|
42 |
|
43 |
+
The input of the model can be either a single text, a data frame or a CSV file containing multiple texts (in the English language) and its output
|
44 |
+
is the summarized texts and their evaluation metrics. As an optional (although recommended) input, the model accepts gold-standard summaries
|
45 |
+
for the texts, i.e., human-written (or extracted) summaries of the texts which are considered to be good representations of their contents.
|
46 |
Evaluators like ROUGE, which in its many variations is the most used to perform the task, require gold-standard summaries as inputs. There are,
|
47 |
+
however, Evaluation Methods that do not depend on the existence of a golden summary (e.g. the cosine similarity method, the Kullback Leibler
|
48 |
+
Divergence method) and this is why an evaluation can be made even when only the text is taken as input to the model.
|
49 |
|
50 |
The text output is produced by a chosen method of ATS which can be extractive (built with the most relevant sentences of the source document)
|
51 |
or abstractive (written from scratch in an abstractive manner). The latter is achieved by means of transformers, and the ones present in the
|
52 |
model are the already existing and vastly applied BART-Large CNN, Pegasus-XSUM and mT5 Multilingual XLSUM. The extractive methods are taken from
|
53 |
the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the
|
54 |
+
methods used for text summarization will be described individually in the following sections.
|
55 |
|
56 |
## Methods
|
57 |
|
|
|
101 |
import more_itertools as mit
|
102 |
|
103 |
```
|
104 |
+
If any of the above-mentioned libraries are not installed in the user's machine, it will be required for
|
105 |
+
him to install them through the CMD with the command:
|
106 |
|
107 |
```python
|
108 |
>>> pip install [LIBRARY]
|
109 |
|
110 |
```
|
111 |
|
112 |
+
To run the code with a given corpus' of data, the following lines of code need to be inserted. If one or multiple
|
113 |
+
corpora, summarizes, and evaluators are not to be applied, the user has to comment the unwanted option.
|
114 |
|
115 |
```python
|
116 |
if __name__ == "__main__":
|
|
|
167 |
### Preprocessing
|
168 |
|
169 |
The preprocessing of given texts is done by the clean_text method present in the Data class. It removes (or normalizes) from the text
|
170 |
+
elements that would make it difficult for them to be summarized, such as jump lines, double quotations and apostrophes.
|
171 |
|
172 |
```python
|
173 |
class Data:
|
|
|
214 |
|
215 |
- **Scientific Papers (arXiv + PubMed)**: [Cohan et al. (2018)](https://arxiv.org/pdf/1804.05685) found out that there were only
|
216 |
datasets with short texts (with an average of 600 words) or datasets with longer texts with
|
217 |
+
extractive human summaries. In order to fill the gap and provide a dataset with long text
|
218 |
documents for abstractive summarization, the authors compiled two new datasets with scientific
|
219 |
+
papers from arXiv and PubMed databases. Scientific papers are especially convenient given the
|
220 |
desired kind of ATS the authors mean to achieve, and that is due to their large length and
|
221 |
the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
|
222 |
abstract.
|
|
|
225 |
Patents Public Datasets, where for each document there is one gold-standard summary which
|
226 |
is the patent’s original abstract. One advantage of this dataset is that it does not present
|
227 |
difficulties inherent to news summarization datasets, where summaries have a flattened discourse
|
228 |
+
structure and the summary content arises at the beginning of the document.
|
229 |
- **CNN Corpus**: [Lins et al. (2019)](https://par.nsf.gov/servlets/purl/10185297) introduced the corpus in order to fill the gap that most news
|
230 |
summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
|
231 |
dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
|
|
|
238 |
around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
|
239 |
to be the key aspect in understanding a text, namely the answering of somewhat complex
|
240 |
questions about it. Even though ATS is not the main focus of the authors, they took inspiration
|
241 |
+
from it to develop their model and include in their dataset the human-made summaries for each
|
242 |
news article.
|
243 |
- **XSum**: [Narayan et al. (2018)](https://arxiv.org/pdf/1808.08745) introduced the single-document dataset, which focuses on a
|
244 |
kind of summarization described by the authors as extreme summarization – an abstractive
|
|
|
252 |
in comparison with the gold-standard summaries. The following tables provide the results for many evaluation methods applied
|
253 |
to the many datasets:
|
254 |
|
255 |
+
### Table 1: Scientific Papers (arXiv + PubMed) results
|
256 |
<img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_arxiv_pubmed.png" width="800">
|
257 |
|
258 |
+
### Table 2: BIGPATENT results
|
259 |
<img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_bigpatent.png" width="800">
|
260 |
|
261 |
+
### Table 3: CNN Corpus results
|
262 |
<img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_corpus.png" width="800">
|
263 |
|
264 |
+
### Table 4: CNN + Daily Mail results
|
265 |
<img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_dailymail.png" width="800">
|
266 |
|
267 |
+
### Table 5: XSum results
|
268 |
<img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_xsum.png" width="800">
|
269 |
|
270 |
+
## BibTeX entry and citation info
|
271 |
|
272 |
```bibtex
|
273 |
@conference{webist22,
|