r2nery commited on
Commit
9a9b5c1
1 Parent(s): de6582f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -25
README.md CHANGED
@@ -29,29 +29,29 @@ Disclaimer:
29
  review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical
30
  exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
31
 
32
- This model was an end-result of the above mentioned literature review paper, from which the best solution was drawn to be applied to the problem of
33
  summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI).
34
  It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem.
35
 
36
  ## Model description
37
 
38
- This Automatic Text Summarizarion (ATS) Model was developed in the Python language to be applied to the Research Financing Products
39
- Portfolio (FPP) of the Brazilian Ministry of Science, Technology and Innovation. It was produced in parallel with the writing of a
40
- Sistematic Literature Review paper, in which there is a discussion concerning many summarization methods, datasets, and evaluators
41
  as well as a brief overview of the nature of the task itself and the state-of-the-art of its implementation.
42
 
43
- The input of the model can be either a single text, a dataframe or a csv file containing multiple texts (in the English language) and its output
44
- are the summarized texts and their evaluation metrics. As an optional (although recommended) input, the model accepts gold-standard summaries
45
- for the texts, i.e., human written (or extracted) summaries of the texts which are considered to be good representations of their contents.
46
  Evaluators like ROUGE, which in its many variations is the most used to perform the task, require gold-standard summaries as inputs. There are,
47
- however, Evaluation Methods which do not deppend on the existence of a golden-summary (e.g. the cosine similarity method, the Kullback Leibler
48
- Divergence method) and this is why an evaluation can be made even when only the text is taken as an input to the model.
49
 
50
  The text output is produced by a chosen method of ATS which can be extractive (built with the most relevant sentences of the source document)
51
  or abstractive (written from scratch in an abstractive manner). The latter is achieved by means of transformers, and the ones present in the
52
  model are the already existing and vastly applied BART-Large CNN, Pegasus-XSUM and mT5 Multilingual XLSUM. The extractive methods are taken from
53
  the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the
54
- methods used for text summarization will be described indvidually in the following sections.
55
 
56
  ## Methods
57
 
@@ -101,16 +101,16 @@ import itertools as it
101
  import more_itertools as mit
102
 
103
  ```
104
- If any of the above mentioned libraries are not installed in the user's machine, it will be required for
105
- him to install them through the CMD with the comand:
106
 
107
  ```python
108
  >>> pip install [LIBRARY]
109
 
110
  ```
111
 
112
- To run the code with given corpus' of data, the following lines of code need to be inserted. If one or multiple
113
- corpora, summarizers and evaluators are not to be applied, the user has to comment the unwanted option.
114
 
115
  ```python
116
  if __name__ == "__main__":
@@ -167,7 +167,7 @@ if __name__ == "__main__":
167
  ### Preprocessing
168
 
169
  The preprocessing of given texts is done by the clean_text method present in the Data class. It removes (or normalizes) from the text
170
- elements which would make it difficult for them to be summarized, such as jump lines, double quotations and apostrophes.
171
 
172
  ```python
173
  class Data:
@@ -214,9 +214,9 @@ used as source texts documents achieved from existing datasets. The chosen datas
214
 
215
  - **Scientific Papers (arXiv + PubMed)**: [Cohan et al. (2018)](https://arxiv.org/pdf/1804.05685) found out that there were only
216
  datasets with short texts (with an average of 600 words) or datasets with longer texts with
217
- extractive humam summaries. In order to fill the gap and to provide a dataset with long text
218
  documents for abstractive summarization, the authors compiled two new datasets with scientific
219
- papers from arXiv and PubMed databases. Scientific papers are specially convenient given the
220
  desired kind of ATS the authors mean to achieve, and that is due to their large length and
221
  the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
222
  abstract.
@@ -225,7 +225,7 @@ examples for the task of abstractive summarization. The data dataset is built us
225
  Patents Public Datasets, where for each document there is one gold-standard summary which
226
  is the patent’s original abstract. One advantage of this dataset is that it does not present
227
  difficulties inherent to news summarization datasets, where summaries have a flattened discourse
228
- structure and the summary content arises in the begining of the document.
229
  - **CNN Corpus**: [Lins et al. (2019)](https://par.nsf.gov/servlets/purl/10185297) introduced the corpus in order to fill the gap that most news
230
  summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
231
  dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
@@ -238,7 +238,7 @@ text via Natural Language Processing techniques. In order to perform that task,
238
  around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
239
  to be the key aspect in understanding a text, namely the answering of somewhat complex
240
  questions about it. Even though ATS is not the main focus of the authors, they took inspiration
241
- from it to develop their model and include in their dataset the human made summaries for each
242
  news article.
243
  - **XSum**: [Narayan et al. (2018)](https://arxiv.org/pdf/1808.08745) introduced the single-document dataset, which focuses on a
244
  kind of summarization described by the authors as extreme summarization – an abstractive
@@ -252,22 +252,22 @@ Each of the datasets' documents was summarized through every summarization metho
252
  in comparison with the gold-standard summaries. The following tables provide the results for many evaluation methods applied
253
  to the many datasets:
254
 
255
- Table 1: Scientific Papers (arXiv + PubMed) results
256
  <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_arxiv_pubmed.png" width="800">
257
 
258
- Table 2: BIGPATENT results
259
  <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_bigpatent.png" width="800">
260
 
261
- Table 3: CNN Corpus results
262
  <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_corpus.png" width="800">
263
 
264
- Table 4: CNN + Daily Mail results
265
  <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_dailymail.png" width="800">
266
 
267
- Table 5: XSum results
268
  <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_xsum.png" width="800">
269
 
270
- ### BibTeX entry and citation info
271
 
272
  ```bibtex
273
  @conference{webist22,
 
29
  review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical
30
  exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
31
 
32
+ This model was an end result of the above-mentioned literature review paper, from which the best solution was drawn to be applied to the problem of
33
  summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI).
34
  It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem.
35
 
36
  ## Model description
37
 
38
+ This Automatic Text Summarization (ATS) Model was developed in the Python language to be applied to the Research Financing Products
39
+ Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation. It was produced in parallel with the writing of a
40
+ Systematic Literature Review paper, in which there is a discussion concerning many summarization methods, datasets, and evaluators
41
  as well as a brief overview of the nature of the task itself and the state-of-the-art of its implementation.
42
 
43
+ The input of the model can be either a single text, a data frame or a CSV file containing multiple texts (in the English language) and its output
44
+ is the summarized texts and their evaluation metrics. As an optional (although recommended) input, the model accepts gold-standard summaries
45
+ for the texts, i.e., human-written (or extracted) summaries of the texts which are considered to be good representations of their contents.
46
  Evaluators like ROUGE, which in its many variations is the most used to perform the task, require gold-standard summaries as inputs. There are,
47
+ however, Evaluation Methods that do not depend on the existence of a golden summary (e.g. the cosine similarity method, the Kullback Leibler
48
+ Divergence method) and this is why an evaluation can be made even when only the text is taken as input to the model.
49
 
50
  The text output is produced by a chosen method of ATS which can be extractive (built with the most relevant sentences of the source document)
51
  or abstractive (written from scratch in an abstractive manner). The latter is achieved by means of transformers, and the ones present in the
52
  model are the already existing and vastly applied BART-Large CNN, Pegasus-XSUM and mT5 Multilingual XLSUM. The extractive methods are taken from
53
  the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the
54
+ methods used for text summarization will be described individually in the following sections.
55
 
56
  ## Methods
57
 
 
101
  import more_itertools as mit
102
 
103
  ```
104
+ If any of the above-mentioned libraries are not installed in the user's machine, it will be required for
105
+ him to install them through the CMD with the command:
106
 
107
  ```python
108
  >>> pip install [LIBRARY]
109
 
110
  ```
111
 
112
+ To run the code with a given corpus' of data, the following lines of code need to be inserted. If one or multiple
113
+ corpora, summarizes, and evaluators are not to be applied, the user has to comment the unwanted option.
114
 
115
  ```python
116
  if __name__ == "__main__":
 
167
  ### Preprocessing
168
 
169
  The preprocessing of given texts is done by the clean_text method present in the Data class. It removes (or normalizes) from the text
170
+ elements that would make it difficult for them to be summarized, such as jump lines, double quotations and apostrophes.
171
 
172
  ```python
173
  class Data:
 
214
 
215
  - **Scientific Papers (arXiv + PubMed)**: [Cohan et al. (2018)](https://arxiv.org/pdf/1804.05685) found out that there were only
216
  datasets with short texts (with an average of 600 words) or datasets with longer texts with
217
+ extractive human summaries. In order to fill the gap and provide a dataset with long text
218
  documents for abstractive summarization, the authors compiled two new datasets with scientific
219
+ papers from arXiv and PubMed databases. Scientific papers are especially convenient given the
220
  desired kind of ATS the authors mean to achieve, and that is due to their large length and
221
  the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
222
  abstract.
 
225
  Patents Public Datasets, where for each document there is one gold-standard summary which
226
  is the patent’s original abstract. One advantage of this dataset is that it does not present
227
  difficulties inherent to news summarization datasets, where summaries have a flattened discourse
228
+ structure and the summary content arises at the beginning of the document.
229
  - **CNN Corpus**: [Lins et al. (2019)](https://par.nsf.gov/servlets/purl/10185297) introduced the corpus in order to fill the gap that most news
230
  summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
231
  dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
 
238
  around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
239
  to be the key aspect in understanding a text, namely the answering of somewhat complex
240
  questions about it. Even though ATS is not the main focus of the authors, they took inspiration
241
+ from it to develop their model and include in their dataset the human-made summaries for each
242
  news article.
243
  - **XSum**: [Narayan et al. (2018)](https://arxiv.org/pdf/1808.08745) introduced the single-document dataset, which focuses on a
244
  kind of summarization described by the authors as extreme summarization – an abstractive
 
252
  in comparison with the gold-standard summaries. The following tables provide the results for many evaluation methods applied
253
  to the many datasets:
254
 
255
+ ### Table 1: Scientific Papers (arXiv + PubMed) results
256
  <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_arxiv_pubmed.png" width="800">
257
 
258
+ ### Table 2: BIGPATENT results
259
  <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_bigpatent.png" width="800">
260
 
261
+ ### Table 3: CNN Corpus results
262
  <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_corpus.png" width="800">
263
 
264
+ ### Table 4: CNN + Daily Mail results
265
  <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_dailymail.png" width="800">
266
 
267
+ ### Table 5: XSum results
268
  <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_xsum.png" width="800">
269
 
270
+ ## BibTeX entry and citation info
271
 
272
  ```bibtex
273
  @conference{webist22,