unb-lamfo-nlp-mcti
/

NLP-ATS-MCTI

6 datasets

English

Summarization

5 papers

Model card Files Files and versions Community

r2nery commited on Dec 20, 2022

Commit

9a9b5c1

•

1 Parent(s): de6582f

Update README.md

Browse files

Files changed (1) hide show

README.md +25 -25

README.md CHANGED Viewed

@@ -29,29 +29,29 @@ Disclaimer:
   review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries.  Finally, we present an empirical
   exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
-This model was an end-result of the above mentioned literature review paper, from which the best solution was drawn to be applied to the problem of
 summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI).
 It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem.
 ## Model description
-This Automatic Text Summarizarion (ATS) Model was developed in the Python language to be applied to the Research Financing Products
-Portfolio (FPP) of the Brazilian Ministry of Science, Technology and Innovation. It was produced in parallel with the writing of a
-Sistematic Literature Review paper, in which there is a discussion concerning many summarization methods, datasets, and evaluators
 as well as a brief overview of the nature of the task itself and the state-of-the-art of its implementation.
-The input of the model can be either a single text, a dataframe or a csv file containing multiple texts (in the English language) and its output
-are the summarized texts and their evaluation metrics. As an optional (although recommended) input, the model accepts gold-standard summaries
-for the texts, i.e., human written (or extracted) summaries of the texts which are considered to be good representations of their contents.
 Evaluators like ROUGE, which in its many variations is the most used to perform the task, require gold-standard summaries as inputs. There are,
-however, Evaluation Methods which do not deppend on the existence of a golden-summary (e.g. the cosine similarity method, the Kullback Leibler
-Divergence method) and this is why an evaluation can be made even when only the text is taken as an input to the model.
 The text output is produced by a chosen method of ATS which can be extractive (built with the most relevant sentences of the source document)
 or abstractive (written from scratch in an abstractive manner). The latter is achieved by means of transformers, and the ones present in the
 model are the already existing and vastly applied BART-Large CNN, Pegasus-XSUM and mT5 Multilingual XLSUM. The extractive methods are taken from
 the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the
-methods used for text summarization will be described indvidually in the following sections.
 ## Methods
@@ -101,16 +101,16 @@ import itertools as it
 import more_itertools as mit
 ```
-If any of the above mentioned libraries are not installed in the user's machine, it will be required for
-him to install them through the CMD with the comand:
 ```python
 >>> pip install [LIBRARY]
 ```
-To run the code with given corpus' of data, the following lines of code need to be inserted. If one or multiple
-corpora, summarizers and evaluators are not to be applied, the user has to comment the unwanted option.
 ```python
 if __name__ == "__main__":
@@ -167,7 +167,7 @@ if __name__ == "__main__":
 ### Preprocessing
 The preprocessing of given texts is done by the clean_text method present in the Data class. It removes (or normalizes) from the text
-elements which would make it difficult for them to be summarized, such as jump lines, double quotations and apostrophes.
 ```python
 class Data:
@@ -214,9 +214,9 @@ used as source texts documents achieved from existing datasets. The chosen datas
 - **Scientific Papers (arXiv + PubMed)**: [Cohan et al. (2018)](https://arxiv.org/pdf/1804.05685) found out that there were only
 datasets with short texts (with an average of 600 words) or datasets with longer texts with
-extractive humam summaries. In order to fill the gap and to provide a dataset with long text
 documents for abstractive summarization, the authors compiled two new datasets with scientific
-papers from arXiv and PubMed databases. Scientific papers are specially convenient given the
 desired kind of ATS the authors mean to achieve, and that is due to their large length and
 the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
 abstract.
@@ -225,7 +225,7 @@ examples for the task of abstractive summarization. The data dataset is built us
 Patents Public Datasets, where for each document there is one gold-standard summary which
 is the patent’s original abstract. One advantage of this dataset is that it does not present
 difficulties inherent to news summarization datasets, where summaries have a flattened discourse
-structure and the summary content arises in the begining of the document.
 - **CNN Corpus**: [Lins et al. (2019)](https://par.nsf.gov/servlets/purl/10185297) introduced the corpus in order to fill the gap that most news
 summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
 dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
@@ -238,7 +238,7 @@ text via Natural Language Processing techniques. In order to perform that task,
 around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
 to be the key aspect in understanding a text, namely the answering of somewhat complex
 questions about it. Even though ATS is not the main focus of the authors, they took inspiration
-from it to develop their model and include in their dataset the human made summaries for each
 news article.
 - **XSum**: [Narayan et al. (2018)](https://arxiv.org/pdf/1808.08745) introduced the single-document dataset, which focuses on a
 kind of summarization described by the authors as extreme summarization – an abstractive
@@ -252,22 +252,22 @@ Each of the datasets' documents was summarized through every summarization metho
 in comparison with the gold-standard summaries. The following tables provide the results for many evaluation methods applied
 to the many datasets:
-Table 1: Scientific Papers (arXiv + PubMed) results
 <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_arxiv_pubmed.png" width="800">
-Table 2: BIGPATENT results
 <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_bigpatent.png" width="800">
-Table 3: CNN Corpus results
 <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_corpus.png" width="800">
-Table 4: CNN + Daily Mail results
 <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_dailymail.png" width="800">
-Table 5: XSum results
 <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_xsum.png" width="800">
-### BibTeX entry and citation info
 ```bibtex
 @conference{webist22,

   review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries.  Finally, we present an empirical
   exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
+This model was an end result of the above-mentioned literature review paper, from which the best solution was drawn to be applied to the problem of
 summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI).
 It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem.
 ## Model description
+This Automatic Text Summarization (ATS) Model was developed in the Python language to be applied to the Research Financing Products
+Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation. It was produced in parallel with the writing of a
+Systematic Literature Review paper, in which there is a discussion concerning many summarization methods, datasets, and evaluators
 as well as a brief overview of the nature of the task itself and the state-of-the-art of its implementation.
+The input of the model can be either a single text, a data frame or a CSV file containing multiple texts (in the English language) and its output
+is the summarized texts and their evaluation metrics. As an optional (although recommended) input, the model accepts gold-standard summaries
+for the texts, i.e., human-written (or extracted) summaries of the texts which are considered to be good representations of their contents.
 Evaluators like ROUGE, which in its many variations is the most used to perform the task, require gold-standard summaries as inputs. There are,
+however, Evaluation Methods that do not depend on the existence of a golden summary (e.g. the cosine similarity method, the Kullback Leibler
+Divergence method) and this is why an evaluation can be made even when only the text is taken as input to the model.
 The text output is produced by a chosen method of ATS which can be extractive (built with the most relevant sentences of the source document)
 or abstractive (written from scratch in an abstractive manner). The latter is achieved by means of transformers, and the ones present in the
 model are the already existing and vastly applied BART-Large CNN, Pegasus-XSUM and mT5 Multilingual XLSUM. The extractive methods are taken from
 the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the
+methods used for text summarization will be described individually in the following sections.
 ## Methods
 import more_itertools as mit
 ```
+If any of the above-mentioned libraries are not installed in the user's machine, it will be required for
+him to install them through the CMD with the command:
 ```python
 >>> pip install [LIBRARY]
 ```
+To run the code with a given corpus' of data, the following lines of code need to be inserted. If one or multiple
+corpora, summarizes, and evaluators are not to be applied, the user has to comment the unwanted option.
 ```python
 if __name__ == "__main__":
 ### Preprocessing
 The preprocessing of given texts is done by the clean_text method present in the Data class. It removes (or normalizes) from the text
+elements that would make it difficult for them to be summarized, such as jump lines, double quotations and apostrophes.
 ```python
 class Data:
 - **Scientific Papers (arXiv + PubMed)**: [Cohan et al. (2018)](https://arxiv.org/pdf/1804.05685) found out that there were only
 datasets with short texts (with an average of 600 words) or datasets with longer texts with
+extractive human summaries. In order to fill the gap and provide a dataset with long text
 documents for abstractive summarization, the authors compiled two new datasets with scientific
+papers from arXiv and PubMed databases. Scientific papers are especially convenient given the
 desired kind of ATS the authors mean to achieve, and that is due to their large length and
 the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
 abstract.
 Patents Public Datasets, where for each document there is one gold-standard summary which
 is the patent’s original abstract. One advantage of this dataset is that it does not present
 difficulties inherent to news summarization datasets, where summaries have a flattened discourse
+structure and the summary content arises at the beginning of the document.
 - **CNN Corpus**: [Lins et al. (2019)](https://par.nsf.gov/servlets/purl/10185297) introduced the corpus in order to fill the gap that most news
 summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
 dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
 around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
 to be the key aspect in understanding a text, namely the answering of somewhat complex
 questions about it. Even though ATS is not the main focus of the authors, they took inspiration
+from it to develop their model and include in their dataset the human-made summaries for each
 news article.
 - **XSum**: [Narayan et al. (2018)](https://arxiv.org/pdf/1808.08745) introduced the single-document dataset, which focuses on a
 kind of summarization described by the authors as extreme summarization – an abstractive
 in comparison with the gold-standard summaries. The following tables provide the results for many evaluation methods applied
 to the many datasets:
+### Table 1: Scientific Papers (arXiv + PubMed) results
 <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_arxiv_pubmed.png" width="800">
+### Table 2: BIGPATENT results
 <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_bigpatent.png" width="800">
+### Table 3: CNN Corpus results
 <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_corpus.png" width="800">
+### Table 4: CNN + Daily Mail results
 <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_cnn_dailymail.png" width="800">
+### Table 5: XSum results
 <img src="https://huggingface.co/unb-lamfo-nlp-mcti/NLP-ATS-MCTI/resolve/main/results_xsum.png" width="800">
+## BibTeX entry and citation info
 ```bibtex
 @conference{webist22,