--- language: en tags: - Summarization license: apache-2.0 datasets: - scientific_papers - big_patent - cnn_corpus - cnn_dailymail - xsum - MCTI_data thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_model.png --- ![MCTIimg](https://antigo.mctic.gov.br/mctic/export/sites/institucional/institucional/entidadesVinculadas/conselhos/pag-old/RODAPE_MCTI.png) # MCTI Text Automatic Text Summarization Task (uncased) DRAFT Disclaimer: ## According to the abstract of the literature review, We provide a literature review about Automatic Text Summarization systems. We consider a citation-based approach. We start with some popular and well-known papers that we have in hand about each topic we want to cover and we have tracked the "backward citations" (papers that are cited by the set of papers we knew beforehand) and the "forward citations" (newer papers that cite the set of papers we knew beforehand). In order to organize the different methods, we present the diverse approaches to ATS guided by the mechanisms they use to generate a summary. Besides presenting the methods, we also present an extensive review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods. This model was an end-result of the above mentioned literature review paper, from which the best solution was drawn to be applied to the problem of summarizing texts extracted from the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology, and Innovation (MCTI). It was first released in [this repository](https://huggingface.co/unb-lamfo-nlp-mcti), along with the other models used to address the given problem. ## Model description This Automatic Text Summarizarion (ATS) Model was developed in the Python language to be applied to the Research Financing Products Portfolio (FPP) of the Brazilian Ministry of Science, Technology and Innovation. It was produced in parallel with the writing of a Sistematic Literature Review paper, in which there is a discussion concerning many summarization methods, datasets, and evaluators as well as a brief overview of the nature of the task itself and the state-of-the-art of its implementation. The input of the model can be either a single text, a dataframe or a csv file containing multiple texts (in the English language) and its output are the summarized texts and their evaluation metrics. As an optional (although recommended) input, the model accepts gold-standard summaries for the texts, i.e., human written (or extracted) summaries of the texts which are considered to be good representations of their contents. Evaluators like ROUGE, which in its many variations is the most used to perform the task, require gold-standard summaries as inputs. There are, however, Evaluation Methods which do not deppend on the existence of a golden-summary (e.g. the cosine similarity method, the Kullback Leibler Divergence method) and this is why an evaluation can be made even when only the text is taken as an input to the model. The text output is produced by a chosen method of ATS which can be extractive (built with the most relevant sentences of the source document) or abstractive (written from scratch in an abstractive manner). The latter is achieved by means of transformers, and the ones present in the model are the already existing and vastly applied BART-Large CNN, Pegasus-XSUM and mT5 Multilingual XLSUM. The extractive methods are taken from the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the methods used for text summarization will be described indvidually in the following sections. ![architeru](https://github.com/marcosdib/S2Query/Classification_Architecture_model.png) ## Methods Since there are many methods to choose from in order to perform the ATS task using this model, the following table presents useful information regarding each of them, such as what kind of ATS the method produces (extractive or abstractive), where to find the documentation necessary for its implementation and the article from which it originated. | Method | Kind of ATS | Documentation | Source Article | |:----------------------:|:-----------:|:---------------:|:--------------:| | SumyRandom | Extractive | [Sumy GitHub](https://github.com/miso-belica/sumy/) | None (picks out random sentences from source text) | | SumyLuhn | Extractive | Ibid. | [(Luhn, 1958)](http://www.di.ubi.pt/~jpaulo/competence/general/%281958%29Luhn.pdf) | | SumyLsa | Extractive | Ibid. | [(Steinberger et al., 2004)](http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf) | | SumyLexRank | Extractive | Ibid. | [(Erkan and Radev, 2004)](http://tangra.si.umich.edu/~radev/lexrank/lexrank.pdf) | | SumyTextRank | Extractive | Ibid. | [(Mihalcea and Tarau, 2004)](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) | | SumySumBasic | Extractive | Ibid. | [(Vanderwende et. al, 2007)](http://www.cis.upenn.edu/~nenkova/papers/ipm.pdf) | | SumyKL | Extractive | Ibid. | [(Haghighi and Vanderwende, 2009)](http://www.aclweb.org/anthology/N09-1041) | | SumyReduction | Extractive | Ibid. | None. | | BART-Large CNN | Abstractive | [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) | [(Lewis et al., 2019)](https://arxiv.org/pdf/1910.13461) | | Pegasus-XSUM | Abstractive | [google/pegasus-xsum](https://huggingface.co/google/pegasus-xsum) | [(Zhang et al., 2020)](http://proceedings.mlr.press/v119/zhang20ae/zhang20ae.pdf) | | mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) | ## Intended uses & limitations You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for fine-tuned versions of a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like XXX. ### How to use You can use this model directly with a pipeline for masked language modeling: ```python >>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='bert-base-uncased') >>> unmasker("Hello I'm a [MASK] model.") [{'sequence': "[CLS] hello i'm a fashion model. [SEP]", 'score': 0.1073106899857521, 'token': 4827, 'token_str': 'fashion'}, {'sequence': "[CLS] hello i'm a role model. [SEP]", 'score': 0.08774490654468536, 'token': 2535, 'token_str': 'role'}, {'sequence': "[CLS] hello i'm a new model. [SEP]", 'score': 0.05338378623127937, 'token': 2047, 'token_str': 'new'}, {'sequence': "[CLS] hello i'm a super model. [SEP]", 'score': 0.04667217284440994, 'token': 3565, 'token_str': 'super'}, {'sequence': "[CLS] hello i'm a fine model. [SEP]", 'score': 0.027095865458250046, 'token': 2986, 'token_str': 'fine'}] ``` Here is how to use this model to get the features of a given text in PyTorch: ```python from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained("bert-base-uncased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) ``` and in TensorFlow: ```python from transformers import BertTokenizer, TFBertModel tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = TFBertModel.from_pretrained("bert-base-uncased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input) ``` ## Training data The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers). ## Training procedure ### Preprocessing ## Evaluation results Table 2: Results from Pre-trained Longformer + ML models. | ML Model | Accuracy | F1 Score | Precision | Recall | |:--------:|:---------:|:---------:|:---------:|:---------:| | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 | | DNN | 0.8462 | 0.8776 |0.8474 | 0.9123 | | CNN | 0.8462 | 0.8776 |0.8474 | 0.9123 | | LSTM | 0.8269 | 0.8801 |0.8571 | 0.9091 | ## Checkpoints - Examples - Implementation Notes - Usage Example - >>> - >>> ... ### BibTeX entry and citation info ```bibtex @conference{webist22, author ={Daniel O. Cajueiro and MaĆ­sa {Kely de Melo}. and Arthur G. Nery and Silvia A. dos Reis and Igor Tavares and Li Weigang and Victor R. R. Celestino.}, title ={A comprehensive review of automatic text summarization techniques: method, data, evaluation and coding}, booktitle ={Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST,}, year ={2022}, pages ={}, publisher ={}, organization ={}, doi ={}, isbn ={}, issn ={}, } ```