English
Summarization
5 papers
igorgavi commited on
Commit
a9ce3f2
1 Parent(s): 6b232ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -7
README.md CHANGED
@@ -76,7 +76,7 @@ its implementation and the article from which it originated.
76
 
77
  ## Limitations
78
 
79
-
80
 
81
  ### How to use
82
 
@@ -163,12 +163,40 @@ if __name__ == "__main__":
163
 
164
  ## Training data
165
 
166
- The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
167
- unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
168
- headers).
169
-
170
-
171
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
  ## Training procedure
174
 
 
76
 
77
  ## Limitations
78
 
79
+ [PERGUNTAR ARTHUR]
80
 
81
  ### How to use
82
 
 
163
 
164
  ## Training data
165
 
166
+ In order to train the model, it's transformers were trained with five datasets, which were:
167
+ - Scientific Papers (arXiv + PubMed): Cohan et al. (2018) found out that there were only
168
+ datasets with short texts (with an average of 600 words) or datasets with longer texts with
169
+ extractive humam summaries. In order to fill the gap and to provide a dataset with long text
170
+ documents for abstractive summarization, the authors compiled two new datasets with scientific
171
+ papers from arXiv and PubMed databases. Scientific papers are specially convenient given the
172
+ desired kind of ATS the authors mean to achieve, and that is due to their large length and
173
+ the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
174
+ abstract.
175
+ - BIGPATENT: Sharma et al. (2019) introduced the BIGPATENT dataset that provides goods
176
+ examples for the task of abstractive summarization. The data dataset is built using Google
177
+ Patents Public Datasets, where for each document there is one gold-standard summary which
178
+ is the patent’s original abstract. One advantage of this dataset is that it does not present
179
+ difficulties inherent to news summarization datasets, where summaries have a flattened discourse
180
+ structure and the summary content arises in the begining of the document.
181
+ - CNN Corpus: Lins et al. (2019b) introduced the corpus in order to fill the gap that most news
182
+ summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
183
+ dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
184
+ extractive and one abstractive. The encompassing of extractive gold-standard summaries is
185
+ also an advantage of this particular dataset over others with similar goals, which usually only
186
+ contain abstractive ones.
187
+ - CNN/Daily Mail: Hermann et al. (2015) intended to develop a consistent method for what
188
+ they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
189
+ text via Natural Language Processing techniques. In order to perform that task, they collected
190
+ around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
191
+ to be the key aspect in understanding a text, namely the answering of somewhat complex
192
+ questions about it. Even though ATS is not the main focus of the authors, they took inspiration
193
+ from it to develop their model and include in their dataset the human made summaries for each
194
+ news article.
195
+ - XSum: Narayan et al. (2018b) introduced the single-document dataset, which focuses on a
196
+ kind of summarization described by the authors as extreme summarization – an abstractive
197
+ kind of ATS that is aimed at answering the question “What is the document about?”. The data
198
+ was obtained from BBC articles and each one of them is accompanied by a short gold-standard
199
+ summary often written by its very author.
200
 
201
  ## Training procedure
202