English
Summarization
5 papers
igorgavi commited on
Commit
4b1346b
1 Parent(s): 97f5942

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -173,7 +173,7 @@ Hey, look how easy it is to write LaTeX equations in here \\(Ax = b\\) or even $
173
  In order to evaluate the model, summaries were generated by each of its summarization methods, which
174
  used as source texts documents achieved from existing datasets. The chosen datasets for evaluation were the following:
175
 
176
- - **Scientific Papers (arXiv + PubMed)**: Cohan et al. (2018) found out that there were only
177
  datasets with short texts (with an average of 600 words) or datasets with longer texts with
178
  extractive humam summaries. In order to fill the gap and to provide a dataset with long text
179
  documents for abstractive summarization, the authors compiled two new datasets with scientific
@@ -181,19 +181,19 @@ papers from arXiv and PubMed databases. Scientific papers are specially convenie
181
  desired kind of ATS the authors mean to achieve, and that is due to their large length and
182
  the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
183
  abstract.
184
- - **BIGPATENT**: Sharma et al. (2019) introduced the BIGPATENT dataset that provides goods
185
  examples for the task of abstractive summarization. The data dataset is built using Google
186
  Patents Public Datasets, where for each document there is one gold-standard summary which
187
  is the patent’s original abstract. One advantage of this dataset is that it does not present
188
  difficulties inherent to news summarization datasets, where summaries have a flattened discourse
189
  structure and the summary content arises in the begining of the document.
190
- - **CNN Corpus**: Lins et al. (2019b) introduced the corpus in order to fill the gap that most news
191
  summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
192
  dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
193
  extractive and one abstractive. The encompassing of extractive gold-standard summaries is
194
  also an advantage of this particular dataset over others with similar goals, which usually only
195
  contain abstractive ones.
196
- - **CNN/Daily Mail**: Hermann et al. (2015) intended to develop a consistent method for what
197
  they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
198
  text via Natural Language Processing techniques. In order to perform that task, they collected
199
  around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
@@ -201,7 +201,7 @@ to be the key aspect in understanding a text, namely the answering of somewhat c
201
  questions about it. Even though ATS is not the main focus of the authors, they took inspiration
202
  from it to develop their model and include in their dataset the human made summaries for each
203
  news article.
204
- - **XSum**: Narayan et al. (2018b) introduced the single-document dataset, which focuses on a
205
  kind of summarization described by the authors as extreme summarization – an abstractive
206
  kind of ATS that is aimed at answering the question “What is the document about?”. The data
207
  was obtained from BBC articles and each one of them is accompanied by a short gold-standard
 
173
  In order to evaluate the model, summaries were generated by each of its summarization methods, which
174
  used as source texts documents achieved from existing datasets. The chosen datasets for evaluation were the following:
175
 
176
+ - **Scientific Papers (arXiv + PubMed)**: [Cohan et al. (2018)](https://arxiv.org/pdf/1804.05685) found out that there were only
177
  datasets with short texts (with an average of 600 words) or datasets with longer texts with
178
  extractive humam summaries. In order to fill the gap and to provide a dataset with long text
179
  documents for abstractive summarization, the authors compiled two new datasets with scientific
 
181
  desired kind of ATS the authors mean to achieve, and that is due to their large length and
182
  the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
183
  abstract.
184
+ - **BIGPATENT**: [Sharma et al. (2019)](https://arxiv.org/pdf/1906.03741) introduced the BIGPATENT dataset that provides goods
185
  examples for the task of abstractive summarization. The data dataset is built using Google
186
  Patents Public Datasets, where for each document there is one gold-standard summary which
187
  is the patent’s original abstract. One advantage of this dataset is that it does not present
188
  difficulties inherent to news summarization datasets, where summaries have a flattened discourse
189
  structure and the summary content arises in the begining of the document.
190
+ - **CNN Corpus**: [Lins et al. (2019)](https://par.nsf.gov/servlets/purl/10185297) introduced the corpus in order to fill the gap that most news
191
  summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
192
  dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
193
  extractive and one abstractive. The encompassing of extractive gold-standard summaries is
194
  also an advantage of this particular dataset over others with similar goals, which usually only
195
  contain abstractive ones.
196
+ - **CNN/Daily Mail**: [Hermann et al. (2015)](https://proceedings.neurips.cc/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf) intended to develop a consistent method for what
197
  they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
198
  text via Natural Language Processing techniques. In order to perform that task, they collected
199
  around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
 
201
  questions about it. Even though ATS is not the main focus of the authors, they took inspiration
202
  from it to develop their model and include in their dataset the human made summaries for each
203
  news article.
204
+ - **XSum**: [Narayan et al. (2018)](https://arxiv.org/pdf/1808.08745) introduced the single-document dataset, which focuses on a
205
  kind of summarization described by the authors as extreme summarization – an abstractive
206
  kind of ATS that is aimed at answering the question “What is the document about?”. The data
207
  was obtained from BBC articles and each one of them is accompanied by a short gold-standard