igorgavi commited on
Commit
44b6e9f
1 Parent(s): a9ce3f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -164,7 +164,7 @@ if __name__ == "__main__":
164
  ## Training data
165
 
166
  In order to train the model, it's transformers were trained with five datasets, which were:
167
- - Scientific Papers (arXiv + PubMed): Cohan et al. (2018) found out that there were only
168
  datasets with short texts (with an average of 600 words) or datasets with longer texts with
169
  extractive humam summaries. In order to fill the gap and to provide a dataset with long text
170
  documents for abstractive summarization, the authors compiled two new datasets with scientific
@@ -172,19 +172,19 @@ papers from arXiv and PubMed databases. Scientific papers are specially convenie
172
  desired kind of ATS the authors mean to achieve, and that is due to their large length and
173
  the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
174
  abstract.
175
- - BIGPATENT: Sharma et al. (2019) introduced the BIGPATENT dataset that provides goods
176
  examples for the task of abstractive summarization. The data dataset is built using Google
177
  Patents Public Datasets, where for each document there is one gold-standard summary which
178
  is the patent’s original abstract. One advantage of this dataset is that it does not present
179
  difficulties inherent to news summarization datasets, where summaries have a flattened discourse
180
  structure and the summary content arises in the begining of the document.
181
- - CNN Corpus: Lins et al. (2019b) introduced the corpus in order to fill the gap that most news
182
  summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
183
  dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
184
  extractive and one abstractive. The encompassing of extractive gold-standard summaries is
185
  also an advantage of this particular dataset over others with similar goals, which usually only
186
  contain abstractive ones.
187
- - CNN/Daily Mail: Hermann et al. (2015) intended to develop a consistent method for what
188
  they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
189
  text via Natural Language Processing techniques. In order to perform that task, they collected
190
  around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
@@ -192,7 +192,7 @@ to be the key aspect in understanding a text, namely the answering of somewhat c
192
  questions about it. Even though ATS is not the main focus of the authors, they took inspiration
193
  from it to develop their model and include in their dataset the human made summaries for each
194
  news article.
195
- - XSum: Narayan et al. (2018b) introduced the single-document dataset, which focuses on a
196
  kind of summarization described by the authors as extreme summarization – an abstractive
197
  kind of ATS that is aimed at answering the question “What is the document about?”. The data
198
  was obtained from BBC articles and each one of them is accompanied by a short gold-standard
@@ -200,8 +200,11 @@ summary often written by its very author.
200
 
201
  ## Training procedure
202
 
 
203
  ### Preprocessing
 
204
 
 
205
  ## Evaluation results
206
 
207
 
 
164
  ## Training data
165
 
166
  In order to train the model, it's transformers were trained with five datasets, which were:
167
+ - **Scientific Papers (arXiv + PubMed)**: Cohan et al. (2018) found out that there were only
168
  datasets with short texts (with an average of 600 words) or datasets with longer texts with
169
  extractive humam summaries. In order to fill the gap and to provide a dataset with long text
170
  documents for abstractive summarization, the authors compiled two new datasets with scientific
 
172
  desired kind of ATS the authors mean to achieve, and that is due to their large length and
173
  the fact that each one contains an abstractive summary made by its author – i.e., the paper’s
174
  abstract.
175
+ - **BIGPATENT**: Sharma et al. (2019) introduced the BIGPATENT dataset that provides goods
176
  examples for the task of abstractive summarization. The data dataset is built using Google
177
  Patents Public Datasets, where for each document there is one gold-standard summary which
178
  is the patent’s original abstract. One advantage of this dataset is that it does not present
179
  difficulties inherent to news summarization datasets, where summaries have a flattened discourse
180
  structure and the summary content arises in the begining of the document.
181
+ - **CNN Corpus**: Lins et al. (2019b) introduced the corpus in order to fill the gap that most news
182
  summarization single-document datasets have fewer than 1,000 documents. The CNN-Corpus
183
  dataset, thus, contains 3,000 Single-Documents with two gold-standard summaries each: one
184
  extractive and one abstractive. The encompassing of extractive gold-standard summaries is
185
  also an advantage of this particular dataset over others with similar goals, which usually only
186
  contain abstractive ones.
187
+ - **CNN/Daily Mail**: Hermann et al. (2015) intended to develop a consistent method for what
188
  they called ”teaching machines how to read”, i.e., making the machine be able to comprehend a
189
  text via Natural Language Processing techniques. In order to perform that task, they collected
190
  around 400k news from the newspapers CNN and Daily Mail and evaluated what they considered
 
192
  questions about it. Even though ATS is not the main focus of the authors, they took inspiration
193
  from it to develop their model and include in their dataset the human made summaries for each
194
  news article.
195
+ - **XSum**: Narayan et al. (2018b) introduced the single-document dataset, which focuses on a
196
  kind of summarization described by the authors as extreme summarization – an abstractive
197
  kind of ATS that is aimed at answering the question “What is the document about?”. The data
198
  was obtained from BBC articles and each one of them is accompanied by a short gold-standard
 
200
 
201
  ## Training procedure
202
 
203
+
204
  ### Preprocessing
205
+ [PERGUNTAR ARTHUR]
206
 
207
+ Hey, look how easy it is to write LaTeX equations in here \\(Ax = b\\) or even $$Ax = b$$
208
  ## Evaluation results
209
 
210