English
Summarization
5 papers
igorgavi commited on
Commit
6b232ca
1 Parent(s): 92aefa9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -50
README.md CHANGED
@@ -53,9 +53,6 @@ model are the already existing and vastly applied BART-Large CNN, Pegasus-XSUM a
53
  the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the
54
  methods used for text summarization will be described indvidually in the following sections.
55
 
56
-
57
- ![architeru](https://github.com/marcosdib/S2Query/Classification_Architecture_model.png)
58
-
59
  ## Methods
60
 
61
  Since there are many methods to choose from in order to perform the ATS task using this model, the following table presents useful information
@@ -77,67 +74,91 @@ its implementation and the article from which it originated.
77
  | mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) |
78
 
79
 
80
- ## Intended uses & limitations
81
 
82
- You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
83
- be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
84
- fine-tuned versions of a task that interests you.
85
 
86
- Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
87
- to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
88
- generation you should look at model like XXX.
89
 
90
  ### How to use
91
 
92
- You can use this model directly with a pipeline for masked language modeling:
 
93
 
94
  ```python
95
- >>> from transformers import pipeline
96
- >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
97
- >>> unmasker("Hello I'm a [MASK] model.")
98
-
99
- [{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
100
- 'score': 0.1073106899857521,
101
- 'token': 4827,
102
- 'token_str': 'fashion'},
103
- {'sequence': "[CLS] hello i'm a role model. [SEP]",
104
- 'score': 0.08774490654468536,
105
- 'token': 2535,
106
- 'token_str': 'role'},
107
- {'sequence': "[CLS] hello i'm a new model. [SEP]",
108
- 'score': 0.05338378623127937,
109
- 'token': 2047,
110
- 'token_str': 'new'},
111
- {'sequence': "[CLS] hello i'm a super model. [SEP]",
112
- 'score': 0.04667217284440994,
113
- 'token': 3565,
114
- 'token_str': 'super'},
115
- {'sequence': "[CLS] hello i'm a fine model. [SEP]",
116
- 'score': 0.027095865458250046,
117
- 'token': 2986,
118
- 'token_str': 'fine'}]
119
- ```
120
 
121
- Here is how to use this model to get the features of a given text in PyTorch:
 
 
122
 
123
  ```python
124
- from transformers import BertTokenizer, BertModel
125
- tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
126
- model = BertModel.from_pretrained("bert-base-uncased")
127
- text = "Replace me by any text you'd like."
128
- encoded_input = tokenizer(text, return_tensors='pt')
129
- output = model(**encoded_input)
130
  ```
131
 
132
- and in TensorFlow:
 
133
 
134
  ```python
135
- from transformers import BertTokenizer, TFBertModel
136
- tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
137
- model = TFBertModel.from_pretrained("bert-base-uncased")
138
- text = "Replace me by any text you'd like."
139
- encoded_input = tokenizer(text, return_tensors='tf')
140
- output = model(encoded_input)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  ```
142
 
143
  ## Training data
@@ -147,6 +168,8 @@ unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_
147
  headers).
148
 
149
 
 
 
150
  ## Training procedure
151
 
152
  ### Preprocessing
 
53
  the Sumy Python Library and include SumyRandom, SumyLuhn, SumyLsa, SumyLexRank, SumyTextRank, SumySumBasic, SumyKL and SumyReduction. Each of the
54
  methods used for text summarization will be described indvidually in the following sections.
55
 
 
 
 
56
  ## Methods
57
 
58
  Since there are many methods to choose from in order to perform the ATS task using this model, the following table presents useful information
 
74
  | mT5 Multilingual XLSUM | Abstractive | [csebuetnlp/mT5_multilingual_XLSum](https://huggingface.co/csebuetnlp/mT5_multilingual_XLSum)| [(Raffel et al., 2019)](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) |
75
 
76
 
77
+ ## Limitations
78
 
 
 
 
79
 
 
 
 
80
 
81
  ### How to use
82
 
83
+ Initially, some libraries will need to be imported in order for the program to work. The following lines
84
+ of code, then, are necessary:
85
 
86
  ```python
87
+ import threading
88
+ from alive_progress import alive_bar
89
+ from datasets import load_dataset
90
+ from bs4 import BeautifulSoup
91
+ import pandas as pd
92
+ import numpy as np
93
+ import shutil
94
+ import regex
95
+ import os
96
+ import re
97
+ import itertools as it
98
+ import more_itertools as mit
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
+ ```
101
+ If any of the above mentioned libraries are not installed in the user's machine, it will be required for
102
+ him to install them through the CMD with the comand:
103
 
104
  ```python
105
+ >>> pip install [LIBRARY]
106
+
 
 
 
 
107
  ```
108
 
109
+ To run the code with given corpus' of data, the following lines of code need to be inserted. If one or multiple
110
+ corpora, summarizers and evaluators are not to be applied, the user has to comment the unwanted option.
111
 
112
  ```python
113
+ if __name__ == "__main__":
114
+
115
+ corpora = [
116
+ "mcti_data",
117
+ "cnn_dailymail",
118
+ "big_patent",
119
+ "cnn_corpus_abstractive",
120
+ "cnn_corpus_extractive",
121
+ "xsum",
122
+ "arxiv_pubmed",
123
+ ]
124
+
125
+ summarizers = [
126
+ "SumyRandom",
127
+ "SumyLuhn",
128
+ "SumyLsa",
129
+ "SumyLexRank",
130
+ "SumyTextRank",
131
+ "SumySumBasic",
132
+ "SumyKL",
133
+ "SumyReduction",
134
+ "Transformers-facebook/bart-large-cnn",
135
+ "Transformers-google/pegasus-xsum",
136
+ "Transformers-csebuetnlp/mT5_multilingual_XLSum",
137
+ ]
138
+
139
+ metrics = [
140
+ "rouge",
141
+ "gensim",
142
+ "nltk",
143
+ "sklearn",
144
+ ]
145
+
146
+ ### Running methods and eval locally
147
+
148
+ reader = Data()
149
+ reader.show_available_databases()
150
+ for corpus in corpora:
151
+ data = reader.read_data(corpus, 50)
152
+ method = Method(data, corpus)
153
+ method.show_methods()
154
+ for summarizer in summarizers:
155
+ df = method.run(summarizer)
156
+ method.examples_to_csv()
157
+ evaluator = Evaluator(df, summarizer, corpus)
158
+ for metric in metrics:
159
+ evaluator.run(metric)
160
+ evaluator.metrics_to_csv()
161
+ evaluator.join_all_results()
162
  ```
163
 
164
  ## Training data
 
168
  headers).
169
 
170
 
171
+
172
+
173
  ## Training procedure
174
 
175
  ### Preprocessing