sidphbot commited on
Commit
8a54c40
1 Parent(s): 395b423

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +273 -0
README.md CHANGED
@@ -9,3 +9,276 @@ app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
  ----
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pinned: false
10
  license: apache-2.0
11
  ----
12
+
13
+
14
+ # Auto-Research
15
+ ![Auto-Research][logo]
16
+
17
+ [logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png
18
+ A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
19
+
20
+ Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI
21
+
22
+ Requirements:
23
+ - python 3.7 or above
24
+ - poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev`
25
+ - list of requirements in requirements.txt - `cat requirements.txt | xargs pip install`
26
+ - 8GB disk space
27
+ - 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
28
+
29
+ #### Demo :
30
+
31
+ Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing
32
+
33
+ Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
34
+
35
+ (`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU)
36
+
37
+
38
+ #### Installation:
39
+ ```
40
+ sudo apt-get install build-essential poppler-utils libpoppler-cpp-dev pkg-config python-dev
41
+ pip install git+https://github.com/sidphbot/Auto-Research.git
42
+ ```
43
+
44
+ #### Run Survey (cli):
45
+ ```
46
+ python survey.py [options] <your_research_query>
47
+ ```
48
+
49
+ #### Run Survey (Streamlit web-interface - new):
50
+ ```
51
+ streamlit run app.py
52
+ ```
53
+
54
+ #### Run Survey (Python API):
55
+ ```
56
+ from survey import Surveyor
57
+ mysurveyor = Surveyor()
58
+ mysurveyor.survey('quantum entanglement')
59
+ ```
60
+
61
+ ### Research tools:
62
+
63
+ These are independent tools for your research or document text handling needs.
64
+
65
+ ```
66
+ *[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
67
+ ```
68
+
69
+ - `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`)
70
+
71
+ Input:
72
+
73
+ `longtext` : string
74
+
75
+ Returns:
76
+
77
+ `summary` : string
78
+
79
+ - `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`)
80
+
81
+ Input:
82
+
83
+ `longtext` : string
84
+
85
+ Returns:
86
+
87
+ `summary` : string
88
+
89
+ - `generate_title` - takes a long text document (`string`) and returns a generated title (`string`)
90
+
91
+ Input:
92
+
93
+ `longtext` : string
94
+
95
+ Returns:
96
+
97
+ `title` : string
98
+
99
+ - `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`)
100
+
101
+ Input:
102
+
103
+ `longtext` : string
104
+
105
+ Returns:
106
+
107
+ `highlights` : [string]
108
+ `keywords` : [string]
109
+ `keyphrases` : [string]
110
+
111
+ - `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`).
112
+
113
+ Input:
114
+
115
+ `pdf_file` : string
116
+
117
+ Returns:
118
+
119
+ `images_files` : [string]
120
+
121
+ - `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`).
122
+
123
+ Input:
124
+
125
+ `pdf_file` : string
126
+
127
+ Returns:
128
+
129
+ `images_files` : [string]
130
+
131
+ - `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`)
132
+
133
+ Input:
134
+
135
+ `lines` : [string]
136
+
137
+ Returns:
138
+
139
+ `sections` : dict(generated_title: [cluster_abstract])
140
+ `clusters` : dict(cluster_id: [cluster_lines])
141
+
142
+ - `extract_headings` - *[for scientific texts - Assumes an ‘abstract’ heading present]* takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`).
143
+
144
+ `[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`)
145
+
146
+ `[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!
147
+
148
+ Input:
149
+
150
+ `text_file` : string
151
+
152
+ Returns:
153
+
154
+ `refined` : [string],
155
+ `headings` : [string]
156
+ `sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
157
+
158
+
159
+ ## Access/Modify defaults:
160
+
161
+ - inside code
162
+ ```
163
+ from survey.Surveyor import DEFAULTS
164
+ from pprint import pprint
165
+
166
+ pprint(DEFAULTS)
167
+ ```
168
+ or,
169
+
170
+ - Modify static config file - `defaults.py`
171
+
172
+ or,
173
+
174
+ - At runtime (utility)
175
+
176
+ ```
177
+ python survey.py --help
178
+ ```
179
+ ```
180
+ usage: survey.py [-h] [--max_search max_metadata_papers]
181
+ [--num_papers max_num_papers] [--pdf_dir pdf_dir]
182
+ [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
183
+ [--dump_dir dump_dir] [--models_dir save_models_dir]
184
+ [--title_model_name title_model_name]
185
+ [--ex_summ_model_name extractive_summ_model_name]
186
+ [--ledmodel_name ledmodel_name]
187
+ [--embedder_name sentence_embedder_name]
188
+ [--nlp_name spacy_model_name]
189
+ [--similarity_nlp_name similarity_nlp_name]
190
+ [--kw_model_name kw_model_name]
191
+ [--refresh_models refresh_models] [--high_gpu high_gpu]
192
+ query_string
193
+
194
+ Generate a survey just from a query !!
195
+
196
+ positional arguments:
197
+ query_string your research query/keywords
198
+
199
+ optional arguments:
200
+ -h, --help show this help message and exit
201
+ --max_search max_metadata_papers
202
+ maximium number of papers to gaze at - defaults to 100
203
+ --num_papers max_num_papers
204
+ maximium number of papers to download and analyse -
205
+ defaults to 25
206
+ --pdf_dir pdf_dir pdf paper storage directory - defaults to
207
+ arxiv_data/tarpdfs/
208
+ --txt_dir txt_dir text-converted paper storage directory - defaults to
209
+ arxiv_data/fulltext/
210
+ --img_dir img_dir image storage directory - defaults to
211
+ arxiv_data/images/
212
+ --tab_dir tab_dir tables storage directory - defaults to
213
+ arxiv_data/tables/
214
+ --dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/
215
+ --models_dir save_models_dir
216
+ directory to save models (> 5GB) - defaults to
217
+ saved_models/
218
+ --title_model_name title_model_name
219
+ title model name/tag in hugging-face, defaults to
220
+ 'Callidior/bert2bert-base-arxiv-titlegen'
221
+ --ex_summ_model_name extractive_summ_model_name
222
+ extractive summary model name/tag in hugging-face,
223
+ defaults to 'allenai/scibert_scivocab_uncased'
224
+ --ledmodel_name ledmodel_name
225
+ led model(for abstractive summary) name/tag in
226
+ hugging-face, defaults to 'allenai/led-
227
+ large-16384-arxiv'
228
+ --embedder_name sentence_embedder_name
229
+ sentence embedder name/tag in hugging-face, defaults
230
+ to 'paraphrase-MiniLM-L6-v2'
231
+ --nlp_name spacy_model_name
232
+ spacy model name/tag in hugging-face (if changed -
233
+ needs to be spacy-installed prior), defaults to
234
+ 'en_core_sci_scibert'
235
+ --similarity_nlp_name similarity_nlp_name
236
+ spacy downstream model(for similarity) name/tag in
237
+ hugging-face (if changed - needs to be spacy-installed
238
+ prior), defaults to 'en_core_sci_lg'
239
+ --kw_model_name kw_model_name
240
+ keyword extraction model name/tag in hugging-face,
241
+ defaults to 'distilbert-base-nli-mean-tokens'
242
+ --refresh_models refresh_models
243
+ Refresh model downloads with given names (needs
244
+ atleast one model name param above), defaults to False
245
+ --high_gpu high_gpu High GPU usage permitted, defaults to False
246
+
247
+ ```
248
+
249
+ - At runtime (code)
250
+
251
+ > during surveyor object initialization with `surveyor_obj = Surveyor()`
252
+ - `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
253
+ - `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
254
+ - `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
255
+ - `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
256
+ - `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
257
+ - `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
258
+ - `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
259
+ - `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
260
+ - `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
261
+ - `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
262
+ - `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
263
+ - `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
264
+ - `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
265
+ - `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
266
+ - `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
267
+
268
+ > during survey generation with `surveyor_obj.survey(query="my_research_query")`
269
+ - `max_search`: int maximium number of papers to gaze at - defaults to `100`
270
+ - `num_papers`: int maximium number of papers to download and analyse - defaults to `25`
271
+
272
+
273
+
274
+ #### Artifacts generated (zipped):
275
+ - Detailed survey draft paper as txt file
276
+ - A curated list of top 25+ papers as pdfs and txts
277
+ - Images extracted from above papers as jpegs, bmps etc
278
+ - Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
279
+ - Tables extracted from papers(optional)
280
+ - Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
281
+
282
+
283
+ Please cite this repo if it helped you :)
284
+