File size: 12,516 Bytes
a8d4e3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
Metadata-Version: 2.1
Name: Auto-Research
Version: 1.0
Summary: Geberate scientific survey with just a query
Home-page: https://github.com/sidphbot/Auto-Research
Author: Sidharth Pal
Author-email: sidharth.pal1992@gmail.com
License: UNKNOWN
Project-URL: Docs, https://github.com/example/example/README.md
Project-URL: Bug Tracker, https://github.com/sidphbot/Auto-Research/issues
Project-URL: Demo, https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Environment :: Other Environment
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Other Audience
Classifier: Topic :: Education
Classifier: Topic :: Education :: Computer Aided Instruction (CAI)
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Topic :: Scientific/Engineering :: Physics
Classifier: Natural Language :: English
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 or later (LGPLv3+)
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Environment :: GPU
Classifier: Environment :: GPU :: NVIDIA CUDA
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.6
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: spacy
License-File: LICENSE

# Auto-Research
![Auto-Research][logo]

[logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png
A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.

Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI

Requirements:
 - python 3.7 or above
 - poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev`
 - list of requirements in requirements.txt - `cat requirements.txt | xargs pip install`
 - 8GB disk space 
 - 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)

#### Demo : 

Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing

Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query 

(`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU)


#### Steps to run (pip coming soon):
```
apt install -y poppler-utils libpoppler-cpp-dev
git clone https://github.com/sidphbot/Auto-Research.git

cd Auto-Research/
pip install -r requirements.txt
python survey.py [options] <your_research_query>
```

#### Artifacts generated (zipped):
- Detailed survey draft paper as txt file
- A curated list of top 25+ papers as pdfs and txts
- Images extracted from above papers as jpegs, bmps etc
- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
- Tables extracted from papers(optional)
- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump

## Example run #1 - python utility

```
python survey.py 'multi-task representation learning'
```

## Example run #2 - python class

```
from survey import Surveyor
mysurveyor = Surveyor()
mysurveyor.survey('quantum entanglement')
```

### Research tools: 

These are independent tools for your research or document text handling needs.

```
*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
```

- `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`)

	Input: 		
		
		`longtext` : string
		
	Returns: 		
		
		`summary` : string

- `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`)

	Input: 		
		
		`longtext` : string
		
	Returns: 		
		
		`summary` : string

- `generate_title` - takes a long text document (`string`) and returns a generated title (`string`)

	Input: 		
		
		`longtext` : string
		
	Returns: 		
		
		`title` : string

- `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`)

	Input: 		
		
		`longtext` : string
		
	Returns: 		
		
		`highlights` : [string]
		`keywords` : [string]
		`keyphrases` : [string]

- `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`).

	Input: 		
		
		`pdf_file` : string
		
	Returns: 		
		
		`images_files` : [string]

- `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`).

	Input: 		
		
		`pdf_file` : string
		
	Returns: 		
		
		`images_files` : [string]

- `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`)

	Input: 		
		
		`lines` : [string]
		
	Returns: 		
		
		`sections` : dict(generated_title: [cluster_abstract])
		`clusters` : dict(cluster_id: [cluster_lines])

- `extract_headings` - *[for scientific texts - Assumes an ‘abstract’ heading present]* takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`). 
    
    `[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`)
    
    `[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!

	Input: 		
		
		`text_file` : string 		
		
	Returns: 
		
		`refined` : [string], 
		`headings` : [string]
		`sectioned_doc` : dict( heading: text) (Optional - Wrapper case)


## Access/Modify defaults:

- inside code 
```
from survey.Surveyor import DEFAULTS
from pprint import pprint

pprint(DEFAULTS)
```
or,

- Modify static config file - `defaults.py`

or,

- At runtime (utility)

```
python survey.py --help
```
```
usage: survey.py [-h] [--max_search max_metadata_papers]
                   [--num_papers max_num_papers] [--pdf_dir pdf_dir]
                   [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
                   [--dump_dir dump_dir] [--models_dir save_models_dir]
                   [--title_model_name title_model_name]
                   [--ex_summ_model_name extractive_summ_model_name]
                   [--ledmodel_name ledmodel_name]
                   [--embedder_name sentence_embedder_name]
                   [--nlp_name spacy_model_name]
                   [--similarity_nlp_name similarity_nlp_name]
                   [--kw_model_name kw_model_name]
                   [--refresh_models refresh_models] [--high_gpu high_gpu]
                   query_string

Generate a survey just from a query !!

positional arguments:
  query_string          your research query/keywords

optional arguments:
  -h, --help            show this help message and exit
  --max_search max_metadata_papers
                        maximium number of papers to gaze at - defaults to 100
  --num_papers max_num_papers
                        maximium number of papers to download and analyse -
                        defaults to 25
  --pdf_dir pdf_dir     pdf paper storage directory - defaults to
                        arxiv_data/tarpdfs/
  --txt_dir txt_dir     text-converted paper storage directory - defaults to
                        arxiv_data/fulltext/
  --img_dir img_dir     image storage directory - defaults to
                        arxiv_data/images/
  --tab_dir tab_dir     tables storage directory - defaults to
                        arxiv_data/tables/
  --dump_dir dump_dir   all_output_dir - defaults to arxiv_dumps/
  --models_dir save_models_dir
                        directory to save models (> 5GB) - defaults to
                        saved_models/
  --title_model_name title_model_name
                        title model name/tag in hugging-face, defaults to
                        'Callidior/bert2bert-base-arxiv-titlegen'
  --ex_summ_model_name extractive_summ_model_name
                        extractive summary model name/tag in hugging-face,
                        defaults to 'allenai/scibert_scivocab_uncased'
  --ledmodel_name ledmodel_name
                        led model(for abstractive summary) name/tag in
                        hugging-face, defaults to 'allenai/led-
                        large-16384-arxiv'
  --embedder_name sentence_embedder_name
                        sentence embedder name/tag in hugging-face, defaults
                        to 'paraphrase-MiniLM-L6-v2'
  --nlp_name spacy_model_name
                        spacy model name/tag in hugging-face (if changed -
                        needs to be spacy-installed prior), defaults to
                        'en_core_sci_scibert'
  --similarity_nlp_name similarity_nlp_name
                        spacy downstream model(for similarity) name/tag in
                        hugging-face (if changed - needs to be spacy-installed
                        prior), defaults to 'en_core_sci_lg'
  --kw_model_name kw_model_name
                        keyword extraction model name/tag in hugging-face,
                        defaults to 'distilbert-base-nli-mean-tokens'
  --refresh_models refresh_models
                        Refresh model downloads with given names (needs
                        atleast one model name param above), defaults to False
  --high_gpu high_gpu   High GPU usage permitted, defaults to False

```

- At runtime (code)

    > during surveyor object initialization with `surveyor_obj = Surveyor()`
    - `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
    - `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
    - `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
    - `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
    - `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
    - `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
    - `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
    - `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
    - `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
    - `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
    - `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
    - `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
    - `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
    - `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
    - `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
    
    > during survey generation with `surveyor_obj.survey(query="my_research_query")`
    - `max_search`: int maximium number of papers to gaze at - defaults to `100`
    - `num_papers`: int maximium number of papers to download and analyse - defaults to `25`