sidphbot commited on
Commit
395b423
1 Parent(s): 7af17d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -272
README.md CHANGED
@@ -9,275 +9,3 @@ app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
  ----
12
-
13
- # Auto-Research
14
- ![Auto-Research][logo]
15
-
16
- [logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png
17
- A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
18
-
19
- Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI
20
-
21
- Requirements:
22
- - python 3.7 or above
23
- - poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev`
24
- - list of requirements in requirements.txt - `cat requirements.txt | xargs pip install`
25
- - 8GB disk space
26
- - 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
27
-
28
- #### Demo :
29
-
30
- Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing
31
-
32
- Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
33
-
34
- (`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU)
35
-
36
-
37
- #### Installation:
38
- ```
39
- sudo apt-get install build-essential poppler-utils libpoppler-cpp-dev pkg-config python-dev
40
- pip install git+https://github.com/sidphbot/Auto-Research.git
41
- ```
42
-
43
- #### Run Survey (cli):
44
- ```
45
- python survey.py [options] <your_research_query>
46
- ```
47
-
48
- #### Run Survey (Streamlit web-interface - new):
49
- ```
50
- streamlit run app.py
51
- ```
52
-
53
- #### Run Survey (Python API):
54
- ```
55
- from survey import Surveyor
56
- mysurveyor = Surveyor()
57
- mysurveyor.survey('quantum entanglement')
58
- ```
59
-
60
- ### Research tools:
61
-
62
- These are independent tools for your research or document text handling needs.
63
-
64
- ```
65
- *[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
66
- ```
67
-
68
- - `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`)
69
-
70
- Input:
71
-
72
- `longtext` : string
73
-
74
- Returns:
75
-
76
- `summary` : string
77
-
78
- - `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`)
79
-
80
- Input:
81
-
82
- `longtext` : string
83
-
84
- Returns:
85
-
86
- `summary` : string
87
-
88
- - `generate_title` - takes a long text document (`string`) and returns a generated title (`string`)
89
-
90
- Input:
91
-
92
- `longtext` : string
93
-
94
- Returns:
95
-
96
- `title` : string
97
-
98
- - `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`)
99
-
100
- Input:
101
-
102
- `longtext` : string
103
-
104
- Returns:
105
-
106
- `highlights` : [string]
107
- `keywords` : [string]
108
- `keyphrases` : [string]
109
-
110
- - `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`).
111
-
112
- Input:
113
-
114
- `pdf_file` : string
115
-
116
- Returns:
117
-
118
- `images_files` : [string]
119
-
120
- - `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`).
121
-
122
- Input:
123
-
124
- `pdf_file` : string
125
-
126
- Returns:
127
-
128
- `images_files` : [string]
129
-
130
- - `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`)
131
-
132
- Input:
133
-
134
- `lines` : [string]
135
-
136
- Returns:
137
-
138
- `sections` : dict(generated_title: [cluster_abstract])
139
- `clusters` : dict(cluster_id: [cluster_lines])
140
-
141
- - `extract_headings` - *[for scientific texts - Assumes an ‘abstract’ heading present]* takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`).
142
-
143
- `[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`)
144
-
145
- `[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!
146
-
147
- Input:
148
-
149
- `text_file` : string
150
-
151
- Returns:
152
-
153
- `refined` : [string],
154
- `headings` : [string]
155
- `sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
156
-
157
-
158
- ## Access/Modify defaults:
159
-
160
- - inside code
161
- ```
162
- from survey.Surveyor import DEFAULTS
163
- from pprint import pprint
164
-
165
- pprint(DEFAULTS)
166
- ```
167
- or,
168
-
169
- - Modify static config file - `defaults.py`
170
-
171
- or,
172
-
173
- - At runtime (utility)
174
-
175
- ```
176
- python survey.py --help
177
- ```
178
- ```
179
- usage: survey.py [-h] [--max_search max_metadata_papers]
180
- [--num_papers max_num_papers] [--pdf_dir pdf_dir]
181
- [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
182
- [--dump_dir dump_dir] [--models_dir save_models_dir]
183
- [--title_model_name title_model_name]
184
- [--ex_summ_model_name extractive_summ_model_name]
185
- [--ledmodel_name ledmodel_name]
186
- [--embedder_name sentence_embedder_name]
187
- [--nlp_name spacy_model_name]
188
- [--similarity_nlp_name similarity_nlp_name]
189
- [--kw_model_name kw_model_name]
190
- [--refresh_models refresh_models] [--high_gpu high_gpu]
191
- query_string
192
-
193
- Generate a survey just from a query !!
194
-
195
- positional arguments:
196
- query_string your research query/keywords
197
-
198
- optional arguments:
199
- -h, --help show this help message and exit
200
- --max_search max_metadata_papers
201
- maximium number of papers to gaze at - defaults to 100
202
- --num_papers max_num_papers
203
- maximium number of papers to download and analyse -
204
- defaults to 25
205
- --pdf_dir pdf_dir pdf paper storage directory - defaults to
206
- arxiv_data/tarpdfs/
207
- --txt_dir txt_dir text-converted paper storage directory - defaults to
208
- arxiv_data/fulltext/
209
- --img_dir img_dir image storage directory - defaults to
210
- arxiv_data/images/
211
- --tab_dir tab_dir tables storage directory - defaults to
212
- arxiv_data/tables/
213
- --dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/
214
- --models_dir save_models_dir
215
- directory to save models (> 5GB) - defaults to
216
- saved_models/
217
- --title_model_name title_model_name
218
- title model name/tag in hugging-face, defaults to
219
- 'Callidior/bert2bert-base-arxiv-titlegen'
220
- --ex_summ_model_name extractive_summ_model_name
221
- extractive summary model name/tag in hugging-face,
222
- defaults to 'allenai/scibert_scivocab_uncased'
223
- --ledmodel_name ledmodel_name
224
- led model(for abstractive summary) name/tag in
225
- hugging-face, defaults to 'allenai/led-
226
- large-16384-arxiv'
227
- --embedder_name sentence_embedder_name
228
- sentence embedder name/tag in hugging-face, defaults
229
- to 'paraphrase-MiniLM-L6-v2'
230
- --nlp_name spacy_model_name
231
- spacy model name/tag in hugging-face (if changed -
232
- needs to be spacy-installed prior), defaults to
233
- 'en_core_sci_scibert'
234
- --similarity_nlp_name similarity_nlp_name
235
- spacy downstream model(for similarity) name/tag in
236
- hugging-face (if changed - needs to be spacy-installed
237
- prior), defaults to 'en_core_sci_lg'
238
- --kw_model_name kw_model_name
239
- keyword extraction model name/tag in hugging-face,
240
- defaults to 'distilbert-base-nli-mean-tokens'
241
- --refresh_models refresh_models
242
- Refresh model downloads with given names (needs
243
- atleast one model name param above), defaults to False
244
- --high_gpu high_gpu High GPU usage permitted, defaults to False
245
-
246
- ```
247
-
248
- - At runtime (code)
249
-
250
- > during surveyor object initialization with `surveyor_obj = Surveyor()`
251
- - `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
252
- - `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
253
- - `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
254
- - `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
255
- - `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
256
- - `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
257
- - `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
258
- - `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
259
- - `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
260
- - `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
261
- - `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
262
- - `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
263
- - `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
264
- - `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
265
- - `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
266
-
267
- > during survey generation with `surveyor_obj.survey(query="my_research_query")`
268
- - `max_search`: int maximium number of papers to gaze at - defaults to `100`
269
- - `num_papers`: int maximium number of papers to download and analyse - defaults to `25`
270
-
271
-
272
-
273
- #### Artifacts generated (zipped):
274
- - Detailed survey draft paper as txt file
275
- - A curated list of top 25+ papers as pdfs and txts
276
- - Images extracted from above papers as jpegs, bmps etc
277
- - Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
278
- - Tables extracted from papers(optional)
279
- - Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
280
-
281
-
282
- Please cite this repo if it helped you :)
283
-
 
9
  pinned: false
10
  license: apache-2.0
11
  ----