Spaces:
Build error
Build error
Update README.md
Browse files
README.md
CHANGED
@@ -9,275 +9,3 @@ app_file: app.py
|
|
9 |
pinned: false
|
10 |
license: apache-2.0
|
11 |
----
|
12 |
-
|
13 |
-
# Auto-Research
|
14 |
-
![Auto-Research][logo]
|
15 |
-
|
16 |
-
[logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png
|
17 |
-
A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
|
18 |
-
|
19 |
-
Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI
|
20 |
-
|
21 |
-
Requirements:
|
22 |
-
- python 3.7 or above
|
23 |
-
- poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev`
|
24 |
-
- list of requirements in requirements.txt - `cat requirements.txt | xargs pip install`
|
25 |
-
- 8GB disk space
|
26 |
-
- 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
|
27 |
-
|
28 |
-
#### Demo :
|
29 |
-
|
30 |
-
Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing
|
31 |
-
|
32 |
-
Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
|
33 |
-
|
34 |
-
(`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU)
|
35 |
-
|
36 |
-
|
37 |
-
#### Installation:
|
38 |
-
```
|
39 |
-
sudo apt-get install build-essential poppler-utils libpoppler-cpp-dev pkg-config python-dev
|
40 |
-
pip install git+https://github.com/sidphbot/Auto-Research.git
|
41 |
-
```
|
42 |
-
|
43 |
-
#### Run Survey (cli):
|
44 |
-
```
|
45 |
-
python survey.py [options] <your_research_query>
|
46 |
-
```
|
47 |
-
|
48 |
-
#### Run Survey (Streamlit web-interface - new):
|
49 |
-
```
|
50 |
-
streamlit run app.py
|
51 |
-
```
|
52 |
-
|
53 |
-
#### Run Survey (Python API):
|
54 |
-
```
|
55 |
-
from survey import Surveyor
|
56 |
-
mysurveyor = Surveyor()
|
57 |
-
mysurveyor.survey('quantum entanglement')
|
58 |
-
```
|
59 |
-
|
60 |
-
### Research tools:
|
61 |
-
|
62 |
-
These are independent tools for your research or document text handling needs.
|
63 |
-
|
64 |
-
```
|
65 |
-
*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
|
66 |
-
```
|
67 |
-
|
68 |
-
- `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`)
|
69 |
-
|
70 |
-
Input:
|
71 |
-
|
72 |
-
`longtext` : string
|
73 |
-
|
74 |
-
Returns:
|
75 |
-
|
76 |
-
`summary` : string
|
77 |
-
|
78 |
-
- `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`)
|
79 |
-
|
80 |
-
Input:
|
81 |
-
|
82 |
-
`longtext` : string
|
83 |
-
|
84 |
-
Returns:
|
85 |
-
|
86 |
-
`summary` : string
|
87 |
-
|
88 |
-
- `generate_title` - takes a long text document (`string`) and returns a generated title (`string`)
|
89 |
-
|
90 |
-
Input:
|
91 |
-
|
92 |
-
`longtext` : string
|
93 |
-
|
94 |
-
Returns:
|
95 |
-
|
96 |
-
`title` : string
|
97 |
-
|
98 |
-
- `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`)
|
99 |
-
|
100 |
-
Input:
|
101 |
-
|
102 |
-
`longtext` : string
|
103 |
-
|
104 |
-
Returns:
|
105 |
-
|
106 |
-
`highlights` : [string]
|
107 |
-
`keywords` : [string]
|
108 |
-
`keyphrases` : [string]
|
109 |
-
|
110 |
-
- `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`).
|
111 |
-
|
112 |
-
Input:
|
113 |
-
|
114 |
-
`pdf_file` : string
|
115 |
-
|
116 |
-
Returns:
|
117 |
-
|
118 |
-
`images_files` : [string]
|
119 |
-
|
120 |
-
- `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`).
|
121 |
-
|
122 |
-
Input:
|
123 |
-
|
124 |
-
`pdf_file` : string
|
125 |
-
|
126 |
-
Returns:
|
127 |
-
|
128 |
-
`images_files` : [string]
|
129 |
-
|
130 |
-
- `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`)
|
131 |
-
|
132 |
-
Input:
|
133 |
-
|
134 |
-
`lines` : [string]
|
135 |
-
|
136 |
-
Returns:
|
137 |
-
|
138 |
-
`sections` : dict(generated_title: [cluster_abstract])
|
139 |
-
`clusters` : dict(cluster_id: [cluster_lines])
|
140 |
-
|
141 |
-
- `extract_headings` - *[for scientific texts - Assumes an ‘abstract’ heading present]* takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`).
|
142 |
-
|
143 |
-
`[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`)
|
144 |
-
|
145 |
-
`[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!
|
146 |
-
|
147 |
-
Input:
|
148 |
-
|
149 |
-
`text_file` : string
|
150 |
-
|
151 |
-
Returns:
|
152 |
-
|
153 |
-
`refined` : [string],
|
154 |
-
`headings` : [string]
|
155 |
-
`sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
|
156 |
-
|
157 |
-
|
158 |
-
## Access/Modify defaults:
|
159 |
-
|
160 |
-
- inside code
|
161 |
-
```
|
162 |
-
from survey.Surveyor import DEFAULTS
|
163 |
-
from pprint import pprint
|
164 |
-
|
165 |
-
pprint(DEFAULTS)
|
166 |
-
```
|
167 |
-
or,
|
168 |
-
|
169 |
-
- Modify static config file - `defaults.py`
|
170 |
-
|
171 |
-
or,
|
172 |
-
|
173 |
-
- At runtime (utility)
|
174 |
-
|
175 |
-
```
|
176 |
-
python survey.py --help
|
177 |
-
```
|
178 |
-
```
|
179 |
-
usage: survey.py [-h] [--max_search max_metadata_papers]
|
180 |
-
[--num_papers max_num_papers] [--pdf_dir pdf_dir]
|
181 |
-
[--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
|
182 |
-
[--dump_dir dump_dir] [--models_dir save_models_dir]
|
183 |
-
[--title_model_name title_model_name]
|
184 |
-
[--ex_summ_model_name extractive_summ_model_name]
|
185 |
-
[--ledmodel_name ledmodel_name]
|
186 |
-
[--embedder_name sentence_embedder_name]
|
187 |
-
[--nlp_name spacy_model_name]
|
188 |
-
[--similarity_nlp_name similarity_nlp_name]
|
189 |
-
[--kw_model_name kw_model_name]
|
190 |
-
[--refresh_models refresh_models] [--high_gpu high_gpu]
|
191 |
-
query_string
|
192 |
-
|
193 |
-
Generate a survey just from a query !!
|
194 |
-
|
195 |
-
positional arguments:
|
196 |
-
query_string your research query/keywords
|
197 |
-
|
198 |
-
optional arguments:
|
199 |
-
-h, --help show this help message and exit
|
200 |
-
--max_search max_metadata_papers
|
201 |
-
maximium number of papers to gaze at - defaults to 100
|
202 |
-
--num_papers max_num_papers
|
203 |
-
maximium number of papers to download and analyse -
|
204 |
-
defaults to 25
|
205 |
-
--pdf_dir pdf_dir pdf paper storage directory - defaults to
|
206 |
-
arxiv_data/tarpdfs/
|
207 |
-
--txt_dir txt_dir text-converted paper storage directory - defaults to
|
208 |
-
arxiv_data/fulltext/
|
209 |
-
--img_dir img_dir image storage directory - defaults to
|
210 |
-
arxiv_data/images/
|
211 |
-
--tab_dir tab_dir tables storage directory - defaults to
|
212 |
-
arxiv_data/tables/
|
213 |
-
--dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/
|
214 |
-
--models_dir save_models_dir
|
215 |
-
directory to save models (> 5GB) - defaults to
|
216 |
-
saved_models/
|
217 |
-
--title_model_name title_model_name
|
218 |
-
title model name/tag in hugging-face, defaults to
|
219 |
-
'Callidior/bert2bert-base-arxiv-titlegen'
|
220 |
-
--ex_summ_model_name extractive_summ_model_name
|
221 |
-
extractive summary model name/tag in hugging-face,
|
222 |
-
defaults to 'allenai/scibert_scivocab_uncased'
|
223 |
-
--ledmodel_name ledmodel_name
|
224 |
-
led model(for abstractive summary) name/tag in
|
225 |
-
hugging-face, defaults to 'allenai/led-
|
226 |
-
large-16384-arxiv'
|
227 |
-
--embedder_name sentence_embedder_name
|
228 |
-
sentence embedder name/tag in hugging-face, defaults
|
229 |
-
to 'paraphrase-MiniLM-L6-v2'
|
230 |
-
--nlp_name spacy_model_name
|
231 |
-
spacy model name/tag in hugging-face (if changed -
|
232 |
-
needs to be spacy-installed prior), defaults to
|
233 |
-
'en_core_sci_scibert'
|
234 |
-
--similarity_nlp_name similarity_nlp_name
|
235 |
-
spacy downstream model(for similarity) name/tag in
|
236 |
-
hugging-face (if changed - needs to be spacy-installed
|
237 |
-
prior), defaults to 'en_core_sci_lg'
|
238 |
-
--kw_model_name kw_model_name
|
239 |
-
keyword extraction model name/tag in hugging-face,
|
240 |
-
defaults to 'distilbert-base-nli-mean-tokens'
|
241 |
-
--refresh_models refresh_models
|
242 |
-
Refresh model downloads with given names (needs
|
243 |
-
atleast one model name param above), defaults to False
|
244 |
-
--high_gpu high_gpu High GPU usage permitted, defaults to False
|
245 |
-
|
246 |
-
```
|
247 |
-
|
248 |
-
- At runtime (code)
|
249 |
-
|
250 |
-
> during surveyor object initialization with `surveyor_obj = Surveyor()`
|
251 |
-
- `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
|
252 |
-
- `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
|
253 |
-
- `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
|
254 |
-
- `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
|
255 |
-
- `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
|
256 |
-
- `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
|
257 |
-
- `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
|
258 |
-
- `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
|
259 |
-
- `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
|
260 |
-
- `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
|
261 |
-
- `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
|
262 |
-
- `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
|
263 |
-
- `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
|
264 |
-
- `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
|
265 |
-
- `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
|
266 |
-
|
267 |
-
> during survey generation with `surveyor_obj.survey(query="my_research_query")`
|
268 |
-
- `max_search`: int maximium number of papers to gaze at - defaults to `100`
|
269 |
-
- `num_papers`: int maximium number of papers to download and analyse - defaults to `25`
|
270 |
-
|
271 |
-
|
272 |
-
|
273 |
-
#### Artifacts generated (zipped):
|
274 |
-
- Detailed survey draft paper as txt file
|
275 |
-
- A curated list of top 25+ papers as pdfs and txts
|
276 |
-
- Images extracted from above papers as jpegs, bmps etc
|
277 |
-
- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
|
278 |
-
- Tables extracted from papers(optional)
|
279 |
-
- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
|
280 |
-
|
281 |
-
|
282 |
-
Please cite this repo if it helped you :)
|
283 |
-
|
|
|
9 |
pinned: false
|
10 |
license: apache-2.0
|
11 |
----
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|