Spaces:
Build error
Build error
File size: 10,595 Bytes
115cdad 7af17d4 d38185d 7af17d4 115cdad 8a54c40 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 |
---
title: Surveyor
emoji: 📊
colorFrom: gray
colorTo: pink
sdk: streamlit
sdk_version: 1.2.0
app_file: app.py
pinned: false
---
# Auto-Research
![Auto-Research][logo]
[logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png
A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI
Requirements:
- python 3.7 or above
- poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev`
- list of requirements in requirements.txt - `cat requirements.txt | xargs pip install`
- 8GB disk space
- 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
#### Demo :
Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing
Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
(`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU)
#### Installation:
```
sudo apt-get install build-essential poppler-utils libpoppler-cpp-dev pkg-config python-dev
pip install git+https://github.com/sidphbot/Auto-Research.git
```
#### Run Survey (cli):
```
python survey.py [options] <your_research_query>
```
#### Run Survey (Streamlit web-interface - new):
```
streamlit run app.py
```
#### Run Survey (Python API):
```
from survey import Surveyor
mysurveyor = Surveyor()
mysurveyor.survey('quantum entanglement')
```
### Research tools:
These are independent tools for your research or document text handling needs.
```
*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
```
- `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`)
Input:
`longtext` : string
Returns:
`summary` : string
- `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`)
Input:
`longtext` : string
Returns:
`summary` : string
- `generate_title` - takes a long text document (`string`) and returns a generated title (`string`)
Input:
`longtext` : string
Returns:
`title` : string
- `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`)
Input:
`longtext` : string
Returns:
`highlights` : [string]
`keywords` : [string]
`keyphrases` : [string]
- `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`).
Input:
`pdf_file` : string
Returns:
`images_files` : [string]
- `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`).
Input:
`pdf_file` : string
Returns:
`images_files` : [string]
- `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`)
Input:
`lines` : [string]
Returns:
`sections` : dict(generated_title: [cluster_abstract])
`clusters` : dict(cluster_id: [cluster_lines])
- `extract_headings` - *[for scientific texts - Assumes an ‘abstract’ heading present]* takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`).
`[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`)
`[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!
Input:
`text_file` : string
Returns:
`refined` : [string],
`headings` : [string]
`sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
## Access/Modify defaults:
- inside code
```
from survey.Surveyor import DEFAULTS
from pprint import pprint
pprint(DEFAULTS)
```
or,
- Modify static config file - `defaults.py`
or,
- At runtime (utility)
```
python survey.py --help
```
```
usage: survey.py [-h] [--max_search max_metadata_papers]
[--num_papers max_num_papers] [--pdf_dir pdf_dir]
[--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
[--dump_dir dump_dir] [--models_dir save_models_dir]
[--title_model_name title_model_name]
[--ex_summ_model_name extractive_summ_model_name]
[--ledmodel_name ledmodel_name]
[--embedder_name sentence_embedder_name]
[--nlp_name spacy_model_name]
[--similarity_nlp_name similarity_nlp_name]
[--kw_model_name kw_model_name]
[--refresh_models refresh_models] [--high_gpu high_gpu]
query_string
Generate a survey just from a query !!
positional arguments:
query_string your research query/keywords
optional arguments:
-h, --help show this help message and exit
--max_search max_metadata_papers
maximium number of papers to gaze at - defaults to 100
--num_papers max_num_papers
maximium number of papers to download and analyse -
defaults to 25
--pdf_dir pdf_dir pdf paper storage directory - defaults to
arxiv_data/tarpdfs/
--txt_dir txt_dir text-converted paper storage directory - defaults to
arxiv_data/fulltext/
--img_dir img_dir image storage directory - defaults to
arxiv_data/images/
--tab_dir tab_dir tables storage directory - defaults to
arxiv_data/tables/
--dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/
--models_dir save_models_dir
directory to save models (> 5GB) - defaults to
saved_models/
--title_model_name title_model_name
title model name/tag in hugging-face, defaults to
'Callidior/bert2bert-base-arxiv-titlegen'
--ex_summ_model_name extractive_summ_model_name
extractive summary model name/tag in hugging-face,
defaults to 'allenai/scibert_scivocab_uncased'
--ledmodel_name ledmodel_name
led model(for abstractive summary) name/tag in
hugging-face, defaults to 'allenai/led-
large-16384-arxiv'
--embedder_name sentence_embedder_name
sentence embedder name/tag in hugging-face, defaults
to 'paraphrase-MiniLM-L6-v2'
--nlp_name spacy_model_name
spacy model name/tag in hugging-face (if changed -
needs to be spacy-installed prior), defaults to
'en_core_sci_scibert'
--similarity_nlp_name similarity_nlp_name
spacy downstream model(for similarity) name/tag in
hugging-face (if changed - needs to be spacy-installed
prior), defaults to 'en_core_sci_lg'
--kw_model_name kw_model_name
keyword extraction model name/tag in hugging-face,
defaults to 'distilbert-base-nli-mean-tokens'
--refresh_models refresh_models
Refresh model downloads with given names (needs
atleast one model name param above), defaults to False
--high_gpu high_gpu High GPU usage permitted, defaults to False
```
- At runtime (code)
> during surveyor object initialization with `surveyor_obj = Surveyor()`
- `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
- `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
- `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
- `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
- `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
- `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
- `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
- `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
- `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
- `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
- `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
- `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
- `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
- `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
- `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
> during survey generation with `surveyor_obj.survey(query="my_research_query")`
- `max_search`: int maximium number of papers to gaze at - defaults to `100`
- `num_papers`: int maximium number of papers to download and analyse - defaults to `25`
#### Artifacts generated (zipped):
- Detailed survey draft paper as txt file
- A curated list of top 25+ papers as pdfs and txts
- Images extracted from above papers as jpegs, bmps etc
- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
- Tables extracted from papers(optional)
- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
Please cite this repo if it helped you :)
|