File size: 3,187 Bytes
933b211 7563fd5 8887d21 f9ad19d fbce275 7563fd5 bed1667 7563fd5 8887d21 ece17e8 f12459e fbce275 8887d21 7563fd5 8887d21 7502d6f 8887d21 7502d6f 7563fd5 8887d21 7563fd5 ea6d01d 7563fd5 f12459e 8887d21 7563fd5 8887d21 4e0405e 00aec5d 7563fd5 00aec5d 7563fd5 7502d6f 7563fd5 f12459e 00aec5d c530f7a 00aec5d f12459e 8b805bb 7563fd5 8b805bb d825967 8b805bb ece17e8 d9796cb 00aec5d ea6d01d f12459e d825967 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
license: apache-2.0
---
Use ColBERT as a search engine for the [ACL Anthology](https://aclanthology.org/). (Parse any bibtex, and store in a MySQL service)
# Setup
## Setup ColBERT
```sh
git clone https://huggingface.co/davidheineman/colbert-acl
# install dependencies
# torch==1.13.1 required (conda install -y -n [env] python=3.10)
pip install -r requirements.txt
brew install mysql
```
### (Optional) Parse & Index the Anthology
Feel free to skip, since the parsed/indexed anthology is contained in this repo.
```sh
# get up-to-date abstracts in bibtex
curl -O https://aclanthology.org/anthology+abstracts.bib.gz
gunzip anthology+abstracts.bib.gz
mv anthology+abstracts.bib anthology.bib
# parse .bib -> .json
python parse.py
# index with ColBERT
# (note sometimes there is a silent failure if the CPP extensions do not exist)
python index.py
```
### Search with ColBERT
```sh
# start flask server
python server.py
# or start a production API endpoint
gunicorn -w 4 -b 0.0.0.0:8893 server:app
```
Then, to test, visit:
```
http://localhost:8893/api/search?query=Information retrevial with BERT
```
or for an interface:
```
http://localhost:8893
```
### Deploy as a Docker App
```sh
docker-compose build --no-cache
docker-compose up --build
```
## Example notebooks
To see an example of search, visit:
[colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs](https://colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs?usp=sharing)
<!-- ## Notes
- See:
- https://github.com/stanford-futuredata/ColBERT/blob/main/colbert/index_updater.py
- https://github.com/stanford-futuredata/ColBERT/issues/111
- TODO:
- On UI
- Colors: make the colors resemble the ACL page much closer
- Smaller line spacing for abstract text
- Add "PDF" button
- Justify the result metadata (Year, venue, etc.) so the content all starts at the same vertical position
- Add a "Expand" button at the end of the abstract
- Make the results scrollable, without scrolling the rest of the page
- Put two sliders on the year range (and make the years selectable, with the years at both ends of the bar)
- If the user selects certain venues, remember these venues
- Add a dropdown under the "Workshop" box to select specific workshops
- Move code to github and index to hf, then use this to download the index:
from huggingface_hub import snapshot_download
# Download indexed repo at: https://huggingface.co/davidheineman/colbert-acl
!mkdir "acl"
index_name = snapshot_download(repo_id="davidheineman/colbert-acl", local_dir="acl")
- Make indexing much easier
(currently, the setup involves manually copying the CPP files becuase there is a silent failure, this also should be possible to do on Google Collab, or even MPS)
- Make index save in parent folder
- Fix "sanity check" in index.py
- Profile bibtexparser.load(f) (why so slow)
- Ship as a containerized service
- Scrape:
- https://proceedings.neurips.cc/
- https://dblp.uni-trier.de/db/conf/iclr/index.html
- openreview
--> |