license: apache-2.0
This uses ColBERT as an information retreival interface for the ACL Anthology. It uses a MySQL backend for storing paper data and a simple flask front-end.
We have two methods for retreving passage candidates, (i) using ColBERT, which may not scale well for extremely large datastores and (ii) using OpenAI embeddings, which selects the top-k passages for ColBERT to perform the expensive re-ranking. For OpenAI, you must have an API key and a MongoDB key for storing the vector entries.
Setup
Setup ColBERT
First, clone this repo and create a conda environment and install the dependencies:
git clone https://huggingface.co/davidheineman/colbert-acl
# torch==1.13.1 required (conda install -y -n [env] python=3.10)
pip install bibtexparser colbert-ir[torch,faiss-gpu]
Setup MySQL server
Install pip dependencies
pip install mysql-connector-python flask openai pymongo[srv]
Set up a local MySQL server:
brew install mysql
Run the database setup to copy the ACL entries:
python init_db.py
Setup MongoDB server
First, make sure you have an OpenAI and MongoDB API key
echo [OPEN_AI_KEY] > .opeani-secret
echo [MONGO_DB_KEY] > .mongodb-secret
(Optional) Step 1: Parse the Anthology
Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo.
To grab the up-to-date abstracts:
curl -O https://aclanthology.org/anthology+abstracts.bib.gz
gunzip anthology+abstracts.bib.gz
mv anthology+abstracts.bib anthology.bib
To parse the .bib
file into .json
:
python parse.py
(Optional) Step 2: Index with ColBERT
python index.py
Step 3: Search with ColBERT
To create a flask server capable of serving outputs, run:
python server.py
Then, to test, visit:
http://localhost:8893/api/search?k=25&query=How to extend context windows?
or for an interface:
http://localhost:8893
Example notebooks
To see an example of search, visit: colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs