colbert-acl / README.md
davidheineman's picture
improve readme
f9ad19d
|
raw
history blame
2.4 kB
metadata
license: apache-2.0

This uses ColBERT as an information retreival interface for the ACL Anthology. It uses a MySQL backend for storing paper data and a simple flask front-end.

We have two methods for retreving passage candidates, (i) using ColBERT, which may not scale well for extremely large datastores and (ii) using OpenAI embeddings, which selects the top-k passages for ColBERT to perform the expensive re-ranking. For OpenAI, you must have an API key and a MongoDB key for storing the vector entries.

Setup

Setup ColBERT

First, clone this repo and create a conda environment and install the dependencies:

git clone https://huggingface.co/davidheineman/colbert-acl
# torch==1.13.1 required (conda install -y -n [env] python=3.10)
pip install bibtexparser colbert-ir[torch,faiss-gpu]

Setup MySQL server

Install pip dependencies

pip install mysql-connector-python flask openai pymongo[srv]

Set up a local MySQL server:

brew install mysql

Run the database setup to copy the ACL entries:

python init_db.py

Setup MongoDB server

First, make sure you have an OpenAI and MongoDB API key

echo [OPEN_AI_KEY] > .opeani-secret
echo [MONGO_DB_KEY] > .mongodb-secret

(Optional) Step 1: Parse the Anthology

Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo.

To grab the up-to-date abstracts:

curl -O https://aclanthology.org/anthology+abstracts.bib.gz
gunzip anthology+abstracts.bib.gz
mv anthology+abstracts.bib anthology.bib

To parse the .bib file into .json:

python parse.py

(Optional) Step 2: Index with ColBERT

python index.py

Step 3: Search with ColBERT

To create a flask server capable of serving outputs, run:

python server.py 

Then, to test, visit:

http://localhost:8893/api/search?k=25&query=How to extend context windows?

or for an interface:

http://localhost:8893

Example notebooks

To see an example of search, visit: colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs