README.md · davidheineman/colbert-acl at adda92653737221632d382310be5d7ddc366873e

metadata

license: apache-2.0

Setup ColBERT

First, clone this repo and create a conda environment and install the dependencies:

git clone https://huggingface.co/davidheineman/colbert-acl
# torch==1.13.1 required (conda install -y -n [env] python=3.10)
pip install bibtexparser colbert-ir[torch,faiss-gpu]

To grab the up-to-date abstracts:

curl -O https://aclanthology.org/anthology+abstracts.bib.gz
gunzip anthology+abstracts.bib.gz
mv anthology+abstracts.bib anthology.bib

Setup server

Install pip dependencies

pip install mysql-connector-python flask

Set up a local MySQL server:

brew install mysql

Run the database setup to copy the ACL entries:

python init_db.py

(Optional) Step 1: Parse the Anthology

Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo. To parse the .bib file into .json:

python parse.py

(Optional) Step 2: Index with ColBERT

python index.py

Step 3: Search with ColBERT

To create a flask server capable of serving outputs, run:

python server.py

Then, to test, visit:

http://localhost:8893/api/search?k=25&query=How to extend context windows?

Example notebooks

To see an example of search, visit: colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs

Notes

It's possible to update the index without re-computing the whole dataset. Basically the IVF table is updated, but the centroids are not re-computed. This requires a large dataset to already exist (in our case it does).
- We'll need someone to manage the storage/saving of the index, so it can be updated in real-time.
See:
- https://github.com/stanford-futuredata/ColBERT/blob/main/colbert/index_updater.py
- https://github.com/stanford-futuredata/ColBERT/issues/111
We also need a MySQL database which can take in a document ID and return its metadata, so the ColBERT database only stores the passage encodings, not the full text (right now it just loads the whole json into memory).
We may be able to offload the centroids calculation to a vector DB (check on this)
Should have 2 people on UI, 1 on MySQL, 1 on VectorDB, 1 on ColBERT