README.md · davidheineman/colbert-acl at f3e3a514d50bc6fb19a81b5bab784d76697c2a4e

metadata

license: apache-2.0

Setup

First, clone this repo and create a conda environment and install the dependencies:

git clone https://huggingface.co/davidheineman/colbert-acl
# torch==1.13.1 required (conda install -y -n [env] python=3.10)
pip install bibtexparser colbert-ir[torch,faiss-gpu]

To grab the up-to-date abstracts:

curl -O https://aclanthology.org/anthology+abstracts.bib.gz
gunzip anthology+abstracts.bib.gz
mv anthology+abstracts.bib anthology.bib

(Optional) Step 1: Parse the Anthology

Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo. To parse the .bib file into .json:

python parse.py

(Optional) Step 2: Index with ColBERT

python index.py

Step 3: Search with ColBERT

To create a flask server capable of serving outputs, run:

python server.py

Then, to test, visit:

http://localhost:8893/api/search?k=25&query=How to extend context windows?

Example notebooks

To see an example of search, visit: colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs

Notes

It's possible to update the index without re-computing the whole dataset. Basically the IVF table is updated, but the centroids are not re-computed. This requires a large dataset to already exist (in our case it does).
- We'll need someone to manage the storage/saving of the index, so it can be updated in real-time.
See:
- https://github.com/stanford-futuredata/ColBERT/blob/main/colbert/index_updater.py
- https://github.com/stanford-futuredata/ColBERT/issues/111
We also need a MySQL database which can take in a document ID and return its metadata, so the ColBERT database only stores the passage encodings, not the full text (right now it just loads the whole json into memory).
We may be able to offload the centroids calculation to a vector DB (check on this)
Should have 2 people on UI, 1 on MySQL, 1 on VectorDB, 1 on ColBERT