metadata
license: apache-2.0
Setup ColBERT
First, clone this repo and create a conda environment and install the dependencies:
git clone https://huggingface.co/davidheineman/colbert-acl
# torch==1.13.1 required (conda install -y -n [env] python=3.10)
pip install bibtexparser colbert-ir[torch,faiss-gpu]
To grab the up-to-date abstracts:
curl -O https://aclanthology.org/anthology+abstracts.bib.gz
gunzip anthology+abstracts.bib.gz
mv anthology+abstracts.bib anthology.bib
Setup server
Install pip dependencies
pip install mysql-connector-python flask
Set up a local MySQL server:
brew install mysql
Run the database setup to copy the ACL entries:
python init_db.py
(Optional) Step 1: Parse the Anthology
Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo. To parse the .bib
file into .json
:
python parse.py
(Optional) Step 2: Index with ColBERT
python index.py
Step 3: Search with ColBERT
To create a flask server capable of serving outputs, run:
python server.py
Then, to test, visit:
http://localhost:8893/api/search?k=25&query=How to extend context windows?
Example notebooks
To see an example of search, visit: colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs
Notes
- It's possible to update the index without re-computing the whole dataset. Basically the IVF table is updated, but the centroids are not re-computed. This requires a large dataset to already exist (in our case it does).
- We'll need someone to manage the storage/saving of the index, so it can be updated in real-time.
- See:
- We also need a MySQL database which can take in a document ID and return its metadata, so the ColBERT database only stores the passage encodings, not the full text (right now it just loads the whole json into memory).
- We may be able to offload the centroids calculation to a vector DB (check on this)
- Should have 2 people on UI, 1 on MySQL, 1 on VectorDB, 1 on ColBERT