colbert-acl / README.md
davidheineman's picture
fix filepaths
8b805bb
|
raw
history blame
No virus
2.08 kB
metadata
license: apache-2.0

Setup

First, clone this repo and create a conda environment and install the dependencies:

git clone https://huggingface.co/davidheineman/colbert-acl
# torch==1.13.1 required (conda install -y -n [env] python=3.10)
pip install bibtexparser colbert-ir[torch,faiss-gpu]

To grab the up-to-date abstracts:

curl -O https://aclanthology.org/anthology+abstracts.bib.gz
gunzip anthology+abstracts.bib.gz
mv anthology+abstracts.bib anthology.bib

(Optional) Step 1: Parse the Anthology

Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo. To parse the .bib file into .json:

python parse.py

(Optional) Step 2: Index with ColBERT

python index.py

Step 3: Search with ColBERT

To create a flask server capable of serving outputs, run:

python server.py 

Then, to test, visit:

http://localhost:8893/api/search?k=25&query=How to extend context windows?

Example notebooks

To see an example of search, visit: colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs

Notes

  • It's possible to update the index without re-computing the whole dataset. Basically the IVF table is updated, but the centroids are not re-computed. This requires a large dataset to already exist (in our case it does).
    • We'll need someone to manage the storage/saving of the index, so it can be updated in real-time.
  • See:
  • We also need a MySQL database which can take in a document ID and return its metadata, so the ColBERT database only stores the passage encodings, not the full text (right now it just loads the whole json into memory).
  • We may be able to offload the centroids calculation to a vector DB (check on this)
  • Should have 2 people on UI, 1 on MySQL, 1 on VectorDB, 1 on ColBERT