--- license: apache-2.0 --- ## Setup First, clone this repo and create a conda environment and install the dependencies: ```sh git clone https://huggingface.co/davidheineman/colbert-acl # torch==1.13.1 required (conda install -y -n [env] python=3.10) pip install bibtexparser colbert-ir[torch,faiss-gpu] ``` To grab the up-to-date abstracts: ```sh curl -O https://aclanthology.org/anthology+abstracts.bib.gz gunzip anthology+abstracts.bib.gz mv anthology+abstracts.bib anthology.bib ``` ### (Optional) Step 1: Parse the Anthology Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo. To parse the `.bib` file into `.json`: ```sh python parse.py ``` ### (Optional) Step 2: Index with ColBERT ```sh python index.py ``` ### Step 3: Search with ColBERT To create a flask server capable of serving outputs, run: ```sh python server.py ``` Then, to test, visit: ``` http://localhost:8893/api/search?k=25&query=How to extend context windows? ``` ## Example notebooks To see an example of search, visit: [colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs](https://colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs?usp=sharing) ## Notes - It's possible to update the index without re-computing the whole dataset. Basically the IVF table is updated, but the centroids are not re-computed. This requires a large dataset to already exist (in our case it does). - We'll need someone to manage the storage/saving of the index, so it can be updated in real-time. - See: - https://github.com/stanford-futuredata/ColBERT/blob/main/colbert/index_updater.py - https://github.com/stanford-futuredata/ColBERT/issues/111 - We also need a MySQL database which can take in a document ID and return its metadata, so the ColBERT database only stores the passage encodings, not the full text (right now it just loads the whole json into memory). - We may be able to offload the centroids calculation to a vector DB (check on this) - Should have 2 people on UI, 1 on MySQL, 1 on VectorDB, 1 on ColBERT