--- license: apache-2.0 --- This uses ColBERT as an information retreival interface for the [ACL Anthology](https://aclanthology.org/). It uses a MySQL backend for storing paper data and a simple flask front-end. We have two methods for retreving passage candidates, (i) using ColBERT, which may not scale well for extremely large datastores and (ii) using OpenAI embeddings, which selects the top-k passages for ColBERT to perform the expensive re-ranking. For OpenAI, you must have an API key and a MongoDB key for storing the vector entries. # Setup ## Setup ColBERT First, clone this repo and create a conda environment and install the dependencies: ```sh git clone https://huggingface.co/davidheineman/colbert-acl # torch==1.13.1 required (conda install -y -n [env] python=3.10) pip install bibtexparser colbert-ir[torch,faiss-gpu] ``` ## Setup MySQL server Install pip dependencies ```sh pip install mysql-connector-python flask openai pymongo[srv] ``` Set up a local MySQL server: ```sh brew install mysql ``` Run the database setup to copy the ACL entries: ```sh python init_db.py ``` ## Setup MongoDB server First, make sure you have an OpenAI and MongoDB API key ```sh echo [OPEN_AI_KEY] > .opeani-secret echo [MONGO_DB_KEY] > .mongodb-secret ``` ### (Optional) Step 1: Parse the Anthology Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo. To grab the up-to-date abstracts: ```sh curl -O https://aclanthology.org/anthology+abstracts.bib.gz gunzip anthology+abstracts.bib.gz mv anthology+abstracts.bib anthology.bib ``` To parse the `.bib` file into `.json`: ```sh python parse.py ``` ### (Optional) Step 2: Index with ColBERT ```sh python index.py ``` ### Step 3: Search with ColBERT To create a flask server capable of serving outputs, run: ```sh python server.py ``` Then, to test, visit: ``` http://localhost:8893/api/search?k=25&query=How to extend context windows? ``` or for an interface: ``` http://localhost:8893 ``` ## Example notebooks To see an example of search, visit: [colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs](https://colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs?usp=sharing)