File size: 2,076 Bytes
933b211
 
 
7563fd5
 
bed1667
7563fd5
bed1667
8b805bb
7563fd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d72876
7563fd5
 
 
 
 
 
 
8b805bb
7563fd5
 
8b805bb
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
license: apache-2.0
---

## Setup
First, clone this repo and create a conda environment and install the dependencies:
```sh
git clone https://huggingface.co/davidheineman/colbert-acl
# torch==1.13.1 required (conda install -y -n [env] python=3.10)
pip install bibtexparser colbert-ir[torch,faiss-gpu]
```

To grab the up-to-date abstracts:
```sh
curl -O https://aclanthology.org/anthology+abstracts.bib.gz
gunzip anthology+abstracts.bib.gz
mv anthology+abstracts.bib anthology.bib
```

### (Optional) Step 1: Parse the Anthology

Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo. To parse the `.bib` file into `.json`:

```sh 
python parse.py
```

### (Optional) Step 2: Index with ColBERT

```sh 
python index.py
```

### Step 3: Search with ColBERT

To create a flask server capable of serving outputs, run:

```sh
python server.py 
```

Then, to test, visit:
```
http://localhost:8893/api/search?k=25&query=How to extend context windows?
```

## Example notebooks

To see an example of search, visit:
[colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs](https://colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs?usp=sharing)

## Notes
- It's possible to update the index without re-computing the whole dataset. Basically the IVF table is updated, but the centroids are not re-computed. This requires a large dataset to already exist (in our case it does).
    - We'll need someone to manage the storage/saving of the index, so it can be updated in real-time.
- See: 
    - https://github.com/stanford-futuredata/ColBERT/blob/main/colbert/index_updater.py
    - https://github.com/stanford-futuredata/ColBERT/issues/111
- We also need a MySQL database which can take in a document ID and return its metadata, so the ColBERT database only stores the passage encodings, not the full text (right now it just loads the whole json into memory).
- We may be able to offload the centroids calculation to a vector DB (check on this)
- Should have 2 people on UI, 1 on MySQL, 1 on VectorDB, 1 on ColBERT