davidheineman
commited on
Commit
•
f9ad19d
1
Parent(s):
7f8aaec
improve readme
Browse files- README.md +14 -1
- knn_db_access.py +1 -2
- openai_embed.py +1 -2
README.md
CHANGED
@@ -2,6 +2,12 @@
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
## Setup ColBERT
|
6 |
First, clone this repo and create a conda environment and install the dependencies:
|
7 |
```sh
|
@@ -10,7 +16,7 @@ git clone https://huggingface.co/davidheineman/colbert-acl
|
|
10 |
pip install bibtexparser colbert-ir[torch,faiss-gpu]
|
11 |
```
|
12 |
|
13 |
-
## Setup server
|
14 |
Install pip dependencies
|
15 |
```sh
|
16 |
pip install mysql-connector-python flask openai pymongo[srv]
|
@@ -26,6 +32,13 @@ Run the database setup to copy the ACL entries:
|
|
26 |
python init_db.py
|
27 |
```
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
### (Optional) Step 1: Parse the Anthology
|
30 |
|
31 |
Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo.
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
5 |
+
This uses ColBERT as an information retreival interface for the [ACL Anthology](https://aclanthology.org/). It uses a MySQL backend for storing paper data and a simple flask front-end.
|
6 |
+
|
7 |
+
We have two methods for retreving passage candidates, (i) using ColBERT, which may not scale well for extremely large datastores and (ii) using OpenAI embeddings, which selects the top-k passages for ColBERT to perform the expensive re-ranking. For OpenAI, you must have an API key and a MongoDB key for storing the vector entries.
|
8 |
+
|
9 |
+
# Setup
|
10 |
+
|
11 |
## Setup ColBERT
|
12 |
First, clone this repo and create a conda environment and install the dependencies:
|
13 |
```sh
|
|
|
16 |
pip install bibtexparser colbert-ir[torch,faiss-gpu]
|
17 |
```
|
18 |
|
19 |
+
## Setup MySQL server
|
20 |
Install pip dependencies
|
21 |
```sh
|
22 |
pip install mysql-connector-python flask openai pymongo[srv]
|
|
|
32 |
python init_db.py
|
33 |
```
|
34 |
|
35 |
+
## Setup MongoDB server
|
36 |
+
First, make sure you have an OpenAI and MongoDB API key
|
37 |
+
```sh
|
38 |
+
echo [OPEN_AI_KEY] > .opeani-secret
|
39 |
+
echo [MONGO_DB_KEY] > .mongodb-secret
|
40 |
+
```
|
41 |
+
|
42 |
### (Optional) Step 1: Parse the Anthology
|
43 |
|
44 |
Feel free to skip steps 1 and 2, since the parsed/indexed anthology is contained in this repo.
|
knn_db_access.py
CHANGED
@@ -7,8 +7,7 @@ OPENAI = QueryEmbedder()
|
|
7 |
|
8 |
USER = "test"
|
9 |
SERVER = "dbbackend.c9tcfpp"
|
10 |
-
with open('.mongodb-secret', 'r') as f:
|
11 |
-
PASS = f.read()
|
12 |
|
13 |
|
14 |
class MongoDBAccess:
|
|
|
7 |
|
8 |
USER = "test"
|
9 |
SERVER = "dbbackend.c9tcfpp"
|
10 |
+
with open('.mongodb-secret', 'r') as f: PASS = f.read()
|
|
|
11 |
|
12 |
|
13 |
class MongoDBAccess:
|
openai_embed.py
CHANGED
@@ -1,8 +1,7 @@
|
|
1 |
from openai import OpenAI
|
2 |
|
3 |
|
4 |
-
with open('.openai-secret', 'r') as f:
|
5 |
-
OPENAI_API_KEY = f.read()
|
6 |
|
7 |
|
8 |
class QueryEmbedder:
|
|
|
1 |
from openai import OpenAI
|
2 |
|
3 |
|
4 |
+
with open('.openai-secret', 'r') as f: OPENAI_API_KEY = f.read()
|
|
|
5 |
|
6 |
|
7 |
class QueryEmbedder:
|