derek-thomas HF staff commited on
Commit
867c18d
1 Parent(s): c606718

Adding more info around LanceDB

Browse files
Files changed (1) hide show
  1. notebooks/05_vector_db.ipynb +40 -14
notebooks/05_vector_db.ipynb CHANGED
@@ -6,9 +6,25 @@
6
  "metadata": {},
7
  "source": [
8
  "# Approach\n",
 
9
  "There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
10
  "\n",
11
- "Im targeting a demo (low utilization, latency can be relaxed) that will live on a huggingface space. I have a small scale that could even fit in memory. I like [Qdrant](https://qdrant.tech) for this. "
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ]
13
  },
14
  {
@@ -97,7 +113,7 @@
97
  },
98
  "source": [
99
  "# Setup\n",
100
- "Read in our list of dictionaries. This is the upper end for the machine Im using. This takes ~10GB of RAM. We could easily do this in batches of ~100k and be fine in most machines. "
101
  ]
102
  },
103
  {
@@ -115,14 +131,6 @@
115
  " document['vector'] = document.pop('embedding')"
116
  ]
117
  },
118
- {
119
- "cell_type": "markdown",
120
- "id": "98aec715-8d97-439e-99c0-0eff63df386b",
121
- "metadata": {},
122
- "source": [
123
- "Convert the dictionaries to `Documents`"
124
- ]
125
- },
126
  {
127
  "cell_type": "code",
128
  "execution_count": 6,
@@ -170,9 +178,7 @@
170
  "id": "676f644c-fb09-4d17-89ba-30c92aad8777",
171
  "metadata": {},
172
  "source": [
173
- "Instantiate our `DocumentStore`. Note that Im saving this to disk, this is for portability which is good considering I want to move from this ec2 instance into a Hugging Face Space. \n",
174
- "\n",
175
- "Note that if you are doing this at scale, you should use a proper instance and not saving to file. You should also take a [measured ingestion](https://qdrant.tech/documentation/tutorials/bulk-upload/) approach instead of using a convenient loader. "
176
  ]
177
  },
178
  {
@@ -187,11 +193,23 @@
187
  "from lancedb.embeddings.registry import EmbeddingFunctionRegistry\n",
188
  "from lancedb.embeddings.sentence_transformers import SentenceTransformerEmbeddings\n",
189
  "\n",
190
- "\n",
191
  "db = lancedb.connect(proj_dir/\".lancedb\")\n",
192
  "tbl = db.create_table('arabic-wiki', [document])"
193
  ]
194
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
195
  {
196
  "cell_type": "code",
197
  "execution_count": 8,
@@ -818,6 +836,14 @@
818
  " "
819
  ]
820
  },
 
 
 
 
 
 
 
 
821
  {
822
  "cell_type": "code",
823
  "execution_count": 9,
 
6
  "metadata": {},
7
  "source": [
8
  "# Approach\n",
9
+ "## VectorDB\n",
10
  "There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
11
  "\n",
12
+ "I've been hearing a lot about LanceDB and wanted to check it out. It's newer and may or may not be good for **your** use-case. I'm attracted by its fast ingestion, cuda assisted indexing, and portability. It has some drawbacks, it doesnt support hnsw yet and it could change significantly given how early it is.\n",
13
+ "\n",
14
+ "\n",
15
+ "You will be blown away on how fast ingestion + indexing is with LanceDB. \n",
16
+ "\n",
17
+ "## Ingestion Strategy\n",
18
+ "I used the ~100k document `.ndjson` files in sequence to upload. After uploading I index.\n",
19
+ "\n",
20
+ "## Indexing\n",
21
+ "The algorithm used is `IVF_PQ`. I ignore the `PQ` part because I want better recall. Recall is important since Jais only has a 2k context window, I can't put my top 10 documents for RAG in my prompt. It will be my top 3 (512\\*3 + query + instructions ~ 2k). For many use-cases its worth the trade-off as you get much faster retrieval with not much performance loss. \n",
22
+ "\n",
23
+ "More partitions means faster retrieval but slower indexing. I chose 384 sub_vectors to be equal to my embedding dimension size. \n",
24
+ "\n",
25
+ "```tbl.create_index(num_partitions=1024, num_sub_vectors=384, accelerator=\"cuda\")```\n",
26
+ "\n",
27
+ "Read more about it [here](https://lancedb.github.io/lancedb/ann_indexes/)."
28
  ]
29
  },
30
  {
 
113
  },
114
  "source": [
115
  "# Setup\n",
116
+ "To work with LanceDB we want to create the table before ingesting the first batch. To create a table we need at least 1 doc."
117
  ]
118
  },
119
  {
 
131
  " document['vector'] = document.pop('embedding')"
132
  ]
133
  },
 
 
 
 
 
 
 
 
134
  {
135
  "cell_type": "code",
136
  "execution_count": 6,
 
178
  "id": "676f644c-fb09-4d17-89ba-30c92aad8777",
179
  "metadata": {},
180
  "source": [
181
+ "Here we create the db and the table."
 
 
182
  ]
183
  },
184
  {
 
193
  "from lancedb.embeddings.registry import EmbeddingFunctionRegistry\n",
194
  "from lancedb.embeddings.sentence_transformers import SentenceTransformerEmbeddings\n",
195
  "\n",
 
196
  "db = lancedb.connect(proj_dir/\".lancedb\")\n",
197
  "tbl = db.create_table('arabic-wiki', [document])"
198
  ]
199
  },
200
+ {
201
+ "cell_type": "markdown",
202
+ "id": "502f7cb9-32cf-4b32-8cb3-b021e02bd06c",
203
+ "metadata": {},
204
+ "source": [
205
+ "For each file we:\n",
206
+ "- Read the `ndjson` into a list of documents\n",
207
+ "- Replace 'embedding' with 'vector' to be compatible with LanceDB\n",
208
+ "- Write the docs to the table\n",
209
+ "\n",
210
+ "After that we index with a cuda accelerator."
211
+ ]
212
+ },
213
  {
214
  "cell_type": "code",
215
  "execution_count": 8,
 
836
  " "
837
  ]
838
  },
839
+ {
840
+ "cell_type": "markdown",
841
+ "id": "179af522-84ca-4985-9ca4-ffd1bde487eb",
842
+ "metadata": {},
843
+ "source": [
844
+ "It's crazy how fast it was. 42minutes to ingest and index >2M documents. Lets run a test to make sure it worked!"
845
+ ]
846
+ },
847
  {
848
  "cell_type": "code",
849
  "execution_count": 9,