{ "cells": [ { "cell_type": "markdown", "id": "6a151ade-7d86-4a2e-bfe7-462089f4e04c", "metadata": {}, "source": [ "# Approach\n", "There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n", "\n", "Im targeting a demo (low utilization, latency can be relaxed) that will live on a huggingface space. I have a small scale that could even fit in memory. I like [Qdrant](https://qdrant.tech) for this. " ] }, { "cell_type": "markdown", "id": "b1b28232-b65d-41ce-88de-fd70b93a528d", "metadata": {}, "source": [ "# Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "88408486-566a-4791-8ef2-5ee3e6941156", "metadata": { "tags": [] }, "outputs": [], "source": [ "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = 'all'" ] }, { "cell_type": "code", "execution_count": 2, "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4", "metadata": { "tags": [] }, "outputs": [], "source": [ "from pathlib import Path\n", "import pickle\n", "\n", "from tqdm.notebook import tqdm\n", "from haystack.schema import Document\n", "from qdrant_haystack import QdrantDocumentStore" ] }, { "cell_type": "code", "execution_count": 3, "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/ec2-user/RAGDemo\n" ] } ], "source": [ "proj_dir = Path.cwd().parent\n", "print(proj_dir)" ] }, { "cell_type": "markdown", "id": "76119e74-f601-436d-a253-63c5a19d1c83", "metadata": {}, "source": [ "# Config" ] }, { "cell_type": "code", "execution_count": 4, "id": "f6f74545-54a7-4f41-9f02-96964e1417f0", "metadata": { "tags": [] }, "outputs": [], "source": [ "file_in = proj_dir / 'data/processed/simple_wiki_embeddings.pkl'" ] }, { "cell_type": "markdown", "id": "d2dd0df0-4274-45b3-9ee5-0205494e4d75", "metadata": { "tags": [] }, "source": [ "# Setup\n", "Read in our list of dictionaries. This is the upper end for the machine Im using. This takes ~10GB of RAM. We could easily do this in batches of ~100k and be fine in most machines. " ] }, { "cell_type": "code", "execution_count": 5, "id": "3c08e039-3686-4eca-9f87-7c469e3f19bc", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 11.6 s, sys: 2.25 s, total: 13.9 s\n", "Wall time: 18.1 s\n" ] } ], "source": [ "%%time\n", "with open(file_in, 'rb') as handle:\n", " documents = pickle.load(handle)" ] }, { "cell_type": "markdown", "id": "98aec715-8d97-439e-99c0-0eff63df386b", "metadata": {}, "source": [ "Convert the dictionaries to `Documents`" ] }, { "cell_type": "code", "execution_count": 6, "id": "4821e3c1-697d-4b69-bae3-300168755df9", "metadata": { "tags": [] }, "outputs": [], "source": [ "documents = [Document.from_dict(d) for d in documents]" ] }, { "cell_type": "markdown", "id": "676f644c-fb09-4d17-89ba-30c92aad8777", "metadata": {}, "source": [ "Instantiate our `DocumentStore`. Note that Im saving this to disk, this is for portability which is good considering I want to move from this ec2 instance into a Hugging Face Space. \n", "\n", "Note that if you are doing this at scale, you should use a proper instance and not saving to file. You should also take a [measured ingestion](https://qdrant.tech/documentation/tutorials/bulk-upload/) approach instead of using a convenient loader. " ] }, { "cell_type": "code", "execution_count": 7, "id": "e51b6e19-3be8-4cb0-8b65-9d6f6121f660", "metadata": { "tags": [] }, "outputs": [], "source": [ "document_store = QdrantDocumentStore(\n", " path=str(proj_dir/'Qdrant'),\n", " index=\"RAGDemo\",\n", " embedding_dim=768,\n", " recreate_index=True,\n", " hnsw_config={\"m\": 16, \"ef_construct\": 64} # Optional\n", ")" ] }, { "cell_type": "code", "execution_count": 9, "id": "55fbcd5d-922c-4e93-a37a-974ba84464ac", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "270000it [28:43, 156.68it/s] " ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 13min 23s, sys: 48.6 s, total: 14min 12s\n", "Wall time: 28min 43s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "%%time\n", "document_store.write_documents(documents, batch_size=5_000)" ] }, { "cell_type": "code", "execution_count": null, "id": "9a073815-0191-48f7-890f-a4e4ecc0f9f1", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" } }, "nbformat": 4, "nbformat_minor": 5 }