{ "cells": [ { "cell_type": "markdown", "id": "6a151ade-7d86-4a2e-bfe7-462089f4e04c", "metadata": {}, "source": [ "# Approach\n", "## VectorDB\n", "There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n", "\n", "I've been hearing a lot about LanceDB and wanted to check it out. It's newer and may or may not be good for **your** use-case. I'm attracted by its fast ingestion, cuda assisted indexing, and portability. It has some drawbacks, it doesnt support hnsw yet and it could change significantly given how early it is.\n", "\n", "\n", "You will be blown away on how fast ingestion + indexing is with LanceDB. \n", "\n", "## Ingestion Strategy\n", "I used the ~100k document `.ndjson` files in sequence to upload. After uploading I index.\n", "\n", "## Indexing\n", "The algorithm used is `IVF_PQ`. I ignore the `PQ` part because I want better recall. Recall is important since Jais only has a 2k context window, I can't put my top 10 documents for RAG in my prompt. It will be my top 3 (512\\*3 + query + instructions ~ 2k). For many use-cases its worth the trade-off as you get much faster retrieval with not much performance loss. \n", "\n", "More partitions means faster retrieval but slower indexing. I chose 384 sub_vectors to be equal to my embedding dimension size. \n", "\n", "```tbl.create_index(num_partitions=1024, num_sub_vectors=384, accelerator=\"cuda\")```\n", "\n", "Read more about it [here](https://lancedb.github.io/lancedb/ann_indexes/)." ] }, { "cell_type": "markdown", "id": "b1b28232-b65d-41ce-88de-fd70b93a528d", "metadata": {}, "source": [ "# Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "88408486-566a-4791-8ef2-5ee3e6941156", "metadata": { "tags": [] }, "outputs": [], "source": [ "from IPython.core.interactiveshell import InteractiveShell\n", "InteractiveShell.ast_node_interactivity = 'all'" ] }, { "cell_type": "code", "execution_count": 2, "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4", "metadata": { "tags": [] }, "outputs": [], "source": [ "from pathlib import Path\n", "import json\n", "\n", "from tqdm.notebook import tqdm\n", "import lancedb" ] }, { "cell_type": "code", "execution_count": 3, "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/ec2-user/arabic-wiki\n" ] } ], "source": [ "proj_dir = Path.cwd().parent\n", "print(proj_dir)" ] }, { "cell_type": "markdown", "id": "76119e74-f601-436d-a253-63c5a19d1c83", "metadata": {}, "source": [ "# Config" ] }, { "cell_type": "code", "execution_count": 4, "id": "f6f74545-54a7-4f41-9f02-96964e1417f0", "metadata": { "tags": [] }, "outputs": [], "source": [ "files_in = list((proj_dir / 'data/embedded/').glob('*.ndjson'))" ] }, { "cell_type": "markdown", "id": "d2dd0df0-4274-45b3-9ee5-0205494e4d75", "metadata": { "tags": [] }, "source": [ "# Setup\n", "To work with LanceDB we want to create the table before ingesting the first batch. To create a table we need at least 1 doc." ] }, { "cell_type": "code", "execution_count": 5, "id": "3c08e039-3686-4eca-9f87-7c469e3f19bc", "metadata": { "tags": [] }, "outputs": [], "source": [ "with open(files_in[0], 'r') as f:\n", " first_line = f.readline().strip() # read only the first line\n", " document = json.loads(first_line)\n", " document['vector'] = document.pop('embedding')" ] }, { "cell_type": "code", "execution_count": 6, "id": "4821e3c1-697d-4b69-bae3-300168755df9", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'content': 'الماء مادةٌ شفافةٌ عديمة اللون والرائحة، وهو المكوّن الأساسي للجداول والبحيرات والبحار والمحيطات وكذلك للسوائل في جميع الكائنات الحيّة، وهو أكثر المركّبات الكيميائيّة انتشاراً على سطح الأرض. يتألّف جزيء الماء من ذرّة أكسجين مركزية ترتبط بها ذرّتا هيدروجين على طرفيها برابطة تساهميّة بحيث تكون صيغته الكيميائية H2O. عند الظروف القياسية من الضغط ودرجة الحرارة يكون الماء سائلاً؛ أمّا الحالة الصلبة فتتشكّل عند نقطة التجمّد، وتدعى بالجليد؛ أمّا الحالة الغازية فتتشكّل عند نقطة الغليان، وتسمّى بخار الماء.\\nإنّ الماء هو أساس وجود الحياة على كوكب الأرض، وهو يغطّي 71% من سطحها، وتمثّل مياه البحار والمحيطات أكبر نسبة للماء على الأرض، حيث تبلغ حوالي 96.5%. وتتوزّع النسب الباقية بين المياه الجوفيّة وبين جليد المناطق القطبيّة (1.7% لكليهما)، مع وجود نسبة صغيرة على شكل بخار ماء معلّق في الهواء على هيئة سحاب (غيوم)، وأحياناً أخرى على هيئة ضباب أو ندى، بالإضافة إلى الزخات المطريّة أو الثلجيّة. تبلغ نسبة الماء العذب حوالي 2.5% فقط من الماء الموجود على الأرض، وأغلب هذه الكمّيّة (حوالي 99%) موجودة في الكتل الجليديّة في المناطق القطبيّة، في حين تتواجد 0.3% من الماء العذب في الأنهار والبحيرات وفي الغلاف الجوّي.\\nأما في الطبيعة، فتتغيّر حالة الماء بين الحالات الثلاثة للمادة على سطح الأرض باستمرار من خلال ما يعرف باسم الدورة المائيّة (أو دورة الماء)، والتي تتضمّن حدوث تبخّر ونتح (نتح تبخّري) ثم تكثيف فهطول ثم جريان لتصل إلى المصبّ في المسطّحات المائيّة.\\n',\n", " 'content_type': 'text',\n", " 'score': None,\n", " 'meta': {'id': '7',\n", " 'revid': '2080427',\n", " 'url': 'https://ar.wikipedia.org/wiki?curid=7',\n", " 'title': 'ماء',\n", " '_split_id': 0,\n", " '_split_overlap': [{'doc_id': '725ec671057ef790ad582509a8653584',\n", " 'range': [887, 1347]}]},\n", " 'id_hash_keys': ['content'],\n", " 'id': '109a29bb227b1aaa5b784e972d8e1e3e',\n", " 'vector': [-0.07318115,\n", " 0.087646484,\n", " 0.03274536,\n", " 0.034942627,\n", " 0.097961426,\n", " '...']}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc = document.copy()\n", "doc['vector'] = doc['vector'][:5] + ['...']\n", "doc" ] }, { "cell_type": "markdown", "id": "676f644c-fb09-4d17-89ba-30c92aad8777", "metadata": {}, "source": [ "Here we create the db and the table." ] }, { "cell_type": "code", "execution_count": 7, "id": "78033b87-8f68-4a44-899e-36fa8167cacf", "metadata": { "tags": [] }, "outputs": [], "source": [ "from lancedb.embeddings.registry import EmbeddingFunctionRegistry\n", "from lancedb.embeddings.sentence_transformers import SentenceTransformerEmbeddings\n", "\n", "db = lancedb.connect(proj_dir/\".lancedb\")\n", "tbl = db.create_table('arabic-wiki', [document])" ] }, { "cell_type": "markdown", "id": "502f7cb9-32cf-4b32-8cb3-b021e02bd06c", "metadata": {}, "source": [ "For each file we:\n", "- Read the `ndjson` into a list of documents\n", "- Replace 'embedding' with 'vector' to be compatible with LanceDB\n", "- Write the docs to the table\n", "\n", "After that we index with a cuda accelerator." ] }, { "cell_type": "code", "execution_count": 8, "id": "21d5fa58-519e-4a23-9fc6-eed31e4723b5", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "789efc342218412aa31d5a5a74b34c52", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Wiki Files: 0%| | 0/23 [00:002M documents. Lets run a test to make sure it worked!" ] }, { "cell_type": "code", "execution_count": 9, "id": "8ad72ca5-6ca3-43e3-bf2c-7461906576b9", "metadata": { "tags": [] }, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer\n", "\n", "name=\"sentence-transformers/paraphrase-multilingual-minilm-l12-v2\"\n", "model = SentenceTransformer(name)\n", "\n", "# used for both training and querying\n", "def embed_func(batch):\n", " return [model.encode(sentence) for sentence in batch]" ] }, { "cell_type": "code", "execution_count": 11, "id": "41ab5a84-8984-4726-acd8-57ca0fce9e76", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['بكين',\n", " 'كونمينغ',\n", " 'نينغشيا',\n", " 'تاي يوان',\n", " 'تشنغتشو',\n", " 'شانغهاي',\n", " 'سنغافورة',\n", " 'دلتا نهر يانغتسي',\n", " 'تشانغتشون',\n", " 'بكين']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"What is the capital of China? I think it's Singapore.\"\n", "query_vector = embed_func([query])[0]\n", "[doc['meta']['title'] for doc in tbl.search(query_vector).limit(10).to_list()]" ] }, { "cell_type": "code", "execution_count": null, "id": "c0abad86-652a-4d7d-b118-21dc23a7a5c5", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 5 }