{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a0f21cb1-fbc8-4282-b902-f47d92974df8",
   "metadata": {},
   "source": [
    "# Pre-requisites"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f625807-0707-4e2f-a0e0-8fbcdf08c865",
   "metadata": {},
   "source": [
    "## Why TEI\n",
    "There are 2 **unsung** challenges with RAG at scale:\n",
    "1. Getting the embeddings efficiently\n",
    "1. Efficient ingestion into the vector DB\n",
    "\n",
    "The issue with `1.` is that there are techniques but they are not widely *applied*. TEI solves a number of aspects:\n",
    "- Token Based Dynamic Batching\n",
    "- Using latest optimizations (Flash Attention, Candle and cuBLASLt)\n",
    "- Fast loading with safetensors\n",
    "\n",
    "The issue with `2.` is that it takes a bit of planning. We wont go much into that side of things here though."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3102abce-ea42-4da6-8c98-c6dd4edf7f0b",
   "metadata": {},
   "source": [
    "## Start TEI Locally\n",
    "Run [TEI](https://github.com/huggingface/text-embeddings-inference#docker), I have this running in a nvidia-docker container, but you can install as you like. Note that I ran this in a different terminal for monitoring and seperation. \n",
    "\n",
    "Note that as its running, its always going to pull the latest. Its at a very early stage at the time of writing. \n",
    "\n",
    "I chose [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) based on the STS ar-ar performance on [mteb/leaderboard](https://huggingface.co/spaces/mteb/leaderboard), its the top performer and half the size of second place! TEI is fast, but this will make our life easier for storage and retrieval.\n",
    "\n",
    "I use the `revision=refs/pr/8` because this has the pull request with [safetensors](https://github.com/huggingface/safetensors) which is required by TEI. Check out the [pull request](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/discussions/8) if you want to use a different embedding model and it doesnt have safetensors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7e873652-8257-4aae-92bc-94e1bac54b73",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# volume=$pwd/tei\n",
    "# model=sentence-transformers/paraphrase-multilingual-minilm-l12-v2\n",
    "# revision=refs/pr/8\n",
    "# docker run \\\n",
    "#     --gpus all \\\n",
    "#     -p 8080:80 \\\n",
    "#     -v $volume:/data \\\n",
    "#     -v /home/ec2-user/.cache/huggingface/token:/root/.cache/huggingface/token \\\n",
    "#     --pull always \\\n",
    "#     ghcr.io/huggingface/text-embeddings-inference:latest \\\n",
    "#     --model-id $model \\\n",
    "#     --revision $revision \\\n",
    "#     --pooling mean \\\n",
    "#     --max-batch-tokens 65536"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51959ef4-186e-4a32-826a-731813eaf593",
   "metadata": {},
   "source": [
    "### Test Endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "52edfc97-5b6f-44f9-8d89-8578cf79fae9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# response_code=$(curl -s -o /dev/null -w \"%{http_code}\" 127.0.0.1:8080/embed \\\n",
    "#     -X POST \\\n",
    "#     -d '{\"inputs\":\"What is Deep Learning?\"}' \\\n",
    "#     -H 'Content-Type: application/json')\n",
    "\n",
    "# if [ \"$response_code\" -eq 200 ]; then\n",
    "#     echo \"passed\"\n",
    "# else\n",
    "#     echo \"failed\"\n",
    "# fi"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9d6b54a-02bd-49aa-b180-27a7ab90154e",
   "metadata": {},
   "source": [
    "## Start TEI with Inference Endpoints\n",
    "Another option is to run TEI on [Inference Endpoints](https://huggingface.co/inference-endpoints). Its cheap and fast. It took me less than 5 minutes to get it up and running!\n",
    "\n",
    "Check here for a [comprehensive guide](https://huggingface.co/blog/inference-endpoints-embeddings#3-deploy-embedding-model-as-inference-endpoint). Make sure to set these options **IN ORDER**:\n",
    "1. Model Repository = `transformers/paraphrase-multilingual-minilm-l12-v2`\n",
    "1. Name your endpoint\n",
    "1. Choose a GPU, I chose `Nvidia A10G` which is **$1.3/hr**.\n",
    "1. Advanced Configuration\n",
    "    1. Task = `Sentence Embeddings`\n",
    "    1. Revision (based on [this pull request for safetensors](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/discussions/8) = `a21e6630`\n",
    "    1. Container Type = `Text Embeddings Inference`\n",
    "    \n",
    "Set the other options as you prefer."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec78c98a-6b7b-4689-8ef8-582c3fcdf66e",
   "metadata": {},
   "source": [
    "### Test Endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a69e2ee1-67f2-4f0a-b496-02f5415a52ca",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "What is your API_URL? ········\n",
      "What is your BEARER TOKEN? Check your endpoint. ········\n"
     ]
    }
   ],
   "source": [
    "import getpass\n",
    "API_URL = getpass.getpass(prompt='What is your API_URL?')\n",
    "bearer_token = getpass.getpass(prompt='What is your BEARER TOKEN? Check your endpoint.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "949d6bf8-804f-496b-a59a-834483cc7073",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Constants\n",
    "HEADERS = {\n",
    "\t\"Authorization\": f\"Bearer {bearer_token}\",\n",
    "\t\"Content-Type\": \"application/json\"\n",
    "}\n",
    "MAX_WORKERS = 512"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "d00b4af1-8fbc-4f7a-8a78-e1c52dd77a66",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.0047912598, -0.03164673, -0.018051147, -0.057739258, -0.04498291]...\n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "\n",
    "\n",
    "def query(payload):\n",
    "\tresponse = requests.post(API_URL, headers=HEADERS, json=payload)\n",
    "\treturn response.json()\n",
    "\t\n",
    "output = query({\n",
    "\t\"inputs\": \"This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music!\",\n",
    "})\n",
    "print(f'{output[0][:5]}...')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1b28232-b65d-41ce-88de-fd70b93a528d",
   "metadata": {},
   "source": [
    "# Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import asyncio\n",
    "from pathlib import Path\n",
    "import json\n",
    "import time\n",
    "\n",
    "\n",
    "from aiohttp import ClientSession, ClientTimeout\n",
    "from tqdm.notebook import tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/ec2-user/arabic-wiki\n"
     ]
    }
   ],
   "source": [
    "proj_dir = Path.cwd().parent\n",
    "print(proj_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76119e74-f601-436d-a253-63c5a19d1c83",
   "metadata": {},
   "source": [
    "# Config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f6f74545-54a7-4f41-9f02-96964e1417f0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "files_in = list((proj_dir / 'data/processed/').glob('*.ndjson'))\n",
    "folder_out = proj_dir / 'data/embedded/'\n",
    "folder_out_str = str(folder_out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e73235d-6274-4958-9e57-977afeeb5f1b",
   "metadata": {},
   "source": [
    "# Embed\n",
    "## Strategy\n",
    "TEI allows multiple concurrent requests, so its important that we dont waste the potential we have. I used the default `max-concurrent-requests` value of `512`, so I want to use that many `MAX_WORKERS`.\n",
    "\n",
    "Im using an `async` way of making requests that uses `aiohttp` as well as a nice progress bar. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf3da8cc-1651-4704-9091-39c2a1b835be",
   "metadata": {},
   "source": [
    "Note that Im using `'truncate':True` as even with our `350` word split earlier, there are always exceptions. Its important that as this scales we have as few issues as possible when embedding. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "e455dd52-aad3-4313-8738-03141ee5152a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "async def request(document, semaphore):\n",
    "    # Semaphore guard\n",
    "    async with semaphore:\n",
    "        payload = {\n",
    "            \"inputs\": document['content'],\n",
    "            \"truncate\": True\n",
    "        }\n",
    "        \n",
    "        timeout = ClientTimeout(total=10)  # Set a timeout for requests (10 seconds here)\n",
    "\n",
    "        async with ClientSession(timeout=timeout, headers=HEADERS) as session:\n",
    "            async with session.post(API_URL, json=payload) as resp:\n",
    "                if resp.status != 200:\n",
    "                    raise RuntimeError(await resp.text())\n",
    "                result = await resp.json()\n",
    "                \n",
    "        document['embedding'] = result[0]  # Assuming the API's output can be directly assigned\n",
    "        return document\n",
    "\n",
    "async def main(documents):\n",
    "    # Semaphore to limit concurrent requests. Adjust the number as needed.\n",
    "    semaphore = asyncio.BoundedSemaphore(512)\n",
    "\n",
    "    # Creating a list of tasks\n",
    "    tasks = [request(document, semaphore) for document in documents]\n",
    "    \n",
    "    # Using tqdm to show progress. It's been integrated into the async loop.\n",
    "    for f in tqdm(asyncio.as_completed(tasks), total=len(documents)):\n",
    "        await f\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "f0d17264-72dc-40be-aa46-17cde38c8189",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c4b7384336ad4c39a417a54a5a00a4ad",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0b034dc636df440594550f56dc152c8b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/243068 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 1: Embeddings = 243068 documents = 243068\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0203531009644b75abb22725a38b3ace",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/104065 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 2: Embeddings = 104065 documents = 104065\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a4c781089c42466ba380b0b598b2f9e6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/123154 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 3: Embeddings = 123154 documents = 123154\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "66a0feea106145a0aadeb64fab48b6f8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/135965 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 4: Embeddings = 135965 documents = 135965\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c6976832a78e48c5be335c5fef14bb5d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/99138 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 5: Embeddings = 99138 documents = 99138\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "764ceac837b040a39c2541074386e1f6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/83678 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 6: Embeddings = 83678 documents = 83678\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6d268907dd9844cd8f81f48f8568f576",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/30573 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 7: Embeddings = 30573 documents = 30573\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "53bb70b332774c3d867cdb1cb3c48958",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/78957 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 8: Embeddings = 78957 documents = 78957\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "525ae3cf63af47b2acad508cb3c3efb7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/86327 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 9: Embeddings = 86327 documents = 86327\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1a4c5103a1184ca1999b452c716131be",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/83111 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 10: Embeddings = 83111 documents = 83111\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9a06e2e21c6d4d12a04a55ca746594a4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/92664 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 11: Embeddings = 92664 documents = 92664\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "933d457b2f4f4f1fa3d20b469dc22d75",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/66404 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 12: Embeddings = 66404 documents = 66404\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0e4ce5ea591f431ca1ba6497ccf82b84",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/62844 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 13: Embeddings = 62844 documents = 62844\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1d267ec29d864694b9f89fbf15e3e34a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/59349 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 14: Embeddings = 59349 documents = 59349\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bfb1ecea3b2143c1916beb446201fe7f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/52554 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 15: Embeddings = 52554 documents = 52554\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "cdb83413a46e4ba984eb261994d05cd3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/34240 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 16: Embeddings = 34240 documents = 34240\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a6cb8095952f4db3ab6d31219c21087e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/35933 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 17: Embeddings = 35933 documents = 35933\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0699bd10530c4a34aaaf9e88523ad5e6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/64575 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 18: Embeddings = 64575 documents = 64575\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "736895c24eb84a8a9f514d99c628bdc7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/94244 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 19: Embeddings = 94244 documents = 94244\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "632c1bfc4370488ab977bedc8c31d404",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/124472 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 20: Embeddings = 124472 documents = 124472\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d2c550797b5f4444b91b954c3f3958b1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/121849 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 21: Embeddings = 121849 documents = 121849\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8918701c109f4ecdbb1d73e5fe97d6b5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/147110 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 22: Embeddings = 147110 documents = 147110\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c69c13c3a1354f5c900d268500ffcb00",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/70322 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 23: Embeddings = 70322 documents = 70322\n",
      "104 min 32.33 sec\n"
     ]
    }
   ],
   "source": [
    "start = time.perf_counter()\n",
    "for i, file_in in tqdm(enumerate(files_in)):\n",
    "\n",
    "    with open(file_in, 'r') as f:\n",
    "        documents = [json.loads(line) for line in f]\n",
    "        \n",
    "    # Get embeddings\n",
    "    await main(documents)\n",
    "        \n",
    "    # Make sure we got it all\n",
    "    count = 0\n",
    "    for document in documents:\n",
    "        if document['embedding'] and len(document['embedding']) == 384:\n",
    "            count += 1\n",
    "    print(f'Batch {i+1}: Embeddings = {count} documents = {len(documents)}')\n",
    "\n",
    "    # Write to file\n",
    "    with open(folder_out/file_in.name, 'w', encoding='utf-8') as f:\n",
    "        for document in documents:\n",
    "            json_str = json.dumps(document, ensure_ascii=False)\n",
    "            f.write(json_str + '\\n')\n",
    "            \n",
    "# Print elapsed time\n",
    "elapsed_time = time.perf_counter() - start\n",
    "minutes, seconds = divmod(elapsed_time, 60)\n",
    "print(f\"{int(minutes)} min {seconds:.2f} sec\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f0d9e6d-68f2-4086-9bcc-ffb27971fd63",
   "metadata": {},
   "source": [
    "Lets make sure that we still have all our documents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "abc6dccc-0e5c-45e2-a269-b9f02cff2d05",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/ec2-user/arabic-wiki/data/embedded\n",
      "2094596\n"
     ]
    }
   ],
   "source": [
    "!echo \"$folder_out_str\" && cat \"$folder_out_str\"/*.ndjson | wc -l"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93d6ab01-bd3b-479d-918d-2bdb30b00fac",
   "metadata": {},
   "source": [
    "# Performance and Cost Analysis\n",
    "You can see that we are quite cost effective!\n",
    "\n",
    "![Cost](https://huggingface.co/spaces/derek-thomas/arabic-RAG/resolve/main/media/arabic-rag-embeddings-cost.png)\n",
    "\n",
    "Note that the performance is over just the last 30 min window.\n",
    "Observations:\n",
    "- We have a througput of `~333/s`\n",
    "- Our median latency per request is `~50ms`\n",
    "\n",
    "![Metrics](https://huggingface.co/spaces/derek-thomas/arabic-RAG/resolve/main/media/arabic-rag-embeddings-metrics.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc1e7cc5-b878-42bb-9fb4-e810f3f5006a",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Next Steps\n",
    "We need to import this into a vector db. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}