{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic embedding retrieval with Chroma\n",
    "\n",
    "This notebook demonstrates the most basic use of Chroma to store and retrieve information using embeddings. This core building block is at the heart of many powerful AI applications.\n",
    "\n",
    "## What are embeddings?\n",
    "\n",
    "Embeddings are the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video.\n",
    "\n",
    "To create an embedding, data is fed into an embedding model, which outputs vectors of numbers. The model is trained in such a way that 'similar' data, e.g. text with similar meanings, or images with similar content, will produce vectors which are nearer to one another, than those which are dissimilar.\n",
    "\n",
    "## Embeddings and retrieval\n",
    "\n",
    "We can use the similarity property of embeddings to search for and retrieve information. For example, we can find documents relevant to a particular topic, or images similar to a given image. Rather than searching for keywords or tags, we can search by finding data with similar semantic meaning.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -Uq chromadb numpy datasets tqdm ipywidgets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example Dataset\n",
    "\n",
    "As a demonstration we use the [SciQ dataset](https://arxiv.org/abs/1707.06209), available from [HuggingFace](https://huggingface.co/datasets/sciq).\n",
    "\n",
    "Dataset description, from HuggingFace:\n",
    "\n",
    "> The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.\n",
    "\n",
    "In this notebook, we will demonstrate how to retrieve supporting evidence for a given question.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of questions with support:  10481\n"
     ]
    }
   ],
   "source": [
    "# Get the SciQ dataset from HuggingFace\n",
    "from datasets import load_dataset\n",
    "\n",
    "dataset = load_dataset(\"sciq\", split=\"train\")\n",
    "\n",
    "# Filter the dataset to only include questions with a support\n",
    "dataset = dataset.filter(lambda x: x[\"support\"] != \"\")\n",
    "\n",
    "print(\"Number of questions with support: \", len(dataset))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the data into Chroma\n",
    "\n",
    "Chroma comes with a built-in embedding model, which makes it simple to load text. \n",
    "We can load the SciQ dataset into Chroma with just a few lines of code.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.\n",
    "import chromadb\n",
    "\n",
    "client = chromadb.Client()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding fuction, and the default will be used.\n",
    "collection = client.create_collection(\"sciq_supports\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6a36ed0079c34128bb4c007feacc6ad1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Adding documents:   0%|          | 0/11 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from tqdm.notebook import tqdm\n",
    "\n",
    "# Load the supporting evidence in batches of 1000\n",
    "batch_size = 1000\n",
    "for i in tqdm(range(0, len(dataset), batch_size), desc=\"Adding documents\"):\n",
    "    collection.add(\n",
    "        ids=[\n",
    "            str(i) for i in range(i, min(i + batch_size, len(dataset)))\n",
    "        ],  # IDs are just strings\n",
    "        documents=dataset[\"support\"][i : i + batch_size],\n",
    "        metadatas=[\n",
    "            {\"type\": \"support\"} for _ in range(i, min(i + batch_size, len(dataset)))\n",
    "        ],\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Querying the data\n",
    "\n",
    "Once the data is loaded, we can use Chroma to find supporting evidence for the questions in the dataset.\n",
    "In this example, we retrieve the most relevant result according to the embedding similarity score.\n",
    "\n",
    "Chroma handles computing similarity and finding the most relevant results for you, so you can focus on building your application.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = collection.query(\n",
    "    query_texts=dataset[\"question\"][:10],\n",
    "    n_results=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we display the query questions along with their retrieved supports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question: What type of organism is commonly used in preparation of foods such as cheese and yogurt?\n",
      "Retrieved support: Bacteria can be used to make cheese from milk. The bacteria turn the milk sugars into lactic acid. The acid is what causes the milk to curdle to form cheese. Bacteria are also involved in producing other foods. Yogurt is made by using bacteria to ferment milk ( Figure below ). Fermenting cabbage with bacteria produces sauerkraut.\n",
      "\n",
      "Question: What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?\n",
      "Retrieved support: Without Coriolis Effect the global winds would blow north to south or south to north. But Coriolis makes them blow northeast to southwest or the reverse in the Northern Hemisphere. The winds blow northwest to southeast or the reverse in the southern hemisphere.\n",
      "\n",
      "Question: Changes from a less-ordered state to a more-ordered state (such as a liquid to a solid) are always what?\n",
      "Retrieved support: Solids that change to gases generally first pass through the liquid state. However, sometimes solids change directly to gases and skip the liquid state. The reverse can also occur. Sometimes gases change directly to solids.\n",
      "\n",
      "Question: What is the least dangerous radioactive decay?\n",
      "Retrieved support: All radioactive decay is dangerous to living things, but alpha decay is the least dangerous.\n",
      "\n",
      "Question: Kilauea in hawaii is the world’s most continuously active volcano. very active volcanoes characteristically eject red-hot rocks and lava rather than this?\n",
      "Retrieved support: Volcanoes can be active, dormant, or extinct.\n",
      "\n",
      "Question: When a meteoroid reaches earth, what is the remaining object called?\n",
      "Retrieved support: A meteoroid is dragged toward Earth by gravity and enters the atmosphere. Friction with the atmosphere heats the object quickly, so it starts to vaporize. As it flies through the atmosphere, it leaves a trail of glowing gases. The object is now a meteor. Most meteors vaporize in the atmosphere. They never reach Earth’s surface. Large meteoroids may not burn up entirely in the atmosphere. A small core may remain and hit Earth’s surface. This is called a meteorite .\n",
      "\n",
      "Question: What kind of a reaction occurs when a substance reacts quickly with oxygen?\n",
      "Retrieved support: A combustion reaction occurs when a substance reacts quickly with oxygen (O 2 ). You can see an example of a combustion reaction in Figure below . Combustion is commonly called burning. The substance that burns is usually referred to as fuel. The products of a combustion reaction include carbon dioxide (CO 2 ) and water (H 2 O). The reaction typically gives off heat and light as well. The general equation for a combustion reaction can be represented by:.\n",
      "\n",
      "Question: Organisms categorized by what species descriptor demonstrate a version of allopatric speciation and have limited regions of overlap with one another, but where they overlap they interbreed successfully?.\n",
      "Retrieved support: Allopatric speciation occurs when groups from the same species are geographically isolated for long periods. Imagine all the ways that plants or animals could be isolated from each other:.\n",
      "\n",
      "Question: Alpha emission is a type of what?\n",
      "Retrieved support: One type of radioactivity is alpha emission. What is an alpha particle? What happens to an alpha particle after it is emitted from an unstable nucleus?.\n",
      "\n",
      "Question: What is the stored food in a seed called?\n",
      "Retrieved support: The stored food in a seed is called endosperm . It nourishes the embryo until it can start making food on its own.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Print the question and the corresponding support\n",
    "for i, q in enumerate(dataset['question'][:10]):\n",
    "    print(f\"Question: {q}\")\n",
    "    print(f\"Retrieved support: {results['documents'][i][0]}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's next? \n",
    "\n",
    "Check out the Chroma documentation to [get started](https://docs.trychroma.com/getting-started) with building your own applications. \n",
    "\n",
    "The core embeddings based retrieval functionality demonstrated here is at the heart of many powerful AI applications, like using large language models with Chroma to [chat with your documents](https://github.com/chroma-core/chroma/tree/main/examples/examples/chat_with_your_documents), as well as memory for agents like [BabyAgi](https://github.com/yoheinakajima/babyagi) and [Voyager](https://github.com/MineDojo/Voyager).\n",
    "\n",
    "Chroma is already integrated with many popular AI applications frameworks, including [LangChain](https://python.langchain.com/docs/integrations/vectorstores/chroma) and [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/ChromaIndexDemo.html). \n",
    "\n",
    "Join our community to learn more and get help with your projects: [Discord](https://discord.gg/MMeYNTmh3x) | [Twitter](https://twitter.com/trychroma)\n",
    "\n",
    "We are [hiring](https://trychroma.notion.site/careers-chroma-9d017c3007c7478ebd85bad854101497?pvs=4)! "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}