{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic embedding retrieval with Chroma\n", "\n", "This notebook demonstrates the most basic use of Chroma to store and retrieve information using embeddings. This core building block is at the heart of many powerful AI applications.\n", "\n", "## What are embeddings?\n", "\n", "Embeddings are the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video.\n", "\n", "To create an embedding, data is fed into an embedding model, which outputs vectors of numbers. The model is trained in such a way that 'similar' data, e.g. text with similar meanings, or images with similar content, will produce vectors which are nearer to one another, than those which are dissimilar.\n", "\n", "## Embeddings and retrieval\n", "\n", "We can use the similarity property of embeddings to search for and retrieve information. For example, we can find documents relevant to a particular topic, or images similar to a given image. Rather than searching for keywords or tags, we can search by finding data with similar semantic meaning.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -Uq chromadb numpy datasets tqdm ipywidgets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Dataset\n", "\n", "As a demonstration we use the [SciQ dataset](https://arxiv.org/abs/1707.06209), available from [HuggingFace](https://huggingface.co/datasets/sciq).\n", "\n", "Dataset description, from HuggingFace:\n", "\n", "> The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.\n", "\n", "In this notebook, we will demonstrate how to retrieve supporting evidence for a given question.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of questions with support: 10481\n" ] } ], "source": [ "# Get the SciQ dataset from HuggingFace\n", "from datasets import load_dataset\n", "\n", "dataset = load_dataset(\"sciq\", split=\"train\")\n", "\n", "# Filter the dataset to only include questions with a support\n", "dataset = dataset.filter(lambda x: x[\"support\"] != \"\")\n", "\n", "print(\"Number of questions with support: \", len(dataset))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the data into Chroma\n", "\n", "Chroma comes with a built-in embedding model, which makes it simple to load text. \n", "We can load the SciQ dataset into Chroma with just a few lines of code.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.\n", "import chromadb\n", "\n", "client = chromadb.Client()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding fuction, and the default will be used.\n", "collection = client.create_collection(\"sciq_supports\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6a36ed0079c34128bb4c007feacc6ad1", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Adding documents: 0%| | 0/11 [00:00