{ "cells": [ { "cell_type": "markdown", "id": "5d9aca72-957a-4ee2-862f-e011b9cd3a62", "metadata": {}, "source": [ "---\n", "title: \"Inference Endpoints\"\n", "---\n", "\n", "# How to use Inference Endpoints to Embed Documents\n", "\n", "_Authored by: [Derek Thomas](https://huggingface.co/derek-thomas)_\n", "\n", "## Goal\n", "I have a dataset I want to embed for semantic search (or QA, or RAG), I want the easiest way to do embed this and put it in a new dataset.\n", "\n", "## Approach\n", "I'm using a dataset from my favorite subreddit [r/bestofredditorupdates](https://www.reddit.com/r/bestofredditorupdates/). Because it has long entries, I will use the new [jinaai/jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en) since it has an 8k context length. I will deploy this using [Inference Endpoint](https://huggingface.co/inference-endpoints) to save time and money. To follow this tutorial, you will need to **have already added a payment method**. If you haven't, you can add one here in [billing](https://huggingface.co/docs/hub/billing#billing). To make it even easier, I'll make this fully API based.\n", "\n", "To make this MUCH faster I will use the [Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference) image. This has many benefits like:\n", "- No model graph compilation step\n", "- Small docker images and fast boot times. Get ready for true serverless!\n", "- Token based dynamic batching\n", "- Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt\n", "- Safetensors weight loading\n", "- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)\n", "\n", "![img](https://media.githubusercontent.com/media/huggingface/text-embeddings-inference/main/assets/bs1-tp.png)" ] }, { "cell_type": "markdown", "id": "3c830114-dd88-45a9-81b9-78b0e3da7384", "metadata": {}, "source": [ "## Requirements" ] }, { "cell_type": "code", "execution_count": null, "id": "35386f72-32cb-49fa-a108-3aa504e20429", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install -q aiohttp==3.8.3 datasets==2.14.6 pandas==1.5.3 requests==2.31.0 tqdm==4.66.1 huggingface-hub>=0.20" ] }, { "cell_type": "markdown", "id": "b6f72042-173d-4a72-ade1-9304b43b528d", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 3, "id": "e2beecdd-d033-4736-bd45-6754ec53b4ac", "metadata": { "tags": [] }, "outputs": [], "source": [ "import asyncio\n", "from getpass import getpass\n", "import json\n", "from pathlib import Path\n", "import time\n", "from typing import Optional\n", "\n", "from aiohttp import ClientSession, ClientTimeout\n", "from datasets import load_dataset, Dataset, DatasetDict\n", "from huggingface_hub import notebook_login, create_inference_endpoint, list_inference_endpoints, whoami\n", "import numpy as np\n", "import pandas as pd\n", "import requests\n", "from tqdm.auto import tqdm" ] }, { "cell_type": "markdown", "id": "5eece903-64ce-435d-a2fd-096c0ff650bf", "metadata": {}, "source": [ "## Config\n", "`DATASET_IN` is where your text data is\n", "`DATASET_OUT` is where your embeddings will be stored\n", "\n", "Note I used 5 for the `MAX_WORKERS` since `jina-embeddings-v2` are quite memory hungry. " ] }, { "cell_type": "code", "execution_count": 4, "id": "df2f79f0-9f28-46e6-9fc7-27e9537ff5be", "metadata": { "tags": [] }, "outputs": [], "source": [ "DATASET_IN = 'derek-thomas/dataset-creator-reddit-bestofredditorupdates'\n", "DATASET_OUT = \"processed-subset-bestofredditorupdates\"\n", "ENDPOINT_NAME = \"boru-jina-embeddings-demo-ie\"\n", "\n", "MAX_WORKERS = 5 # This is for how many async workers you want. Choose based on the model and hardware \n", "ROW_COUNT = 100 # Choose None to use all rows, Im using 100 just for a demo" ] }, { "cell_type": "markdown", "id": "1e680f3d-4900-46cc-8b49-bb6ba3e27e2b", "metadata": {}, "source": [ "Hugging Face offers a number of GPUs that you can choose from a number of GPUs that you can choose in Inference Endpoints. Here they are in table form:\n", "\n", "| GPU | instanceType | instanceSize | vRAM |\n", "|---------------------|----------------|--------------|-------|\n", "| 1x Nvidia Tesla T4 | g4dn.xlarge | small | 16GB |\n", "| 4x Nvidia Tesla T4 | g4dn.12xlarge | large | 64GB |\n", "| 1x Nvidia A10G | g5.2xlarge | medium | 24GB |\n", "| 4x Nvidia A10G | g5.12xlarge | xxlarge | 96GB |\n", "| 1x Nvidia A100* | p4de | xlarge | 80GB |\n", "| 2x Nvidia A100* | p4de | 2xlarge | 160GB |\n", "\n", "\\*Note that for A100s you might get a note to email us to get access." ] }, { "cell_type": "code", "execution_count": 4, "id": "3c2106c1-2e5a-443a-9ea8-a3cd0e9c5a94", "metadata": { "tags": [] }, "outputs": [], "source": [ "# GPU Choice\n", "VENDOR=\"aws\"\n", "REGION=\"us-east-1\"\n", "INSTANCE_SIZE=\"medium\"\n", "INSTANCE_TYPE=\"g5.2xlarge\"" ] }, { "cell_type": "code", "execution_count": 5, "id": "0ca1140c-3fcc-4b99-9210-6da1505a27b7", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ee80821056e147fa9cabf30f64dc85a8", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HTML(value='