{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Kq8_kBUjxY3B" }, "source": [ "# Dataset Search Client Documentation\n", "\n", "This notebook demonstrates how to use the [librarian-bots/dataset-column-search-api](https://huggingface.co/spaces/librarian-bots/dataset-column-search-api) API to search for Hugging Face datasets by their column names." ] }, { "cell_type": "markdown", "metadata": { "id": "ArdwzeQSxY3D" }, "source": [ "## Introduction\n", "\n", "The Hugging Face Hub hosts a vast collection of datasets for various machine learning tasks. These datasets often have different structures and column names. The [librarian-bots/dataset-column-search-api](https://huggingface.co/spaces/librarian-bots/dataset-column-search-api) API allows you to find datasets that match specific column structures, which can be incredibly useful for tasks like:\n", "\n", "1. Finding datasets suitable for specific machine learning tasks\n", "2. Identifying datasets with compatible structures for transfer learning or data augmentation\n", "3. Exploring the availability of datasets with certain features or labels\n", "\n", "By searching based on column names, you can quickly identify datasets that fit your specific needs without having to manually inspect each dataset's structure." ] }, { "cell_type": "markdown", "metadata": { "id": "5KeXd86UxY3D" }, "source": [ "## Setup\n", "\n", "First, let's import the necessary libraries and define a `DatasetSearchClient` class which we'll use to call the API (feel free to directly call the API if prefered)." ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "id": "EyvEz03KxY3D" }, "outputs": [], "source": [ "import requests\n", "from typing import List, Dict, Any, Iterator\n", "\n", "class DatasetSearchClient:\n", " def __init__(self, base_url: str = \"https://librarian-bots-dataset-column-search-api.hf.space\"):\n", " self.base_url = base_url\n", "\n", " def search(self,\n", " columns: List[str],\n", " match_all: bool = False,\n", " page_size: int = 100) -> Iterator[Dict[str, Any]]:\n", " \"\"\"\n", " Search datasets using the provided API, automatically handling pagination.\n", "\n", " Args:\n", " columns (List[str]): List of column names to search for.\n", " match_all (bool, optional): If True, match all columns. If False, match any column. Defaults to False.\n", " page_size (int, optional): Number of results per page. Defaults to 100.\n", "\n", " Yields:\n", " Dict[str, Any]: Each dataset result from all pages.\n", "\n", " Raises:\n", " requests.RequestException: If there's an error with the HTTP request.\n", " ValueError: If the API returns an unexpected response format.\n", " \"\"\"\n", " page = 1\n", " total_results = None\n", "\n", " while total_results is None or (page - 1) * page_size < total_results:\n", " params = {\n", " \"columns\": columns,\n", " \"match_all\": str(match_all).lower(),\n", " \"page\": page,\n", " \"page_size\": page_size\n", " }\n", "\n", " try:\n", " response = requests.get(f\"{self.base_url}/search\", params=params)\n", " response.raise_for_status()\n", " data = response.json()\n", "\n", " if not {\"total\", \"page\", \"page_size\", \"results\"}.issubset(data.keys()):\n", " raise ValueError(\"Unexpected response format from the API\")\n", "\n", " if total_results is None:\n", " total_results = data['total']\n", "\n", " for dataset in data['results']:\n", " yield dataset\n", "\n", " page += 1\n", "\n", " except requests.RequestException as e:\n", " raise requests.RequestException(f\"Error connecting to the API: {str(e)}\")\n", " except ValueError as e:\n", " raise ValueError(f\"Error processing API response: {str(e)}\")\n", "\n", "# Create an instance of the client\n", "client = DatasetSearchClient()" ] }, { "cell_type": "markdown", "metadata": { "id": "mxVqxdCtxY3E" }, "source": [ "## Example 1: Searching for Text Classification Datasets\n", "\n", "Let's start by searching for datasets that have both \"text\" and \"label\" columns, which are common in text classification tasks:" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "T2wyABxrxY3E", "outputId": "9541e61e-1e0d-4d8a-a5d7-1e2db117bf3c" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Datasets suitable for text classification (with 'text' and 'label' columns):\n", "1. mteb/amazon_counterfactual: ['text', 'label', 'label_text']\n", "2. dair-ai/emotion: ['text', 'label']\n", "3. stanfordnlp/imdb: ['text', 'label']\n", "4. 203427as321/articles: ['label', 'text', '__index_level_0__']\n", "5. indonlp/NusaX-senti: ['id', 'text', 'lang', 'label']\n", "\n", "Total datasets found: 1866\n" ] } ], "source": [ "text_classification_columns = [\"text\", \"label\"]\n", "results = client.search(text_classification_columns, match_all=True)\n", "\n", "print(\"Datasets suitable for text classification (with 'text' and 'label' columns):\")\n", "for i, dataset in enumerate(results, 1):\n", " print(f\"{i}. {dataset['hub_id']}: {dataset['column_names']}\")\n", " if i >= 5: # Print only the first 5 as a sample\n", " break\n", "\n", "total_results = len(list(client.search(text_classification_columns, match_all=True)))\n", "print(f\"\\nTotal datasets found: {total_results}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "al0oo4yBxY3E" }, "source": [ "## Example 2: Searching for Question-Answering Datasets\n", "\n", "Now, let's search for datasets that could be used for question-answering tasks:" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WY9e3o0CxY3E", "outputId": "f46cb86a-9df9-405a-bca9-17cac3fe5faa" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Datasets suitable for question-answering tasks (with 'question', 'answer', and 'context' columns):\n", "1. hotpotqa/hotpot_qa: ['id', 'question', 'answer', 'type', 'level', 'supporting_facts', 'context']\n", "2. neural-bridge/rag-dataset-12000: ['context', 'question', 'answer']\n", "3. ryo0634/xquad-sampled: ['id', 'question', 'context', 'answer_sentence', 'answer']\n", "4. lcw99/wikipedia-korean-20240501-1million-qna: ['question', 'answer', 'context']\n", "5. virattt/financial-qa-10K: ['question', 'answer', 'context', 'ticker', 'filing']\n", "\n", "Total datasets found: 646\n" ] } ], "source": [ "qa_columns = [\"question\", \"answer\", \"context\"]\n", "results = client.search(qa_columns, match_all=True)\n", "\n", "print(\"Datasets suitable for question-answering tasks (with 'question', 'answer', and 'context' columns):\")\n", "for i, dataset in enumerate(results, 1):\n", " print(f\"{i}. {dataset['hub_id']}: {dataset['column_names']}\")\n", " if i >= 5: # Print only the first 5 as a sample\n", " break\n", "\n", "total_results = len(list(client.search(qa_columns, match_all=True)))\n", "print(f\"\\nTotal datasets found: {total_results}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "kiU3-f-OxY3E" }, "source": [ "## Example 3: Searching for Instruction-Following Datasets\n", "\n", "Let's search for datasets that could be used for instruction-following tasks, which are common in training large language models:" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nt8SSWaRxY3F", "outputId": "42460b4b-6dac-48f1-a3b2-b1504bd16686" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Datasets suitable for instruction-following tasks (with 'instruction', 'input', and 'output' columns):\n", "1. garage-bAInd/Open-Platypus: ['input', 'output', 'instruction', 'data_source']\n", "2. HuggingFaceH4/databricks_dolly_15k: ['category', 'instruction', 'input', 'output']\n", "3. chargoddard/alpaca-gpt4-500: ['instruction', 'input', 'output', 'text', '__index_level_0__']\n", "4. vicgalle/alpaca-gpt4: ['instruction', 'input', 'output', 'text']\n", "5. llamafactory/alpaca_en: ['instruction', 'input', 'output']\n", "\n", "Total datasets found: 1937\n" ] } ], "source": [ "instruction_columns = [\"instruction\", \"input\", \"output\"]\n", "results = client.search(instruction_columns, match_all=True)\n", "\n", "print(\"Datasets suitable for instruction-following tasks (with 'instruction', 'input', and 'output' columns):\")\n", "for i, dataset in enumerate(results, 1):\n", " print(f\"{i}. {dataset['hub_id']}: {dataset['column_names']}\")\n", " if i >= 5: # Print only the first 5 as a sample\n", " break\n", "\n", "total_results = len(list(client.search(instruction_columns, match_all=True)))\n", "print(f\"\\nTotal datasets found: {total_results}\")" ] }, { "cell_type": "markdown", "source": [ "# Creating collections for common dataset formats\n", "\n", "We can also use the API to create a Hugging Face Collection based on our search. Let's use an alpaca formatted dataset as an example:\n", "\n", "alpaca\n", "```\n", "{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}\n", "```\n" ], "metadata": { "id": "yRdaLtZ0AQlj" } }, { "cell_type": "code", "source": [ "alpaca = ['instruction', 'input', 'output']" ], "metadata": { "id": "kdB0wnEDDek8" }, "execution_count": 99, "outputs": [] }, { "cell_type": "code", "source": [ "results = list(client.search(alpaca, match_all=True))\n", "len(results)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uh52VwKTQasR", "outputId": "c16e50ce-6799-42b9-9ae4-e9016d767c6f" }, "execution_count": 100, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1937" ] }, "metadata": {}, "execution_count": 100 } ] }, { "cell_type": "markdown", "source": [ "We now import some functions from `huggingface_hub` to create a collection." ], "metadata": { "id": "BZ6LNKg3FdYs" } }, { "cell_type": "code", "source": [ "from huggingface_hub import login, create_collection, add_collection_item" ], "metadata": { "id": "eckH26s8w_U4" }, "execution_count": 25, "outputs": [] }, { "cell_type": "markdown", "source": [ "I have my HF_TOKEN stored as a Secret in Colab. You can also login by calling `login()` directly." ], "metadata": { "id": "nUIshM8bFhW3" } }, { "cell_type": "code", "source": [ "from google.colab import userdata" ], "metadata": { "id": "3ywhU4J7xGuE" }, "execution_count": 102, "outputs": [] }, { "cell_type": "code", "source": [ "login(userdata.get('HF_TOKEN'))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "b0yRHNw0xCq7", "outputId": "1bcdbda5-34d9-4848-f315-2fc81772df38" }, "execution_count": 103, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.\n", "Token is valid (permission: write).\n", "Your token has been saved to /root/.cache/huggingface/token\n", "Login successful\n" ] } ] }, { "cell_type": "markdown", "source": [ "We create a collection using `create_collection`. WE" ], "metadata": { "id": "krcmAIyNFshv" } }, { "cell_type": "code", "source": [ "collection = create_collection(\"Probably Alpaca Style Datasets\", exists_ok=True)" ], "metadata": { "id": "fGpAnGOPxEWp" }, "execution_count": 108, "outputs": [] }, { "cell_type": "code", "source": [ "collection.title" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "Gt8rql39RC5R", "outputId": "4af9a2f0-6c20-43a9-f46f-1dc38c2cb480" }, "execution_count": 109, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'Probably Alpaca Style Datasets'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 109 } ] }, { "cell_type": "code", "source": [ "collection.slug" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "0OC5U8VeF_Zq", "outputId": "bf135fe4-cf65-4425-c541-eb285aaa86e6" }, "execution_count": 110, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'davanstrien/probably-alpaca-style-datasets-667eead1bad3a964ea580e04'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 110 } ] }, { "cell_type": "markdown", "source": [ "We now loop through our results and add them to the Collection." ], "metadata": { "id": "-GEpHrekGAx6" } }, { "cell_type": "code", "source": [ "for result in results:\n", " add_collection_item(collection.slug, result['hub_id'], item_type=\"dataset\", exists_ok=True)" ], "metadata": { "id": "Vb3hgnRBxW4T" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Since the results have some key metadata about the dataset you can also filter the results further before creating a Collection." ], "metadata": { "id": "vOdodAVcGI96" } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "colab": { "provenance": [] } }, "nbformat": 4, "nbformat_minor": 0 }