{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Automatic curation of the Hugging Face Hub using Collections and the `huggingface_hub` library\n", "\n", "In this short tutorial, we will see how to create a Hugging Face Collection automatically using the `huggingface_hub` library. We'll focus on creating a collection that will curate the top 10% most used instruction tuning datasets on the Hub. \n", "\n", "If you are already familiar with Collections and the `huggingface_hub` library, you can skip to the next section.\n", "\n", "## What is a Hugging Face Collection?\n", "\n", "Collections are a recently added feature on the Hugging Face Hub which unlock some really powerful new ways of curating what is on the Hub. With the Hub becoming the defacto platform for open-source machine learning models, it is important to be able to curate the content on the Hub. Collections allow you to do just that.\n", "\n", "Collections can be used to organize models, datasets, Spaces, and papers on the Hub in various different ways. You could for example create collections around a particular use case, or a particular topic, or a particular model architecture. You could also create collections that are a combination of these things. In this tutorial, we will create a collection that curates the top 10% most used instruction tuning datasets on the Hub. We will do this using the `huggingface_hub` library.\n", "\n", "## So what is the `huggingface_hub` library?\n", "\n", "The `hub` library is a Python library that allows you to interact with the Hugging Face Hub. It allows you to do things like upload and download models, datasets, and Spaces. Recently the library added support for creating and managing collection. This ability to programmatically create and manage collections unlocks a bunch of exciting new use cases. In this tutorial we'll show a few possibilities of what you can do with the `huggingface_hub` library and Collections but we're excited to see what you will do with it! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install packages\n", "\n", "For this tutorial, the only package we'll need outside of the Python standard library is the `huggingface_hub` library." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HKyybBVZ1hBh", "outputId": "ceaa1d1a-85e6-4015-e4e1-04bbe88d05cf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting git+https://github.com/huggingface/huggingface_hub\n", " Cloning https://github.com/huggingface/huggingface_hub to /private/var/folders/gf/nk18mwt53sb4d0zpvjzs40bw0000gn/T/pip-req-build-bdjiy2_a\n", " Running command git clone --filter=blob:none --quiet https://github.com/huggingface/huggingface_hub /private/var/folders/gf/nk18mwt53sb4d0zpvjzs40bw0000gn/T/pip-req-build-bdjiy2_a\n", " Resolved https://github.com/huggingface/huggingface_hub to commit c32d4b31b679c9e91b906709631901f6aa85324d\n", " Installing build dependencies ... \u001b[?25ldone\n", "\u001b[?25h Getting requirements to build wheel ... \u001b[?25ldone\n", "\u001b[?25h Preparing metadata (pyproject.toml) ... \u001b[?25ldone\n", "\u001b[?25hCollecting filelock (from huggingface-hub==0.18.0.dev0)\n", " Obtaining dependency information for filelock from https://files.pythonhosted.org/packages/5e/5d/97afbafd9d584ff1b45fcb354a479a3609bd97f912f8f1f6c563cb1fae21/filelock-3.12.4-py3-none-any.whl.metadata\n", " Using cached filelock-3.12.4-py3-none-any.whl.metadata (2.8 kB)\n", "Collecting fsspec>=2023.5.0 (from huggingface-hub==0.18.0.dev0)\n", " Obtaining dependency information for fsspec>=2023.5.0 from https://files.pythonhosted.org/packages/fe/d3/e1aa96437d944fbb9cc95d0316e25583886e9cd9e6adc07baad943524eda/fsspec-2023.9.2-py3-none-any.whl.metadata\n", " Using cached fsspec-2023.9.2-py3-none-any.whl.metadata (6.7 kB)\n", "Collecting requests (from huggingface-hub==0.18.0.dev0)\n", " Obtaining dependency information for requests from https://files.pythonhosted.org/packages/70/8e/0e2d847013cb52cd35b38c009bb167a1a26b2ce6cd6965bf26b47bc0bf44/requests-2.31.0-py3-none-any.whl.metadata\n", " Using cached requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)\n", "Collecting tqdm>=4.42.1 (from huggingface-hub==0.18.0.dev0)\n", " Obtaining dependency information for tqdm>=4.42.1 from https://files.pythonhosted.org/packages/00/e5/f12a80907d0884e6dff9c16d0c0114d81b8cd07dc3ae54c5e962cc83037e/tqdm-4.66.1-py3-none-any.whl.metadata\n", " Using cached tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)\n", "Collecting pyyaml>=5.1 (from huggingface-hub==0.18.0.dev0)\n", " Obtaining dependency information for pyyaml>=5.1 from https://files.pythonhosted.org/packages/28/09/55f715ddbf95a054b764b547f617e22f1d5e45d83905660e9a088078fe67/PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl.metadata\n", " Using cached PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.1 kB)\n", "Collecting typing-extensions>=3.7.4.3 (from huggingface-hub==0.18.0.dev0)\n", " Obtaining dependency information for typing-extensions>=3.7.4.3 from https://files.pythonhosted.org/packages/24/21/7d397a4b7934ff4028987914ac1044d3b7d52712f30e2ac7a2ae5bc86dd0/typing_extensions-4.8.0-py3-none-any.whl.metadata\n", " Using cached typing_extensions-4.8.0-py3-none-any.whl.metadata (3.0 kB)\n", "Requirement already satisfied: packaging>=20.9 in ./.venv/lib/python3.11/site-packages (from huggingface-hub==0.18.0.dev0) (23.1)\n", "Collecting charset-normalizer<4,>=2 (from requests->huggingface-hub==0.18.0.dev0)\n", " Obtaining dependency information for charset-normalizer<4,>=2 from https://files.pythonhosted.org/packages/91/e6/8fa919fc84a106e9b04109de62bdf8526899e2754a64da66e1cd50ac1faa/charset_normalizer-3.2.0-cp311-cp311-macosx_11_0_arm64.whl.metadata\n", " Using cached charset_normalizer-3.2.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (31 kB)\n", "Collecting idna<4,>=2.5 (from requests->huggingface-hub==0.18.0.dev0)\n", " Using cached idna-3.4-py3-none-any.whl (61 kB)\n", "Collecting urllib3<3,>=1.21.1 (from requests->huggingface-hub==0.18.0.dev0)\n", " Obtaining dependency information for urllib3<3,>=1.21.1 from https://files.pythonhosted.org/packages/37/dc/399e63f5d1d96bb643404ee830657f4dfcf8503f5ba8fa3c6d465d0c57fe/urllib3-2.0.5-py3-none-any.whl.metadata\n", " Using cached urllib3-2.0.5-py3-none-any.whl.metadata (6.6 kB)\n", "Collecting certifi>=2017.4.17 (from requests->huggingface-hub==0.18.0.dev0)\n", " Obtaining dependency information for certifi>=2017.4.17 from https://files.pythonhosted.org/packages/4c/dd/2234eab22353ffc7d94e8d13177aaa050113286e93e7b40eae01fbf7c3d9/certifi-2023.7.22-py3-none-any.whl.metadata\n", " Using cached certifi-2023.7.22-py3-none-any.whl.metadata (2.2 kB)\n", "Using cached fsspec-2023.9.2-py3-none-any.whl (173 kB)\n", "Using cached PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl (167 kB)\n", "Using cached tqdm-4.66.1-py3-none-any.whl (78 kB)\n", "Using cached typing_extensions-4.8.0-py3-none-any.whl (31 kB)\n", "Using cached filelock-3.12.4-py3-none-any.whl (11 kB)\n", "Using cached requests-2.31.0-py3-none-any.whl (62 kB)\n", "Using cached certifi-2023.7.22-py3-none-any.whl (158 kB)\n", "Using cached charset_normalizer-3.2.0-cp311-cp311-macosx_11_0_arm64.whl (122 kB)\n", "Using cached urllib3-2.0.5-py3-none-any.whl (123 kB)\n", "Building wheels for collected packages: huggingface-hub\n", " Building wheel for huggingface-hub (pyproject.toml) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for huggingface-hub: filename=huggingface_hub-0.18.0.dev0-py3-none-any.whl size=298588 sha256=88b09ea2b9f009a9aeae12440af109575fc5b82e58a29b0b250cc9a95eaff3aa\n", " Stored in directory: /private/var/folders/gf/nk18mwt53sb4d0zpvjzs40bw0000gn/T/pip-ephem-wheel-cache-5yfewvyz/wheels/0d/44/01/c6da8315f53a5f367cd4bb3e00643c462c8df2065b29a67f4f\n", "Successfully built huggingface-hub\n", "Installing collected packages: urllib3, typing-extensions, tqdm, pyyaml, idna, fsspec, filelock, charset-normalizer, certifi, requests, huggingface-hub\n", "Successfully installed certifi-2023.7.22 charset-normalizer-3.2.0 filelock-3.12.4 fsspec-2023.9.2 huggingface-hub-0.18.0.dev0 idna-3.4 pyyaml-6.0.1 requests-2.31.0 tqdm-4.66.1 typing-extensions-4.8.0 urllib3-2.0.5\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install git+https://github.com/huggingface/huggingface_hub --upgrade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Authenticate\n", "\n", "In order to create and manage collections, you need to be authenticated. You can do this via the `huggingface_hub` library using the `notebook_login` function if you're using a notebook, or the `login` function if you're using a script. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "Qn9p5Bsz2NN5" }, "outputs": [], "source": [ "from huggingface_hub import notebook_login" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 145, "referenced_widgets": [ "428d3687eb4342e59d23318099afe34f", "18f533e671114b6385428a534364f10a", "e4c0e23001254742a94898203a222c6c", "9d970a88c8c04bc586473251393aaec7", "9f8288bb8cae4796a067580ff7afce69", "077012b6f63e4148848c9b9e8726fb18", "14f46443f97c4b4fb46c5967aec1178f", "5aaff54fecb84936a8dc9fee4393494d", "aee4b5a2e361451dae879f37222245f3", "9dfa5a7ee7794a5d8396674db2c0b683", "3634abd523b7477082a0a8135f1fa770", "023691d310634e6e83da20b9575759a2", "dded08e463404a53abb86ac605968626", "9d99d6e39a424145b017abff9021d9a0", "38e81f2aed79485498035e9c418165b4", "37807c4db5834365b86bc92c36835220", "847b9e4085814f958a147286be4f56eb", "6efac326ad7946e3a9ecc22b50568633", "9d043d68e16440899a6fc9b740f5970d", "88242ee08b884098ad743c1738b7dc97", "862cc50e401845fa98054e6bd015a074", "4fe39f9b54474fb4966f71c7df0cf93e", "d97e9b98cade45eaa2b9b526b1a4bb98", "b7f608ef35d84fd7a736260236025429", "09ca4ed6420340e0b76d64949301bed4", "b2261f5044db4af0bed02f76115d08f9", "055c0da20e264ec896963f9edf372a7d", "d0f55244ec614704a571f920ffa27bfd", "3b77fe48c5e44c879998de497be7a381", "15032b9578124624bcc42771cb5d5ad8", "eca43f65d6c1407bbb16cd26f60d5b7f", "934d899f6c604fa1bf4a8108aa09b190" ] }, "id": "Sv15J3mW2Ous", "outputId": "e537a566-8cdd-4316-bb69-72f5353345da" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "79b9c67a0334432bad65c411b7560672", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HTML(value='