Create an image dataset

There are two methods for creating and sharing an image dataset. This guide will show you how to:

Create an image dataset with ImageFolder and some metadata. This is a no-code solution for quickly creating an image dataset with several thousand images.
Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale image datasets.

You can control access to your dataset by requiring users to share their contact information first. Check out the Gated datasets guide for more information about how to enable this feature on the Hub.

ImageFolder

The ImageFolder is a dataset builder designed to quickly load an image dataset with several thousand images without requiring you to write any code.

💡 Take a look at the Split pattern hierarchy to learn more about how ImageFolder creates dataset splits based on your dataset repository structure.

ImageFolder automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:

folder/train/dog/golden_retriever.png
folder/train/dog/german_shepherd.png
folder/train/dog/chihuahua.png

folder/train/cat/maine_coon.png
folder/train/cat/bengal.png
folder/train/cat/birman.png

Then users can load your dataset by specifying imagefolder in load_dataset() and the directory in data_dir:

>>> from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")

You can also use imagefolder to load datasets involving multiple splits. To do so, your dataset directory should have the following structure:

folder/train/dog/golden_retriever.png
folder/train/cat/maine_coon.png
folder/test/dog/german_shepherd.png
folder/test/cat/bengal.png

If all image files are contained in a single directory or if they are not on the same level of directory structure, label column won’t be added automatically. If you need it, set drop_labels=False explicitly.

If there is additional information you’d like to include about your dataset, like text captions or bounding boxes, add it as a metadata.csv file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file metadata.jsonl.

folder/train/metadata.csv
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png

You can also zip your images:

folder/metadata.csv
folder/train.zip
folder/test.zip
folder/valid.zip

Your metadata.csv file must have a file_name column which links image files with their metadata:

file_name,additional_feature
0001.png,This is a first value of a text feature you added to your images
0002.png,This is a second value of a text feature you added to your images
0003.png,This is a third value of a text feature you added to your images

or using metadata.jsonl:

{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}

If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set drop_labels=False in load_dataset.

Image captioning

Image captioning datasets have text describing an image. An example metadata.csv may look like:

file_name,text
0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua

Load the dataset with ImageFolder, and it will create a text column for the image captions:

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["text"]
"This is a golden retriever playing with a ball"

Object detection

Object detection datasets have bounding boxes and categories identifying objects in an image. An example metadata.jsonl may look like:

{"file_name": "0001.png", "objects": {"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}}
{"file_name": "0002.png", "objects": {"bbox": [[810.0, 100.0, 57.0, 28.0]], "categories": [1]}}
{"file_name": "0003.png", "objects": {"bbox": [[160.0, 31.0, 248.0, 616.0], [741.0, 68.0, 202.0, 401.0]], "categories": [2, 2]}}

Load the dataset with ImageFolder, and it will create a objects column with the bounding boxes and the categories:

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["objects"]
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}

Upload dataset to the Hub

Once you’ve created a dataset, you can share it to the Hub with the push_to_hub() method. Make sure you have the huggingface_hub library installed and you’re logged in to your Hugging Face account (see the Upload with Python tutorial for more details).

Upload your dataset with push_to_hub():

>>> from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset.push_to_hub("stevhliu/my-image-captioning-dataset")

WebDataset

The WebDataset format is based on TAR archives and is suitable for big image datasets. Indeed you can group your images in TAR archives (e.g. 1GB of images per TAR archive) and have thousands of TAR archives:

folder/train/00000.tar
folder/train/00001.tar
folder/train/00002.tar
...

In the archives, each example is made of files sharing the same prefix:

e39871fd9fd74f55.jpg
e39871fd9fd74f55.json
f18b91585c4d3f3e.jpg
f18b91585c4d3f3e.json
ede6e66b2fb59aab.jpg
ede6e66b2fb59aab.json
ed600d57fcee4f94.jpg
ed600d57fcee4f94.json
...

You can put your images labels/captions/bounding boxes using JSON or text files for example.

For more details on the WebDataset format and the python library, please check the WebDataset documentation.

Load your WebDataset and it will create on column per file suffix (here “jpg” and “json”):

>>> from datasets import load_dataset

>>> dataset = load_dataset("webdataset", data_dir="/path/to/folder", split="train")
>>> dataset[0]["json"]
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}

Loading script

Write a dataset loading script to share a dataset. It defines a dataset’s splits and configurations, and handles downloading and generating a dataset. The script is located in the same folder or repository as the dataset and should have the same name.

my_dataset/
├── README.md
├── my_dataset.py
└── data/  # optional, may contain your images or TAR archives

This structure allows your dataset to be loaded in one line:

>>> from datasets import load_dataset
>>> dataset = load_dataset("path/to/my_dataset")

This guide will show you how to create a dataset loading script for image datasets, which is a bit different from creating a loading script for text datasets. You’ll learn how to:

Create a dataset builder class.
Create dataset configurations.
Add dataset metadata.
Download and define the dataset splits.
Generate the dataset.
Generate the dataset metadata (optional).
Upload the dataset to the Hub.

The best way to learn is to open up an existing image dataset loading script, like Food-101, and follow along!

To help you get started, we created a loading script template you can copy and use as a starting point!

Create a dataset builder class

GeneratorBasedBuilder is the base class for datasets generated from a dictionary generator. Within this class, there are three methods to help create your dataset:

info stores information about your dataset like its description, license, and features.
split_generators downloads the dataset and defines its splits.
generate_examples generates the images and labels for each split.

Start by creating your dataset class as a subclass of GeneratorBasedBuilder and add the three methods. Don’t worry about filling in each of these methods yet, you’ll develop those over the next few sections:

class Food101(datasets.GeneratorBasedBuilder):
    """Food-101 Images dataset"""

    def _info(self):

    def _split_generators(self, dl_manager):

    def _generate_examples(self, images, metadata_path):

Multiple configurations

In some cases, a dataset may have more than one configuration. For example, if you check out the Imagenette dataset, you’ll notice there are three subsets.

To create different configurations, use the BuilderConfig class to create a subclass for your dataset. Provide the links to download the images and labels in data_url and metadata_urls:

class Food101Config(datasets.BuilderConfig):
    """Builder Config for Food-101"""
 
    def __init__(self, data_url, metadata_urls, **kwargs):
        """BuilderConfig for Food-101.
        Args:
          data_url: `string`, url to download the zip file from.
          metadata_urls: dictionary with keys 'train' and 'validation' containing the archive metadata URLs
          **kwargs: keyword arguments forwarded to super.
        """
        super(Food101Config, self).__init__(version=datasets.Version("1.0.0"), **kwargs)
        self.data_url = data_url
        self.metadata_urls = metadata_urls

Now you can define your subsets at the top of GeneratorBasedBuilder. Imagine you want to create two subsets in the Food-101 dataset based on whether it is a breakfast or dinner food.

Define your subsets with Food101Config in a list in BUILDER_CONFIGS.
For each configuration, provide a name, description, and where to download the images and labels from.

class Food101(datasets.GeneratorBasedBuilder):
    """Food-101 Images dataset"""
 
    BUILDER_CONFIGS = [
        Food101Config(
            name="breakfast",
            description="Food types commonly eaten during breakfast.",
            data_url="https://link-to-breakfast-foods.zip",
            metadata_urls={
                "train": "https://link-to-breakfast-foods-train.txt", 
                "validation": "https://link-to-breakfast-foods-validation.txt"
            },
        ,
        Food101Config(
            name="dinner",
            description="Food types commonly eaten during dinner.",
            data_url="https://link-to-dinner-foods.zip",
            metadata_urls={
                "train": "https://link-to-dinner-foods-train.txt", 
                "validation": "https://link-to-dinner-foods-validation.txt"
            },
        )...
    ]

Now if users want to load the breakfast configuration, they can use the configuration name:

>>> from datasets import load_dataset
>>> ds = load_dataset("food101", "breakfast", split="train")

Add dataset metadata

Adding information about your dataset is useful for users to learn more about it. This information is stored in the DatasetInfo class which is returned by the info method. Users can access this information by:

>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("food101")
>>> ds_builder.info

There is a lot of information you can specify about your dataset, but some important ones to include are:

description provides a concise description of the dataset.
features specify the dataset column types. Since you’re creating an image loading script, you’ll need to include the Image feature.
supervised_keys specify the input feature and label.
homepage provides a link to the dataset homepage.
citation is a BibTeX citation of the dataset.
license states the dataset’s license.

You’ll notice a lot of the dataset information is defined earlier in the loading script which makes it easier to read. There are also other ~Datasets.Features you can input, so be sure to check out the full list for more details.

def _info(self):
    return datasets.DatasetInfo(
        description=_DESCRIPTION,
        features=datasets.Features(
            {
                "image": datasets.Image(),
                "label": datasets.ClassLabel(names=_NAMES),
            }
        ),
        supervised_keys=("image", "label"),
        homepage=_HOMEPAGE,
        citation=_CITATION,
        license=_LICENSE,
        task_templates=[ImageClassification(image_column="image", label_column="label")],
    )

Download and define the dataset splits

Now that you’ve added some information about your dataset, the next step is to download the dataset and generate the splits.

Use the DownloadManager.download() method to download the dataset and any other metadata you’d like to associate with it. This method accepts:
- a name to a file inside a Hub dataset repository (in other words, the data/ folder)
- a URL to a file hosted somewhere else
- a list or dictionary of file names or URLs
In the Food-101 loading script, you’ll notice again the URLs are defined earlier in the script.
After you’ve downloaded the dataset, use the SplitGenerator to organize the images and labels in each split. Name each split with a standard name like: Split.TRAIN, Split.TEST, and SPLIT.Validation.

In the gen_kwargs parameter, specify the file paths to the images to iterate over and load. If necessary, you can use DownloadManager.iter_archive() to iterate over images in TAR archives. You can also specify the associated labels in the metadata_path. The images and metadata_path are actually passed onto the next step where you’ll actually generate the dataset.

To stream a TAR archive file, you need to use DownloadManager.iter_archive()! The DownloadManager.download_and_extract() function does not support TAR archives in streaming mode.

def _split_generators(self, dl_manager):
    archive_path = dl_manager.download(_BASE_URL)
    split_metadata_paths = dl_manager.download(_METADATA_URLS)
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={
                "images": dl_manager.iter_archive(archive_path),
                "metadata_path": split_metadata_paths["train"],
            },
        ),
        datasets.SplitGenerator(
            name=datasets.Split.VALIDATION,
            gen_kwargs={
                "images": dl_manager.iter_archive(archive_path),
                "metadata_path": split_metadata_paths["test"],
            },
        ),
    ]

Generate the dataset

The last method in the GeneratorBasedBuilder class actually generates the images and labels in the dataset. It yields a dataset according to the stucture specified in features from the info method. As you can see, generate_examples accepts the images and metadata_path from the previous method as arguments.

To stream a TAR archive file, the metadata_path needs to be opened and read first. TAR files are accessed and yielded sequentially. This means you need to have the metadata information in hand first so you can yield it with its corresponding image.

Now you can write a function for opening and loading examples from the dataset:

def _generate_examples(self, images, metadata_path):
    """Generate images and labels for splits."""
    with open(metadata_path, encoding="utf-8") as f:
        files_to_keep = set(f.read().split("\n"))
    for file_path, file_obj in images:
        if file_path.startswith(_IMAGES_DIR):
            if file_path[len(_IMAGES_DIR) : -len(".jpg")] in files_to_keep:
                label = file_path.split("/")[2]
                yield file_path, {
                    "image": {"path": file_path, "bytes": file_obj.read()},
                    "label": label,
                }

Generate the dataset metadata (optional)

The dataset metadata can be generated and stored in the dataset card (README.md file).

Run the following command to generate your dataset metadata in README.md and make sure your new loading script works correctly:

datasets-cli test path/to/<your-dataset-loading-script> --save_info --all_configs

If your loading script passed the test, you should now have the dataset_info YAML fields in the header of the README.md file in your dataset folder.

Upload the dataset to the Hub

Once your script is ready, create a dataset card and upload it to the Hub.

Congratulations, you can now load your dataset from the Hub! 🥳

>>> from datasets import load_dataset
>>> load_dataset("<username>/my_dataset")