Create an image dataset
There are two methods for creating and sharing an image dataset. This guide will show you how to:
- Create an image dataset with
ImageFolder
and some metadata. This is a no-code solution for quickly creating an image dataset with several thousand images. - Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale image datasets.
You can control access to your dataset by requiring users to share their contact information first. Check out the Gated datasets guide for more information about how to enable this feature on the Hub.
ImageFolder
The ImageFolder
is a dataset builder designed to quickly load an image dataset with several thousand images without requiring you to write any code.
💡 Take a look at the Split pattern hierarchy to learn more about how ImageFolder
creates dataset splits based on your dataset repository structure.
ImageFolder
automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:
folder/train/dog/golden_retriever.png
folder/train/dog/german_shepherd.png
folder/train/dog/chihuahua.png
folder/train/cat/maine_coon.png
folder/train/cat/bengal.png
folder/train/cat/birman.png
Then users can load your dataset by specifying imagefolder
in load_dataset() and the directory in data_dir
:
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
You can also use imagefolder
to load datasets involving multiple splits. To do so, your dataset directory should have the following structure:
folder/train/dog/golden_retriever.png
folder/train/cat/maine_coon.png
folder/test/dog/german_shepherd.png
folder/test/cat/bengal.png
If all image files are contained in a single directory or if they are not on the same level of directory structure, label
column won’t be added automatically. If you need it, set drop_labels=False
explicitly.
If there is additional information you’d like to include about your dataset, like text captions or bounding boxes, add it as a metadata.csv
file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file metadata.jsonl
.
folder/train/metadata.csv
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png
You can also zip your images:
folder/metadata.csv folder/train.zip folder/test.zip folder/valid.zip
Your metadata.csv
file must have a file_name
column which links image files with their metadata:
file_name,additional_feature
0001.png,This is a first value of a text feature you added to your images
0002.png,This is a second value of a text feature you added to your images
0003.png,This is a third value of a text feature you added to your images
or using metadata.jsonl
:
{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}
If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set drop_labels=False
in load_dataset
.
Image captioning
Image captioning datasets have text describing an image. An example metadata.csv
may look like:
file_name,text
0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua
Load the dataset with ImageFolder
, and it will create a text
column for the image captions:
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["text"]
"This is a golden retriever playing with a ball"
Object detection
Object detection datasets have bounding boxes and categories identifying objects in an image. An example metadata.jsonl
may look like:
{"file_name": "0001.png", "objects": {"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}}
{"file_name": "0002.png", "objects": {"bbox": [[810.0, 100.0, 57.0, 28.0]], "categories": [1]}}
{"file_name": "0003.png", "objects": {"bbox": [[160.0, 31.0, 248.0, 616.0], [741.0, 68.0, 202.0, 401.0]], "categories": [2, 2]}}
Load the dataset with ImageFolder
, and it will create a objects
column with the bounding boxes and the categories:
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["objects"]
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}
Upload dataset to the Hub
Once you’ve created a dataset, you can share it to the Hub with the push_to_hub() method. Make sure you have the huggingface_hub library installed and you’re logged in to your Hugging Face account (see the Upload with Python tutorial for more details).
Upload your dataset with push_to_hub():
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset.push_to_hub("stevhliu/my-image-captioning-dataset")
WebDataset
The WebDataset format is based on TAR archives and is suitable for big image datasets. Indeed you can group your images in TAR archives (e.g. 1GB of images per TAR archive) and have thousands of TAR archives:
folder/train/00000.tar
folder/train/00001.tar
folder/train/00002.tar
...
In the archives, each example is made of files sharing the same prefix:
e39871fd9fd74f55.jpg
e39871fd9fd74f55.json
f18b91585c4d3f3e.jpg
f18b91585c4d3f3e.json
ede6e66b2fb59aab.jpg
ede6e66b2fb59aab.json
ed600d57fcee4f94.jpg
ed600d57fcee4f94.json
...
You can put your images labels/captions/bounding boxes using JSON or text files for example.
For more details on the WebDataset format and the python library, please check the WebDataset documentation.
Load your WebDataset and it will create on column per file suffix (here “jpg” and “json”):
>>> from datasets import load_dataset
>>> dataset = load_dataset("webdataset", data_dir="/path/to/folder", split="train")
>>> dataset[0]["json"]
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}
Loading script
Write a dataset loading script to share a dataset. It defines a dataset’s splits and configurations, and handles downloading and generating a dataset. The script is located in the same folder or repository as the dataset and should have the same name.
my_dataset/
├── README.md
├── my_dataset.py
└── data/ # optional, may contain your images or TAR archives
This structure allows your dataset to be loaded in one line:
>>> from datasets import load_dataset
>>> dataset = load_dataset("path/to/my_dataset")
This guide will show you how to create a dataset loading script for image datasets, which is a bit different from creating a loading script for text datasets. You’ll learn how to:
- Create a dataset builder class.
- Create dataset configurations.
- Add dataset metadata.
- Download and define the dataset splits.
- Generate the dataset.
- Generate the dataset metadata (optional).
- Upload the dataset to the Hub.
The best way to learn is to open up an existing image dataset loading script, like Food-101, and follow along!
To help you get started, we created a loading script template you can copy and use as a starting point!
Create a dataset builder class
GeneratorBasedBuilder is the base class for datasets generated from a dictionary generator. Within this class, there are three methods to help create your dataset:
info
stores information about your dataset like its description, license, and features.split_generators
downloads the dataset and defines its splits.generate_examples
generates the images and labels for each split.
Start by creating your dataset class as a subclass of GeneratorBasedBuilder and add the three methods. Don’t worry about filling in each of these methods yet, you’ll develop those over the next few sections:
class Food101(datasets.GeneratorBasedBuilder):
"""Food-101 Images dataset"""
def _info(self):
def _split_generators(self, dl_manager):
def _generate_examples(self, images, metadata_path):
Multiple configurations
In some cases, a dataset may have more than one configuration. For example, if you check out the Imagenette dataset, you’ll notice there are three subsets.
To create different configurations, use the BuilderConfig class to create a subclass for your dataset. Provide the links to download the images and labels in data_url
and metadata_urls
:
class Food101Config(datasets.BuilderConfig):
"""Builder Config for Food-101"""
def __init__(self, data_url, metadata_urls, **kwargs):
"""BuilderConfig for Food-101.
Args:
data_url: `string`, url to download the zip file from.
metadata_urls: dictionary with keys 'train' and 'validation' containing the archive metadata URLs
**kwargs: keyword arguments forwarded to super.
"""
super(Food101Config, self).__init__(version=datasets.Version("1.0.0"), **kwargs)
self.data_url = data_url
self.metadata_urls = metadata_urls
Now you can define your subsets at the top of GeneratorBasedBuilder. Imagine you want to create two subsets in the Food-101 dataset based on whether it is a breakfast or dinner food.
- Define your subsets with
Food101Config
in a list inBUILDER_CONFIGS
. - For each configuration, provide a name, description, and where to download the images and labels from.
class Food101(datasets.GeneratorBasedBuilder):
"""Food-101 Images dataset"""
BUILDER_CONFIGS = [
Food101Config(
name="breakfast",
description="Food types commonly eaten during breakfast.",
data_url="https://link-to-breakfast-foods.zip",
metadata_urls={
"train": "https://link-to-breakfast-foods-train.txt",
"validation": "https://link-to-breakfast-foods-validation.txt"
},
,
Food101Config(
name="dinner",
description="Food types commonly eaten during dinner.",
data_url="https://link-to-dinner-foods.zip",
metadata_urls={
"train": "https://link-to-dinner-foods-train.txt",
"validation": "https://link-to-dinner-foods-validation.txt"
},
)...
]
Now if users want to load the breakfast
configuration, they can use the configuration name:
>>> from datasets import load_dataset
>>> ds = load_dataset("food101", "breakfast", split="train")
Add dataset metadata
Adding information about your dataset is useful for users to learn more about it. This information is stored in the DatasetInfo class which is returned by the info
method. Users can access this information by:
>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("food101")
>>> ds_builder.info
There is a lot of information you can specify about your dataset, but some important ones to include are:
description
provides a concise description of the dataset.features
specify the dataset column types. Since you’re creating an image loading script, you’ll need to include the Image feature.supervised_keys
specify the input feature and label.homepage
provides a link to the dataset homepage.citation
is a BibTeX citation of the dataset.license
states the dataset’s license.
You’ll notice a lot of the dataset information is defined earlier in the loading script which makes it easier to read. There are also other ~Datasets.Features
you can input, so be sure to check out the full list for more details.
def _info(self):
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=datasets.Features(
{
"image": datasets.Image(),
"label": datasets.ClassLabel(names=_NAMES),
}
),
supervised_keys=("image", "label"),
homepage=_HOMEPAGE,
citation=_CITATION,
license=_LICENSE,
task_templates=[ImageClassification(image_column="image", label_column="label")],
)
Download and define the dataset splits
Now that you’ve added some information about your dataset, the next step is to download the dataset and generate the splits.
Use the DownloadManager.download() method to download the dataset and any other metadata you’d like to associate with it. This method accepts:
- a name to a file inside a Hub dataset repository (in other words, the
data/
folder) - a URL to a file hosted somewhere else
- a list or dictionary of file names or URLs
In the Food-101 loading script, you’ll notice again the URLs are defined earlier in the script.
- a name to a file inside a Hub dataset repository (in other words, the
After you’ve downloaded the dataset, use the SplitGenerator to organize the images and labels in each split. Name each split with a standard name like:
Split.TRAIN
,Split.TEST
, andSPLIT.Validation
.In the
gen_kwargs
parameter, specify the file paths to theimages
to iterate over and load. If necessary, you can use DownloadManager.iter_archive() to iterate over images in TAR archives. You can also specify the associated labels in themetadata_path
. Theimages
andmetadata_path
are actually passed onto the next step where you’ll actually generate the dataset.
To stream a TAR archive file, you need to use DownloadManager.iter_archive()! The DownloadManager.download_and_extract() function does not support TAR archives in streaming mode.
def _split_generators(self, dl_manager):
archive_path = dl_manager.download(_BASE_URL)
split_metadata_paths = dl_manager.download(_METADATA_URLS)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"images": dl_manager.iter_archive(archive_path),
"metadata_path": split_metadata_paths["train"],
},
),
datasets.SplitGenerator(
name=datasets.Split.VALIDATION,
gen_kwargs={
"images": dl_manager.iter_archive(archive_path),
"metadata_path": split_metadata_paths["test"],
},
),
]
Generate the dataset
The last method in the GeneratorBasedBuilder class actually generates the images and labels in the dataset. It yields a dataset according to the stucture specified in features
from the info
method. As you can see, generate_examples
accepts the images
and metadata_path
from the previous method as arguments.
To stream a TAR archive file, the metadata_path
needs to be opened and read first. TAR files are accessed and yielded sequentially. This means you need to have the metadata information in hand first so you can yield it with its corresponding image.
Now you can write a function for opening and loading examples from the dataset:
def _generate_examples(self, images, metadata_path):
"""Generate images and labels for splits."""
with open(metadata_path, encoding="utf-8") as f:
files_to_keep = set(f.read().split("\n"))
for file_path, file_obj in images:
if file_path.startswith(_IMAGES_DIR):
if file_path[len(_IMAGES_DIR) : -len(".jpg")] in files_to_keep:
label = file_path.split("/")[2]
yield file_path, {
"image": {"path": file_path, "bytes": file_obj.read()},
"label": label,
}
Generate the dataset metadata (optional)
The dataset metadata can be generated and stored in the dataset card (README.md
file).
Run the following command to generate your dataset metadata in README.md
and make sure your new loading script works correctly:
datasets-cli test path/to/<your-dataset-loading-script> --save_info --all_configs
If your loading script passed the test, you should now have the dataset_info
YAML fields in the header of the README.md
file in your dataset folder.
Upload the dataset to the Hub
Once your script is ready, create a dataset card and upload it to the Hub.
Congratulations, you can now load your dataset from the Hub! 🥳
>>> from datasets import load_dataset
>>> load_dataset("<username>/my_dataset")