Hub documentation

Hugging Face Dataset Upload Decision Guide

Hub

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Hugging Face Dataset Upload Decision Guide

This guide is primarily designed for LLMs to help users upload datasets to the Hugging Face Hub in the most compatible format. Users can also reference this guide to understand the upload process and best practices.

Decision guide for uploading datasets to Hugging Face Hub. Optimized for Dataset Viewer compatibility and integration with the Hugging Face ecosystem.

Overview

Your goal is to help a user upload a dataset to the Hugging Face Hub. Ideally, the dataset should be compatible with the Dataset Viewer (and thus the load_dataset function) to ensure easy access and usability. You should aim to meet the following criteria:

Criteria	Description	Priority
Respect repository limits	Ensure the dataset adheres to Hugging Face’s storage limits for file sizes, repository sizes, and file counts. See the Critical Constraints section below for specific limits.	Required
Use hub-compatible formats	Use Parquet format when possible (best compression, rich typing, large dataset support). For smaller datasets (<several GB), JSON/JSONL or CSV are acceptable. Raw files work well for images/audio in smaller datasets while respecting repo limits. Use WebDataset (.tar) for large media collections. Domain-specific formats can be used when conversion is impractical.	Desired
Dataset Viewer compatibility	Structure data to work with the automatic Dataset Viewer, enabling preview and easy exploration. This typically means using supported formats and proper file organization. Validation steps are provided later in this guide.	Desired
Organize data sensibly	Use logical folder structures that match Hub conventions (e.g., train/test splits). Configs can be used to define different configurations of the dataset. This facilitates both human understanding and automatic data loading.	Desired
Use appropriate Features	When using the datasets library, specify correct feature types (e.g., Image(), Audio(), ClassLabel()) to ensure proper data handling and viewer functionality. This enables type-specific optimizations and previews.	Required (when using datasets library)
Document non-standard datasets	If conversion to hub-compatible formats is impossible and custom formats must be used, ensure repository limits are strictly followed and provide clear documentation on how to download and load the dataset. Include usage examples and any special requirements.	Required (when datasets library isn’t compatible)

Working Without File Access

When you don’t have direct access to the user’s files (e.g., web interface), ask the user to run these commands to understand their dataset:

Dataset structure:

# Show directory tree (install with: pip install tree or brew install tree)
tree -L 3 --filelimit 20

# Alternative without tree:
find . -type f -name "*.csv" -o -name "*.json" -o -name "*.parquet" | head -20

Check file sizes:

# Total dataset size
du -sh .

# Individual file sizes
ls -lh data/

Peek at data format:

# First few lines of CSV/JSON
head -n 5 data/train.csv

# Check image folder structure
ls -la images/ | head -10

Quick file count:

# Count files by type
find . -name "*.jpg" | wc -l

Critical Constraints

Storage Limits:

# Machine-readable Hub limits
hub_limits:
  max_file_size_gb: 50 # absolute hard stop enforced by LFS
  recommended_file_size_gb: 20 # best-practice shard size
  max_files_per_folder: 10000 # Git performance threshold
  max_files_per_repo: 100000 # Repository file count limit
  recommended_repo_size_gb: 300 # public-repo soft cap; contact HF if larger
  viewer_row_size_mb: 2 # approximate per-row viewer limit

Human-readable summary:

Free: 100GB private datasets
Pro (for individuals) | Team or Enterprise (for organizations): 1TB+ private storage per seat (see pricing)
Public: 300GB (contact datasets@huggingface.co for larger)
Per file: 50GB max, 20GB recommended
Per folder: <10k files

See https://huggingface.co/docs/hub/storage-limits#repository-limitations-and-recommendations for current limits for current recommendations for repository sizes and file counts.

Quick Reference by Data Type

Your Data	Recommended Approach	Quick Command
CSV/JSON files	Use built-in loaders (handles any size via memory mapping)	`load_dataset("csv", data_files="data.csv").push_to_hub("username/dataset")`
Images in folders	Use imagefolder for automatic class detection	`load_dataset("imagefolder", data_dir="./images").push_to_hub("username/dataset")`
Audio files	Use audiofolder for automatic organization	`load_dataset("audiofolder", data_dir="./audio").push_to_hub("username/dataset")`
Video files	Use videofolder for automatic organization	`load_dataset("videofolder", data_dir="./videos").push_to_hub("username/dataset")`
PDF documents	Use pdffolder for text extraction	`load_dataset("pdffolder", data_dir="./pdfs").push_to_hub("username/dataset")`
Very large datasets (100GB+)	Use `max_shard_size` to control memory usage	`dataset.push_to_hub("username/dataset", max_shard_size="5GB")`
Many files / directories (>10k)	Use upload_large_folder to avoid Git limitations	`api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset")`
Streaming large media	WebDataset format for efficient streaming	Create .tar shards, then `upload_large_folder()`
Scientific data (HDF5, NetCDF)	Convert to Parquet with Array features	See Scientific Data section
Custom/proprietary formats	Document thoroughly if conversion impossible	`upload_large_folder()` with comprehensive README

Upload Workflow

✓ Gather dataset information (if needed):
- What type of data? (images, text, audio, CSV, etc.)
- How is it organized? (folder structure, single file, multiple files)
- What’s the approximate size?
- What format are the files in?
- Any special requirements? (e.g., streaming, private access)
- Check for existing README or documentation files that describe the dataset
✓ Authenticate:
- CLI: huggingface-cli login
- Or use token: HfApi(token="hf_...") or set HF_TOKEN environment variable
✓ Identify your data type: Check the Quick Reference table above
✓ Choose upload method:
- Small files (<1GB) with hub-compatible format: Can use Hub UI for quick uploads
- Built-in loader available: Use the loader + push_to_hub() (see Quick Reference table)
- Large datasets or many files: Use upload_large_folder() for files >100GB or >10k files
- Custom formats: Convert to hub-compatible format if possible, otherwise document thoroughly

✓ Test locally (if using built-in loader):

# Validate your dataset loads correctly before uploading
dataset = load_dataset("loader_name", data_dir="./your_data")
print(dataset)

✓ Upload to Hub:

# Basic upload
dataset.push_to_hub("username/dataset-name")

# With options for large datasets
dataset.push_to_hub(
    "username/dataset-name",
    max_shard_size="5GB",  # Control memory usage
    private=True  # For private datasets
)

✓ Verify your upload:
- Check Dataset Viewer: https://huggingface.co/datasets/username/dataset-name
- Test loading: load_dataset("username/dataset-name")
- If viewer shows errors, check the Troubleshooting section

Common Conversion Patterns

When built-in loaders don’t match your data structure, use the datasets library as a compatibility layer. Convert your data to a Dataset object, then use push_to_hub() for maximum flexibility and Dataset Viewer compatibility.

From DataFrames

If you already have your data working in pandas, polars, or other dataframe libraries, you can convert directly:

# From pandas DataFrame
import pandas as pd
from datasets import Dataset

df = pd.read_csv("your_data.csv")
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/dataset-name")

# From polars DataFrame (direct method)
import polars as pl
from datasets import Dataset

df = pl.read_csv("your_data.csv")
dataset = Dataset.from_polars(df)  # Direct conversion
dataset.push_to_hub("username/dataset-name")

# From PyArrow Table (useful for scientific data)
import pyarrow as pa
from datasets import Dataset

# If you have a PyArrow table
table = pa.table({'data': [1, 2, 3], 'labels': ['a', 'b', 'c']})
dataset = Dataset(table)
dataset.push_to_hub("username/dataset-name")

# For Spark/Dask dataframes, see https://huggingface.co/docs/hub/datasets-libraries

Custom Format Conversion

When built-in loaders don’t match your data format, convert to Dataset objects following these principles:

Design Principles

1. Prefer wide/flat structures over joins

Denormalize relational data into single rows for better usability
Include all relevant information in each example
Lean towards bigger but more usable data - Hugging Face’s infrastructure uses advanced deduplication (XetHub) and Parquet optimizations to handle redundancy efficiently

2. Use configs for logical dataset variations

Beyond train/test/val splits, use configs for different subsets or views of your data
Each config can have different features or data organization
Example: language-specific configs, task-specific views, or data modalities

Conversion Methods

Small datasets (fits in memory) - use Dataset.from_dict():

# Parse your custom format into a dictionary
data_dict = {
    "text": ["example1", "example2"],
    "label": ["positive", "negative"],
    "score": [0.9, 0.2]
}

# Create dataset with appropriate features
from datasets import Dataset, Features, Value, ClassLabel
features = Features({
    'text': Value('string'),
    'label': ClassLabel(names=['negative', 'positive']),
    'score': Value('float32')
})

dataset = Dataset.from_dict(data_dict, features=features)
dataset.push_to_hub("username/dataset")

Large datasets (memory-efficient) - use Dataset.from_generator():

def data_generator():
    # Parse your custom format progressively
    for item in parse_large_file("data.custom"):
        yield {
            "text": item["content"],
            "label": item["category"],
            "embedding": item["vector"]
        }

# Specify features for Dataset Viewer compatibility
from datasets import Features, Value, ClassLabel, List
features = Features({
    'text': Value('string'),
    'label': ClassLabel(names=['cat1', 'cat2', 'cat3']),
    'embedding': List(feature=Value('float32'), length=768)
})

dataset = Dataset.from_generator(data_generator, features=features)
dataset.push_to_hub("username/dataset", max_shard_size="1GB")

Tip: For large datasets, test with a subset first by adding a limit to your generator or using .select(range(100)) after creation.

Using Configs for Dataset Variations

# Push different configurations of your dataset
dataset_en = Dataset.from_dict(english_data, features=features)
dataset_en.push_to_hub("username/multilingual-dataset", config_name="english")

dataset_fr = Dataset.from_dict(french_data, features=features)
dataset_fr.push_to_hub("username/multilingual-dataset", config_name="french")

# Users can then load specific configs
dataset = load_dataset("username/multilingual-dataset", "english")

Multi-modal Examples

Text + Audio (speech recognition):

def speech_generator():
    for audio_file in Path("audio/").glob("*.wav"):
        transcript_file = audio_file.with_suffix(".txt")
        yield {
            "audio": str(audio_file),
            "text": transcript_file.read_text().strip(),
            "speaker_id": audio_file.stem.split("_")[0]
        }

features = Features({
    'audio': Audio(sampling_rate=16000),
    'text': Value('string'),
    'speaker_id': Value('string')
})

dataset = Dataset.from_generator(speech_generator, features=features)
dataset.push_to_hub("username/speech-dataset")

Multiple images per example:

# Before/after images, medical imaging, etc.
data = {
    "image_before": ["img1_before.jpg", "img2_before.jpg"],
    "image_after": ["img1_after.jpg", "img2_after.jpg"],
    "treatment": ["method_A", "method_B"]
}

features = Features({
    'image_before': Image(),
    'image_after': Image(),
    'treatment': ClassLabel(names=['method_A', 'method_B'])
})

dataset = Dataset.from_dict(data, features=features)
dataset.push_to_hub("username/before-after-images")

Note: For text + images, consider using ImageFolder with metadata.csv which handles this automatically.

Essential Features

Features define the schema and data types for your dataset columns. Specifying correct features ensures:

Proper data handling and type conversion
Dataset Viewer functionality (e.g., image/audio previews)
Efficient storage and loading
Clear documentation of your data structure

For complete feature documentation, see: Dataset Features

Feature Types Overview

Basic Types:

Value: Scalar values - string, int64, float32, bool, binary, and other numeric types
ClassLabel: Categorical data with named classes
Sequence: Lists of any feature type
LargeList: For very large lists

Media Types (enable Dataset Viewer previews):

Image(): Handles various image formats, returns PIL Image objects
Audio(sampling_rate=16000): Audio with array data and optional sampling rate
Video(): Video files
Pdf(): PDF documents with text extraction

Array Types (for tensors/scientific data):

Array2D, Array3D, Array4D, Array5D: Fixed or variable-length arrays
Example: Array2D(shape=(224, 224), dtype='float32')
First dimension can be None for variable length

Translation Types:

Translation: For translation pairs with fixed languages
TranslationVariableLanguages: For translations with varying language pairs

Note: New feature types are added regularly. Check the documentation for the latest additions.

Upload Methods

Dataset objects (use push_to_hub): Use when you’ve loaded/converted data using the datasets library

dataset.push_to_hub("username/dataset", max_shard_size="5GB")

Pre-existing files (use upload_large_folder): Use when you have hub-compatible files (e.g., Parquet files) already prepared and organized

from huggingface_hub import HfApi
api = HfApi()
api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset", num_workers=16)

Important: Before using upload_large_folder, verify the files meet repository limits:

Check folder structure if you have file access: ensure no folder contains >10k files
Ask the user to confirm: “Are your files in a hub-compatible format (Parquet/CSV/JSON) and organized appropriately?”
For non-standard formats, consider converting to Dataset objects first to ensure compatibility

Validation

Consider small reformatting: If data is close to a built-in loader format, suggest minor changes:

Rename columns (e.g., ‘filename’ → ‘file_name’ for ImageFolder)
Reorganize folders (e.g., move images into class subfolders)
Rename files to match expected patterns (e.g., ‘data.csv’ → ‘train.csv’)

Pre-upload:

Test locally: load_dataset("imagefolder", data_dir="./data")

Verify features work correctly:

# Test first example
print(dataset[0])

# For images: verify they load
if 'image' in dataset.features:
    dataset[0]['image']  # Should return PIL Image

# Check dataset size before upload
print(f"Size: {len(dataset)} examples")

Check metadata.csv has ‘file_name’ column
Verify relative paths, no leading slashes
Ensure no folder >10k files

Post-upload:

Check viewer: https://huggingface.co/datasets/username/dataset
Test loading: load_dataset("username/dataset")
Verify features preserved: print(dataset.features)

Common Issues → Solutions

Issue	Solution
“Repository not found”	Run `huggingface-cli login`
Memory errors	Use `max_shard_size="500MB"`
Dataset viewer not working	Wait 5-10min, check README.md config
Timeout errors	Use `multi_commits=True`
Files >50GB	Split into smaller files
“File not found”	Use relative paths in metadata

Dataset Viewer Configuration

Note: This section is primarily for datasets uploaded directly to the Hub (via UI or upload_large_folder). Datasets uploaded with push_to_hub() typically configure the viewer automatically.

When automatic detection works

The Dataset Viewer automatically detects standard structures:

Files named: train.csv, test.json, validation.parquet
Directories named: train/, test/, validation/
Split names with delimiters: test-data.csv ✓ (not testdata.csv ✗)

Manual configuration

For custom structures, add YAML to your README.md:

---
configs:
  - config_name: default # Required even for single config!
    data_files:
      - split: train
        path: "data/train/*.parquet"
      - split: test
        path: "data/test/*.parquet"
---

Multiple configurations example:

---
configs:
  - config_name: english
    data_files: "en/*.parquet"
  - config_name: french
    data_files: "fr/*.parquet"
---

Common viewer issues

No viewer after upload: Wait 5-10 minutes for processing
“Config names error”: Add config_name field (required!)
Files not detected: Check naming patterns (needs delimiters)
Viewer disabled: Remove viewer: false from README YAML

Quick Templates

# ImageFolder with metadata
dataset = load_dataset("imagefolder", data_dir="./images")
dataset.push_to_hub("username/dataset")

# Memory-efficient upload
dataset.push_to_hub("username/dataset", max_shard_size="500MB")

# Multiple CSV files
dataset = load_dataset('csv', data_files={'train': 'train.csv', 'test': 'test.csv'})
dataset.push_to_hub("username/dataset")

Documentation

Core docs: Adding datasets | Dataset viewer | Storage limits | Upload guide

Dataset Cards

Remind users to add a dataset card (README.md) with:

Dataset description and usage
License information
Citation details

See Dataset Cards guide for details.

Appendix: Special Cases

WebDataset Structure

For streaming large media datasets:

Create 1-5GB tar shards
Consistent internal structure
Upload with upload_large_folder

Scientific Data

HDF5/NetCDF → Convert to Parquet with Array features
Time series → Array2D(shape=(None, n))
Complex metadata → Store as JSON strings

Community Resources

For very specialized or bespoke formats:

Search the Hub for similar datasets: https://huggingface.co/datasets
Ask for advice on the Hugging Face Forums
Join the Hugging Face Discord for real-time help
Many domain-specific formats already have examples on the Hub

< > Update on GitHub

←Uploading Datasets Downloading Datasets→