Hub documentation

Hugging Face Dataset Upload Decision Guide

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Hugging Face Dataset Upload Decision Guide

This guide is primarily designed for LLMs to help users upload datasets to the Hugging Face Hub in the most compatible format. Users can also reference this guide to understand the upload process and best practices.

Decision guide for uploading datasets to Hugging Face Hub. Optimized for Dataset Viewer compatibility and integration with the Hugging Face ecosystem.

Overview

Your goal is to help a user upload a dataset to the Hugging Face Hub. Ideally, the dataset should be compatible with the Dataset Viewer (and thus the load_dataset function) to ensure easy access and usability. You should aim to meet the following criteria:

Criteria Description Priority
Respect repository limits Ensure the dataset adheres to Hugging Face’s storage limits for file sizes, repository sizes, and file counts. See the Critical Constraints section below for specific limits. Required
Use hub-compatible formats Use Parquet format when possible (best compression, rich typing, large dataset support). For smaller datasets (<several GB), JSON/JSONL or CSV are acceptable. Raw files work well for images/audio in smaller datasets while respecting repo limits. Use WebDataset (.tar) for large media collections. Domain-specific formats can be used when conversion is impractical. Desired
Dataset Viewer compatibility Structure data to work with the automatic Dataset Viewer, enabling preview and easy exploration. This typically means using supported formats and proper file organization. Validation steps are provided later in this guide. Desired
Organize data sensibly Use logical folder structures that match Hub conventions (e.g., train/test splits). Configs can be used to define different configurations of the dataset. This facilitates both human understanding and automatic data loading. Desired
Use appropriate Features When using the datasets library, specify correct feature types (e.g., Image(), Audio(), ClassLabel()) to ensure proper data handling and viewer functionality. This enables type-specific optimizations and previews. Required (when using datasets library)
Document non-standard datasets If conversion to hub-compatible formats is impossible and custom formats must be used, ensure repository limits are strictly followed and provide clear documentation on how to download and load the dataset. Include usage examples and any special requirements. Required (when datasets library isn’t compatible)

Working Without File Access

When you don’t have direct access to the user’s files (e.g., web interface), ask the user to run these commands to understand their dataset:

Dataset structure:

# Show directory tree (install with: pip install tree or brew install tree)
tree -L 3 --filelimit 20

# Alternative without tree:
find . -type f -name "*.csv" -o -name "*.json" -o -name "*.parquet" | head -20

Check file sizes:

# Total dataset size
du -sh .

# Individual file sizes
ls -lh data/

Peek at data format:

# First few lines of CSV/JSON
head -n 5 data/train.csv

# Check image folder structure
ls -la images/ | head -10

Quick file count:

# Count files by type
find . -name "*.jpg" | wc -l

Critical Constraints

Storage Limits:

# Machine-readable Hub limits
hub_limits:
  max_file_size_gb: 50 # absolute hard stop enforced by LFS
  recommended_file_size_gb: 20 # best-practice shard size
  max_files_per_folder: 10000 # Git performance threshold
  max_files_per_repo: 100000 # Repository file count limit
  recommended_repo_size_gb: 300 # public-repo soft cap; contact HF if larger
  viewer_row_size_mb: 2 # approximate per-row viewer limit

Human-readable summary:

  • Free: 100GB private datasets
  • Pro (for individuals) | Team or Enterprise (for organizations): 1TB+ private storage per seat (see pricing)
  • Public: 300GB (contact datasets@huggingface.co for larger)
  • Per file: 50GB max, 20GB recommended
  • Per folder: <10k files

See https://huggingface.co/docs/hub/storage-limits#repository-limitations-and-recommendations for current limits for current recommendations for repository sizes and file counts.

Quick Reference by Data Type

Your Data Recommended Approach Quick Command
CSV/JSON files Use built-in loaders (handles any size via memory mapping) load_dataset("csv", data_files="data.csv").push_to_hub("username/dataset")
Images in folders Use imagefolder for automatic class detection load_dataset("imagefolder", data_dir="./images").push_to_hub("username/dataset")
Audio files Use audiofolder for automatic organization load_dataset("audiofolder", data_dir="./audio").push_to_hub("username/dataset")
Video files Use videofolder for automatic organization load_dataset("videofolder", data_dir="./videos").push_to_hub("username/dataset")
PDF documents Use pdffolder for text extraction load_dataset("pdffolder", data_dir="./pdfs").push_to_hub("username/dataset")
Very large datasets (100GB+) Use max_shard_size to control memory usage dataset.push_to_hub("username/dataset", max_shard_size="5GB")
Many files / directories (>10k) Use upload_large_folder to avoid Git limitations api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset")
Streaming large media WebDataset format for efficient streaming Create .tar shards, then upload_large_folder()
Scientific data (HDF5, NetCDF) Convert to Parquet with Array features See Scientific Data section
Custom/proprietary formats Document thoroughly if conversion impossible upload_large_folder() with comprehensive README

Upload Workflow

  1. Gather dataset information (if needed):

    • What type of data? (images, text, audio, CSV, etc.)
    • How is it organized? (folder structure, single file, multiple files)
    • What’s the approximate size?
    • What format are the files in?
    • Any special requirements? (e.g., streaming, private access)
    • Check for existing README or documentation files that describe the dataset
  2. Authenticate:

    • CLI: huggingface-cli login
    • Or use token: HfApi(token="hf_...") or set HF_TOKEN environment variable
  3. Identify your data type: Check the Quick Reference table above

  4. Choose upload method:

    • Small files (<1GB) with hub-compatible format: Can use Hub UI for quick uploads
    • Built-in loader available: Use the loader + push_to_hub() (see Quick Reference table)
    • Large datasets or many files: Use upload_large_folder() for files >100GB or >10k files
    • Custom formats: Convert to hub-compatible format if possible, otherwise document thoroughly
  5. Test locally (if using built-in loader):

    # Validate your dataset loads correctly before uploading
    dataset = load_dataset("loader_name", data_dir="./your_data")
    print(dataset)
  6. Upload to Hub:

    # Basic upload
    dataset.push_to_hub("username/dataset-name")
    
    # With options for large datasets
    dataset.push_to_hub(
        "username/dataset-name",
        max_shard_size="5GB",  # Control memory usage
        private=True  # For private datasets
    )
  7. Verify your upload:

    • Check Dataset Viewer: https://huggingface.co/datasets/username/dataset-name
    • Test loading: load_dataset("username/dataset-name")
    • If viewer shows errors, check the Troubleshooting section

Common Conversion Patterns

When built-in loaders don’t match your data structure, use the datasets library as a compatibility layer. Convert your data to a Dataset object, then use push_to_hub() for maximum flexibility and Dataset Viewer compatibility.

From DataFrames

If you already have your data working in pandas, polars, or other dataframe libraries, you can convert directly:

# From pandas DataFrame
import pandas as pd
from datasets import Dataset

df = pd.read_csv("your_data.csv")
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/dataset-name")

# From polars DataFrame (direct method)
import polars as pl
from datasets import Dataset

df = pl.read_csv("your_data.csv")
dataset = Dataset.from_polars(df)  # Direct conversion
dataset.push_to_hub("username/dataset-name")

# From PyArrow Table (useful for scientific data)
import pyarrow as pa
from datasets import Dataset

# If you have a PyArrow table
table = pa.table({'data': [1, 2, 3], 'labels': ['a', 'b', 'c']})
dataset = Dataset(table)
dataset.push_to_hub("username/dataset-name")

# For Spark/Dask dataframes, see https://huggingface.co/docs/hub/datasets-libraries

Custom Format Conversion

When built-in loaders don’t match your data format, convert to Dataset objects following these principles:

Design Principles

1. Prefer wide/flat structures over joins

  • Denormalize relational data into single rows for better usability
  • Include all relevant information in each example
  • Lean towards bigger but more usable data - Hugging Face’s infrastructure uses advanced deduplication (XetHub) and Parquet optimizations to handle redundancy efficiently

2. Use configs for logical dataset variations

  • Beyond train/test/val splits, use configs for different subsets or views of your data
  • Each config can have different features or data organization
  • Example: language-specific configs, task-specific views, or data modalities

Conversion Methods

Small datasets (fits in memory) - use Dataset.from_dict():

# Parse your custom format into a dictionary
data_dict = {
    "text": ["example1", "example2"],
    "label": ["positive", "negative"],
    "score": [0.9, 0.2]
}

# Create dataset with appropriate features
from datasets import Dataset, Features, Value, ClassLabel
features = Features({
    'text': Value('string'),
    'label': ClassLabel(names=['negative', 'positive']),
    'score': Value('float32')
})

dataset = Dataset.from_dict(data_dict, features=features)
dataset.push_to_hub("username/dataset")

Large datasets (memory-efficient) - use Dataset.from_generator():

def data_generator():
    # Parse your custom format progressively
    for item in parse_large_file("data.custom"):
        yield {
            "text": item["content"],
            "label": item["category"],
            "embedding": item["vector"]
        }

# Specify features for Dataset Viewer compatibility
from datasets import Features, Value, ClassLabel, List
features = Features({
    'text': Value('string'),
    'label': ClassLabel(names=['cat1', 'cat2', 'cat3']),
    'embedding': List(feature=Value('float32'), length=768)
})

dataset = Dataset.from_generator(data_generator, features=features)
dataset.push_to_hub("username/dataset", max_shard_size="1GB")

Tip: For large datasets, test with a subset first by adding a limit to your generator or using .select(range(100)) after creation.

Using Configs for Dataset Variations

# Push different configurations of your dataset
dataset_en = Dataset.from_dict(english_data, features=features)
dataset_en.push_to_hub("username/multilingual-dataset", config_name="english")

dataset_fr = Dataset.from_dict(french_data, features=features)
dataset_fr.push_to_hub("username/multilingual-dataset", config_name="french")

# Users can then load specific configs
dataset = load_dataset("username/multilingual-dataset", "english")

Multi-modal Examples

Text + Audio (speech recognition):

def speech_generator():
    for audio_file in Path("audio/").glob("*.wav"):
        transcript_file = audio_file.with_suffix(".txt")
        yield {
            "audio": str(audio_file),
            "text": transcript_file.read_text().strip(),
            "speaker_id": audio_file.stem.split("_")[0]
        }

features = Features({
    'audio': Audio(sampling_rate=16000),
    'text': Value('string'),
    'speaker_id': Value('string')
})

dataset = Dataset.from_generator(speech_generator, features=features)
dataset.push_to_hub("username/speech-dataset")

Multiple images per example:

# Before/after images, medical imaging, etc.
data = {
    "image_before": ["img1_before.jpg", "img2_before.jpg"],
    "image_after": ["img1_after.jpg", "img2_after.jpg"],
    "treatment": ["method_A", "method_B"]
}

features = Features({
    'image_before': Image(),
    'image_after': Image(),
    'treatment': ClassLabel(names=['method_A', 'method_B'])
})

dataset = Dataset.from_dict(data, features=features)
dataset.push_to_hub("username/before-after-images")

Note: For text + images, consider using ImageFolder with metadata.csv which handles this automatically.

Essential Features

Features define the schema and data types for your dataset columns. Specifying correct features ensures:

  • Proper data handling and type conversion
  • Dataset Viewer functionality (e.g., image/audio previews)
  • Efficient storage and loading
  • Clear documentation of your data structure

For complete feature documentation, see: Dataset Features

Feature Types Overview

Basic Types:

  • Value: Scalar values - string, int64, float32, bool, binary, and other numeric types
  • ClassLabel: Categorical data with named classes
  • Sequence: Lists of any feature type
  • LargeList: For very large lists

Media Types (enable Dataset Viewer previews):

  • Image(): Handles various image formats, returns PIL Image objects
  • Audio(sampling_rate=16000): Audio with array data and optional sampling rate
  • Video(): Video files
  • Pdf(): PDF documents with text extraction

Array Types (for tensors/scientific data):

  • Array2D, Array3D, Array4D, Array5D: Fixed or variable-length arrays
  • Example: Array2D(shape=(224, 224), dtype='float32')
  • First dimension can be None for variable length

Translation Types:

  • Translation: For translation pairs with fixed languages
  • TranslationVariableLanguages: For translations with varying language pairs

Note: New feature types are added regularly. Check the documentation for the latest additions.

Upload Methods

Dataset objects (use push_to_hub): Use when you’ve loaded/converted data using the datasets library

dataset.push_to_hub("username/dataset", max_shard_size="5GB")

Pre-existing files (use upload_large_folder): Use when you have hub-compatible files (e.g., Parquet files) already prepared and organized

from huggingface_hub import HfApi
api = HfApi()
api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset", num_workers=16)

Important: Before using upload_large_folder, verify the files meet repository limits:

  • Check folder structure if you have file access: ensure no folder contains >10k files
  • Ask the user to confirm: “Are your files in a hub-compatible format (Parquet/CSV/JSON) and organized appropriately?”
  • For non-standard formats, consider converting to Dataset objects first to ensure compatibility

Validation

Consider small reformatting: If data is close to a built-in loader format, suggest minor changes:

  • Rename columns (e.g., ‘filename’ → ‘file_name’ for ImageFolder)
  • Reorganize folders (e.g., move images into class subfolders)
  • Rename files to match expected patterns (e.g., ‘data.csv’ → ‘train.csv’)

Pre-upload:

  • Test locally: load_dataset("imagefolder", data_dir="./data")

  • Verify features work correctly:

    # Test first example
    print(dataset[0])
    
    # For images: verify they load
    if 'image' in dataset.features:
        dataset[0]['image']  # Should return PIL Image
    
    # Check dataset size before upload
    print(f"Size: {len(dataset)} examples")
  • Check metadata.csv has ‘file_name’ column

  • Verify relative paths, no leading slashes

  • Ensure no folder >10k files

Post-upload:

  • Check viewer: https://huggingface.co/datasets/username/dataset
  • Test loading: load_dataset("username/dataset")
  • Verify features preserved: print(dataset.features)

Common Issues → Solutions

Issue Solution
“Repository not found” Run huggingface-cli login
Memory errors Use max_shard_size="500MB"
Dataset viewer not working Wait 5-10min, check README.md config
Timeout errors Use multi_commits=True
Files >50GB Split into smaller files
“File not found” Use relative paths in metadata

Dataset Viewer Configuration

Note: This section is primarily for datasets uploaded directly to the Hub (via UI or upload_large_folder). Datasets uploaded with push_to_hub() typically configure the viewer automatically.

When automatic detection works

The Dataset Viewer automatically detects standard structures:

  • Files named: train.csv, test.json, validation.parquet
  • Directories named: train/, test/, validation/
  • Split names with delimiters: test-data.csv ✓ (not testdata.csv ✗)

Manual configuration

For custom structures, add YAML to your README.md:

---
configs:
  - config_name: default # Required even for single config!
    data_files:
      - split: train
        path: "data/train/*.parquet"
      - split: test
        path: "data/test/*.parquet"
---

Multiple configurations example:

---
configs:
  - config_name: english
    data_files: "en/*.parquet"
  - config_name: french
    data_files: "fr/*.parquet"
---

Common viewer issues

  • No viewer after upload: Wait 5-10 minutes for processing
  • “Config names error”: Add config_name field (required!)
  • Files not detected: Check naming patterns (needs delimiters)
  • Viewer disabled: Remove viewer: false from README YAML

Quick Templates

# ImageFolder with metadata
dataset = load_dataset("imagefolder", data_dir="./images")
dataset.push_to_hub("username/dataset")

# Memory-efficient upload
dataset.push_to_hub("username/dataset", max_shard_size="500MB")

# Multiple CSV files
dataset = load_dataset('csv', data_files={'train': 'train.csv', 'test': 'test.csv'})
dataset.push_to_hub("username/dataset")

Documentation

Core docs: Adding datasets | Dataset viewer | Storage limits | Upload guide

Dataset Cards

Remind users to add a dataset card (README.md) with:

  • Dataset description and usage
  • License information
  • Citation details

See Dataset Cards guide for details.


Appendix: Special Cases

WebDataset Structure

For streaming large media datasets:

  • Create 1-5GB tar shards
  • Consistent internal structure
  • Upload with upload_large_folder

Scientific Data

  • HDF5/NetCDF → Convert to Parquet with Array features
  • Time series → Array2D(shape=(None, n))
  • Complex metadata → Store as JSON strings

Community Resources

For very specialized or bespoke formats:

  • Search the Hub for similar datasets: https://huggingface.co/datasets
  • Ask for advice on the Hugging Face Forums
  • Join the Hugging Face Discord for real-time help
  • Many domain-specific formats already have examples on the Hub
< > Update on GitHub