Hub documentation
Hugging Face Dataset Upload Decision Guide
Hugging Face Dataset Upload Decision Guide
Decision guide for uploading datasets to Hugging Face Hub. Optimized for Dataset Viewer compatibility and integration with the Hugging Face ecosystem.
Overview
Your goal is to help a user upload a dataset to the Hugging Face Hub. Ideally, the dataset should be compatible with the Dataset Viewer (and thus the load_dataset
function) to ensure easy access and usability. You should aim to meet the following criteria:
Criteria | Description | Priority |
---|---|---|
Respect repository limits | Ensure the dataset adheres to Hugging Face’s storage limits for file sizes, repository sizes, and file counts. See the Critical Constraints section below for specific limits. | Required |
Use hub-compatible formats | Use Parquet format when possible (best compression, rich typing, large dataset support). For smaller datasets (<several GB), JSON/JSONL or CSV are acceptable. Raw files work well for images/audio in smaller datasets while respecting repo limits. Use WebDataset (.tar) for large media collections. Domain-specific formats can be used when conversion is impractical. | Desired |
Dataset Viewer compatibility | Structure data to work with the automatic Dataset Viewer, enabling preview and easy exploration. This typically means using supported formats and proper file organization. Validation steps are provided later in this guide. | Desired |
Organize data sensibly | Use logical folder structures that match Hub conventions (e.g., train/test splits). Configs can be used to define different configurations of the dataset. This facilitates both human understanding and automatic data loading. | Desired |
Use appropriate Features | When using the datasets library, specify correct feature types (e.g., Image(), Audio(), ClassLabel()) to ensure proper data handling and viewer functionality. This enables type-specific optimizations and previews. | Required (when using datasets library) |
Document non-standard datasets | If conversion to hub-compatible formats is impossible and custom formats must be used, ensure repository limits are strictly followed and provide clear documentation on how to download and load the dataset. Include usage examples and any special requirements. | Required (when datasets library isn’t compatible) |
Working Without File Access
When you don’t have direct access to the user’s files (e.g., web interface), ask the user to run these commands to understand their dataset:
Dataset structure:
# Show directory tree (install with: pip install tree or brew install tree)
tree -L 3 --filelimit 20
# Alternative without tree:
find . -type f -name "*.csv" -o -name "*.json" -o -name "*.parquet" | head -20
Check file sizes:
# Total dataset size
du -sh .
# Individual file sizes
ls -lh data/
Peek at data format:
# First few lines of CSV/JSON
head -n 5 data/train.csv
# Check image folder structure
ls -la images/ | head -10
Quick file count:
# Count files by type
find . -name "*.jpg" | wc -l
Critical Constraints
Storage Limits:
# Machine-readable Hub limits
hub_limits:
max_file_size_gb: 50 # absolute hard stop enforced by LFS
recommended_file_size_gb: 20 # best-practice shard size
max_files_per_folder: 10000 # Git performance threshold
max_files_per_repo: 100000 # Repository file count limit
recommended_repo_size_gb: 300 # public-repo soft cap; contact HF if larger
viewer_row_size_mb: 2 # approximate per-row viewer limit
Human-readable summary:
- Free: 100GB private datasets
- Pro (for individuals) | Team or Enterprise (for organizations): 1TB+ private storage per seat (see pricing)
- Public: 300GB (contact datasets@huggingface.co for larger)
- Per file: 50GB max, 20GB recommended
- Per folder: <10k files
See https://huggingface.co/docs/hub/storage-limits#repository-limitations-and-recommendations for current limits for current recommendations for repository sizes and file counts.
Quick Reference by Data Type
Your Data | Recommended Approach | Quick Command |
---|---|---|
CSV/JSON files | Use built-in loaders (handles any size via memory mapping) | load_dataset("csv", data_files="data.csv").push_to_hub("username/dataset") |
Images in folders | Use imagefolder for automatic class detection | load_dataset("imagefolder", data_dir="./images").push_to_hub("username/dataset") |
Audio files | Use audiofolder for automatic organization | load_dataset("audiofolder", data_dir="./audio").push_to_hub("username/dataset") |
Video files | Use videofolder for automatic organization | load_dataset("videofolder", data_dir="./videos").push_to_hub("username/dataset") |
PDF documents | Use pdffolder for text extraction | load_dataset("pdffolder", data_dir="./pdfs").push_to_hub("username/dataset") |
Very large datasets (100GB+) | Use max_shard_size to control memory usage | dataset.push_to_hub("username/dataset", max_shard_size="5GB") |
Many files / directories (>10k) | Use upload_large_folder to avoid Git limitations | api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset") |
Streaming large media | WebDataset format for efficient streaming | Create .tar shards, then upload_large_folder() |
Scientific data (HDF5, NetCDF) | Convert to Parquet with Array features | See Scientific Data section |
Custom/proprietary formats | Document thoroughly if conversion impossible | upload_large_folder() with comprehensive README |
Upload Workflow
✓ Gather dataset information (if needed):
- What type of data? (images, text, audio, CSV, etc.)
- How is it organized? (folder structure, single file, multiple files)
- What’s the approximate size?
- What format are the files in?
- Any special requirements? (e.g., streaming, private access)
- Check for existing README or documentation files that describe the dataset
✓ Authenticate:
- CLI:
huggingface-cli login
- Or use token:
HfApi(token="hf_...")
or setHF_TOKEN
environment variable
- CLI:
✓ Identify your data type: Check the Quick Reference table above
✓ Choose upload method:
- Small files (<1GB) with hub-compatible format: Can use Hub UI for quick uploads
- Built-in loader available: Use the loader +
push_to_hub()
(see Quick Reference table) - Large datasets or many files: Use
upload_large_folder()
for files >100GB or >10k files - Custom formats: Convert to hub-compatible format if possible, otherwise document thoroughly
✓ Test locally (if using built-in loader):
# Validate your dataset loads correctly before uploading dataset = load_dataset("loader_name", data_dir="./your_data") print(dataset)
✓ Upload to Hub:
# Basic upload dataset.push_to_hub("username/dataset-name") # With options for large datasets dataset.push_to_hub( "username/dataset-name", max_shard_size="5GB", # Control memory usage private=True # For private datasets )
✓ Verify your upload:
- Check Dataset Viewer:
https://huggingface.co/datasets/username/dataset-name
- Test loading:
load_dataset("username/dataset-name")
- If viewer shows errors, check the Troubleshooting section
- Check Dataset Viewer:
Common Conversion Patterns
When built-in loaders don’t match your data structure, use the datasets library as a compatibility layer. Convert your data to a Dataset object, then use push_to_hub()
for maximum flexibility and Dataset Viewer compatibility.
From DataFrames
If you already have your data working in pandas, polars, or other dataframe libraries, you can convert directly:
# From pandas DataFrame
import pandas as pd
from datasets import Dataset
df = pd.read_csv("your_data.csv")
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/dataset-name")
# From polars DataFrame (direct method)
import polars as pl
from datasets import Dataset
df = pl.read_csv("your_data.csv")
dataset = Dataset.from_polars(df) # Direct conversion
dataset.push_to_hub("username/dataset-name")
# From PyArrow Table (useful for scientific data)
import pyarrow as pa
from datasets import Dataset
# If you have a PyArrow table
table = pa.table({'data': [1, 2, 3], 'labels': ['a', 'b', 'c']})
dataset = Dataset(table)
dataset.push_to_hub("username/dataset-name")
# For Spark/Dask dataframes, see https://huggingface.co/docs/hub/datasets-libraries
Custom Format Conversion
When built-in loaders don’t match your data format, convert to Dataset objects following these principles:
Design Principles
1. Prefer wide/flat structures over joins
- Denormalize relational data into single rows for better usability
- Include all relevant information in each example
- Lean towards bigger but more usable data - Hugging Face’s infrastructure uses advanced deduplication (XetHub) and Parquet optimizations to handle redundancy efficiently
2. Use configs for logical dataset variations
- Beyond train/test/val splits, use configs for different subsets or views of your data
- Each config can have different features or data organization
- Example: language-specific configs, task-specific views, or data modalities
Conversion Methods
Small datasets (fits in memory) - use Dataset.from_dict()
:
# Parse your custom format into a dictionary
data_dict = {
"text": ["example1", "example2"],
"label": ["positive", "negative"],
"score": [0.9, 0.2]
}
# Create dataset with appropriate features
from datasets import Dataset, Features, Value, ClassLabel
features = Features({
'text': Value('string'),
'label': ClassLabel(names=['negative', 'positive']),
'score': Value('float32')
})
dataset = Dataset.from_dict(data_dict, features=features)
dataset.push_to_hub("username/dataset")
Large datasets (memory-efficient) - use Dataset.from_generator()
:
def data_generator():
# Parse your custom format progressively
for item in parse_large_file("data.custom"):
yield {
"text": item["content"],
"label": item["category"],
"embedding": item["vector"]
}
# Specify features for Dataset Viewer compatibility
from datasets import Features, Value, ClassLabel, List
features = Features({
'text': Value('string'),
'label': ClassLabel(names=['cat1', 'cat2', 'cat3']),
'embedding': List(feature=Value('float32'), length=768)
})
dataset = Dataset.from_generator(data_generator, features=features)
dataset.push_to_hub("username/dataset", max_shard_size="1GB")
Tip: For large datasets, test with a subset first by adding a limit to your generator or using .select(range(100))
after creation.
Using Configs for Dataset Variations
# Push different configurations of your dataset
dataset_en = Dataset.from_dict(english_data, features=features)
dataset_en.push_to_hub("username/multilingual-dataset", config_name="english")
dataset_fr = Dataset.from_dict(french_data, features=features)
dataset_fr.push_to_hub("username/multilingual-dataset", config_name="french")
# Users can then load specific configs
dataset = load_dataset("username/multilingual-dataset", "english")
Multi-modal Examples
Text + Audio (speech recognition):
def speech_generator():
for audio_file in Path("audio/").glob("*.wav"):
transcript_file = audio_file.with_suffix(".txt")
yield {
"audio": str(audio_file),
"text": transcript_file.read_text().strip(),
"speaker_id": audio_file.stem.split("_")[0]
}
features = Features({
'audio': Audio(sampling_rate=16000),
'text': Value('string'),
'speaker_id': Value('string')
})
dataset = Dataset.from_generator(speech_generator, features=features)
dataset.push_to_hub("username/speech-dataset")
Multiple images per example:
# Before/after images, medical imaging, etc.
data = {
"image_before": ["img1_before.jpg", "img2_before.jpg"],
"image_after": ["img1_after.jpg", "img2_after.jpg"],
"treatment": ["method_A", "method_B"]
}
features = Features({
'image_before': Image(),
'image_after': Image(),
'treatment': ClassLabel(names=['method_A', 'method_B'])
})
dataset = Dataset.from_dict(data, features=features)
dataset.push_to_hub("username/before-after-images")
Note: For text + images, consider using ImageFolder with metadata.csv which handles this automatically.
Essential Features
Features define the schema and data types for your dataset columns. Specifying correct features ensures:
- Proper data handling and type conversion
- Dataset Viewer functionality (e.g., image/audio previews)
- Efficient storage and loading
- Clear documentation of your data structure
For complete feature documentation, see: Dataset Features
Feature Types Overview
Basic Types:
Value
: Scalar values -string
,int64
,float32
,bool
,binary
, and other numeric typesClassLabel
: Categorical data with named classesSequence
: Lists of any feature typeLargeList
: For very large lists
Media Types (enable Dataset Viewer previews):
Image()
: Handles various image formats, returns PIL Image objectsAudio(sampling_rate=16000)
: Audio with array data and optional sampling rateVideo()
: Video filesPdf()
: PDF documents with text extraction
Array Types (for tensors/scientific data):
Array2D
,Array3D
,Array4D
,Array5D
: Fixed or variable-length arrays- Example:
Array2D(shape=(224, 224), dtype='float32')
- First dimension can be
None
for variable length
Translation Types:
Translation
: For translation pairs with fixed languagesTranslationVariableLanguages
: For translations with varying language pairs
Note: New feature types are added regularly. Check the documentation for the latest additions.
Upload Methods
Dataset objects (use push_to_hub): Use when you’ve loaded/converted data using the datasets library
dataset.push_to_hub("username/dataset", max_shard_size="5GB")
Pre-existing files (use upload_large_folder): Use when you have hub-compatible files (e.g., Parquet files) already prepared and organized
from huggingface_hub import HfApi
api = HfApi()
api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset", num_workers=16)
Important: Before using upload_large_folder
, verify the files meet repository limits:
- Check folder structure if you have file access: ensure no folder contains >10k files
- Ask the user to confirm: “Are your files in a hub-compatible format (Parquet/CSV/JSON) and organized appropriately?”
- For non-standard formats, consider converting to Dataset objects first to ensure compatibility
Validation
Consider small reformatting: If data is close to a built-in loader format, suggest minor changes:
- Rename columns (e.g., ‘filename’ → ‘file_name’ for ImageFolder)
- Reorganize folders (e.g., move images into class subfolders)
- Rename files to match expected patterns (e.g., ‘data.csv’ → ‘train.csv’)
Pre-upload:
Test locally:
load_dataset("imagefolder", data_dir="./data")
Verify features work correctly:
# Test first example print(dataset[0]) # For images: verify they load if 'image' in dataset.features: dataset[0]['image'] # Should return PIL Image # Check dataset size before upload print(f"Size: {len(dataset)} examples")
Check metadata.csv has ‘file_name’ column
Verify relative paths, no leading slashes
Ensure no folder >10k files
Post-upload:
- Check viewer:
https://huggingface.co/datasets/username/dataset
- Test loading:
load_dataset("username/dataset")
- Verify features preserved:
print(dataset.features)
Common Issues → Solutions
Issue | Solution |
---|---|
“Repository not found” | Run huggingface-cli login |
Memory errors | Use max_shard_size="500MB" |
Dataset viewer not working | Wait 5-10min, check README.md config |
Timeout errors | Use multi_commits=True |
Files >50GB | Split into smaller files |
“File not found” | Use relative paths in metadata |
Dataset Viewer Configuration
Note: This section is primarily for datasets uploaded directly to the Hub (via UI or upload_large_folder
). Datasets uploaded with push_to_hub()
typically configure the viewer automatically.
When automatic detection works
The Dataset Viewer automatically detects standard structures:
- Files named:
train.csv
,test.json
,validation.parquet
- Directories named:
train/
,test/
,validation/
- Split names with delimiters:
test-data.csv
✓ (nottestdata.csv
✗)
Manual configuration
For custom structures, add YAML to your README.md:
---
configs:
- config_name: default # Required even for single config!
data_files:
- split: train
path: "data/train/*.parquet"
- split: test
path: "data/test/*.parquet"
---
Multiple configurations example:
---
configs:
- config_name: english
data_files: "en/*.parquet"
- config_name: french
data_files: "fr/*.parquet"
---
Common viewer issues
- No viewer after upload: Wait 5-10 minutes for processing
- “Config names error”: Add
config_name
field (required!) - Files not detected: Check naming patterns (needs delimiters)
- Viewer disabled: Remove
viewer: false
from README YAML
Quick Templates
# ImageFolder with metadata
dataset = load_dataset("imagefolder", data_dir="./images")
dataset.push_to_hub("username/dataset")
# Memory-efficient upload
dataset.push_to_hub("username/dataset", max_shard_size="500MB")
# Multiple CSV files
dataset = load_dataset('csv', data_files={'train': 'train.csv', 'test': 'test.csv'})
dataset.push_to_hub("username/dataset")
Documentation
Core docs: Adding datasets | Dataset viewer | Storage limits | Upload guide
Dataset Cards
Remind users to add a dataset card (README.md) with:
- Dataset description and usage
- License information
- Citation details
See Dataset Cards guide for details.
Appendix: Special Cases
WebDataset Structure
For streaming large media datasets:
- Create 1-5GB tar shards
- Consistent internal structure
- Upload with
upload_large_folder
Scientific Data
- HDF5/NetCDF → Convert to Parquet with Array features
- Time series → Array2D(shape=(None, n))
- Complex metadata → Store as JSON strings
Community Resources
For very specialized or bespoke formats:
- Search the Hub for similar datasets:
https://huggingface.co/datasets
- Ask for advice on the Hugging Face Forums
- Join the Hugging Face Discord for real-time help
- Many domain-specific formats already have examples on the Hub