LUCID-CC0 v2: Large-Scale Curated CC0 Training Dataset for Single-Image Super-Resolution
A large-scale, high-quality training dataset for single-image super-resolution (SISR), filtered from nyuuzyou/pxhere using the LUCID filtering pipeline. All source images are CC0-licensed (public domain).
Overview
| Property | Value |
|---|---|
| Source | PxHere (CC0) via WebDataset tars |
| Filtering | LUCID pipeline (ICNet complexity + signal filter + deduplication) |
| Tile size | 256Γ256 pixels |
| Multiscale | Yes (1.0Γ, 0.75Γ, 0.5Γ, 0.25Γ scales) |
| Complexity threshold | β₯ 0.6 (LUCID auto-calibrated) |
| Deduplication | Cosine similarity < 0.96 |
| License | CC0-1.0 (public domain) |
| Total tiles | 1,590,938 |
| Disk size | 199 GB |
Intended Use
This dataset is designed for training SISR models from scratch, particularly large transformer-based architectures that are data-hungry:
- HAT (Hybrid Attention Transformer)
- HAT-L (Large variant)
- SwinIR
- RealESRGAN / traiNNer-redux
- Diffusion-based super-resolution models
- Any new architecture that benefits from diverse, high-quality training data
Recommended Training Strategy
This dataset is part of a three-stage training pipeline:
| Stage | Dataset | Purpose |
|---|---|---|
| 1. Pretrain from scratch | This dataset (lucid-cc0-v2) | Learn general image representations from diverse CC0 photos |
| 2. Finetune | lucid-cc0-v2-hc (high-complexity, 256Γ256) | Refine on highest-quality, most detailed tiles |
| 3. Finetune-finetune | lucid-cc0-v2-hc-512 (high-complexity, 512Γ512) | Push quality with maximum patch size |
Dataset Structure
lucid-cc0-v2/
βββ train/
β βββ 000/ # β€10,000 tiles per subdirectory
β β βββ 00000.png
β β βββ 00001.png
β β βββ ...
β βββ 001/
β βββ ...
βββ LR/
β βββ x2/ # Bicubic downscaled Γ2 (MATLAB-compatible)
β βββ x4/ # Bicubic downscaled Γ4 (MATLAB-compatible)
βββ batch_manifest.json
βββ lineage_batch_*.csv
βββ DATASET_NOTES.md
Filtering Pipeline
Images were filtered using LUCID with the following stages:
- Signal filter β Removes low-information images (blurry, overexposed, underexposed, low-contrast)
- ICNet complexity scoring β Neural network estimates perceptual complexity; tiles below threshold are removed
- Multiscale tiling β Images tiled at multiple scales (1.0Γ, 0.75Γ, 0.5Γ, 0.25Γ) to capture both fine detail and global structure
- Deduplication β Perceptual cosine similarity deduplication (threshold 0.96) removes near-duplicate tiles
- Tile extraction β 256Γ256 PNG tiles saved with β€10,000 files per subdirectory
Source Data
- Repository: nyuuzyou/pxhere
- Description: ~1.1M CC0 images from PxHere, stored as WebDataset tars
- Content: Professional photography spanning landscapes, architecture, nature, objects, and more
- License: CC0-1.0 (public domain)
Bicubic Downscaling
LR (low-resolution) images are provided alongside HR tiles, downscaled using MATLAB-compatible bicubic interpolation (a = -0.5 anti-aliased cubic kernel). This matches the standard used in SISR benchmarks (Urban100, Set5, Set14, etc.) and ensures comparable PSNR/SSIM values.
Scale factors: Γ2 and Γ4.
Lineage
Each batch produces a lineage_batch_*.csv file tracking per-image complexity scores and tile counts for reproducibility.
Citation
If you use this dataset, please cite:
@dataset{lucid_cc0_v2,
title={LUCID-CC0 v2: Large-Scale Curated CC0 Training Dataset for SISR},
author={Phips},
year={2026},
license={CC0-1.0},
url={https://huggingface.co/datasets/Phips/lucid-cc0-v2}
}