LUCID-CC0 v2: Large-Scale Curated CC0 Training Dataset for Single-Image Super-Resolution

A large-scale, high-quality training dataset for single-image super-resolution (SISR), filtered from nyuuzyou/pxhere using the LUCID filtering pipeline. All source images are CC0-licensed (public domain).

Overview

Property Value
Source PxHere (CC0) via WebDataset tars
Filtering LUCID pipeline (ICNet complexity + signal filter + deduplication)
Tile size 256Γ—256 pixels
Multiscale Yes (1.0Γ—, 0.75Γ—, 0.5Γ—, 0.25Γ— scales)
Complexity threshold β‰₯ 0.6 (LUCID auto-calibrated)
Deduplication Cosine similarity < 0.96
License CC0-1.0 (public domain)
Total tiles 1,590,938
Disk size 199 GB

Intended Use

This dataset is designed for training SISR models from scratch, particularly large transformer-based architectures that are data-hungry:

  • HAT (Hybrid Attention Transformer)
  • HAT-L (Large variant)
  • SwinIR
  • RealESRGAN / traiNNer-redux
  • Diffusion-based super-resolution models
  • Any new architecture that benefits from diverse, high-quality training data

Recommended Training Strategy

This dataset is part of a three-stage training pipeline:

Stage Dataset Purpose
1. Pretrain from scratch This dataset (lucid-cc0-v2) Learn general image representations from diverse CC0 photos
2. Finetune lucid-cc0-v2-hc (high-complexity, 256Γ—256) Refine on highest-quality, most detailed tiles
3. Finetune-finetune lucid-cc0-v2-hc-512 (high-complexity, 512Γ—512) Push quality with maximum patch size

Dataset Structure

lucid-cc0-v2/
β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ 000/          # ≀10,000 tiles per subdirectory
β”‚   β”‚   β”œβ”€β”€ 00000.png
β”‚   β”‚   β”œβ”€β”€ 00001.png
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ 001/
β”‚   └── ...
β”œβ”€β”€ LR/
β”‚   β”œβ”€β”€ x2/           # Bicubic downscaled Γ—2 (MATLAB-compatible)
β”‚   └── x4/           # Bicubic downscaled Γ—4 (MATLAB-compatible)
β”œβ”€β”€ batch_manifest.json
β”œβ”€β”€ lineage_batch_*.csv
└── DATASET_NOTES.md

Filtering Pipeline

Images were filtered using LUCID with the following stages:

  1. Signal filter β€” Removes low-information images (blurry, overexposed, underexposed, low-contrast)
  2. ICNet complexity scoring β€” Neural network estimates perceptual complexity; tiles below threshold are removed
  3. Multiscale tiling β€” Images tiled at multiple scales (1.0Γ—, 0.75Γ—, 0.5Γ—, 0.25Γ—) to capture both fine detail and global structure
  4. Deduplication β€” Perceptual cosine similarity deduplication (threshold 0.96) removes near-duplicate tiles
  5. Tile extraction β€” 256Γ—256 PNG tiles saved with ≀10,000 files per subdirectory

Source Data

  • Repository: nyuuzyou/pxhere
  • Description: ~1.1M CC0 images from PxHere, stored as WebDataset tars
  • Content: Professional photography spanning landscapes, architecture, nature, objects, and more
  • License: CC0-1.0 (public domain)

Bicubic Downscaling

LR (low-resolution) images are provided alongside HR tiles, downscaled using MATLAB-compatible bicubic interpolation (a = -0.5 anti-aliased cubic kernel). This matches the standard used in SISR benchmarks (Urban100, Set5, Set14, etc.) and ensures comparable PSNR/SSIM values.

Scale factors: Γ—2 and Γ—4.

Lineage

Each batch produces a lineage_batch_*.csv file tracking per-image complexity scores and tile counts for reproducibility.

Citation

If you use this dataset, please cite:

@dataset{lucid_cc0_v2,
  title={LUCID-CC0 v2: Large-Scale Curated CC0 Training Dataset for SISR},
  author={Phips},
  year={2026},
  license={CC0-1.0},
  url={https://huggingface.co/datasets/Phips/lucid-cc0-v2}
}

Acknowledgments

  • Source images from PxHere (CC0)
  • Filtering powered by LUCID
  • Inspired by the SISR community's need for large-scale, ethically-sourced training data
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support