NAF / README.md
LChambon's picture
Update README.md
bb088ff verified

A newer version of the Gradio SDK is available: 6.0.2

Upgrade
metadata
title: NAF Zero-Shot Feature Upsampling
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.1
app_file: app.py
pinned: false
license: apache-2.0

🎯 NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

This Space demonstrates NAF (Neighborhood Attention Filtering), a method for upsampling features from Vision Foundation Models to any resolution without model-specific training.

πŸš€ Features

  • Universal Upsampling: Works with any Vision Foundation Model (DINOv2, DINOv3, RADIO, DINO, SigLIP, etc.)
  • Arbitrary Resolutions: Upsample features to any target resolution while maintaining aspect ratio
  • Zero-Shot: No model-specific training or fine-tuning required
  • Interactive Demo: Upload your own images or try sample images from various domains

🎨 How to Use

  1. Upload an Image: Click "Upload Your Image" or select from sample images
  2. Choose a Model: Select a Vision Foundation Model from the dropdown
  3. Set Resolution: Choose the target resolution for upsampled features (64-512)
  4. Click "Upsample Features": See the comparison between low and high-resolution features

πŸ“Š Visualization

The output shows three panels:

  • Left: Your input image
  • Center: Low-resolution features from the backbone (PCA visualization)
  • Right: High-resolution features upsampled by NAF

Features are visualized using PCA for the first 3 principal components as RGB channels.

πŸ”¬ Supported Models

  • DINOv3: Latest self-supervised vision models
  • RADIO v2.5: High-performance vision backbones
  • DINOv2: Self-supervised learning with registers
  • DINO: Original self-supervised ViT
  • SigLIP: Contrastive vision-language models

πŸ“– Learn More

πŸ’‘ Use Cases

NAF enables better feature representations for:

  • Dense prediction tasks (segmentation, depth estimation)
  • High-resolution visual understanding
  • Feature matching and correspondence
  • Vision-language alignment

βš™οΈ Technical Details

  • Input: Images up to 512px (maintains aspect ratio)
  • Processing: Backbone feature extraction β†’ NAF upsampling
  • Output: High-resolution features at target resolution
  • Device: Runs on CPU (free tier) or GPU (faster inference)

🀝 Citation

If you use NAF in your research, please cite:

@article{chambon2025naf,
  title={NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering},
  author={Chambon, Lucas and others},
  journal={arXiv preprint arXiv:2501.01535},
  year={2025}
}

πŸ“œ License

This demo is released under the Apache 2.0 license.