arxiv:2606.14024

ViT-Up: Faithful Feature Upsampling for Vision Transformers

Published on Jun 12

· Submitted by

Wandel on Jun 18

Shanghai Jiao Tong University

Upvote

Authors:

Krispin Wandel ,

Abstract

ViT-Up is a feature upsampling framework for Vision Transformers that uses layer-wise query construction from hidden states to improve dense prediction tasks, outperforming existing image-guided methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.

View arXiv page View PDF Project page GitHub 6 Add to collection

Community

Krispin

Paper author Paper submitter about 6 hours ago

Introducing ViT-Up: A state-of-the-art task-agnostic feature upsampler for Vision Transformers.
ViT-Up predicts features at ⭐ arbitrary continuous image coordinates ⭐, enabling dense feature maps at any resolution and sample-aware vision pipelines that query features only where they are needed.

Pretrained through self-supervised feature distillation on over one million ImageNet-1K images, it supports data-constrained dense prediction and fine-grained correspondence by letting downstream heads operate directly on dense DINOv3 features.

ViT-Up outperforms prior state-of-the-art feature upsamplers across dense prediction and semantic correspondence benchmarks. On DINOv3-S+, ViT-Up improves over prior methods by up to:

+2.07 mIoU on Cityscapes
+4.17 PCK@0.10 on SPair-71k

The project page includes pretrained checkpoints, code for training and evaluation, quantitative results, qualitative comparisons, the arXiv preprint, and a Google Colab demo:

https://vitup.papers.discuna.com/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.14024

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.14024 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.14024 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.