English
computer_vision

Near, far: Patch-ordering enhances vision foundation models' scene understanding

Welcome to the Hugging Face repository for NeCo. an adapted vision encoder that captures fine-grained details and structural information essential for performing key-point matching, semantic segmentation and more. This repository hosts pretrained checkpoints for NeCo, enabling easy integration into your projects.

Our paper discussing our work: "Near, far: Patch-ordering enhances vision foundation models' scene understanding"
Valentinos Pariza, Mohammadreza Salehi,Gertjan J. Burghouts, Francesco Locatello, Yuki M. Asano

๐ŸŒ Project Page โŒจ๏ธ GitHub Repository ๐Ÿ“„ Read the Paper on arXiv

Model Details

Model Description

NeCo introduces a new self-supervised learning technique for enhancing spatial representations in vision transformers. By leveraging Patch Neighbor Consistency, NeCo captures fine-grained details and structural information that are crucial for various downstream tasks, such as semantic segmentation.

  • Model type: Vision Encoder (Dino, Dinov2, ...)
  • Language(s) (NLP): Python
  • License: MIT
  • Finetuned from model [optional]: Dinov2, Dinov2R, Dino, ...

How to Get Started with the Model

To use NeCo models on downstream dense prediction tasks, you just need to install timm and torch and depending on which checkpoint you use you can load it as follows:

The models can be download from our NeCo Hugging Face repo.

Models after post-training dinov2 (following dinov2 architecture)

NeCo on Dinov2
import torch
# change to dinov2_vitb14 for base as described in:
#    https://github.com/facebookresearch/dinov2
model =  torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14') 
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)
NeCo on Dinov2 with Registers
import torch
# change to dinov2_vitb14_reg for base as described in:
#    https://github.com/facebookresearch/dinov2
model =  torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg') 
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)

Models after post-training dino or similar (following dino architecture)

timm vit-small and vit-base architectures
import torch
from timm.models.vision_transformer import vit_small_patch16_224, vit_base_patch16_224
# Change to vit_base_patch8_224() if you want to use our larger model
model = vit_small_patch16_224()  
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint, map_location='cpu')
model.load_state_dict(state_dict, strict=False)

Note: In case you want to directly load the weights of the model from a hugging face url, please execute:

import torch
state_dict = torch.hub.load_state_dict_from_url("<url to the hugging face checkpoint>")

Training Details

Training Data

  • We have post-trained our models on the COCO Dataset.

Training Procedure

Please look our repository and read our paper for more details.

Environmental Impact

  • Hardware Type: NVIDIA A100 GPU
  • Hours used: 18 (per model)
  • Cloud Provider: Helma NHR FAU (Germany), (Snellius The Netherlands)
  • Compute Region: Europe/Germany & Netherlands

Citation

BibTeX:

@inproceedings{
   pariza2025near,
   title={Near, far: Patch-ordering enhances vision foundation models' scene understanding},
   author={Valentinos Pariza and Mohammadreza Salehi and Gertjan J. Burghouts and Francesco Locatello and Yuki M Asano},
   booktitle={The Thirteenth International Conference on Learning Representations},
   year={2025},
   url={https://openreview.net/forum?id=Qro97zWC29}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for FunAILab/NeCo

Finetuned
(44)
this model