arxiv:2304.07193

DINOv2: Learning Robust Visual Features without Supervision

Published on Apr 14, 2023

Upvote

Authors:

Timothée Darcet ,

Pierre Fernandez ,

Alaaeldin El-Nouby ,

Shang-Wen Li ,

Abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Sep 6, 2023

Proposes DINOv2: an upgrade to DINO with many integrations, larger dataset (accelerate and stabilise training at scale), and distilled mode (from a ViT-1B). Uses the LVD-142M curated dataset; dataset de-duplication by clustering/retrieval using embeddings (FAISS batch searches); sourced from ImageNet, Google Landmarks, Mapillary SLS, Food-101, etc. (collection of tasks).
Trained using a combination of DINO and iBOT losses with the centering of SwAV; learnable student with EMA teacher, cross-entropy (multi-crop like DINO) of CLS (global image representation); patch-level masking to patches of student and CE with corresponding teacher patches; untying head (separate weights for image and patch level objectives) improves performance; replace teacher softmax-centering (from DINO and iBOT) with Sinkhorn-Knoop (SK) batch normalization from SwAV; L2 normalizes KoLeo regularizer for batch uniformity; use high-resolution images during end of pre-training.
Implementation includes FlashAttention (fast and memory efficient), nested tensors (fewer forward passes for global and local crops) from xFormers, stochastic depth for skipping computation of dropped residuals/dropout (instead of masking), uses mixed-precision PyTorch FSDP over DDP; distillation into smaller models from a larger teacher with some modifications.
Ablations included for model selection (done on ImageNet-1K with kNN and linear probing), data source (curating strategy tested on four datasets), model scale and data size comparison tests, loss objectives (KoLeo and MIM patch objective), distillation strategy/effectiveness (approaching from-scratch numbers of a larger model).
Approaches weak text supervision results on ImageNet (comparable to EVA-CLIP, better than SSL methods like Mugs, EsViT, and iBOT), also tested on ImageNet domain generalization; image and video classification on iNat and Places205 for linear probing (comparable or better than OpenCLIP); better instance recognition on Oxford, Paris, and Met, AmsterTime; better semantic segmentation on ADE20k, CityScapes, and Pascal VOC (using Mask2Former with ViT-Adapter on ViT-g approaches fully-supervised absolute SOTA); depth estimation has linear layer on frozen (last layer) tokens with CLS (256-bin classification), also tries concatenation of multiple ViT layers/blocks, and regression over DPT decoder (DPT DINOv2 is better at depth estimation on NYUd, KITTI, and SUN-RGBd out-of-domain from NYUd). Also shows qualitative results; PCA on patch features, threshold first component and keep only positive-value features (extracts the image’s subject/foreground/main focus); patch features can be matched (semantic matching).
Also has fairness and carbon footprint analysis. Data processing, implementation details, and list of benchmarks in appendix. From Meta and Inria.