facebook
/

dinov2-base

Image Feature Extraction

Model card Files Files and versions Community

nielsr HF staff commited on Jul 18, 2023

Commit

04132fb

·

1 Parent(s): 79030f8

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -5,7 +5,7 @@ tags:
 - vision
 ---
-# Vision Transformer (base-sized model, patch size 16) trained using DINOv2
 Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Oquab et al. and first released in [this repository](https://github.com/facebookresearch/dinov2).
@@ -15,7 +15,7 @@ Disclaimer: The team releasing DINOv2 did not write a model card for this model
 The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion at a resolution of 224x224 pixels.
-Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
 Note that this model does not include any fine-tuned heads.

 - vision
 ---
+# Vision Transformer (base-sized model) trained using DINOv2
 Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Oquab et al. and first released in [this repository](https://github.com/facebookresearch/dinov2).
 The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion at a resolution of 224x224 pixels.
+Images are presented to the model as a sequence of fixed-size patches, which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
 Note that this model does not include any fine-tuned heads.