Matthijs commited on
Commit
a30ce04
1 Parent(s): f6ee0bf

add model card

Browse files
Files changed (2) hide show
  1. .DS_Store +0 -0
  2. README.md +95 -0
.DS_Store ADDED
Binary file (6.15 kB). View file
 
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ tags:
4
+ - vision
5
+ - image-classification
6
+ datasets:
7
+ - imagenet-1k
8
+ widget:
9
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
10
+ example_title: Tiger
11
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
12
+ example_title: Teapot
13
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
14
+ example_title: Palace
15
+ ---
16
+
17
+ # MobileViT (extra small-sized model)
18
+
19
+ MobileViT model pre-trained on ImageNet-1k at resolution 256x256. It was introduced in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari, and first released in [this repository](https://github.com/apple/ml-cvnets). The license used is [Apple sample code license](https://github.com/apple/ml-cvnets/blob/main/LICENSE).
20
+
21
+ Disclaimer: The team releasing MobileViT did not write a model card for this model so this model card has been written by the Hugging Face team.
22
+
23
+ ## Model description
24
+
25
+ MobileViT is a light-weight, low latency convolutional neural network that combines MobileNetV2-style layers with a new block that replaces local processing in convolutions with global processing using transformers. As with ViT (Vision Transformer), the image data is converted into flattened patches before it is processed by the transformer layers. Afterwards, the patches are "unflattened" back into feature maps. This allows the MobileViT-block to be placed anywhere inside a CNN. MobileViT does not require any positional embeddings.
26
+
27
+ ## Intended uses & limitations
28
+
29
+ You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=mobilevit) to look for fine-tuned versions on a task that interests you.
30
+
31
+ ### How to use
32
+
33
+ Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:
34
+
35
+ ```python
36
+ from transformers import MobileViTFeatureExtractor, MobileViTForImageClassification
37
+ from PIL import Image
38
+ import requests
39
+
40
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
41
+ image = Image.open(requests.get(url, stream=True).raw)
42
+
43
+ feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/mobilevit-x-small")
44
+ model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-x-small")
45
+
46
+ inputs = feature_extractor(images=image, return_tensors="pt")
47
+
48
+ outputs = model(**inputs)
49
+ logits = outputs.logits
50
+
51
+ # model predicts one of the 1000 ImageNet classes
52
+ predicted_class_idx = logits.argmax(-1).item()
53
+ print("Predicted class:", model.config.id2label[predicted_class_idx])
54
+ ```
55
+
56
+ Currently, both the feature extractor and model support PyTorch.
57
+
58
+ ## Training data
59
+
60
+ The MobileViT model was pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset consisting of 1 million images and 1,000 classes.
61
+
62
+ ## Training procedure
63
+
64
+ ### Preprocessing
65
+
66
+ Training requires only basic data augmentation, i.e. random resized cropping and horizontal flipping.
67
+
68
+ To learn multi-scale representations without requiring fine-tuning, a multi-scale sampler was used during training, with image sizes randomly sampled from: (160, 160), (192, 192), (256, 256), (288, 288), (320, 320).
69
+
70
+ At inference time, images are resized/rescaled to the same resolution (288x288), and center-cropped at 256x256.
71
+
72
+ Pixels are normalized to the range [0, 1]. Images are expected to be in BGR pixel order, not RGB.
73
+
74
+ ### Pretraining
75
+
76
+ The MobileViT networks are trained from scratch for 300 epochs on ImageNet-1k on 8 NVIDIA GPUs with an effective batch size of 1024 and learning rate warmup for 3k steps, followed by cosine annealing. Also used were label smoothing cross-entropy loss and L2 weight decay. Training resolution varies from 160x160 to 320x320, using multi-scale sampling.
77
+
78
+ ## Evaluation results
79
+
80
+ | Model | ImageNet top-1 accuracy | ImageNet top-5 accuracy | # params | URL |
81
+ |------------------|-------------------------|-------------------------|-----------|-------------------------------------------------|
82
+ | MobileViT-XXS | 69.0 | 88.9 | 1.3 M | https://huggingface.co/apple/mobilevit-xx-small |
83
+ | **MobileViT-XS** | **74.8** | **92.3** | **2.3 M** | https://huggingface.co/apple/mobilevit-x-small |
84
+ | MobileViT-S | 78.4 | 94.1 | 5.6 M | https://huggingface.co/apple/mobilevit-small |
85
+
86
+ ### BibTeX entry and citation info
87
+
88
+ ```bibtex
89
+ @inproceedings{vision-transformer,
90
+ title = {MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
91
+ author = {Sachin Mehta and Mohammad Rastegari},
92
+ year = {2022},
93
+ URL = {https://arxiv.org/abs/2110.02178}
94
+ }
95
+ ```