File size: 2,388 Bytes
030b5a2
 
ca46bec
 
030b5a2
 
 
 
 
 
ca46bec
030b5a2
 
8a95488
030b5a2
8a95488
 
 
65c778f
 
030b5a2
 
3c94648
030b5a2
65c778f
030b5a2
3c94648
030b5a2
 
 
65c778f
030b5a2
65c778f
030b5a2
 
 
 
 
ebaaa6d
030b5a2
 
 
65c778f
 
 
030b5a2
9429ac5
030b5a2
65c778f
030b5a2
eb73311
 
 
 
 
9429ac5
030b5a2
 
9429ac5
030b5a2
9429ac5
65c778f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: other
license_name: stem.ai.mtl
license_link: LICENSE
tags:
- vision
- image-classification
- STEM-AI-mtl/City_map
- Google
- ViT
- STEM-AI-mtl
datasets:
- STEM-AI-mtl/City_map

widget:
  - image: https://cdn.britannica.com/50/69550-050-B9DA3DCA/Central-New-York-City-borough-Manhattan-Park.jpg
    output:
      text: NYC
metrics:
- accuracy
---

# The fine-tuned ViT model that beats [Google's state-of-the-art model](https://huggingface.co/google/vit-base-patch16-224) and OpenAI's famous GPT4

Image-classification fine-tuned model that identifies which city map is illustrated from an image input.

The Vision Transformer (ViT) base model is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.



### How to use:

[Inference script](https://github.com/STEM-ai/Vision/raw/7d92c8daa388eb74e8c336f2d0d3942722fec3c6/ViT_inference.py)

For more code examples, we refer to the [documentation](https://huggingface.co/transformers/model_doc/vit.html#).

## Training data

This [Google's ViT-base-patch16-224 for city identification](https://huggingface.co/google/vit-base-patch16-224) model was fine-tuned on the [STEM-AI-mtl/City_map dataset](https://huggingface.co/datasets/STEM-AI-mtl/City_map), contaning overer 600 images of 45 different maps of cities around the world. 

## Training procedure

A Transformer training was performed on [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on a 4 Gb Nvidia GTX 1650 GPU.

[Training notebook](https://github.com/STEM-ai/Vision/raw/7d92c8daa388eb74e8c336f2d0d3942722fec3c6/Trainer_ViT.ipynb)

## Training evaluation results

The most accurate output model was obtained from a learning rate of 1e-3. The quality of the training was evaluated with the training dataset and resulted in the following metrics:\

{'eval_loss': 1.3691096305847168,\
 'eval_accuracy': 0.6666666666666666,\
 'eval_runtime': 13.0277,\
 'eval_samples_per_second': 4.606,\
 'eval_steps_per_second': 0.154,\
 'epoch': 2.82}


## Model Card Authors

STEM.AI: stem.ai.mtl@gmail.com\
[William Harbec](https://www.linkedin.com/in/william-harbec-56a262248/)