File size: 2,453 Bytes
030b5a2
 
ca46bec
 
030b5a2
 
 
 
 
 
ca46bec
030b5a2
 
8a95488
030b5a2
 
2c70309
030b5a2
65c778f
030b5a2
3c94648
030b5a2
2c70309
 
 
030b5a2
 
65c778f
030b5a2
ca67c85
030b5a2
2c70309
030b5a2
 
 
ebaaa6d
030b5a2
 
 
65c778f
 
 
030b5a2
9429ac5
030b5a2
ca67c85
030b5a2
eb73311
 
 
 
 
9429ac5
030b5a2
 
9429ac5
030b5a2
9429ac5
65c778f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: other
license_name: stem.ai.mtl
license_link: LICENSE
tags:
- vision
- image-classification
- STEM-AI-mtl/City_map
- Google
- ViT
- STEM-AI-mtl
datasets:
- STEM-AI-mtl/City_map

---

# The fine-tuned ViT model that beats [Google's state-of-the-art model](https://huggingface.co/google/vit-base-patch16-224) and OpenAI's famous GPT4 for maps of cities around the world

Image-classification fine-tuned model that identifies which city map is illustrated from an image input.

The Vision Transformer (ViT) base model is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.

- **Developed by:** STEM.AI
- **Model type:** Image classification of maps of cities
- **Finetuned from model:** [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224)


### How to use:

[Inference script](https://github.com/STEM-ai/Vision/blob/7d92c8daa388eb74e8c336f2d0d3942722fec3c6/ViT_inference.py)

For more code examples, we refer to [ViTdocumentation](https://huggingface.co/transformers/model_doc/vit.html#).

## Training data

This [Google's ViT-base-patch16-224 for city identification](https://huggingface.co/google/vit-base-patch16-224) model was fine-tuned on the [STEM-AI-mtl/City_map dataset](https://huggingface.co/datasets/STEM-AI-mtl/City_map), contaning overer 600 images of 45 different maps of cities around the world. 

## Training procedure

A Transformer training was performed on [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on a 4 Gb Nvidia GTX 1650 GPU.

[Training notebook](https://github.com/STEM-ai/Vision/raw/7d92c8daa388eb74e8c336f2d0d3942722fec3c6/Trainer_ViT.ipynb)

## Training evaluation results

The most accurate output model was obtained from a learning rate of 1e-3. The quality of the training was evaluated with the training dataset and resulted in the following metrics:

{'eval_loss': 1.3691096305847168,\
 'eval_accuracy': 0.6666666666666666,\
 'eval_runtime': 13.0277,\
 'eval_samples_per_second': 4.606,\
 'eval_steps_per_second': 0.154,\
 'epoch': 2.82}


## Model Card Authors

STEM.AI: stem.ai.mtl@gmail.com\
[William Harbec](https://www.linkedin.com/in/william-harbec-56a262248/)