metadata

license: other
license_name: stem.ai.mtl
license_link: LICENSE
tags:
  - vision
  - image-classification
  - STEM-AI-mtl/City_map
  - Google
  - ViT
  - STEM-AI-mtl
datasets:
  - STEM-AI-mtl/City_map
widget:
  - src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
    example_title: Tiger
  - src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg
    example_title: Teapot
  - src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg
    example_title: Palace

The fine-tuned ViT model that beats Google's base model and OpenAI's GPT4

Image-classification model that identifies which city map is illustrated from an image input.

Model description

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.

By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.

How to use

Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes:

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests

url = 'https://assets.wfcdn.com/im/16661612/compr-r85/4172/41722749/new-york-city-map-on-paper-print.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = ViTImageProcessor.from_pretrained('STEM-AI-mtl/City_map-vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('STEM-AI-mtl/City_map-vit-base-patch16-224')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

For more code examples, we refer to the documentation.

Training data

This Google's ViT-base-patch16-224 model was fine-tuned on the STEM-AI-mtl/City_map dataset, contaning overer 600 images of 45 different maps of cities around the world.

Training procedure

Training evaluation results

The quality of the training was evaluated with the training dataset and resulted in the following metrics:\

{'eval_loss': 1.3691096305847168,
'eval_accuracy': 0.6666666666666666,
'eval_runtime': 13.0277,
'eval_samples_per_second': 4.606,
'eval_steps_per_second': 0.154,
'epoch': 2.82}

Model Card Authors

STEM.AI: stem.ai.mtl@gmail.com
William Harbec