Edit model card

Fine-Tuned Vision Transformer (ViT) on Traffic Sign Recognition

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. Fine-tuned on the German Traffic Sign Recognition Benchmark Dataset.

Model description

  • Model Architecture: Vision Transformer (ViT) - google/vit-base-patch16-224-21k.
  • Fine-tuning Objective: Classify traffic signs into 43 different categories, including various speed limits, warning signs, and prohibitory or regulatory signs.
  • Developer: Aleksandra Cvetanovska

Example Use

from transformers import ViTForImageClassification, ViTImageProcessor
from torch.utils.data import DataLoader
import torch

url = 'https://images.unsplash.com/photo-1572670014853-1d3a3f22b40f?q=80&w=2942&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D'
image = Image.open(requests.get(url, stream=True).raw)

model_name = "cvetanovskaa/vit-base-patch16-224-in21k-gtsrb-tuned"
model = ViTForImageClassification.from_pretrained(model_name)
processor = ViTImageProcessor.from_pretrained(model_name)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

Limitations and Bias

  • The model is trained exclusively on data from German traffic signs, which may not generalize well to signs in other regions due to differences in design and context.
  • Performance may vary under different lighting conditions or when signs are partially occluded

Intended uses & limitations

You can use the fine-tuned model for image classification.

Downloads last month
0
Safetensors
Model size
85.8M params
Tensor type
F32
·

Dataset used to train cvetanovskaa/vit-base-patch16-224-in21k-gtsrb-tuned

Space using cvetanovskaa/vit-base-patch16-224-in21k-gtsrb-tuned 1