Model Overviwe:

The Vision Transformer (ViT) is a transformer encoder model designed for image recognition tasks. It was pretrained on a large dataset of 14 million images and 21,843 classes known as ImageNet-21k, and fine-tuned on ImageNet 2012, which consists of 1 million images across 1,000 classes.

How It Works:

Input Representation: Images are split into fixed-size patches (16x16 pixels) and linearly embedded. A special [CLS] token is added at the beginning of the sequence to indicate the image's classification.

Transformer Encoder: The model uses a transformer encoder architecture, similar to BERT for text, to process the image patches. Absolute position embeddings are added to encode spatial information before inputting the sequence into transformer layers.

Classification: After processing through the transformer layers, the output from the [CLS] token is used for image classification. This token's final hidden state represents the entire image's features.

Intended Uses:

Image Classification: ViT can be directly used for image classification tasks. By adding a linear layer on top of the [CLS] token, the model can classify images into one of the 1,000 ImageNet classes. Limitations:

Resolution Dependency: While the model was fine-tuned on ImageNet at 224x224 resolution, better performance is achieved with higher resolutions such as 384x384. Larger models generally yield better results but require more computational resources. Training Details:

Preprocessing: Images are resized to 224x224 pixels and normalized across RGB channels.

Training: Pretraining was conducted on TPUv3 hardware with a batch size of 4096 and learning rate warmup. Gradient clipping was applied during training to enhance stability.

from transformers import ViTImageProcessor, ViTForImageClassification
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
import requests
import torch

def predict_image_from_url(url):
    # Load image from URL
    image = Image.open(requests.get(url, stream=True).raw)
    
    # Initialize Sreekanth's processor and model
    processor = AutoImageProcessor.from_pretrained("Sreekanth3096/vit-coco-image-classification")
    model = AutoModelForImageClassification.from_pretrained("Sreekanth3096/vit-coco-image-classification")
    
    # Preprocess image and make predictions
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    
    # Get predicted class label
    logits = outputs.logits
    predicted_class_idx = logits.argmax(-1).item()
    predicted_class = model.config.id2label[predicted_class_idx]
    
    return predicted_class

# Example usage
if __name__ == "__main__":
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    predicted_class = predict_image_from_url(url)
    print(f"Predicted class: {predicted_class}")

For more code examples, we refer to the documentation.

Training data

The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes.

Evaluation Results:

Performance: Detailed evaluation results on various benchmarks can be found in tables from the original paper. Fine-tuning the model on higher resolutions typically improves classification accuracy.

Sreekanth3096
/

vit-coco-image-classification

Model Overviwe:

How It Works:

Intended Uses:

Training data

Evaluation Results:

Dataset used to train Sreekanth3096/vit-coco-image-classification