Instructions to use zutaars1/vit-computer-vision-classification-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zutaars1/vit-computer-vision-classification-model with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-classification", model="zutaars1/vit-computer-vision-classification-model") pipe("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png")# Load model directly from transformers import AutoImageProcessor, AutoModelForImageClassification processor = AutoImageProcessor.from_pretrained("zutaars1/vit-computer-vision-classification-model") model = AutoModelForImageClassification.from_pretrained("zutaars1/vit-computer-vision-classification-model") - Notebooks
- Google Colab
- Kaggle
vit-base-mini-food-3
This model is a fine-tuned version of google/vit-base-patch16-224-in21k on the Mini Food-3 custom dataset. It achieves the following results on the evaluation set:
- Loss: 0.5656
- Accuracy: 0.9067
Model description
This is a Vision Transformer (ViT) model fine-tuned for food image classification. The model was trained to classify three food categories: pizza, sushi, and ice cream.
Intended uses & limitations
This model is intended for classifying images of three food types:
- pizza
- sushi
- ice_cream
It performs best on images similar to the training dataset (Food-101 subset). The model may not generalize well to other food categories or different image conditions.
Training and evaluation data
The model was trained on the Mini Food-3 dataset, derived from the Food-101 dataset:
| Split | Images per class | Total images |
|---|---|---|
| Train | 100 | 300 |
| Validation | 25 | 75 |
| Test | 25 | 75 |
| Total | 150 | 450 |
Preprocessing
- Images converted to RGB
- Resizing and normalization using ViT image processor
- Automatic label encoding from folder structure
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.00002
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: adamw_torch with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3
- weight_decay: 0.01
Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|---|---|---|---|---|
| 0.9543 | 1.0 | 38 | 0.8170 | 0.8667 |
| 0.6416 | 2.0 | 76 | 0.6260 | 0.9067 |
| 0.4879 | 3.0 | 114 | 0.5656 | 0.9067 |
Framework versions
- Transformers 4.50.0
- PyTorch 2.0+
- Datasets 2.0+
- Tokenizers 0.13+
Model Comparison
This fine-tuned ViT model was compared against:
- CLIP (Zero-Shot):
openai/clip-vit-large-patch14- No training required - OpenAI Vision Model: LLM-based image classification
The fine-tuned ViT model outperforms both baseline approaches on this specific dataset due to task-specific training.
- Downloads last month
- 1
Model tree for zutaars1/vit-computer-vision-classification-model
Base model
google/vit-base-patch16-224-in21k