File size: 2,654 Bytes
f3003dd 4fb93c1 f3003dd bae117f f3003dd 4fb93c1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
license: mit
base_model:
- apple/aimv2-large-patch14-native
pipeline_tag: image-classification
tags:
- image-classification
- vision
library_name: transformers
---
# AIMv2-Large-Patch14-Native Image Classification
[Original AIMv2 Paper](https://arxiv.org/abs/2411.14402) | [BibTeX](#citation)
This repository contains an adapted version of the original AIMv2 model, modified to be compatible with the `AutoModelForImageClassification` class from Hugging Face Transformers. This adaptation enables seamless use of the model for image classification tasks.
**This model has not been trained/fine-tuned**
## Introduction
We have adapted the original `apple/aimv2-large-patch14-native` model to work with `AutoModelForImageClassification`. The AIMv2 family consists of vision models pre-trained with a multimodal autoregressive objective, offering robust performance across various benchmarks.
Some highlights of the AIMv2 models include:
1. Outperforming OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
2. Surpassing DINOv2 in open-vocabulary object detection and referring expression comprehension.
3. Demonstrating strong recognition performance, with AIMv2-3B achieving **89.5% on ImageNet using a frozen trunk**.
## Usage
### PyTorch
```python
import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained(
"amaye15/aimv2-large-patch14-native-image-classification",
)
model = AutoModelForImageClassification.from_pretrained(
"amaye15/aimv2-large-patch14-native-image-classification",
trust_remote_code=True,
)
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
# Get predicted class
predictions = outputs.logits.softmax(dim=-1)
predicted_class = predictions.argmax(-1).item()
print(f"Predicted class: {model.config.id2label[predicted_class]}")
```
## Model Details
- **Model Name**: `amaye15/aimv2-large-patch14-native-image-classification`
- **Original Model**: `apple/aimv2-large-patch14-native`
- **Adaptation**: Modified to be compatible with `AutoModelForImageClassification` for direct use in image classification tasks.
- **Framework**: PyTorch
## Citation
If you use this model or find it helpful, please consider citing the original AIMv2 paper:
```bibtex
@article{yang2023aimv2,
title={AIMv2: Advances in Multimodal Vision Models},
author={Yang, Li and others},
journal={arXiv preprint arXiv:2411.14402},
year={2023}
}
``` |