File size: 2,654 Bytes
f3003dd
 
 
 
 
 
 
 
4fb93c1
f3003dd
 
 
 
 
 
 
 
 
bae117f
 
f3003dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4fb93c1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: mit
base_model:
- apple/aimv2-large-patch14-native
pipeline_tag: image-classification
tags:
- image-classification
- vision
library_name: transformers
---


# AIMv2-Large-Patch14-Native Image Classification

[Original AIMv2 Paper](https://arxiv.org/abs/2411.14402) | [BibTeX](#citation)

This repository contains an adapted version of the original AIMv2 model, modified to be compatible with the `AutoModelForImageClassification` class from Hugging Face Transformers. This adaptation enables seamless use of the model for image classification tasks.

**This model has not been trained/fine-tuned**

## Introduction

We have adapted the original `apple/aimv2-large-patch14-native` model to work with `AutoModelForImageClassification`. The AIMv2 family consists of vision models pre-trained with a multimodal autoregressive objective, offering robust performance across various benchmarks.

Some highlights of the AIMv2 models include:

1. Outperforming OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
2. Surpassing DINOv2 in open-vocabulary object detection and referring expression comprehension.
3. Demonstrating strong recognition performance, with AIMv2-3B achieving **89.5% on ImageNet using a frozen trunk**.

## Usage

### PyTorch

```python
import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "amaye15/aimv2-large-patch14-native-image-classification",
)
model = AutoModelForImageClassification.from_pretrained(
    "amaye15/aimv2-large-patch14-native-image-classification",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

# Get predicted class
predictions = outputs.logits.softmax(dim=-1)
predicted_class = predictions.argmax(-1).item()

print(f"Predicted class: {model.config.id2label[predicted_class]}")
```

## Model Details

- **Model Name**: `amaye15/aimv2-large-patch14-native-image-classification`
- **Original Model**: `apple/aimv2-large-patch14-native`
- **Adaptation**: Modified to be compatible with `AutoModelForImageClassification` for direct use in image classification tasks.
- **Framework**: PyTorch

## Citation

If you use this model or find it helpful, please consider citing the original AIMv2 paper:

```bibtex
@article{yang2023aimv2,
  title={AIMv2: Advances in Multimodal Vision Models},
  author={Yang, Li and others},
  journal={arXiv preprint arXiv:2411.14402},
  year={2023}
}
```