--- library_name: aim pipeline_tag: image-classification license: other license_name: apple-sample-code-license license_link: LICENSE datasets: - imagenet-1k metrics: - accuracy tags: - large-scale-vision-models - pytorch - mlx - jax - vision - ssl - pre-training - DFN --- # AIM: Autoregressive Image Models *Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin* This software project accompanies the research paper, Scalable Pre-training of Large Autoregressive Image Models. We introduce **AIM** a collection of vision models pre-trained with an autoregressive generative objective. We show that autoregressive pre-training of image features exhibits similar scaling properties to their textual counterpart (i.e. Large Language Models). Specifically, we highlight two findings: 1. the model capacity can be trivially scaled to billions of parameters, and 2. AIM effectively leverages large collections of uncurated image data. ## Installation Please install PyTorch using the official [installation instructions](https://pytorch.org/get-started/locally/). Afterward, install the package as: ```commandline pip install git+https://git@github.com/apple/ml-aim.git ``` We also offer [MLX](https://github.com/ml-explore/mlx) backend support for research and experimentation on Apple silicon. To enable MLX support, simply run: ```commandline pip install mlx ``` ## Usage Below we provide an example of usage in [PyTorch](https://pytorch.org/): ```python from PIL import Image from aim.utils import load_pretrained from aim.torch.data import val_transforms img = Image.open(...) model = load_pretrained("aim-600M-2B-imgs", backend="torch") transform = val_transforms() inp = transform(img).unsqueeze(0) logits, _ = model(inp) ```

and in both MLX

```python from PIL import Image import mlx.core as mx from aim.utils import load_pretrained from aim.torch.data import val_transforms img = Image.open(...) model = load_pretrained("aim-600M-2B-imgs", backend="mlx") transform = val_transforms() inp = transform(img).unsqueeze(0) inp = mx.array(inp.numpy()) logits, _ = model(inp) ```

and JAX

```python from PIL import Image import jax.numpy as jnp from aim.utils import load_pretrained from aim.torch.data import val_transforms img = Image.open(...) model, params = load_pretrained("aim-600M-2B-imgs", backend="jax") transform = val_transforms() inp = transform(img).unsqueeze(0) inp = jnp.array(inp) (logits, _), _ = model.apply(params, inp, mutable=['batch_stats']) ```

## Pre-trained checkpoints The pre-trained models can be accessed either via [Hugging Face](https://huggingface.co/collections/apple/aim-65aa3ce948c718a574f09eb7): ```python # after running pip install git+https://git@github.com/apple/ml-aim.git from aim.torch.models import AIMForImageClassification aim_600m = AIMForImageClassification.from_pretrained("apple/aim-600M") aim_1b = AIMForImageClassification.from_pretrained("apple/aim-1B") aim_3b = AIMForImageClassification.from_pretrained("apple/aim-3B") aim_7b = AIMForImageClassification.from_pretrained("apple/aim-7B") ``` or [PyTorch Hub](https://pytorch.org/hub/) as: ```python import torch aim_600m = torch.hub.load("apple/ml-aim", "aim_600M") aim_1b = torch.hub.load("apple/ml-aim", "aim_1B") aim_3b = torch.hub.load("apple/ml-aim", "aim_3B") aim_7b = torch.hub.load("apple/ml-aim", "aim_7B") ``` ### Pre-trained backbones The following table contains pre-trained backbones used in our paper.

model	#params	attn (best layer)	backbone, SHA256
AIM-0.6B	0.6B	79.4%	link, 0d6f6b8f
AIM-1B	1B	82.3%	link, d254ecd3
AIM-3B	3B	83.3%	link, 8475ce4e
AIM-7B	7B	84.0%	link, 184ed94c

### Pre-trained attention heads The table below contains the classification results on ImageNet-1k validation set.

model	top-1 IN-1k		attention head, SHA256
model	last layer	best layer	last layer	best layer
AIM-0.6B	78.5%	79.4%	link, 5ce5a341	link, ebd45c05
AIM-1B	80.6%	82.3%	link, db3be2ad	link, f1ed7852
AIM-3B	82.2%	83.3%	link, 5c057b30	link, ad380e16
AIM-7B	82.4%	84.0%	link, 1e5c99ba	link, 73ecd732

## Reproducing the IN-1k classification results The commands below reproduce the [attention probe results](#pre-trained-attention-heads) on ImageNet-1k validation set. We run the evaluation using 1 node with 8 GPUs: ```commandline torchrun --standalone --nnodes=1 --nproc-per-node=8 main_attnprobe.py \ --model=aim-7B \ --batch-size=64 \ --data-path=/path/to/imagenet \ --probe-layers=last \ --backbone-ckpt-path=/path/to/backbone_ckpt.pth \ --head-ckpt-path=/path/to/head_ckpt.pth ``` By default, we probe the last 6 layers. To change this, simply pass `--probe-layers=best`.