This repository is the entry point for all things AIM, a family of autoregressive models that push the boundaries of visual and multimodal learning:
Multimodal Autoregressive Pre-training of Large Vision Encoders
[BibTeX
] [CVPR 2025 (Highlight)] Scalable Pre-training of Large Autoregressive Image Models
[BibTeX
][ICML 2024]*: Equal technical contribution
If you're looking for the original AIM model (AIMv1), please refer to the README here.
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and to scale effectively. Some AIMv2 highlights include:
We share with the community AIMv2 pre-trained checkpoints of varying capacities, pre-training resolutions:
AIMv2 with 224px
]AIMv2 with 336px
]AIMv2 with 448px
]AIMv2 with Native Resolution
]AIMv2 distilled ViT-Large
] (recommended for multimodal applications)Zero-shot Adapted AIMv2
]Please install PyTorch using the official installation instructions. Afterward, install the package as:
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v1'
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v2'
We also offer MLX backend support for research and experimentation on Apple silicon. To enable MLX support, simply run:
from PIL import Image from aim.v2.utils import load_pretrained from aim.v1.torch.data import val_transforms img = Image.open(...) model = load_pretrained("aimv2-large-patch14-336", backend="torch") transform = val_transforms(img_size=336) inp = transform(img).unsqueeze(0) features = model(inp)
from PIL import Image import mlx.core as mx from aim.v2.utils import load_pretrained from aim.v1.torch.data import val_transforms img = Image.open(...) model = load_pretrained("aimv2-large-patch14-336", backend="mlx") transform = val_transforms(img_size=336) inp = transform(img).unsqueeze(0) inp = mx.array(inp.numpy()) features = model(inp)
from PIL import Image import jax.numpy as jnp from aim.v2.utils import load_pretrained from aim.v1.torch.data import val_transforms img = Image.open(...) model, params = load_pretrained("aimv2-large-patch14-336", backend="jax") transform = val_transforms(img_size=336) inp = transform(img).unsqueeze(0) inp = jnp.array(inp) features = model.apply({"params": params}, inp)
The pre-trained models can be accessed via HuggingFace Hub as:
from PIL import Image from transformers import AutoImageProcessor, AutoModel image = Image.open(...) processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-336") model = AutoModel.from_pretrained("apple/aimv2-large-patch14-336", trust_remote_code=True) inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs)model_id #params IN-1k HF Link Backbone aimv2-large-patch14-224 0.3B 86.6 π€link link aimv2-huge-patch14-224 0.6B 87.5 π€link link aimv2-1B-patch14-224 1.2B 88.1 π€link link aimv2-3B-patch14-224 2.7B 88.5 π€link link model_id #params IN-1k HF Link Backbone aimv2-large-patch14-336 0.3B 87.6 π€link link aimv2-huge-patch14-336 0.6B 88.2 π€link link aimv2-1B-patch14-336 1.2B 88.7 π€link link aimv2-3B-patch14-336 2.7B 89.2 π€link link model_id #params IN-1k HF Link Backbone aimv2-large-patch14-448 0.3B 87.9 π€link link aimv2-huge-patch14-448 0.6B 88.6 π€link link aimv2-1B-patch14-448 1.2B 89.0 π€link link aimv2-3B-patch14-448 2.7B 89.5 π€link link AIMv2 with Native Resolution
We additionally provide an AIMv2-L checkpoint that is finetuned to process a wide range of image resolutions and aspect ratios. Regardless of the aspect ratio, the image is patchified (patch_size=14) and a 2D sinusoidal positional embedding is added to the linearly projected input patches. This checkpoint supports number of patches in the range of [112, 4096].
model_id #params IN-1k HF Link Backbone aimv2-large-patch14-native 0.3B 87.3 π€link link AIMv2 distilled ViT-LargeWe provide an AIMv2-L checkpoint distilled from AIMv2-3B that provides a remarkable performance for multimodal understanding benchmarks.
Model VQAv2 GQA OKVQA TextVQA DocVQA InfoVQA ChartQA SciQA MMEp AIMv2-L 80.2 72.6 60.9 53.9 26.8 22.4 20.3 74.5 1457 AIMv2-L-distilled 81.1 73.0 61.4 53.5 29.2 23.3 24.0 76.3 1627 model_id #params Res. HF Link Backbone aimv2-large-patch14-224-distilled 0.3B 224px π€link link aimv2-large-patch14-336-distilled 0.3B 336px π€link linkWe provide the AIMv2-L vision and text encoders after LiT tuning to enable zero-shot recognition.
model #params zero-shot IN1-k Backbone AIMv2-L 0.3B 77.0 linkIf you find our work useful, please consider citing us as:
@misc{fini2024multimodal, title={Multimodal Autoregressive Pre-training of Large Vision Encoders}, author={Enrico Fini and Mustafa Shukor and Xiujun Li and Philipp Dufter and Michal Klein and David Haldimann and Sai Aitharaju and Victor Guilherme Turrisi da Costa and Louis BΓ©thune and Zhe Gan and Alexander T Toshev and Marcin Eichner and Moin Nabi and Yinfei Yang and Joshua M. Susskind and Alaaeldin El-Nouby}, year={2024}, eprint={2411.14402}, archivePrefix={arXiv}, primaryClass={cs.CV} }
@InProceedings{pmlr-v235-el-nouby24a, title = {Scalable Pre-training of Large Autoregressive Image Models}, author = {El-Nouby, Alaaeldin and Klein, Michal and Zhai, Shuangfei and Bautista, Miguel \'{A}ngel and Shankar, Vaishaal and Toshev, Alexander T and Susskind, Joshua M. and Joulin, Armand}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {12371--12384}, year = {2024}, }
Please check out the repository LICENSE before using the provided code and models.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4