A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/apple/ml-aim below:

apple/ml-aim: This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.

Autoregressive Pre-training of Large Vision Encoders

This repository is the entry point for all things AIM, a family of autoregressive models that push the boundaries of visual and multimodal learning:

*: Equal technical contribution

If you're looking for the original AIM model (AIMv1), please refer to the README here.

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective. AIMv2 pre-training is simple and straightforward to train and to scale effectively. Some AIMv2 highlights include:

  1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
  2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension.
  3. Exhibits strong recognition performance with AIMv2-3B achieving 89.5% on ImageNet using a frozen trunk.

We share with the community AIMv2 pre-trained checkpoints of varying capacities, pre-training resolutions:

Please install PyTorch using the official installation instructions. Afterward, install the package as:

pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v1'
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v2'

We also offer MLX backend support for research and experimentation on Apple silicon. To enable MLX support, simply run:

from PIL import Image

from aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transforms

img = Image.open(...)
model = load_pretrained("aimv2-large-patch14-336", backend="torch")
transform = val_transforms(img_size=336)

inp = transform(img).unsqueeze(0)
features = model(inp)
from PIL import Image
import mlx.core as mx

from aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transforms

img = Image.open(...)
model = load_pretrained("aimv2-large-patch14-336", backend="mlx")
transform = val_transforms(img_size=336)

inp = transform(img).unsqueeze(0)
inp = mx.array(inp.numpy())
features = model(inp)
from PIL import Image
import jax.numpy as jnp

from aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transforms

img = Image.open(...)
model, params = load_pretrained("aimv2-large-patch14-336", backend="jax")
transform = val_transforms(img_size=336)

inp = transform(img).unsqueeze(0)
inp = jnp.array(inp)
features = model.apply({"params": params}, inp)

The pre-trained models can be accessed via HuggingFace Hub as:

from PIL import Image
from transformers import AutoImageProcessor, AutoModel

image = Image.open(...)
processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-336")
model = AutoModel.from_pretrained("apple/aimv2-large-patch14-336", trust_remote_code=True)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
model_id #params IN-1k HF Link Backbone aimv2-large-patch14-224 0.3B 86.6 πŸ€—link link aimv2-huge-patch14-224 0.6B 87.5 πŸ€—link link aimv2-1B-patch14-224 1.2B 88.1 πŸ€—link link aimv2-3B-patch14-224 2.7B 88.5 πŸ€—link link model_id #params IN-1k HF Link Backbone aimv2-large-patch14-336 0.3B 87.6 πŸ€—link link aimv2-huge-patch14-336 0.6B 88.2 πŸ€—link link aimv2-1B-patch14-336 1.2B 88.7 πŸ€—link link aimv2-3B-patch14-336 2.7B 89.2 πŸ€—link link model_id #params IN-1k HF Link Backbone aimv2-large-patch14-448 0.3B 87.9 πŸ€—link link aimv2-huge-patch14-448 0.6B 88.6 πŸ€—link link aimv2-1B-patch14-448 1.2B 89.0 πŸ€—link link aimv2-3B-patch14-448 2.7B 89.5 πŸ€—link link AIMv2 with Native Resolution

We additionally provide an AIMv2-L checkpoint that is finetuned to process a wide range of image resolutions and aspect ratios. Regardless of the aspect ratio, the image is patchified (patch_size=14) and a 2D sinusoidal positional embedding is added to the linearly projected input patches. This checkpoint supports number of patches in the range of [112, 4096].

model_id #params IN-1k HF Link Backbone aimv2-large-patch14-native 0.3B 87.3 πŸ€—link link AIMv2 distilled ViT-Large

We provide an AIMv2-L checkpoint distilled from AIMv2-3B that provides a remarkable performance for multimodal understanding benchmarks.

Model VQAv2 GQA OKVQA TextVQA DocVQA InfoVQA ChartQA SciQA MMEp AIMv2-L 80.2 72.6 60.9 53.9 26.8 22.4 20.3 74.5 1457 AIMv2-L-distilled 81.1 73.0 61.4 53.5 29.2 23.3 24.0 76.3 1627 model_id #params Res. HF Link Backbone aimv2-large-patch14-224-distilled 0.3B 224px πŸ€—link link aimv2-large-patch14-336-distilled 0.3B 336px πŸ€—link link

We provide the AIMv2-L vision and text encoders after LiT tuning to enable zero-shot recognition.

model #params zero-shot IN1-k Backbone AIMv2-L 0.3B 77.0 link

If you find our work useful, please consider citing us as:

@misc{fini2024multimodal,
    title={Multimodal Autoregressive Pre-training of Large Vision Encoders},
    author={Enrico Fini and Mustafa Shukor and Xiujun Li and Philipp Dufter and Michal Klein and David Haldimann and Sai Aitharaju and Victor Guilherme Turrisi da Costa and Louis BΓ©thune and Zhe Gan and Alexander T Toshev and Marcin Eichner and Moin Nabi and Yinfei Yang and Joshua M. Susskind and Alaaeldin El-Nouby},
    year={2024},
    eprint={2411.14402},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
@InProceedings{pmlr-v235-el-nouby24a,
  title     = {Scalable Pre-training of Large Autoregressive Image Models},
  author    = {El-Nouby, Alaaeldin and Klein, Michal and Zhai, Shuangfei and Bautista, Miguel \'{A}ngel and Shankar, Vaishaal and Toshev, Alexander T and Susskind, Joshua M. and Joulin, Armand},
  booktitle = {Proceedings of the 41st International Conference on Machine Learning},
  pages     = {12371--12384},
  year      = {2024},
}

Please check out the repository LICENSE before using the provided code and models.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4