RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/OpenGVLab/InternImage below:

OpenGVLab/InternImage: [CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

[中文版本]

InternImage: Large-Scale Vision Foundation Model

The official implementation of

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.

[Paper] [Blog in Chinese]

👍 The strongest open-source visual universal backbone model with up to 3 billion parameters
🏆 Achieved 90.1% Top1 accuracy in ImageNet, the most accurate among open-source models
🏆 Achieved 65.5 mAP on the COCO benchmark dataset for object detection, the only model that exceeded 65.0 mAP

Jan 22, 2024: 🚀 Support DCNv4 in InternImage!
Feb 28, 2023: 🚀 InternImage is accepted to CVPR 2023!
Nov 18, 2022: 🚀 InternImage-XL merged into BEVFormer v2 achieves state-of-the-art performance of 63.4 NDS on nuScenes Camera Only.
Nov 10, 2022: 🚀 InternImage-H achieves a new record 65.4 mAP on COCO detection test-dev and 62.9 mIoU on ADE20K, outperforming previous models by a large margin.

Models for other downstream tasks
Support CVPR 2023 Workshop on End-to-End Autonomous Driving, see here
Support extracting intermediate features, see here
Low-cost training with DeepSpeed, see here
Compiling-free .whl package of DCNv3 operator, see here
InternImage-H(1B)/G(3B)
TensorRT inference for classification/detection/segmentation models
Classification code of the InternImage series
InternImage-T/S/B/L/XL ImageNet-1K pretrained model
InternImage-L/XL ImageNet-22K pretrained model
InternImage-T/S/B/L/XL detection and instance segmentation model
InternImage-T/S/B/L/XL semantic segmentation model

InternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.

Some other projects related to InternImage include the pretraining algorithm "M3I-Pretraining," the general-purpose decoder series "Uni-Perceiver," and the autonomous driving perception encoder series "BEVFormer."

InternImage achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, InternImage is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.
InternImage outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.
InternImage also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.

Classification

Image Classification Scene Classification Long-Tail Classification ImageNet Places365 Places 205 iNaturalist 2018 90.1 61.2 71.7 92.6

Detection

General Object Detection Long-Tail Object Detection Autonomous Driving Object Detection Dense Object Detection COCO VOC 2007 VOC 2012 OpenImage LVIS minival LVIS val BDD100K nuScenes CrowdHuman 65.5 94.0 97.2 74.1 65.8 63.2 38.8 64.8 97.2

Segmentation

Semantic Segmentation Street Segmentation RGBD Segmentation ADE20K COCO Stuff-10K Pascal Context CityScapes NYU Depth V2 62.9 59.6 70.3 87.0 68.1 Open-Source Visual Pretrained Models
name pretrain resolution #param download InternImage-L IN-22K 384x384 223M pth | hf InternImage-XL IN-22K 384x384 335M pth | hf InternImage-H Joint 427M -> IN-22K 384x384 1.08B pth | hf InternImage-G Joint 427M -> IN-22K 384x384 3B pth | hf ImageNet-1K Image Classification
name pretrain resolution acc@1 #param FLOPs download InternImage-T IN-1K 224x224 83.5 30M 5G pth | hf | cfg InternImage-S IN-1K 224x224 84.2 50M 8G pth | hf | cfg InternImage-B IN-1K 224x224 84.9 97M 16G pth | hf | cfg InternImage-L IN-22K 384x384 87.7 223M 108G pth | hf | cfg InternImage-XL IN-22K 384x384 88.0 335M 163G pth | hf | cfg InternImage-H Joint 427M -> IN-22K 640x640 89.6 1.08B 1478G pth | hf | cfg InternImage-G Joint 427M -> IN-22K 512x512 90.1 3B 2700G pth | hf | cfg COCO Object Detection and Instance Segmentation
backbone method schd box mAP mask mAP #param FLOPs download InternImage-T Mask R-CNN 1x 47.2 42.5 49M 270G ckpt | cfg InternImage-T Mask R-CNN 3x 49.1 43.7 49M 270G ckpt | cfg InternImage-S Mask R-CNN 1x 47.8 43.3 69M 340G ckpt | cfg InternImage-S Mask R-CNN 3x 49.7 44.5 69M 340G ckpt | cfg InternImage-B Mask R-CNN 1x 48.8 44.0 115M 501G ckpt | cfg InternImage-B Mask R-CNN 3x 50.3 44.8 115M 501G ckpt | cfg InternImage-L Cascade 1x 54.9 47.7 277M 1399G ckpt | cfg InternImage-L Cascade 3x 56.1 48.5 277M 1399G ckpt | cfg InternImage-XL Cascade 1x 55.3 48.1 387M 1782G ckpt | cfg InternImage-XL Cascade 3x 56.2 48.8 387M 1782G ckpt | cfg backbone method box mAP (val/test) #param download CB-InternImage-H DINO (TTA) 65.0 / 65.4 2.18B ckpt | cfg CB-InternImage-G DINO (TTA) 65.3 / 65.5 6B TODO ADE20K Semantic Segmentation
backbone method resolution mIoU (ss/ms) #param FLOPs download InternImage-T UperNet 512x512 47.9 / 48.1 59M 944G ckpt | cfg InternImage-S UperNet 512x512 50.1 / 50.9 80M 1017G ckpt | cfg InternImage-B UperNet 512x512 50.8 / 51.3 128M 1185G ckpt | cfg InternImage-L UperNet 640x640 53.9 / 54.1 256M 2526G ckpt | cfg InternImage-XL UperNet 640x640 55.0 / 55.3 368M 3142G ckpt | cfg InternImage-H UperNet 896x896 59.9 / 60.3 1.12B 3566G ckpt | cfg InternImage-H Mask2Former 896x896 62.5 / 62.9 1.31B 4635G ckpt | cfg Main Results of FPS

Export classification model from pytorch to tensorrt

Export detection model from pytorch to tensorrt

Export segmentation model from pytorch to tensorrt

name resolution #param FLOPs batch 1 FPS (TensorRT) InternImage-T 224x224 30M 5G 156 InternImage-S 224x224 50M 8G 129 InternImage-B 224x224 97M 16G 116 InternImage-L 384x384 223M 108G 56 InternImage-XL 384x384 335M 163G 47

Before using mmdeploy to convert our PyTorch models to TensorRT, please make sure you have the DCNv3 custom operator built correctly. You can build it with the following command:

export MMDEPLOY_DIR=/the/root/path/of/MMDeploy

# prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3
cp -r modulated_deform_conv_v3 ${MMDEPLOY_DIR}/csrc/mmdeploy/backend_ops/tensorrt

# build custom ops
cd ${MMDEPLOY_DIR}
mkdir -p build && cd build
cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..
make -j$(nproc) && make install

# install the mmdeploy after building custom ops
cd ${MMDEPLOY_DIR}
pip install -e .

For more details on building custom ops, please referring to this document.

Uni-Perceiver: A Pre-training unified architecture for generic perception for zero-shot and few-shot tasks
Uni-Perceiver v2: A generalist model for large-scale vision and vision-language tasks
M3I-Pretraining: One-stage pre-training paradigm via maximizing multi-modal mutual information
InternVL: A leading multimodal large language model excelling in tasks such as OCR, multimodal reasoning, and dialogue

BEVFormer: A cutting-edge baseline for camera-based 3D detection
BEVFormer v2: Adapting modern image backbones to Bird's-Eye-View recognition via perspective supervision

Application in Challenges

2022 Waymo 3D Camera-Only Detection Challenge: BEVFormer++ ranks 1st based on InternImage
nuScenes 3D detection: BEVFormer v2 achieves SOTA performance of 64.8 NDS on nuScenes Camera Only
CVPR 2023 Workshop End-to-End Autonomous Driving: InternImage supports the baseline of the 3D Occupancy Prediction Challenge and OpenLane Topology Challenge

If this work is helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{wang2023internimage,
  title={Internimage: Exploring large-scale vision foundation models with deformable convolutions},
  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={14408--14419},
  year={2023}
}

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4