Skip to content

[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

License

Notifications You must be signed in to change notification settings

OpenGVLab/InternImage

Repository files navigation

[中文版本]

InternImage: Large-Scale Vision Foundation Model

PWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWC

The official implementation of

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.

[Paper] [Blog in Chinese]

Highlights

  • 👍 The strongest open-source visual universal backbone model with up to 3 billion parameters
  • 🏆 Achieved 90.1% Top1 accuracy in ImageNet, the most accurate among open-source models
  • 🏆 Achieved 65.5 mAP on the COCO benchmark dataset for object detection, the only model that exceeded 65.0 mAP

News

  • Jan 22, 2024: 🚀 Support DCNv4 in InternImage!
  • Feb 28, 2023: 🚀 InternImage is accepted to CVPR 2023!
  • Nov 18, 2022: 🚀 InternImage-XL merged into BEVFormer v2 achieves state-of-the-art performance of 63.4 NDS on nuScenes Camera Only.
  • Nov 10, 2022: 🚀 InternImage-H achieves a new record 65.4 mAP on COCO detection test-dev and 62.9 mIoU on ADE20K, outperforming previous models by a large margin.

History

  • Models for other downstream tasks
  • Support CVPR 2023 Workshop on End-to-End Autonomous Driving, see here
  • Support extracting intermediate features, see here
  • Low-cost training with DeepSpeed, see here
  • Compiling-free .whl package of DCNv3 operator, see here
  • InternImage-H(1B)/G(3B)
  • TensorRT inference for classification/detection/segmentation models
  • Classification code of the InternImage series
  • InternImage-T/S/B/L/XL ImageNet-1K pretrained model
  • InternImage-L/XL ImageNet-22K pretrained model
  • InternImage-T/S/B/L/XL detection and instance segmentation model
  • InternImage-T/S/B/L/XL semantic segmentation model

Introduction

InternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.

Some other projects related to InternImage include the pretraining algorithm "M3I-Pretraining," the general-purpose decoder series "Uni-Perceiver," and the autonomous driving perception encoder series "BEVFormer."

Performance

  • InternImage achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, InternImage is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.
  • InternImage outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.
  • InternImage also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.

Classification

Image Classification Scene Classification Long-Tail Classification
ImageNetPlaces365Places 205iNaturalist 2018
90.161.271.792.6

Detection

General Object Detection Long-Tail Object Detection Autonomous Driving Object Detection Dense Object Detection
COCOVOC 2007VOC 2012OpenImageLVIS minivalLVIS valBDD100KnuScenesCrowdHuman
65.594.097.274.165.863.238.864.897.2

Segmentation

Semantic SegmentationStreet SegmentationRGBD Segmentation
ADE20KCOCO Stuff-10KPascal ContextCityScapesNYU Depth V2
62.959.670.387.068.1

Released Models

Open-Source Visual Pretrained Models
namepretrainresolution#paramdownload
InternImage-LIN-22K384x384223Mpth | hf
InternImage-XLIN-22K384x384335Mpth | hf
InternImage-HJoint 427M -> IN-22K384x3841.08Bpth | hf
InternImage-GJoint 427M -> IN-22K384x3843Bpth | hf
ImageNet-1K Image Classification
namepretrainresolutionacc@1#paramFLOPsdownload
InternImage-TIN-1K224x22483.530M5Gpth | hf | cfg
InternImage-SIN-1K224x22484.250M8Gpth | hf | cfg
InternImage-BIN-1K224x22484.997M16Gpth | hf | cfg
InternImage-LIN-22K384x38487.7223M108Gpth | hf | cfg
InternImage-XLIN-22K384x38488.0335M163Gpth | hf | cfg
InternImage-HJoint 427M -> IN-22K640x64089.61.08B1478Gpth | hf | cfg
InternImage-GJoint 427M -> IN-22K512x51290.13B2700Gpth | hf | cfg
COCO Object Detection and Instance Segmentation
backbonemethodschdbox mAPmask mAP#paramFLOPsdownload
InternImage-TMask R-CNN1x47.242.549M270Gckpt | cfg
InternImage-TMask R-CNN3x49.143.749M270Gckpt | cfg
InternImage-SMask R-CNN1x47.843.369M340Gckpt | cfg
InternImage-SMask R-CNN3x49.744.569M340Gckpt | cfg
InternImage-BMask R-CNN1x48.844.0115M501Gckpt | cfg
InternImage-BMask R-CNN3x50.344.8115M501Gckpt | cfg
InternImage-LCascade1x54.947.7277M1399Gckpt | cfg
InternImage-LCascade3x56.148.5277M1399Gckpt | cfg
InternImage-XLCascade1x55.348.1387M1782Gckpt | cfg
InternImage-XLCascade3x56.248.8387M1782Gckpt | cfg
backbonemethodbox mAP (val/test)#paramdownload
CB-InternImage-HDINO (TTA)65.0 / 65.42.18Bckpt | cfg
CB-InternImage-GDINO (TTA)65.3 / 65.56BTODO
ADE20K Semantic Segmentation
backbonemethodresolutionmIoU (ss/ms)#paramFLOPsdownload
InternImage-TUperNet512x51247.9 / 48.159M944Gckpt | cfg
InternImage-SUperNet512x51250.1 / 50.980M1017Gckpt | cfg
InternImage-BUperNet512x51250.8 / 51.3128M1185Gckpt | cfg
InternImage-LUperNet640x64053.9 / 54.1256M2526Gckpt | cfg
InternImage-XLUperNet640x64055.0 / 55.3368M3142Gckpt | cfg
InternImage-HUperNet896x89659.9 / 60.31.12B3566Gckpt | cfg
InternImage-HMask2Former896x89662.5 / 62.91.31B4635Gckpt | cfg
Main Results of FPS

Export classification model from pytorch to tensorrt

Export detection model from pytorch to tensorrt

Export segmentation model from pytorch to tensorrt

nameresolution#paramFLOPsbatch 1 FPS (TensorRT)
InternImage-T224x22430M5G156
InternImage-S224x22450M8G129
InternImage-B224x22497M16G116
InternImage-L384x384223M108G56
InternImage-XL384x384335M163G47

Before using mmdeploy to convert our PyTorch models to TensorRT, please make sure you have the DCNv3 custom operator built correctly. You can build it with the following command:

export MMDEPLOY_DIR=/the/root/path/of/MMDeploy # prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3 cp -r modulated_deform_conv_v3 ${MMDEPLOY_DIR}/csrc/mmdeploy/backend_ops/tensorrt # build custom opscd${MMDEPLOY_DIR} mkdir -p build &&cd build cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} .. make -j$(nproc)&& make install # install the mmdeploy after building custom opscd${MMDEPLOY_DIR} pip install -e .

For more details on building custom ops, please referring to this document.

Related Projects

Foundation Models

  • Uni-Perceiver: A Pre-training unified architecture for generic perception for zero-shot and few-shot tasks
  • Uni-Perceiver v2: A generalist model for large-scale vision and vision-language tasks
  • M3I-Pretraining: One-stage pre-training paradigm via maximizing multi-modal mutual information
  • InternVL: A leading multimodal large language model excelling in tasks such as OCR, multimodal reasoning, and dialogue

Autonomous Driving

  • BEVFormer: A cutting-edge baseline for camera-based 3D detection
  • BEVFormer v2: Adapting modern image backbones to Bird's-Eye-View recognition via perspective supervision

Application in Challenges

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{wang2023internimage, title={Internimage: Exploring large-scale vision foundation models with deformable convolutions}, author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages={14408--14419}, year={2023} }

About

[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 14