Figure 1: The comparison between a conventional pre-training (left) and the proposed integral pre-training framework (right). We use a feature pyramid as the unified neck module and apply masked feature modeling for pre-training the feature pyramid. The green and red blocks indicate that the network weights are pre-trained and un-trained (i.e., randomly initialized for fine-tuning), respectively.
26/Dec./2023
model | Para. (M) | Pre-train | teacher | input/patch | 21K ft? | Acc on IN.1K | checkpoint | checkpoint (21K) |
---|---|---|---|---|---|---|---|---|
Fast-iTPN-T | 24 | IN.1K | CLIP-L | 224/16 | N | 85.1% | baidu/google | |
Fast-iTPN-T | 24 | IN.1K | CLIP-L | 384/16 | N | 86.2% | ||
Fast-iTPN-T | 24 | IN.1K | CLIP-L | 512/16 | N | 86.5% | ||
Fast-iTPN-S | 40 | IN.1K | CLIP-L | 224/16 | N | 86.4% | baidu/google | |
Fast-iTPN-S | 40 | IN.1K | CLIP-L | 384/16 | N | 86.95% | ||
Fast-iTPN-S | 40 | IN.1K | CLIP-L | 512/16 | N | 87.8% | ||
Fast-iTPN-B | 85 | IN.1K | CLIP-L | 224/16 | N | 87.4% | baidu/google | |
Fast-iTPN-B | 85 | IN.1K | CLIP-L | 512/16 | N | 88.5% | ||
Fast-iTPN-B | 85 | IN.1K | CLIP-L | 512/16 | Y | 88.75% | baidu/google | |
Fast-iTPN-L | 312 | IN.1K | CLIP-L | 640/16 | N | 89.5% | baidu/google |
All the pre-trained Fast-iTPN models are available now (passward: itpn) ! The tiny/small/base scale models report the best performance on ImageNet-1K as far as we know. Use them for your own tasks! See Details.
30/May/2023
model | Pre-train | teacher | input/patch | 21K ft? | Acc on IN.1K |
---|---|---|---|---|---|
EVA-02-B | IN.21K | EVA-CLIP-g | 196/14 | N | 87.0% |
EVA-02-B | IN.21K | EVA-CLIP-g | 448/14 | N | 88.3% |
EVA-02-B | IN.21K | EVA-CLIP-g | 448/14 | Y | 88.6% |
Fast-iTPN-B | IN.1K | CLIP-L | 224/16 | N | 87.4% |
Fast-iTPN-B | IN.1K | CLIP-L | 512/16 | N | 88.5% |
Fast-iTPN-B | IN.1K | CLIP-L | 512/16 | Y | 88.7% |
All the models above are only pre-trained on ImageNet-1K and these models will be available soon.
29/May/2023
The iTPN-L-CLIP/16 intermediate fine-tuned model is available (password:itpn) pretrained on 21K, and fine-tuned on 1K. Evaluating the latter one on ImageNet-1K obtains 89.2% accuracy.
28/Feb./2023
iTPN is accepted by CVPR2023!
08/Feb./2023
The iTPN-L-CLIP/16 model reaches 89.2% fine-tuning performance on ImageNet-1K.
configurations: intermediate fine-tuning on ImageNet-21K + 384 input size
21/Jan./2023
Our HiViT is accepted by ICLR2023!
HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer
08/Dec./2022
Get checkpoints (password: abcd):
iTPN-B-pixel | iTPN-B-CLIP | iTPN-L-pixel | iTPN-L-CLIP/16 | |
---|---|---|---|---|
baidu drive | download | download | download | download |
google drive | download | download | download | download |
25/Nov./2022
The preprint version is public at arxiv.
- Ubuntu
- Python 3.7+
- CUDA 10.2+
- GCC 5+
- Pytorch 1.7+
- ImageNet-1K
- COCO2017
- ADE20K
Prepare the environment:
conda create --name itpn python=3.8 -y
conda activate itpn
git clone git@github.com:sunsmarterjie/iTPN.git
cd iTPN
pip install torch==1.7.1+cu10.2 torchvision==0.8.2+cu10.2 timm==0.3.2 tensorboard einops
iTPN supports pre-training using pixel and CLIP as supervision. For the latter, please first download the CLIP models (We use CLIP-B/16 and CLIP-L/14 models in the paper).
Table 1: Top-1 classification accuracy (%) by fine-tuning the pre-trained models on ImageNet-1K. We compare models of different levels and supervisions (e.g., with and without CLIP) separately.
Table 2: Visual recognition results (%) on COCO and ADE20K. Mask R-CNN (abbr. MR, 1x/3x) and Cascade Mask R-CNN (abbr. CMR, 1x) are used on COCO, and UPerHead with 512x512 input is used on ADE20K. For the base-level models, each cell of COCO results contains object detection (box) and instance segmentation (mask) APs. For the large-level models, the accuracy of 1x Mask R-CNN surpasses all existing methods.
iTPN is released under the License.
@inproceedings{tian2023integrally,
title={Integrally Pre-Trained Transformer Pyramid Networks},
author={Tian, Yunjie and Xie, Lingxi and Wang, Zhaozhi and Wei, Longhui and Zhang, Xiaopeng and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18610--18620},
year={2023}
}
@inproceedings{zhang2023hivit,
title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
booktitle={International Conference on Learning Representations},
year={2023}
}