This folder contains the implementation of Slide Transformer based on PVT, Swin models for image classification.
- Python 3.9
- PyTorch == 1.11.0
- torchvision == 0.12.0
- numpy
- timm == 0.4.12
- einops
- yacs
The ImageNet dataset should be prepared as follows:
$ tree data
imagenet
├── train
│ ├── class1
│ │ ├── img1.jpeg
│ │ ├── img2.jpeg
│ │ └── ...
│ ├── class2
│ │ ├── img3.jpeg
│ │ └── ...
│ └── ...
└── val
├── class1
│ ├── img4.jpeg
│ ├── img5.jpeg
│ └── ...
├── class2
│ ├── img6.jpeg
│ └── ...
└── ...
- To train
Slide-PVT/Slide-Swin
on ImageNet from scratch, run:
python -m torch.distributed.launch --nproc_per_node=8 main.py --cfg <path-to-config-file> --data-path <imagenet-path> --output <output-path>
If you find this repo helpful, please consider citing us.
@inproceedings{pan2023slide,
title={Slide-transformer: Hierarchical vision transformer with local self-attention},
author={Pan, Xuran and Ye, Tianzhu and Xia, Zhuofan and Song, Shiji and Huang, Gao},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
pages={2082--2091},
year={2023}
}