PGSeg is a framework for learning semantic segmentation with only image-text pairs. It introduces the prototyical knowledge to provide explicit supervision for the group tokens, which are used to perform bottom-up heirarchical spatial grouping of semantically-related visual regions. This repository is the official implementation of PGSeg introduced in the paper:
Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation, NeurIPS 2023,
Fei Zhang, Tianfei Zhou, Boyang Li, Hao He, Chaofan Ma, Tianjiao Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang.
- Comparison with SAM
Since the installation of this environment is quite annoying, I provide a conda-packed environment in here;, in case of some weired bugs due to the uncompatiable environment. )
- Python 3.7
- PyTorch 1.8
- webdataset 0.1.103
- mmsegmentation 0.18.0
- timm 0.4.12
Instructions:
conda create -n groupvit python=3.7 -y
conda activate groupvit
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install mmcv-full==1.3.14 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html
pip install mmsegmentation==0.18.0
pip install webdataset==0.1.103
pip install timm==0.4.12
git clone https://github.com/NVIDIA/apex
cd && apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
pip install opencv-python==4.4.0.46 termcolor==1.1.0 diffdist einops omegaconf
pip install nltk ftfy regex tqdm
Zero-shot Segmentation | ||||
---|---|---|---|---|
config | Pascal VOC | Pascal Context | COCO | Checkpoints |
GCC | 49.0 | 20.6 | 22.9 | - |
GCC + RedCaps | 53.2 | 23.8 | 28.7 | pre-trained weights |
During training, we use webdataset for scalable data loading. To convert image text pairs into the webdataset format, we use the img2dataset tool to download and preprocess the dataset.
For inference, we use mmsegmentation for semantic segmentation testing, evaluation and visualization on Pascal VOC, Pascal Context and COCO datasets.
The overall file structure is as follows:
GroupViT
├── local_data
│ ├── gcc3m_shards
│ │ ├── gcc-train-000000.tar
│ │ ├── ...
│ │ ├── gcc-train-000436.tar
│ ├── gcc12m_shards
│ │ ├── gcc-conceptual-12m-000000.tar
│ │ ├── ...
│ │ ├── gcc-conceptual-12m-001943.tar
│ ├── yfcc14m_shards
│ │ ├── yfcc14m-000000.tar
│ │ ├── ...
│ │ ├── yfcc14m-001888.tar
│ ├── redcap12m_shards
│ │ ├── redcap12m-000000.tar
│ │ ├── ...
│ │ ├── redcap12m-001211.tar
│ ├── imagenet_shards
│ │ ├── imagenet-val-000000.tar
│ │ ├── ...
│ │ ├── imagenet-val-000049.tar
│ ├── VOCdevkit
│ │ ├── VOC2012
│ │ │ ├── JPEGImages
│ │ │ ├── SegmentationClass
│ │ │ ├── ImageSets
│ │ │ │ ├── Segmentation
│ │ ├── VOC2010
│ │ │ ├── JPEGImages
│ │ │ ├── SegmentationClassContext
│ │ │ ├── ImageSets
│ │ │ │ ├── SegmentationContext
│ │ │ │ │ ├── train.txt
│ │ │ │ │ ├── val.txt
│ │ │ ├── trainval_merged.json
│ │ ├── VOCaug
│ │ │ ├── dataset
│ │ │ │ ├── cls
│ ├── coco
│ │ ├── images
│ │ │ ├── train2017
│ │ │ ├── val2017
│ │ ├── annotations
│ │ │ ├── train2017
│ │ │ ├── val2017
The instructions for preparing each dataset are as follows.
Please download the training split annotation file from Conceptual Caption 12M and name it as gcc3m.tsv
.
Then run img2dataset
to download the image text pairs and save them in the webdataset format.
sed -i '1s/^/caption\turl\n/' gcc3m.tsv
img2dataset --url_list gcc3m.tsv --input_format "tsv" \
--url_col "url" --caption_col "caption" --output_format webdataset\
--output_folder local_data/gcc3m_shards
--processes_count 16 --thread_count 64
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
--enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-train-/' local_data/gcc3m_shards/*
Please refer to img2dataset CC3M tutorial for more details.
Please download the annotation file from Conceptual Caption 12M and name it as gcc12m.tsv
.
Then run img2dataset
to download the image text pairs and save them in the webdataset format.
sed -i '1s/^/caption\turl\n/' gcc12m.tsv
img2dataset --url_list gcc12m.tsv --input_format "tsv" \
--url_col "url" --caption_col "caption" --output_format webdataset\
--output_folder local_data/gcc12m_shards \
--processes_count 16 --thread_count 64
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
--enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-conceptual-12m-/' local_data/gcc12m_shards/*
Please refer to img2dataset CC12M tutorial for more details.
Please follow the CLIP Data Preparation instructions to download the YFCC14M subset.
wget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2
bunzip2 yfcc100m_subset_data.tsv.bz2
Then run the preprocessing script to create the subset sql db and annotation tsv files. This may take a while.
python convert_dataset/create_subset.py --input-dir . --output-dir . --subset yfcc100m_subset_data.tsv
This script will create two files: an SQLite db called yfcc100m_dataset.sql
and an annotation tsv file called yfcc14m_dataset.tsv
.
Then follow the YFCC100M Download Instruction to download the dataset and its metadata file.
pip install git+https://gitlab.com/jfolz/yfcc100m.git
mkdir -p yfcc100m_meta
python -m yfcc100m.convert_metadata . -o yfcc100m_meta --skip_verification
mkdir -p yfcc100m_zip
python -m yfcc100m.download yfcc100m_meta -o yfcc100m_zip
Finally convert the dataset into the webdataset format.
python convert_dataset/convert_yfcc14m.py --root yfcc100m_zip --info yfcc14m_dataset.tsv --shards yfcc14m_shards
Please download the annotation file from RedCaps.
wget https://www.dropbox.com/s/cqtdpsl4hewlli1/redcaps_v1.0_annotations.zip?dl=1
unzip redcaps_v1.0_annotations.zip
Then run the preprocessing script and img2dataset
to download the image text pairs and save them in the webdataset format.
python convert_dataset/process_redcaps.py annotations redcaps12m_meta/redcaps12m.parquet --num-split 16
img2dataset --url_list ~/data/redcaps12m/ --input_format "parquet" \
--url_col "URL" --caption_col "TEXT" --output_format webdataset \
--output_folder local_data/recaps12m_shards
--processes_count 16 --thread_count 64
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
--enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/redcap12m-/' local_data/recaps12m_shards/*
Please follow the webdataset ImageNet Example to convert ImageNet into the webdataset format.
Please follow the MMSegmentation Pascal VOC Preparation instructions to download and setup the Pascal VOC dataset.
Please refer to the MMSegmentation Pascal Context Preparation instructions to download and setup the Pascal Context dataset.
COCO dataset is an object detection dataset with instance segmentation annotations. To evaluate GroupViT, we combine all the instance masks of a catergory together and generate semantic segmentation maps. To generate the semantic segmentation maps, please follow MMSegmentation's documentation to download the COCO-Stuff-164k dataset first and then run the following
python convert_dataset/convert_coco.py local_data/data/coco/ -o local_data/data/coco/
ImageNet-S, extracted from ImageNet, and labelled by human efforts with dense masks. It has three versions depending on the different volumes of training datasets. Please refer to ImageNetS for further detail.
Besides, since the inference to this dataset is quite memory-burden. Without some explicit operation to the scripts in package 'mmseg' (mainly some tricks to control the cuda memory), the cuda would trigger out-of-memory. IF you would like to try on it, please contact me here
!!! Remember to set the right dataset path in
"segmentation/configs/_base_/datasets"
We used 4 NVIDIA A100 GPUs (80GB) for pre-training in our paper.
Since the first 30 epoch training of PGSeg follows the same step with GroupViT, so a tuning on the pretrained weight in GroupViT also works for this PGSeg, which you may find quick convergency could be reached.
/path/to/config = configs/pgseg_specific.yml
Train on a single node:
(node0)$ ./tools/dist_launch.sh main_pg_seg.py /path/to/config $GPUS_PER_NODE
Train on multiple nodes:
(node0)$ ./tools/dist_mn_launch.sh main_pg_seg.py /path/to/config $NODE_RANK $NUM_NODES $GPUS_PER_NODE $MASTER_ADDR
(node1)$ ./tools/dist_mn_launch.sh main_pg_seg.py /path/to/config $NODE_RANK $NUM_NODES $GPUS_PER_NODE $MASTER_ADDR
./tools/dist_launch.sh main_pg_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --eval
./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint
./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --opts evaluate.seg.cfg segmentation/configs/_base_/datasets/pascal_context.py
./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --opts evaluate.seg.cfg segmentation/configs/_base_/datasets/coco.py
If you find our work useful in your research, please cite:
@article{zhang2023uncovering,
title={Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation},
author={Zhang, Fei and Zhou, Tianfei and Li, Boyang and He, Hao and Ma, Chaofan and Zhang, Tianjiao and Yao, Jiangchao and Zhang, Ya and Wang, Yanfeng},
journal={arXiv preprint arXiv:2310.19001},
year={2023}
}
Great thanks to the code of GroupVIT!!!