This is the official implementaion of paper PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining, which is accepted in USENIX Security 2024. This repository contains Pytorch training code and evaluation code. PrivImage is a Differetial Privacy (DP) image generation tool, which leverages the DP technique to generate synthetic data to replace the sensitive data, allowing organizations to share and utilize synthetic images without privacy concerns.
Synthetic images from PrivImage on CIFAR-10 and CelebA32&64 with
- 10/07/2024: We communicated with the author of DPSDA and have added an explanation for why the FID scores in the DPSDA table are lower than those reported in the original paper. Please refer to Paper.
- PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining
Differential Privacy (DP) image data synthesis, which leverages the DP technique to generate synthetic data to replace the sensitive data, allowing organizations to share and utilize synthetic images without privacy concerns. Previous methods incorporate the advanced techniques of generative models and pre-training on a public dataset to produce exceptional DP image data, but suffer from problems of unstable training and massive computational resource demands. This paper proposes a novel DP image synthesis method, termed PrivImage, which meticulously selects pre-training data, promoting the efficient creation of DP datasets with high fidelity and utility. PrivImage first establishes a semantic query function using a public dataset. Then, this function assists in querying the semantic distribution of the sensitive dataset, facilitating the selection of data from the public dataset with analogous semantics for pre-training. Finally, we pre-train an image generative model using the selected data and then fine-tune this model on the sensitive dataset using Differentially Private Stochastic Gradient Descent (DP-SGD). PrivImage allows us to train a lightly parameterized generative model, reducing the noise in the gradient during DP-SGD training and enhancing training stability. Extensive experiments demonstrate that PRIVIMAGE uses only 1% of the public dataset for pre-training and 7.6% of the parameters in the generative model compared to the state-of-the-art method, whereas achieves superior synthetic performance and conserves more computational resources. On average, PrivImage achieves 6.8% lower FID and 13.2% higher Classification Accuracy than the state-of-the-art method.
We provide an example for how to reproduce the results on CIFAR-10 in our paper. Suppose you had 4 GPUs on your device.
To setup the environment of PRIVIMAGE, we use conda
to manage our dependencies. Our developers are conducted using CUDA 11.8
.
Run the following commands to install PrivImage:
conda create -n privimage python=3.8 -y && conda activate privimage
pip install --upgrade pip
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Download the files in the table and arrange the files according to the file tree below.
Dataset & Files | Download | Usage |
---|---|---|
data/ImageNet_ILSVRC2012 | Official Link | Pretraining dataset |
data/CIFAR-10 | Google Drive | Sensitive dataset |
|--src/
|--data/
|--ImageNet_ILSVRC2012/
|--train/
|--n01440764/
|--n01443537/
...
|--val/
|--n01440764/
|--n01443537/
...
|--CIFAR-10
|--cifar-10-python.tar.gz
Preprocess dataset for faster training.
cd /src/PRIVIMAGE+D
# preprocess CIFAR-10
python dataset_tool.py --source /src/data/CIFAR-10/cifar-10-python.tar.gz --dest /src/data/CIFAR-10/cifar10.zip
python compute_fid_statistics.py --path /src/data/CIFAR-10/cifar10.zip --fid_dir /src/data/CIFAR-10/ --file cifar10.npz
# preprocess ImageNet and save it as a folder /src/data/ImageNet32_ILSVRC2012
sh pd.sh
First, train a semantic query function on the public dataset ImageNet.
cd /src/SemanticQuery
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --nnodes=1 train_imagenet_classifier.py
After training, the checkpoints will be saved with the according accuracy on the validate set. You can choose the checkpoint with the highest accuracy to query the semantics.
python query_semantics.py --weight_file weight_path --tar_dataset cifar10 --data_dir /src/data/CIFAR-10 --num_words 5 --sigma1 50 --tar_num_classes 10
The query result will be saved as a .pth
file into the folder /QueryResults
.
Second, pretrain the diffusion model with the query result. Please change data_dir parameters into yours in /src/Pre-training/configs/cifar10_32/pretrain_s.yaml
.
cd /src/Pre-training
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --mode train --worker_dir pt_dir
After training, the checkpoint will be saved as /src/Pre-training/pt_dir/checkpoints/final_checkpoint.pth
.
Third, please finetune the pretrained model on the sensitive dataset. Readers should change data_dir and ckpt parameters into yours in /src/PRIVIMAGE+D/configs/cifar10_32/train_eps_10.0_s.yaml
.
cd /src/PRIVIMAGE+D
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --mode train --worker_dir ft_dir
The FID of synthetic images will be saved in /src/PRIVIMAGE/ft_dir/stdout.txt
.
Use trained PrivImage to generate 50,000 images for training classifiers.
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --mode eval --worker_dir ft_dir/sample50000 --config configs/cifar10_32/sample_s.yaml model.ckpt=/src/PRIVIMAGE+D/ft_dir/checkpoints/final_checkpoint.pth
cd /src/Evaluation
python downstream_classification.py --out_dir /src/PRIVIMAGE+D/ft_dir --train_dir /src/PRIVIMAGE+D/ft_dir/sample50000/samples --test_dir data_dir --dataset cifar10
The Classification Accuracy (CA) of trained classifiers on the testset will be saved into /src/PRIVIMAGE+D/ft_dir/evaluation_downstream_acc_log.txt
.
If you have any problems, please feel free to contact Kecen Li (likecen2023@ia.ac.cn) and Chen Gong (ChenG_abc@outlook.com).
The codes for training the diffusion models with DP-SGD is based on the DPDM.
@article{li2023privimage,
author = {Kecen Li and Chen Gong and Zhixiang Li and Yuzhong Zhao and Xinwen Hou and Tianhao Wang},
title = {{PrivImage}: Differentially Private Synthetic Image Generation using Diffusion Models with {Semantic-Aware} Pretraining},
booktitle = {33rd USENIX Security Symposium (USENIX Security 24)},
year = {2024},
isbn = {978-1-939133-44-1},
address = {Philadelphia, PA},
pages = {4837--4854}
}