VFedPCA+VFedAKPCA: Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data
This is the official source code for the Paper: Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data.
Please cite our paper if you find our code or paper useful:
@article{cheung2022vertical,
title={Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data},
author={Cheung, Yiu-ming and Jiang, Juyong and Yu, Feng and Lou, Jian},
journal={arXiv preprint arXiv:2203.01752},
year={2022}
}
Despite enormous research interest and rapid application of federated learning (FL) to various areas, existing studies mostly focus on supervised federated learning under the horizontally partitioned local dataset setting. This paper will study the unsupervised FL under the vertically partitioned dataset setting. Accordingly, we propose the federated principal component analysis for vertically partitioned dataset (VFedPCA) method, which reduces the dimensionality across the joint datasets over all the clients and extracts the principal component feature information for downstream data analysis. We further take advantage of the nonlinear dimensionality reduction and propose the vertical federated advanced kernel principal component analysis (VFedAKPCA) method, which can effectively and collaboratively model the nonlinear nature existing in many real datasets. In addition, we study two communication topologies. The first is a server-client topology where a semi-trusted server coordinates the federated training, while the second is the fully-decentralized topology which further eliminates the requirement of the server by allowing clients themselves to communicate with their neighbors. Extensive experiments conducted on five types of real-world datasets corroborate the efficacy of VFedPCA and VFedAKPCA under the vertically partitioned FL setting.
Figure 1.: Server-Clients Architecture
VFedPCA+VFedAKPCA
└── case // Case Studies
└── figs // Save experimental results' figures in '.eps' / '.png' format
├── img_name*.eps
└── img_name*.png
├── main.py
├── model.py
└── utils.py
├── dataset // Put downloaded dataset in this folder
└── figs // Save experimental results' figures in '.eps' / '.png' format
├── img_name*.eps
└── img_name*.png
├── README.md
├── main.py // Experiment on Structured Dataset
├── model.py
└── utils.py
- python = 3.8.8
- numpy = 1.20.1
- pandas = 1.2.4
- scipy = 1.6.2
- scikit-learn = 0.24.1
- imageio = 2.9.0
To demonstrate the superiority of our method, we utilized FIVE types of real-world datasets coming with distinct nature.
- structured datasets from different domains;
- medical image dataset;
- face image dataset;
- gait image dataset;
- person re-identification image dataset.
Step 1: Download Dataset from the Google Drive URL
Step 2: Specify Dataset Path by Command Argument
$ python main.py --data_path="./dataset/xxx"
We conduct extensive experiments on structured datasets to exmaines the effect of feature size, local iterations, warm-start power iterations, and weight scaling method on structed datasets. Furthermore, we investigate some case studies with image dataset to demonstrate the effectiveness of VFedPCA and VFedAKPCA.
The example is as follows. More details command usage can be check by python main.py --help
.
$ python main.py --data_path './dataset/College.csv' /
--p_list [3, 5, 10] /
--iter_list [10, 10, 10] /
--period_num 10 /
--sample_num 777 # use the whole by default
The example is as follows. More details command usage can be check by python main.py --help
.
$ python main.py --synthetic / # start use synthetic dataset
--pattern mixture / # or single
--shape [20000, 4000] # M features x N samples
The example is as follows. More details command usage can be check by python main.py --help
.
$ cd case # change into case folder
$ python main.py --data_path '../dataset/Image/DeepLesion' /
--client_num 8 /
--iterations 100 /
--re_size 512 /
--kernel sigmoid
The experimental result on structured datasets (GlaucomaM), the distance error after multiple communication has been significantly reduced and converge compared with the un-communicated situation at 0-th period.
Figure 2. The results of VFedPCA on structured datasets (GlaucomaM).
Here is some of our case studies comparative experiment results on six image datasets: YaleFace (center-light and surprised), CasiaGait (sequence 1 and sequence 10), DeepLesion and CUHK03 (from left to right), as shown in Figure 3.
Figure 3. The results of PCA(a)/AKPCA(b) on the un-splitted data, VFedPCA(a)/VFedAKPCA(b) on the isolated data, VFedAvgPCA(a)/VFedAvgAKPCA(b) (Without Weight Scaling Method) on the isolated data, PCA(a)/AKPCA(b) on the isolated data (from top to bottom).
Furthermore, we investigate three types of decentralized topologies in Vertical Federated Learning with PCA and AKPCA (ours). The architecture of decentralized topologies are (a) Fully Decentralized, (b) Ring Decentralized, and (c) Star Decentralized, as shown in Figure 4. The experimental results are demonstrated in Figure 5.
Figure 4. The three different types of decentralized topology.
Figure 5. The results of VFedPCA with the central coordination server, fully decentralized, ring decentralized, and start decentralized, from top to bottom respectively, on the isolated data from each clients.