We provide an example for training linear model in vertical federated learning:
python3 ../main.py --cfg linear_model/baseline/vertical_fl.yaml
You can specify customized configurations in vertical_fl.yaml
, such as data.type
and federate.total_round_num
.
More details of the provided example can be found in Tutorial.
Note that FederatedScope only provide an abstract_paillier
, user can refer to pyphe for the detail implementation, or adopt other homomorphic encryption algorithms.
FederatedScope-Tree is built for training tree-based models in vertical federated learning, such as XGB, GBDT, RF, etc. We provide a hands-on tutorial here.
We categorize tree-based models in vertical FL into two types, i.e., feature-gathering tree-based models and label-scattering tree-based models, according to their communication and computation protocols, and provide the corresponding implementations.
- Feature-gathering tree-based models: The data parties (i.e., the participants who don't hold labels) send the orders of their feature values to the task party (i.e., the participant holds labels) for training, and then the task party calculate and compare the information gains for determining the split rules at the nodes.
- Label-scattering tree-based models: The task party sends the label-related information (such as gradient and hessian values) to the data parties for calculating the information gains, and all the gains would be retuned to the task parties for compression.
You can set the model and algorithm in yaml files as below:
use_gpu: False # Whether to use GPU
device: 0 # Deciding which GPU to use
backend: torch
# Federate learning related options
federate:
mode: standalone # `standalone` or `distributed`
client_num: 2 # number of client
model:
type: xgb_tree # xgb_tree or gbdt_tree or random_forest
# related hyperparameters
lambda_: 0.1
gamma: 0
num_of_trees: 10
max_tree_depth: 6
# Dataset related options
data:
root: data/ # Root directory where the data stored
type: abalone # Dataset name
splits: [0.8, 0.2] # splits for training and testing
dataloader:
type: raw # Personalized DataLoader
batch_size: 4177
criterion:
type: RegressionMSELoss # CrossEntropyLoss, for binary classification
# Trainer related options
trainer:
# Trainer type
type: verticaltrainer
# vertical related options
vertical:
use: True
dims: [4, 8] # feature split for two clients, one has feature 0~3,
# and the other has feature 4~7
feature_subsample_ratio: 1.0 # default = 1.0
# the proportion of the numbers of features used for training per user.
algo: 'xgb' # 'xgb' or 'gbdt' or 'rf'
data_size_for_debug: 0 # use a subset for debug in vfl,
# 0 indicates using the entire dataset (disable debug mode)
# Evaluation related options
eval:
# Frequency of evaluation
freq: 3
best_res_update_round_wise_key: test_loss
Users can specify model.type
and vertical.algo
to use different models. For examples,:
- XGBoost:
model.type = xgb_tree
, andvertical.algo = xgb
- GBDT:
model.type = gbdt_tree
, andvertical.algo = gbdt
- Random Forest:
model.type = random_forest
, andvertical.algo = rf
For feature-gathering models, we provide two kinds of privacy protection algorithms to protect the order of feature values.
One of protection methods is differential privacy (DP). Users can add the following configurations:
vertical:
protect_object: 'feature_order'
protect_method: 'dp'
protect_args: [{'bucket_num': 50, 'epsilon': 3}]
# protect_args: [{'bucket_num': 50}]
'bucket_num': b
means that we partition the order into None
(i.e.,
Another protection method is 'op_boost' (global/local) as follows:
vertical:
protect_object: 'feature_order'
protect_method: 'op_boost'
protect_args: [{'algo': 'global', 'lower_bound': lb, 'upper_bound': ub, 'epsilon': 2}]
# protect_args: [{'algo': 'adjust', 'lower_bound': lb, 'upper_bound': ub, 'epsilon_prt': 2, 'epsilon_ner': 2, 'partition_num': pb}]
-
global
means we map the data into the integers between$[lb, ub]$ by affine transformation. For each mapped value$x$ , it will be re-mapped to$i\in[lb, ub]$ with probability$$p=\frac{e^{-|x-i|\cdot\epsilon/2}}{\sum_{j\in[lb, ub]} e^{-|x-j|\cdot\epsilon/2}}$$ randomly. -
adjusting
means we map the data into the integers between$[lb, ub]$ , and then partition$[lb, ub]$ into$pb$ buckets evenly. For a value$x$ inside the$m$ -th bucket, we first randomly select a bucket$i$ with probability$$p=\frac{e^{-|m-i|\cdot\epsilon_{p\ r\ t}/2}}{\sum_{j\in[lb, ub]}e^{-|m-j|\cdot\epsilon_{p\ r\ t}/2}},$$ then we randomly select a value$v$ in the selected bucket with probability$$p=\frac{e^{-|x-v|\cdot\epsilon_{n\ e\ r}/2}}{\sum_{j\in[lb, ub]} e^{-|x-j|\cdot\epsilon_{n\ e\ r}/2}}.$$ When$lb$ and$ub$ are close,$\epsilon, \epsilon_{prt}, \epsilon_{ner}$ and$pb$ are small, the strength for privacy protection is strong but the model utility can be affected.
In protect_args
, you can also add bucket_num
to accelerate the training which is similar to the hist algorithm in XGBoost.
The above two protection methods were proposed in "FederBoost: Private Federated Learning for GBDT" and "OpBoost: A Vertical Federated Tree Boosting Framework Based on Order-Preserving Desensitization".
For label-scattering model, we provide privacy protection algorithms proposed by "SecureBoost: A Lossless Federated Learning Framework". Users can add the following configurations:
vertical:
mode: 'label_based'
protect_object: 'grad_and_hess'
protect_method: 'he'
key_size: ks
protect_args: [ { 'bucket_num': b } ]
Specifically, the task party encrypts the label-related information (such as grad and hess for XGBoost , grad and indicator vector for GBDT), and send them to data party. Each data party sort the encrypted information by the order of feature values, and partition them into
In Inference procedure, we also provide different manners. Users can specify vertical.eval
to apply secret sharing (ss
), homomorphic encryption (he
), or choose to no apply protection method (''
).
vertical.eval: ''
means the basic procedure, that is, for each tree, when task party performs inference, for each internal node, he will check the owner of the split feature, and sends a single to the owner. The owner compares the test data and the split value to get the indicator vectors for left and right children, then sends them to task party. Task party continues testing for the next node until the leaf nodes are reached.
vertical.eval: 'ss'
(coming soon!) means for each tree, task party first SS the weight of each leaf node. And during inference, the indicator vectors of left and right children are also secret shared. By SS multiplication, at the end, each party will get a secret shared piece of the testing results. Then task party receives the pieces of data parties to reveal the exact result. The main advantage is that the indicator vectors are masked. This is adapted from "Large-Scale Secure XGB for Vertical Federated Learning".
vertical.eval: 'he'
means for each tree, each party locally get the leaf vector. Here, a leaf vector is a
Several examples are provided in federatedscope/tree_based_models/baseline/
. The following configurations (provided in xgb_feature_gathering_dp_on_adult.yaml
) shows how to run XGBoost n Adult dataset, using feature-gathering tree-based model and applying differential privacy for protection.
use_gpu: False
device: 0
backend: torch
federate:
mode: standalone
client_num: 2
model:
type: xgb_tree
lambda_: 0.1
gamma: 0
num_of_trees: 10
max_tree_depth: 3
data:
root: data/
type: adult
splits: [1.0, 0.0]
dataloader:
type: raw
batch_size: 2000
criterion:
type: CrossEntropyLoss
trainer:
type: verticaltrainer
train:
optimizer:
# learning rate for xgb model
eta: 0.5
vertical:
use: True
dims: [7, 14]
algo: 'xgb'
protect_object: 'feature_order'
protect_method: 'dp'
protect_args: [{'bucket_num': 100, 'epsilon':10}]
data_size_for_debug: 2000
eval:
freq: 3
best_res_update_round_wise_key: test_loss