Hanyu Song, Peizhao Li, and Hongfu Liu. "Deep Clustering based Fair Outlier Detection, SIGKDD 2021.
In this paper, we focus on the fairness issues regarding unsupervised outlier detection. Traditional algorithms, without specific design for algorithmic fairness, could implicitly encode and propagate statistical bias in data and raise societal concerns. To correct such unfairness and deliver a fair set of potential outlier candidates, we propose Deep Clustering-based Fair Outlier Detection (DCFOD) that learns a good representation for utility maximization while enforcing the learnable representation to be subgroup-invariant on the sensitive attribute. Considering the coupled and reciprocal nature between clustering and outlier detection, we leverage deep clustering to discover the intrinsic cluster structure and out-of-structure instances. Meanwhile, an adversarial training erases the sensitive pattern for instances for fairness adaptation. Technically, we propose an instance-level weighted representation learning strategy to enhance the joint deep clustering and outlier detection, where the dynamic weight module re-emphasizes contributions of likely-inliers while mitigating the negative impact from outliers. Demonstrated by experiments on eight datasets comparing to 17 outlier detection algorithms, our DCFOD method consistently achieves superior performance on both the outlier detection validity and two types of fairness notions in outlier detection.
numpy==1.7.1
torch==1.7.0
sklearn==0.22.0
pandas==1.0.5
cuda=10.1.243
Model training requires at least one GPU.
We compare performance on eight datasets student, asd, obesity, cc, german, drug, adult, kdd.
To prepare datasets before model training, run
python3 getDatasets.py
Theoretically, you don't have to modify any path for datasets or folders throughout model training.
To obtain DCFOD's performance on a speicifc dataset, run
python3 train.py *dataset_name* *GPU_index* *with_weight*
i.e.,
python3 train.py student 0 true
GPU_index
indicates the i-1 th GPU you want to train on. If you only have one, simply type 0.
To obtain the results for derivative DCOD, change the kf
hyperparameter value to 0 in the train method.
FairLOF requires the baseline result of LOF, you should first run
python3 LOF.py *dataset_name*
followed by
python3 get_Ws_for_FairLOF.py
which will retrieve all the Ws
variable based on LOF results for all datasets, which is requried in FairLOF calculation.
If you don't have the LOF results for all datasets, tweak the datasets list in the get_Ws_for_FairLOF.py
file.
then run
python3 FairLOF.py *dataset_name*
the experiment runs on 4 GPUs, you can tweak line 21-25 in FairLOF.py
to modify cuda and the associated GPU settings.
We run FairOD with
python3 FairOD.py *dataset_name* *GPU_index* *fair_command*
you should first obtain a baseline result, i.e.,
python3 FairOD.py student 0 f
then train the fair model
python3 FairOD.py student 0 t
To save methods' outlier scores and obtain metrics' values on a specific dataset, run
python3 pyod_results.py *dataset_name*
Or, if you already ran the above command, which has saved outlier scores, and you want to re-obtain the metrics' values, simply run
python3 Retriever.py *dataset_name*
We use AUC to measure detection validity, and Fgap, Frank to measure two types of fairness degree.
We calculate AUC with roc_auc_score
from sklearn.metrics
, and define Fgap, Frank in Retriever.py
.
During model training, we obtain the fairness metrics with the fetch method in Retriever.py
.
@inproceedings{10.1145/3447548.3467225,
author = {Song, Hanyu and Li, Peizhao and Liu, Hongfu},
title = {Deep Clustering Based Fair Outlier Detection},
year = {2021},
isbn = {9781450383325},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3447548.3467225},
doi = {10.1145/3447548.3467225},
booktitle = {Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining},
pages = {1481–1489},
numpages = {9},
keywords = {fair representation learning, deep clustering on outlier detection, outlier detection},
location = {Virtual Event, Singapore},
series = {KDD '21}
}