A Decade's Battle on Dataset Bias: Are We There Yet?
Zhuang Liu and Kaiming He
Meta AI Research, FAIR
[arXiv
] [code
]
These images are sampled from three modern datasets: YFCC, CC, and DataComp. Can you specify which dataset each image is from? While these datasets appear to be less biased, we discover that neural networks can easily accomplish this “dataset classification” task with surprisingly high accuracy on the held-out validation set.
Answer (click)
YFCC: 1, 4, 7, 10, 13; CC: 2, 5, 8, 11, 14; DataComp: 3, 6, 9, 12, 15.We use the code from ConvNeXt. Please follow the instructions there for setup.
Download images from each dataset and organize them as follows:
/path/to/datasets_root/
train/
dataset1/
...
dataset2/
...
dataset3/
...
val/
dataset1/
...
dataset2/
...
dataset3/
...
We give example commands for single-machine and multi-node training below.
python run_with_submitit.py --nodes 4 --ngpus 8 \
--model convnext_tiny --opt_betas 0.9 0.95 \
--batch_size 128 --lr 1e-3 --update_freq 1 \
--weight_decay 0.3 --reprob 0 \
--data_set image_folder --nb_classes 3 \
--data_path /path/to/datasets_root/train \
--eval_data_path /path/to/datasets_root/val \
--job_dir /path/to/save_results
python -m torch.distributed.launch --nproc_per_node=8 main.py \
--model convnext_tiny --opt_betas 0.9 0.95 \
--batch_size 128 --lr 1e-3 --update_freq 1 \
--weight_decay 0.3 --reprob 0 \
--data_set image_folder --nb_classes 3 \
--data_path /path/to/datasets_root/train \
--eval_data_path /path/to/datasets_root/val \
--output_dir /path/to/save_results
This project is released under the MIT license. Please see the LICENSE file for more information.
@article{liu2024decade,
title = {A Decade's Battle on Dataset Bias: Are We There Yet?},
author = {Zhuang Liu and Kaiming He},
year = {2024},
journal = {arXiv preprint arXiv:2403.08632},
}