A Decade's Battle on Dataset Bias: Are We There Yet?

A Decade's Battle on Dataset Bias: Are We There Yet?
Zhuang Liu and Kaiming He
Meta AI Research, FAIR
[arXiv] [code]

These images are sampled from three modern datasets: YFCC, CC, and DataComp. Can you specify which dataset each image is from? While these datasets appear to be less biased, we discover that neural networks can easily accomplish this “dataset classification” task with surprisingly high accuracy on the held-out validation set.

Answer (click)

YFCC: 1, 4, 7, 10, 13; CC: 2, 5, 8, 11, 14; DataComp: 3, 6, 9, 12, 15.

Code

We use the code from ConvNeXt. Please follow the instructions there for setup.

Dataset Preparation

Download images from each dataset and organize them as follows:

/path/to/datasets_root/
  train/
    dataset1/
      ...
    dataset2/
      ...
    dataset3/
      ...
  val/
    dataset1/
      ...
    dataset2/
      ...
    dataset3/
      ...

Training

We give example commands for single-machine and multi-node training below.

Multi-node

python run_with_submitit.py --nodes 4 --ngpus 8 \
--model convnext_tiny --opt_betas 0.9 0.95 \
--batch_size 128 --lr 1e-3 --update_freq 1 \
--weight_decay 0.3 --reprob 0 \
--data_set image_folder --nb_classes 3 \ 
--data_path /path/to/datasets_root/train \
--eval_data_path /path/to/datasets_root/val \
--job_dir /path/to/save_results

Single-machine

python -m torch.distributed.launch --nproc_per_node=8 main.py \
--model convnext_tiny --opt_betas 0.9 0.95 \
--batch_size 128 --lr 1e-3 --update_freq 1 \
--weight_decay 0.3 --reprob 0 \
--data_set image_folder --nb_classes 3 \ 
--data_path /path/to/datasets_root/train \
--eval_data_path /path/to/datasets_root/val \
--output_dir /path/to/save_results

LICENSE

This project is released under the MIT license. Please see the LICENSE file for more information.

Citation

@article{liu2024decade,
  title   = {A Decade's Battle on Dataset Bias: Are We There Yet?},
  author  = {Zhuang Liu and Kaiming He},
  year    = {2024},
  journal = {arXiv preprint arXiv:2403.08632},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Decade's Battle on Dataset Bias: Are We There Yet?

Code

Dataset Preparation

Training

Multi-node

Single-machine

LICENSE

Citation

About

Releases

Packages

License

liuzhuang13/bias

Folders and files

Latest commit

History

Repository files navigation

A Decade's Battle on Dataset Bias: Are We There Yet?

Code

Dataset Preparation

Training

Multi-node

Single-machine

LICENSE

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages