Skip to content

naver-ai/dtm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Taekyung Kim*, Byeongho Heo, Dongyoon Han*
(*equal contribution)
NAVER AI LAB

paper

Abstract

Masked image modeling (MIM) has emerged as a promising approach for training Vision Transformers (ViTs). The essence of MIM lies in the token-wise prediction of masked tokens, which aims to predict targets tokenized from images or generated by pre-trained models like vision-language models. While using tokenizers or pre-trained models are plausible MIM targets, they often offer spatially inconsistent targets even for neighboring tokens, complicating models to learn unified and discriminative representations. Our pilot study identifies spatial inconsistencies and suggests that resolving them can accelerate representation learning. Building upon this insight, we introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets, thereby mitigating spatial inconsistency. DTM is compatible with various SSL frameworks; we showcase improved MIM results by employing DTM, barely introducing extra training costs. Our method facilitates training by using consistent targets, resulting in 1) faster training and 2) reduced losses. Experiments on ImageNet-1K and ADE20K demonstrate the superiority of our method compared with state-of-the-art, complex MIM methods. Furthermore, the comparative evaluation of the iNaturalists and fine-grained visual classification datasets further validates the transferability of our method on various downstream tasks.

Our Motivation: What is spatial consistency among visual tokens?

  • (a): input image; (b) and (c) display the predicted classes for each token within 4 example bounding boxes without/with token aggregations, respectively.
  • Shades of red and green represent the degree of incorrect and correct predictions, respectively.
  • The zero-shot accuracies below support spatial consistency's connection to the model's capability.
image

Schematic Illustrations

  • Left: Dynamic Token Morphing (DTM), Right: Overview of Masked Image Modeling via DTM
image image

Updates

  • (2024/10/11): Code is under internal review.
  • (2024/10/11): Preprint has been updated.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published