CLIP + TOME

This repository contains code for practicing the integration of TOME (Token Merging) into CLIP (Contrastive Language-Image Pre-training).

The main objective is to replace neural network modules related to timm and swag from the tome.patch with attn_tome that employs corresponding modules from the CLIP source code.

Related work

CLIP: https://github.com/OpenAI/CLIP

TOME: https://github.com/facebookresearch/tome

CLIP with efficient vision tower

Before TOME:

After TOME with 1.78x throughput:

Summary

In my experiments, TOME does indeed help accelerate the vision tower in CLIP, but applying TOME to the text tower results in completely disastrous outcomes.

A reasonable explanation is that text lacks the redundancy present in images, thus merging tokens leads to the loss of important information.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
clip		clip
tome		tome
CLIP16.ipynb		CLIP16.ipynb
CLIP16_ToMe_r16.ipynb		CLIP16_ToMe_r16.ipynb
CLIP16_ToMe_r8.ipynb		CLIP16_ToMe_r8.ipynb
CLIP32.ipynb		CLIP32.ipynb
CLIP32_ToMe_r4.ipynb		CLIP32_ToMe_r4.ipynb
CLIP32_ToMe_r8.ipynb		CLIP32_ToMe_r8.ipynb
README.md		README.md
throughput.ipynb		throughput.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP + TOME

Related work

CLIP with efficient vision tower

Before TOME:

After TOME with 1.78x throughput:

Summary

About

Releases

Packages

Languages

RicoWjr/CLIP_TOME

Folders and files

Latest commit

History

Repository files navigation

CLIP + TOME

Related work

CLIP with efficient vision tower

Before TOME:

After TOME with 1.78x throughput:

Summary

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages