DuoduoCLIP

Han-Hung Lee^*1, Yiming Zhang^*1 and Angel Xuan Chang^1,2

^* Equal Contribution ¹ Simon Fraser University ² Canada-CIFAR AI Chair, Amii

Abstract

We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point-clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirements and training time. In addition, we modify the model with cross-view attention to leverage information across multiple frames of the object which further boosts performance. Compared to the current SOTA point cloud method that requires 480 A100 hours to train 1 billion model parameters we only require 57 A5000 hours and 87 million parameters. Multi-view images also provide more flexibility in use cases compared to point clouds. This includes being able to encode objects with a variable number of images, with better performance when more views are used. This is in contrast to point cloud based methods, where an entire scan or model of an object is required. We showcase this flexibility with object retrieval from images of real-world objects. Our model also achieves better performance in more fine-grained text to shape retrieval, demonstrating better text-and-shape alignment than point cloud based models.

Notes

This is the official initial release for the paper Duoduo CLIP: Efficient 3D Understanding with Multi-View Images. In this release we provide evaluation for the LVIS split of Objaverse as well object retrieval from text. We will release the entire data preparation and training codes soon. The detailed list of items are listed in the TODO list below.

Environment Setup

Conda

We use miniconda to manage system dependencies.

# create and activate the conda environment
conda create -n ddclip python=3.10
conda activate ddclip

# install PyTorch
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=12.1 -c pytorch -c nvidia

# install Python libraries
pip install -r requirements.txt
cd open_clip_mod
pip install .

# install Faiss
conda install -c pytorch -c nvidia faiss-gpu=1.8.0

Dataset

Objaverse LVIS

Download the objaverse lvis files for evaluation.

python preprocess/download_lvis.py

Preprocessed Objaverse Embeddings

We also provide embeddings for each object of the entire objaverse dataset using 12 randomly rendered views for each object.

Download the shape embeddings.

python preprocess/download_embeddings.py

Evaluation

Objaverse

Run the objaverse lvis evaluation over multiple view settings. The model here is trained with 1 to 6 frames sampled during training with last 6 layers trainable.

python test_objaverse_lvis.py ckpt_path=Four_1to6F_bs1600_LT6.ckpt

Retrieval

Retrieve objaverse models using text as input. You can visualize models here.

python text_retrieval.py ckpt_path=Four_1to6F_bs1600_LT6.ckpt

TODOs

Add data preparation code for Four, MVImgNet and Text2Shape.
Add training code for all setting in the paper.
Add evaluation scripts for MVPNet and Text2Shape.

Acknowledgements

Code

OpenCLIP: Our model backbones and weights are based off the open source implementation OpenCLIP. The folder open_clip_mod contains the same code as in the OpenCLIP, but with some minor modifications to expose some additional functions from the package. The code within src/custom_clip modifies the OpenCLIP models to support the multi-view attention as described in the paper.

OpenShape: Our training framework closely follows that of OpenShape. We also use their provided model ids and text captions of their released dataset for training.

Zero123: A large chunk of our rendered images for objects come from the paper Zero123, we also use their rendering script to render images for remaining objects.

We thank the authors for their work and releasing their code and weights!

Funding

This work was funded by a CIFAR AI Chair and a NSERC Discovery Grant.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
dataset/csv_list		dataset/csv_list
docs		docs
open_clip_mod		open_clip_mod
preprocess		preprocess
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
test_objaverse_lvis.py		test_objaverse_lvis.py
text_retrieval.py		text_retrieval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DuoduoCLIP

Abstract

Notes

Environment Setup

Conda

Dataset

Objaverse LVIS

Preprocessed Objaverse Embeddings

Evaluation

Objaverse

Retrieval

TODOs

Acknowledgements

Code

Funding

About

Releases

Packages

Contributors 2

Languages

3dlg-hcvc/DuoduoCLIP

Folders and files

Latest commit

History

Repository files navigation

DuoduoCLIP

Abstract

Notes

Environment Setup

Conda

Dataset

Objaverse LVIS

Preprocessed Objaverse Embeddings

Evaluation

Objaverse

Retrieval

TODOs

Acknowledgements

Code

Funding

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages