Han-Hung Lee*1, Yiming Zhang*1 and Angel Xuan Chang1,2
* Equal Contribution 1 Simon Fraser University 2 Canada-CIFAR AI Chair, Amii
We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point-clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirements and training time. In addition, we modify the model with cross-view attention to leverage information across multiple frames of the object which further boosts performance. Compared to the current SOTA point cloud method that requires 480 A100 hours to train 1 billion model parameters we only require 57 A5000 hours and 87 million parameters. Multi-view images also provide more flexibility in use cases compared to point clouds. This includes being able to encode objects with a variable number of images, with better performance when more views are used. This is in contrast to point cloud based methods, where an entire scan or model of an object is required. We showcase this flexibility with object retrieval from images of real-world objects. Our model also achieves better performance in more fine-grained text to shape retrieval, demonstrating better text-and-shape alignment than point cloud based models.
This is the official initial release for the paper Duoduo CLIP: Efficient 3D Understanding with Multi-View Images. In this release we provide evaluation for the LVIS split of Objaverse as well object retrieval from text. We will release the entire data preparation and training codes soon. The detailed list of items are listed in the TODO list below.
We use miniconda to manage system dependencies.
# create and activate the conda environment
conda create -n ddclip python=3.10
conda activate ddclip
# install PyTorch
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=12.1 -c pytorch -c nvidia
# install Python libraries
pip install -r requirements.txt
cd open_clip_mod
pip install .
# install Faiss
conda install -c pytorch -c nvidia faiss-gpu=1.8.0
- Download the objaverse lvis files for evaluation.
python preprocess/download_lvis.py
We also provide embeddings for each object of the entire objaverse dataset using 12 randomly rendered views for each object.
- Download the shape embeddings.
python preprocess/download_embeddings.py
- Run the objaverse lvis evaluation over multiple view settings. The model here is trained with 1 to 6 frames sampled during training with last 6 layers trainable.
python test_objaverse_lvis.py ckpt_path=Four_1to6F_bs1600_LT6.ckpt
- Retrieve objaverse models using text as input. You can visualize models here.
python text_retrieval.py ckpt_path=Four_1to6F_bs1600_LT6.ckpt
- Add data preparation code for Four, MVImgNet and Text2Shape.
- Add training code for all setting in the paper.
- Add evaluation scripts for MVPNet and Text2Shape.
OpenCLIP: Our model backbones and weights are based off the open source implementation OpenCLIP. The folder open_clip_mod contains the same code as in the OpenCLIP, but with some minor modifications to expose some additional functions from the package. The code within src/custom_clip modifies the OpenCLIP models to support the multi-view attention as described in the paper.
OpenShape: Our training framework closely follows that of OpenShape. We also use their provided model ids and text captions of their released dataset for training.
Zero123: A large chunk of our rendered images for objects come from the paper Zero123, we also use their rendering script to render images for remaining objects.
We thank the authors for their work and releasing their code and weights!
This work was funded by a CIFAR AI Chair and a NSERC Discovery Grant.