Skip to content

[ECCV2024] Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Notifications You must be signed in to change notification settings

Ivan-Tang-3D/Any2Point

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 

Repository files navigation

Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Official implementation of 'Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding'. The reamining will be open-source soon!


Introduction

Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method.

Main Results

We report the pre-training modality (Pre-train), the number of learnable parameters (#Param) on the "PB-T50-RS" split of ScanObjectNN (SCAN.) and ModelNet40 (MN.). * indicates utilizing the voting strategy.

Method Pre-train #Param(M) SCAN.(%) MN.(%)
PointNet N/A 3.5 68.0 89.2
PointNet++ N/A 1.5 77.9 90.7
DGCNN N/A 1.8 78.1 92.9
PointMLP N/A 12.6 85.4 94.1
Point-PN N/A 0.8 87.1 93.8
PointNeXt N/A 1.4 87.7 94.0
Point-BERT 3D 22.1 83.1 92.7
Point-MAE 3D 22.1 85.2 93.2
Point-M2AE 3D 15.3 86.4 93.4
P2P-HorNet 2D 1.2 89.3 94.0*
ACT 3D+2D 22.1 88.2 93.7
I2P-MAE 3D+2D 12.9 90.1 93.7
ReCon 3D+2D+Language 43.6 90.6 94.1
Any2Point (Audio) Audio 0.8 87.0 92.7
Any2Point (2D) 2D 0.8 87.7 93.2
Any2Point (Language) Language 0.9 91.9 94.3

Ckpt Release

Real-world shape classification on the PB-T50-RS split of ScanObjectNN:

Method Logs Acc. Ckpts
Any2Point-Lang-CLIP Language_CLIP_Scan.log 91.9% Language_CLIP_Scan.pth
Any2Point-Vision-DINOV2 Vision_DINOV2_Scan.log 87.7% Vision_DINOV2_Scan.pth
Any2Point-Audio-ImageBind Audio_imagebind_scan.log 87.0% Audio_imagebind_scan.pth

Synthetic shape classification on the ModelNet40:

Method Logs Acc. Ckpts
Any2Point-Lang-CLIP Language_CLIP_ModelNet.log 94.3% Language_CLIP_ModelNet.pth
Any2Point-Vision-DINOV2 Vision_DINOV2_ModelNet.log 93.2% Vision_DINOV2_ModelNet.pth
Any2Point-Audio-ImageBind Audio_imagebind_ModelNet.log 92.7% Audio_imagebind_ModelNet.pth

Get Started

Installation

Create a conda environment and install basic dependencies:

git clone https://github.com/Ivan-Tang-3D/Any2Point_code.git
cd Any2Point_code

conda create -n Any2Point python=3.7
conda activate Any2Point

# Install the according versions of torch and torchvision
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch

conda install -c pyg pytorch-cluster pytorch-scatter pytorch-sparse -y
pip install torch-geometric==2.0

source install.sh

Dataset

For pre-training and fine-tuning, please follow DATASET.md to install ModelNet40, ScanObjectNN, and ShapeNetPart datasets, referring to Point-BERT. Specially Put the unzip folder under data/.

The final directory structure should be:

│Point-PEFT/
├──cfgs/
├──data/
│   ├──ModelNet/
│   ├──ScanObjectNN/
├──...

Fine-tuning

Please download the CLIP_pre-train.pth, DINOV2_pre-train.pth and ImageBind_audio_pre-train.pth into the corresponding ckpts/ folder.

For the PB-T50-RS split of ScanObjectNN, run:

sh Finetune_cache_prompt_scan.sh

Acknowledgement

This repo benefits from Pix4Point, Point-NN, PointTransformerV2, Openpoints. Thanks for their wonderful works.

About

[ECCV2024] Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published