Uni3DL: Unified Model for 3D and Language Understanding (ECCV2024)

by Xiang Li*, Jian Ding*, Zhaoyang Chen, Mohamed Elhoseiny.

🎶 Introduction

We present Uni3DL, a unified model for 3D and Language understanding. Distinct from existing unified vision-language models in 3D which are limited in task variety and predominantly dependent on projected multi-view images, Uni3DL operates directly on point clouds. This approach significantly expands the range of supported tasks in 3D, encompassing both vision and vision-language tasks in 3D. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively generate task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D and language understanding.

Getting Started

Installation

pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu113
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
python -m pip install -r requirements.txt
sh install_cococapeval.sh

Pre-training

sbatch srun_joint.sh

Task-Specific Fine-tuning

Semantic Segmentation on S3DIS

sbatch srun_s3dis_sem.sh

Instance Segmentation on S3DIS

sbatch srun_s3dis_inst.sh

Results

Acknowledgement

We build our work on top of X-Decoder and Mask3D
We appreciate the contructive dicussion with Xueyan Zou

Citation

@inproceedings{li2025uni3dl,
  title={Uni3DL: A Unified Model for 3D Vision-Language Understanding},
  author={Li, Xiang and Ding, Jian and Chen, Zhaoyang and Elhoseiny, Mohamed},
  booktitle={European Conference on Computer Vision},
  pages={74--92},
  year={2025},
  organization={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
datasets		datasets
images		images
pipeline		pipeline
requirements		requirements
trainer		trainer
uni3dl		uni3dl
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
entry.py		entry.py
eval.py		eval.py
install_cococapeval.sh		install_cococapeval.sh
requirements.txt		requirements.txt
srun_cap3d.sh		srun_cap3d.sh
srun_joint.sh		srun_joint.sh
srun_s3dis_inst.sh		srun_s3dis_inst.sh
srun_s3dis_sem.sh		srun_s3dis_sem.sh
srun_scannet_inst.sh		srun_scannet_inst.sh
srun_scannet_sem.sh		srun_scannet_sem.sh
srun_scanref.sh		srun_scanref.sh
srun_text2shape.sh		srun_text2shape.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uni3DL: Unified Model for 3D and Language Understanding (ECCV2024)

🎶 Introduction

Getting Started

Installation

Pre-training

Task-Specific Fine-tuning

Semantic Segmentation on S3DIS

Instance Segmentation on S3DIS

Results

Acknowledgement

Citation

About

Releases

Packages

Languages

License

lx709/Uni3DL

Folders and files

Latest commit

History

Repository files navigation

Uni3DL: Unified Model for 3D and Language Understanding (ECCV2024)

🎶 Introduction

Getting Started

Installation

Pre-training

Task-Specific Fine-tuning

Semantic Segmentation on S3DIS

Instance Segmentation on S3DIS

Results

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages