Skip to content

Integrates the vision, touch, and common-sense information of foundational models, customized to the agent's perceptual needs.

License

Notifications You must be signed in to change notification settings

ai4ce/FusionSense

Repository files navigation

More Documentation Ongoing for VLM Reasoning and Real World Experiment. The README Needs a Lot of Cleaning and Update

πŸ†• [2024-10-17] Installation for Hardware Integration/3D Printing Updated. πŸ†• [2024-10-15] Installation for Robotics Software Updated. πŸ†• [2024-10-11] Made Public

FusionSense

[Page] | [Paper] | [Video]

This is the official implementation of FusionSense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction

Irving Fang, Kairui Shi, Xujin He, Siqi Tan, Yifan Wang, Hanwen Zhao, Hung-Jui Huang, Wenzhen Yuan, Chen Feng, Jing Zhang

FusionSense is a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors. It enables visually and geometrically accurate scene and object reconstruction, even for conventionally challenging objects.

FusionSense Snapshot

Preparation

Step 0: Install Everything Robotics

We used a depth camera mounted on a robot arm powered by ROS2 to acquire pictures with accurate pose information. We also used a tactile sensor for Active Touch Selection.

If you have no need for this part, feel free to jump into Step 1 for the 3D Gaussian pipeline of Robust Global Shape Representation and Local Geometric Optimization.

Step 1: Install 3D Gaussian Dependencies and Nerfstudio

Note: Because our major dependencies, Nerfstudio and Grounded-SAM-2, officially support two different CUDA version (11.8 vs. 12.1), we will have to create two separate environments. We hope to resolve this in the future when Nerfstudio bump its official CUDA support version.

git clone --recursive https://github.com/ai4ce/FusionSense.git
cd FusionSense
conda env create -f config.yml
conda activate fusionsense

Install compatible pytorch and cuda-toolkit version:

pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit

Install tinycudann:

pip install ninja git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

Build the environment

pip install -e .

Step 3: Install Grounded-SAM-2

We use Grounded-SAM-2 for segmenting the foreground and background. Please make sure to use our modified submodule.

We recommend starting a separate Conda environment, since Grounded-SAM-2 requires CUDA 12.1, which is not yet officially supported by Nerfstudio.

cd Grounded-SAM2-for-masking
cd checkpoints
bash download_ckpts.sh
cd ../gdino_checkpoints
bash download_ckpts.sh
conda create -n G-SAM-2
conda activate G-SAM-2
conda install pip 
conda install opencv supervision transformers
pip install torch torchvision torchaudio
# select cuda version 12.1
export CUDA_HOME=/path/to/cuda-12.1/
# install Segment Anything 2
pip install -e . 
# install Grounding DINO
pip install --no-build-isolation -e grounding_dino

For further installation problems:

Usage

Select Frames

set train.txt with images id.

Extract Mask

Switch your conda env first
set your scene path and prompt text with the end of '.'
eg. 'transparent white statue.'

conda activate G-SAM-2
cd Grounded-SAM2-for-masking
python grounded_sam2_hf_model_imgs_MaskExtract.py  --path {ABSOLUTE_PATH} --text {TEXT_PROMPT_FOR_TARGET_OBJ}
cd ..

run the script to extract masks.

If the num_no_detection is not 0, you need to select the frame again. Then you will see mask_imgs in /masks, and you can check /annotated frames to see the results more directly.

Run pipeline

You can change configs here: configs/config.py

conda activate fusionsense
python scripts/train.py --data_name {DATASET_NAME} --model_name {MODEL_NAME} --configs {CONFIG_PATH}

Render outputs

For render jpeg or mp4 outputs using nerfstudio, we recommend install ffmpeg in conda environment:

conda install -c conda-forge x264=='1!161.3030' ffmpeg=4.3.2

To render outputs of pretrained models:

python scripts/render_video.py camera-path --load_config your-model-config --camera_path_filename camera_path.json --rendered_output_names rgb depth normal

more details in nerfstudio ns-render.

Dataset Format

datasets/
    ds_name/
    β”‚
    β”œβ”€β”€ transforms.json # need for training
    β”‚
    β”œβ”€β”€ train.txt
    β”‚
    β”œβ”€β”€ images/
    β”‚   β”œβ”€β”€ rgb_1.png
    β”‚   └── rgb_2.png
    β”‚ 
    β”œβ”€β”€ realsense_depth/
    β”‚   β”œβ”€β”€ depth_1.png
    β”‚   └── depth_2.png
    β”‚
    │── tactile/
    β”‚   β”œβ”€β”€ image
    β”‚   β”œβ”€β”€ mask
    β”‚   β”œβ”€β”€ normal
    β”‚   └── patch
    β”‚
    β”œβ”€β”€ model.stl       # need for evaluation
    β”‚
    β”œβ”€β”€ normals_from_pretrain/ # generated
    β”‚   β”œβ”€β”€ rgb_1.png
    β”‚   └── rgb_2.png
    β”‚
    β”œβ”€β”€ foreground_pcd.ply
    β”‚
    └── merged_pcd.ply

Outputs Format

outputs/
    ds_name/
    β”‚
    β”œβ”€β”€ MESH/
    β”‚   └── mesh.ply
    β”‚
    β”œβ”€β”€ nerfstudio_models/
    β”‚   └── 30000.ckpt
    β”‚   
    β”œβ”€β”€ cluster_centers.npy
    β”‚
    β”œβ”€β”€ config.yml
    β”‚
    β”œβ”€β”€ high_grad_pts.pcd
    β”‚
    β”œβ”€β”€ high_grad_pts_ascii.pcd
    β”‚
    └── dataparser_transforms.json

eval/
    ds_name/ *evaluation results files*