Open
Description
Hi,
I noticed that you used SAM and Grounding DINO to generate segmentation masks.
Could you please explain how you merge the outputs from SAM and Grounding DINO to create the ground-truth in the GranD-f dataset?
Additionally, could you describe the process of creating the final dense caption?
Is your method fully automated, or does it require manual verification?
I am interested in applying your method to the GranD dataset.
Thank you.
Metadata
Assignees
Labels
No labels