Papers
arxiv:2406.05132

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Published on Jun 7
ยท Submitted by jedyang97 on Jun 13
Authors:
,

Abstract

The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

Community

Paper author Paper submitter

๐Ÿ”ฅ 3D-LLMs go brrrr! ๐Ÿš€ Excited to announce our latest research on scaling 3D-LLM training data to million-scale with dense grounding.

๐ŸŒŸ Introducing 3D-GRAND: a pioneering dataset featuring 40,087 household scenes paired with 6.2 million densely-grounded 3D-text pairs. ๐Ÿ ๐Ÿ’ฌ https://3d-grand.github.io

๐Ÿš€ We envision 3D-GRAND to be the bedrock for future 3D-LLMs! ๐Ÿ ๐Ÿ’ฌ

  • 6.2 million instructions + 40k 3D household scenes ๐Ÿ”ฅ
  • Significantly enhances grounding & reduces hallucinations for 3D-LLMs ๐ŸŒŸ
  • 3D-POPE: the first benchmark for systematic evaluation of hallucinations in 3D-LLMs ๐ŸŽฏ
  • Data scaling law and sim-to-real transfer provide strong early signals for a low-cost, scalable future for 3D-LLMs ๐Ÿ“ˆ

teaser-1.png

๐ŸŒŸ What's special about this data? ๐Ÿค”

  • Dense Grounding: Unlike traditional 3D-text datasets, ours connects every noun to an object in the 3D world. ๐Ÿ ๐Ÿ”—
  • Large-scale: We provide million-scale data, bridging the gap between 3D and 2D datasets. ๐Ÿ“Š
  • Diverse Tasks: Curated 8 diverse tasks to cover future 3D-LLM challenges. ๐ŸŒ
  • Hallucination: Special attention was given to curate a balanced dataset to help reduce hallucinations & Introduced a benchmark for evaluating hallucinations in 3D-LLMs. ๐Ÿง ๐Ÿ“

data_info_hub-1.png

๐Ÿš€ Results of 3D-LLMs trained on 3D-GRAND:

  • Stronger Grounding
  • Less Hallucination (huge improvement over prev. 3D-LLMs)
  • Data Scaling Law: More data -> better performance. ๐Ÿ“ˆ
  • Sim-to-real Transfer: Trained on synthetic 3D scenes -> effective transfer to real 3D scans in ScanNet. ๐ŸŒ

Screenshot 2024-06-12 at 18.55.06.png
Screenshot 2024-06-12 at 18.57.06.png

๐Ÿ™Œ Let's build better 3D-LLMs together! ๐Ÿ™Œ

๐Ÿ“„ Paper: http://arxiv.org/abs/2406.05132
๐ŸŒ Website & Data: http://3d-grand.github.io
๐Ÿ’ป Demo: http://huggingface.co/spaces/jedyang97/3D-GRAND
๐Ÿ“Š 3D-POPE Leaderboard: http://huggingface.co/spaces/sled-umich/3D-POPE-leaderboard
๐Ÿ”ง Code: http://github.com/sled-group/3D-GRAND

ยท

Thanks for sharing all details!! This helps : )

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.05132 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.05132 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 3