Papers
arxiv:2307.05663

Objaverse-XL: A Universe of 10M+ 3D Objects

Published on Jul 11, 2023
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.

Community

Introduces Objaverse-XL: a dataset of over 10M web-crawled 3D object scans/assets for scaling 3D foundation model research (like how LLMs and VLMs have common crawled and large image datasets); shows improvements in Zero123 and PixelNeRF; much larger than ShapeNet and Objaverse 1.0. Useful for text and image conditioned 3D shape generation. Mined GitHub (public files with extensions like glb, fbx, ply, blend, etc. - renderable in Blender), thingverse (STL with randomized colors), sketchfab (GLB files), polycam (community scans), and Smithsonian institute (3D GLB scans of historic and cultural artifacts). Model metadata gives mesh properties (polygon, vertices, edges count, etc.), higher animated objects (with armature/bone/mover for blender animations) than in Objaverse 1.0; model metadata augmented with CLIP ViT-L/14 embeddings from 12 random/different rendered views (for downstream predictions like aesthetics, NSFW, detecting holes, etc.). Analysis done on NSFW (very few have high NSFW multi-view), face detection, and bad poly-mesh (trained two layer MLP on CLIP for manually chosen bad meshes - many to detect holes). Zero123-XL is fine-tuned on high quality alignment/subset of Objaverse-XL (better than base training), improved PSNR, SSIM, LPIPS, and FID. PixelNeRF (NeRF from one or few images) also shows improvement when fine-tuned from Objaverse-XL. PixelNeRF PSNR increases and Zero123 LPIPS decreases (trend shows) with dataset size. Appendix has Zero123-XL fine tuning implementation details, novel-view synthesis results for Zero-123 XL (better consistency, compared with original), data sheet with FAQs, quality filtering/categorization using LAION-Aesthetics V2 (most are T2). From Allen Institute for AI, University of Washington, Columbia University, Stability AI, California Institute of Technology, LAION.

Links: website, Blog (Stability AI), PapersWithCode, HuggingFace Datasets, Colab Notebook, GitHub

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2307.05663 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2307.05663 in a Space README.md to link it from this page.

Collections including this paper 1