language:
- en
license: other
license_name: autodesk-non-commercial-3d-generative-v1.0
tags:
- wala
- text-to-multiview
pipeline_tag: text-to-image
Model Card for WaLa-MVDream-RGB4
This model is part of the Wavelet Latent Diffusion (WaLa) paper, capable of generating four-view RGB images from text descriptions to support text-to-3D generation.
Model Details
Model Description
WaLa-MVDream-RGB4 is a fine-tuned version of the MVDream model, adapted to generate four-view RGB images from text inputs. This model serves as an intermediate step in the text-to-3D generation pipeline of WaLa, producing multi-view images that are then used by the WaLa-RGB4-1B model to generate 3D shapes.
- Developed by: Aditya Sanghi, Aliasghar Khani, Chinthala Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani
- Model type: Text-to-Image Generative Model
- License: Autodesk Non-Commercial (3D Generative) v1.0
For more information please look at the Project Page and the paper.
Model Sources
Uses
Direct Use
This model is released by Autodesk and intended for academic and research purposes only for the theoretical exploration and demonstration of the WaLa 3D generative framework. It is designed to be used in conjunction with WaLa-RGB4-1B for text-to-3D generation. Please see here for inferencing instructions.
Out-of-Scope Use
The model should not be used for:
- Commercial purposes
- Generation of inappropriate or offensive content
- Any usage not in compliance with the license, in particular, the "Acceptable Use" section.
Bias, Risks, and Limitations
Bias
- The model may inherit biases present in the text-image datasets used for pre-training and fine-tuning.
- The model's performance may vary depending on the complexity and specificity of the input text descriptions.
Risks and Limitations
- The quality of the generated multi-view images may impact the subsequent 3D shape generation.
- The model may occasionally generate images that do not accurately represent the input text or maintain consistency across views.
How to Get Started with the Model
Please refer to the instructions here
Training Details
Training Data
The model was fine-tuned using captions generated for the WaLa dataset. Captions were initially created using the Internvl 2.0 model and then augmented using LLaMA 3.1 to enhance diversity and richness.
Training Procedure
Preprocessing
Captions were generated for each 3D object in the dataset using four renderings and two distinct prompts. These captions were then augmented to increase diversity.
Training Hyperparameters
- Training regime: Please refer to the paper.
Technical Specifications
Model Architecture and Objective
The model is based on the MVDream architecture, fine-tuned to generate four-view RGB images from text inputs. It is designed to work in tandem with the WaLa-RGB4-1B model for text-to-3D generation.
Compute Infrastructure
Hardware
The model was trained on NVIDIA H100 GPUs.
Citation
@misc{sanghi2024waveletlatentdiffusionwala,
title={Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings},
author={Aditya Sanghi and Aliasghar Khani and Pradyumna Reddy and Arianna Rampini and Derek Cheung and Kamal Rahimi Malekshan and Kanika Madan and Hooman Shayani},
year={2024},
eprint={2411.08017},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.08017},
}