NeMo_Canary / docs /source /multimodal /nerf /dreamfusion.rst

Upload folder using huggingface_hub

b386992 verified about 1 month ago

12 kB

	DreamFusion
	===========

	Model Introduction
	-------------------
	DreamFusion :cite:`mm-models-df-poole2022dreamfusion` uses a pretrained text-to-image diffusion model to perform
	text-to-3D synthesis. The model uses a loss based on probability density distillation that enables the use of a 2D
	diffusion model as a prior for optimization of a parametric image generator.

	Using this loss in a DeepDream-like procedure, the model optimizes a randomly-initialized 3D model
	(a Neural Radiance Field, or NeRF) via gradient descent such that its 2D renderings from random angles achieve a low
	loss. The resulting 3D model of the given text can be viewed from any angle, relit by arbitrary illumination, or composited
	into any 3D environment. This approach requires no 3D training data and no modifications to the image diffusion model,
	demonstrating the effectiveness of pretrained image diffusion models as priors.

	Dreamfusion models can be instantiated using the :class:`~nemo.collections.multimodal.models.nerf.dreamfusion.DreamFusion` class.

	.. image:: images/dreamfusion_model_overview.png
	:align: center
	:width: 800px
	:alt: DreamFsuion, overview of the model


	Image Guidance
	^^^^^^^^^^^^^^
	This section of DreamFusion pertains to the initial phase where the model interprets and translates text inputs into visual concepts.
	Utilizing a diffusion based text-to-image model, DreamFusion processes the text input, extracts key visual elements, and translates these into initial 2D images.
	The process ensures that the generated 3D models are not only accurate in terms of the text description but also visually coherent and detailed by conditioning
	the 2D image based on the view angle.


	NeRF (Foreground) Network
	^^^^^^^^^^^^^^^^^^^^^^^^^
	The Neural Radiance Fields network is at the heart of DreamFusion's 3D rendering capabilities.
	In DreamFusion, the NeRF network takes the 2D images generated from the textual description and constructs a 3D model.
	This model is represented as a continuous volumetric scene function, which encodes the color and density of points in space,
	allowing for highly detailed and photorealistic renderings.

	Background Layer
	^^^^^^^^^^^^^^^^
	DreamFusion can leverage a background layer dedicated to background modeling.

	In scenarios where a dynamic background is needed, DreamFusion can be configured to use a secondary NeRF network to generate a background.
	This network functions in parallel to the primary NeRF network, focusing on creating a coherent and contextually appropriate backdrop for the main scene.
	It dynamically adjusts to lighting and perspective changes, maintaining consistency with the foreground model.

	Alternatively, DreamFusion allows for the integration of a static background color, which is particularly useful in scenarios where the focus is predominantly on the object being generated, and a non-distracting backdrop is desirable.
	Implementing a static color background involves setting a uniform chromatic value that encompasses the periphery of the 3D model.
	This approach simplifies the rendering process and can be beneficial in reducing computational load while maintaining focus on the primary object.

	Materials Network
	^^^^^^^^^^^^^^^^^
	The materials network in DreamFusion is responsible for adding realism to the 3D models by accurately simulating the physical properties of different materials.
	This network takes into account various aspects like texture, reflectivity, and transparency.
	By doing so, it adds another layer of detail, making the objects generated by DreamFusion not just structurally accurate but also visually and tactilely realistic.


	Renderer Layer
	^^^^^^^^^^^^^^
	The renderer layer functions as the culminating stage in DreamFusion's processing pipeline.
	It translates the synthesized volumetric data from the NeRF and material networks into perceptible imagery.
	Employing ray-tracing algorithms, this layer computes the interaction of light with the 3D scene,
	producing images that exhibit sophisticated attributes like accurate shadow casting,
	dynamic lighting, and perspective-correct renderings.



	Model Configuration
	-------------------

	DreamFusion models can be instantiated using the :class:`~nemo.collections.multimodal.models.nerf.dreamfusion.DreamFusion` class.
	The model configuration file is organized into the following sections:

	.. code-block:: yaml

	_target_: nemo.collections.multimodal.models.nerf.dreamfusion.DreamFusion
	defaults:
	- nerf: torchngp
	- background: static
	- material: basic_shading
	- renderer: torchngp_raymarching
	- guidance: sd_huggingface
	- optim: adan
	- loss: dreamfusion
	- data: data
	- _self_

	### model options
	resume_from_checkpoint:
	prompt: 'a hamburger'
	negative_prompt: ''
	front_prompt: ', front view'
	side_prompt: ', side view'
	back_prompt: ', back view'
	update_extra_interval: 16
	guidance_scale: 100
	export_video: False

	iters: ${trainer.max_steps}
	latent_iter_ratio: 0.2
	albedo_iter_ratio: 0.0
	min_ambient_ratio: 0.1
	textureless_ratio: 0.2

	data:
	train_dataset:
	width: 64
	height: 64
	val_dataset:
	width: 800
	height: 800
	test_dataset:
	width: 800
	height: 800

	- ``defaults``: Defines default modules for different components like nerf, background, material, etc.
	- ``resume_from_checkpoint``: Path to a checkpoint file for initializing the model.
	- ``prompt``: Main textual input for the model describing the object to generate.
	- ``negative_prompt``: Textual input describing what to avoid in the generated object.
	- ``front_prompt``, ``side_prompt``, ``back_prompt``: Textual inputs that are appended to the prompts for more detailed orientation guidance.
	- ``update_extra_interval``: Interval for updating internal module parameters.
	- ``guidance_scale``: The guidance scaled used with the diffusion model.
	- ``export_video``: Boolean to determine whether to export a 360 video of the generated object.
	- ``iters``, ``latent_iter_ratio``, ``albedo_iter_ratio``, ``min_ambient_ratio``, ``textureless_ratio``: Various ratios and parameters defining iteration behavior and visual characteristics of the output.
	- ``data``: Defines dataset dimensions for training, validation, and testing.

	The behavior of the pipeline can be precisely adjusted by fine-tuning the parameters of various components in the default section.
	Some components support different backends and implementations, the full components catalog can be viewed in the config directory ``{NEMO_ROOT/examples/multimodal/generative/nerf/conf/model}``.

	Image Guidance
	^^^^^^^^^^^^^^

	.. code-block:: yaml

	_target_: nemo.collections.multimodal.modules.nerf.guidance.stablediffusion_huggingface_pipeline.StableDiffusion
	precision: ${trainer.precision}
	model_key: stabilityai/stable-diffusion-2-1-base
	t_range: [0.02, 0.98]

	- ``precision``: Sets the precision of computations (e.g., FP32 or FP16).
	- ``model_key``: Specifies the pre-trained model to use for image guidance.
	- ``t_range``: Range of threshold values for guidance stability.


	NeRF (Foreground) Network
	^^^^^^^^^^^^^^^^^^^^^^^^^

	.. code-block:: yaml

	_target_: nemo.collections.multimodal.modules.nerf.geometry.torchngp_nerf.TorchNGPNerf
	num_input_dims: 3
	bound: 1
	density_activation: exp
	blob_radius: 0.2
	blob_density: 5
	normal_type: central_finite_difference

	encoder_cfg:
	encoder_type: 'hashgrid'
	encoder_max_level:
	log2_hashmap_size: 19
	desired_resolution: 2048
	interpolation: smoothstep

	sigma_net_num_output_dims: 1
	sigma_net_cfg:
	num_hidden_dims: 64
	num_layers: 3
	bias: True

	features_net_num_output_dims: 3
	features_net_cfg:
	num_hidden_dims: 64
	num_layers: 3
	bias: True

	Describes the NeRF network's architecture, including the density activation function, network configuration, and the specification of the sigma and features networks.

	Background Layer
	^^^^^^^^^^^^^^^^

	.. code-block:: yaml

	_target_: nemo.collections.multimodal.modules.nerf.background.static_background.StaticBackground
	background: [0, 0, 1]

	Static background, where the background key is the RGB color.

	.. code-block:: yaml

	_target_: nemo.collections.multimodal.modules.nerf.background.torchngp_background.TorchNGPBackground

	encoder_type: "frequency"
	encoder_input_dims: 3
	encoder_multi_res: 6

	num_output_dims: 3
	net_cfg:
	num_hidden_dims: 32
	num_layers: 2
	bias: True

	Dynamic background, where the background is generated by a NeRF network.


	Materials Network
	^^^^^^^^^^^^^^^^^

	.. code-block:: yaml

	_target_: nemo.collections.multimodal.modules.nerf.materials.basic_shading.BasicShading

	Defines the basic shading model for the material network. The basic shading model supports textureless, lambertian and phong shading.


	Renderer layer
	^^^^^^^^^^^^^^

	.. code-block:: yaml

	_target_: nemo.collections.multimodal.modules.nerf.renderers.torchngp_volume_renderer.TorchNGPVolumeRenderer
	bound: ${model.nerf.bound}
	update_interval: 16
	grid_resolution: 128
	density_thresh: 10
	max_steps: 1024
	dt_gamma: 0

	Configures the renderer, specifying parameters like update interval, grid resolution, and rendering thresholds.


	DreamFusion-DMTet
	-----------------
	NeRF models integrate geometry and appearance through volume rendering. As a result,
	using NeRF for 3D modeling can be less effective when it comes to capturing both the intricate details of a surface as well as
	its material and texture.

	DMTet fine-tuning disentangles the learning of geometry and appearance models, such that both a fine surface and a rich
	material/texture can be generated. To enable such a disentangled learning, a hybrid scene representation of
	[DMTet](https://nv-tlabs.github.io/DMTet/) is used.

	The DMTet model maintains a deformable tetrahedral grid that encodes a discretized signed distance function and a
	differentiable marching tetrahedra layer that converts the implicit signed distance representation to the explicit
	surface mesh representation.


	Model Configuration
	^^^^^^^^^^^^^^^^^^^

	DreamFusion models can be instantiated using the same class as DreamFusion :class:`~nemo.collections.multimodal.models.nerf.dreamfusion.DreamFusion`.
	However, the following changes to the training pipeline are necessary:

	.. code-block:: yaml

	_target_: nemo.collections.multimodal.models.nerf.dreamfusion.DreamFusion
	defaults:
	- nerf: torchngp
	- background: torchngp
	- material: basic_shading
	- renderer: nvdiffrast # (1)
	- guidance: sd_huggingface
	- optim: adan
	- loss: dmtet # (2)
	- data: data
	- _self_

	### model options
	resume_from_checkpoint: "/results/DreamFusion/checkpoints/DreamFusion-step\\=10000-last.ckpt" # (3)
	prompt: 'a hamburger'
	negative_prompt: ''
	front_prompt: ', front view'
	side_prompt: ', side view'
	back_prompt: ', back view'
	update_extra_interval: 16
	guidance_scale: 100
	export_video: False

	iters: ${trainer.max_steps}
	latent_iter_ratio: 0.0
	albedo_iter_ratio: 0
	min_ambient_ratio: 0.1
	textureless_ratio: 0.2

	data:
	train_dataset:
	width: 512 # (4)
	height: 512 # (4)
	val_dataset:
	width: 800
	height: 800
	test_dataset:
	width: 800
	height: 800


	We note the following changes:
	1. The rendering module was updated from a volumetric-based approach to a rasterization-based one using nvdiffrast.
	2. The model loss is changed to account for the changes in the geometry representation.
	3. DreamFusion-DMTet fine-tunes a pretrained DreamFusion model, the pretrained checkpoint is provided using ``resume_from_checkpoint``.
	4. The training shape is increased to 512x512.


	References
	----------

	.. bibliography:: ../mm_all.bib
	:style: plain
	:filter: docname in docnames
	:labelprefix: MM-MODELS-DF
	:keyprefix: mm-models-df-