Merge branch 'ashawkey:main' into main

Browse files

Files changed (14) hide show

assets/update_logs.md +4 -0
docker/Dockerfile +53 -0
docker/README.md +80 -0
gradio_app.py +227 -0
main.py +19 -11
nerf/network.py +4 -5
nerf/network_grid.py +4 -5
nerf/network_tcnn.py +6 -16
nerf/provider.py +12 -5
nerf/renderer.py +4 -4
nerf/sd.py +6 -6
nerf/utils.py +42 -27
raymarching/src/raymarching.cu +1 -1
readme.md +25 -8

assets/update_logs.md CHANGED Viewed

@@ -1,3 +1,7 @@
 ### 2022.10.5
 * Basic reproduction finished.
 * Non --cuda_ray, --tcnn are not working, need to fix.

+### 2022.10.9
+* The shading (partially) starts to work, at least it won't make scene empty. For some prompts, it shows better results (less severe Janus problem). The textureless rendering mode is still disabled.
+* Enable shading by default (--albedo_iters 1000).
 ### 2022.10.5
 * Basic reproduction finished.
 * Non --cuda_ray, --tcnn are not working, need to fix.

docker/Dockerfile ADDED Viewed

	@@ -0,0 +1,53 @@

+FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04
+# Remove any third-party apt sources to avoid issues with expiring keys.
+RUN rm -f /etc/apt/sources.list.d/*.list
+RUN apt-get update
+RUN DEBIAN_FRONTEND=noninteractive TZ=Europe/MADRID apt-get install -y tzdata
+# Install some basic utilities
+RUN apt-get install -y \
+    curl \
+    ca-certificates \
+    sudo \
+    git \
+    bzip2 \
+    libx11-6 \
+    python3 \
+    python3-pip \
+    libglfw3-dev \
+    libgles2-mesa-dev \
+    libglib2.0-0 \
+ && rm -rf /var/lib/apt/lists/*
+# Create a working directory
+RUN mkdir /app
+WORKDIR /app
+RUN cd /app
+RUN git clone https://github.com/ashawkey/stable-dreamfusion.git
+RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
+WORKDIR /app/stable-dreamfusion
+RUN pip3 install -r requirements.txt
+RUN pip3 install git+https://github.com/NVlabs/nvdiffrast/
+# Needs nvidia runtime, if you have "No CUDA runtime is found" error: https://stackoverflow.com/questions/59691207/docker-build-with-nvidia-runtime, first answer
+RUN pip3 install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
+RUN pip3 install git+https://github.com/openai/CLIP.git
+RUN bash scripts/install_ext.sh
+# Set the default command to python3
+#CMD ["python3"]

docker/README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+### Docker installation
+## Build image
+To build the docker image on your own machine, which may take 15-30 mins:
+```
+docker build -t stable-dreamfusion:latest .
+```
+If you have the error **No CUDA runtime is found** when building the wheels for tiny-cuda-nn you need to setup the nvidia-runtime for docker.
+```
+sudo apt-get install nvidia-container-runtime
+```
+Then edit `/etc/docker/daemon.json` and add the default-runtime:
+```
+{
+    "runtimes": {
+        "nvidia": {
+            "path": "nvidia-container-runtime",
+            "runtimeArgs": []
+        }
+    },
+    "default-runtime": "nvidia"
+}
+```
+And restart docker:
+```
+sudo systemctl restart docker
+```
+Now you can build tiny-cuda-nn inside docker.
+## Download image
+To download the image (~6GB) instead:
+```
+docker pull supercabb/stable-dreamfusion:3080_0.0.1
+docker tag supercabb/stable-dreamfusion:3080_0.0.1 stable-dreamfusion
+```
+## Use image
+You can launch an interactive shell inside the container:
+```
+docker run --gpus all -it --rm -v $(cd ~ && pwd):/mnt stable-dreamfusion /bin/bash
+```
+From this shell, all the code in the repo should work.
+To run any single command `<command...>` inside the docker container:
+```
+docker run --gpus all -it --rm -v $(cd ~ && pwd):/mnt stable-dreamfusion /bin/bash -c "<command...>"
+```
+To train:
+```
+export TOKEN="#HUGGING FACE ACCESS TOKEN#"
+docker run --gpus all -it --rm -v $(cd ~ && pwd):/mnt stable-dreamfusion /bin/bash -c "echo ${TOKEN} > TOKEN \
+&& python3 main.py --text \"a hamburger\" --workspace trial -O"
+```
+Run test without gui:
+```
+export PATH_TO_WORKSPACE="#PATH_TO_WORKSPACE#"
+docker run --gpus all -it --rm -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix:ro -v $(cd ~ && pwd):/mnt \
+-v $(cd ${PATH_TO_WORKSPACE} && pwd):/app/stable-dreamfusion/trial stable-dreamfusion /bin/bash -c "python3 \
+main.py --workspace trial -O --test"
+```
+Run test with gui:
+```
+export PATH_TO_WORKSPACE="#PATH_TO_WORKSPACE#"
+xhost +
+docker run --gpus all -it --rm -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix:ro -v $(cd ~ && pwd):/mnt \
+-v $(cd ${PATH_TO_WORKSPACE} && pwd):/app/stable-dreamfusion/trial stable-dreamfusion /bin/bash -c "python3 \
+main.py --workspace trial -O --test --gui"
+xhost -
+```

gradio_app.py ADDED Viewed

	@@ -0,0 +1,227 @@

+import torch
+import argparse
+from nerf.provider import NeRFDataset
+from nerf.utils import *
+import gradio as gr
+import gc
+print(f'[INFO] loading options..')
+# fake config object, this should not be used in CMD, only allow change from gradio UI.
+parser = argparse.ArgumentParser()
+parser.add_argument('--text', default=None, help="text prompt")
+# parser.add_argument('-O', action='store_true', help="equals --fp16 --cuda_ray --dir_text")
+# parser.add_argument('-O2', action='store_true', help="equals --fp16 --dir_text")
+parser.add_argument('--test', action='store_true', help="test mode")
+parser.add_argument('--save_mesh', action='store_true', help="export an obj mesh with texture")
+parser.add_argument('--eval_interval', type=int, default=10, help="evaluate on the valid set every interval epochs")
+parser.add_argument('--workspace', type=str, default='trial_gradio')
+parser.add_argument('--guidance', type=str, default='stable-diffusion', help='choose from [stable-diffusion, clip]')
+parser.add_argument('--seed', type=int, default=0)
+### training options
+parser.add_argument('--iters', type=int, default=10000, help="training iters")
+parser.add_argument('--lr', type=float, default=1e-3, help="initial learning rate")
+parser.add_argument('--ckpt', type=str, default='latest')
+parser.add_argument('--cuda_ray', action='store_true', help="use CUDA raymarching instead of pytorch")
+parser.add_argument('--max_steps', type=int, default=1024, help="max num steps sampled per ray (only valid when using --cuda_ray)")
+parser.add_argument('--num_steps', type=int, default=64, help="num steps sampled per ray (only valid when not using --cuda_ray)")
+parser.add_argument('--upsample_steps', type=int, default=64, help="num steps up-sampled per ray (only valid when not using --cuda_ray)")
+parser.add_argument('--update_extra_interval', type=int, default=16, help="iter interval to update extra status (only valid when using --cuda_ray)")
+parser.add_argument('--max_ray_batch', type=int, default=4096, help="batch size of rays at inference to avoid OOM (only valid when not using --cuda_ray)")
+parser.add_argument('--albedo_iters', type=int, default=1000, help="training iters that only use albedo shading")
+# model options
+parser.add_argument('--bg_radius', type=float, default=1.4, help="if positive, use a background model at sphere(bg_radius)")
+parser.add_argument('--density_thresh', type=float, default=10, help="threshold for density grid to be occupied")
+# network backbone
+parser.add_argument('--fp16', action='store_true', help="use amp mixed precision training")
+parser.add_argument('--backbone', type=str, default='grid', help="nerf backbone, choose from [grid, tcnn, vanilla]")
+# rendering resolution in training, decrease this if CUDA OOM.
+parser.add_argument('--w', type=int, default=64, help="render width for NeRF in training")
+parser.add_argument('--h', type=int, default=64, help="render height for NeRF in training")
+parser.add_argument('--jitter_pose', action='store_true', help="add jitters to the randomly sampled camera poses")
+### dataset options
+parser.add_argument('--bound', type=float, default=1, help="assume the scene is bounded in box(-bound, bound)")
+parser.add_argument('--dt_gamma', type=float, default=0, help="dt_gamma (>=0) for adaptive ray marching. set to 0 to disable, >0 to accelerate rendering (but usually with worse quality)")
+parser.add_argument('--min_near', type=float, default=0.1, help="minimum near distance for camera")
+parser.add_argument('--radius_range', type=float, nargs='*', default=[1.0, 1.5], help="training camera radius range")
+parser.add_argument('--fovy_range', type=float, nargs='*', default=[40, 70], help="training camera fovy range")
+parser.add_argument('--dir_text', action='store_true', help="direction-encode the text prompt, by appending front/side/back/overhead view")
+parser.add_argument('--angle_overhead', type=float, default=30, help="[0, angle_overhead] is the overhead region")
+parser.add_argument('--angle_front', type=float, default=60, help="[0, angle_front] is the front region, [180, 180+angle_front] the back region, otherwise the side region.")
+parser.add_argument('--lambda_entropy', type=float, default=1e-4, help="loss scale for alpha entropy")
+parser.add_argument('--lambda_opacity', type=float, default=0, help="loss scale for alpha value")
+parser.add_argument('--lambda_orient', type=float, default=1e-2, help="loss scale for orientation")
+### GUI options
+parser.add_argument('--gui', action='store_true', help="start a GUI")
+parser.add_argument('--W', type=int, default=800, help="GUI width")
+parser.add_argument('--H', type=int, default=800, help="GUI height")
+parser.add_argument('--radius', type=float, default=3, help="default GUI camera radius from center")
+parser.add_argument('--fovy', type=float, default=60, help="default GUI camera fovy")
+parser.add_argument('--light_theta', type=float, default=60, help="default GUI light direction in [0, 180], corresponding to elevation [90, -90]")
+parser.add_argument('--light_phi', type=float, default=0, help="default GUI light direction in [0, 360), azimuth")
+parser.add_argument('--max_spp', type=int, default=1, help="GUI rendering max sample per pixel")
+opt = parser.parse_args()
+# default to use -O !!!
+opt.fp16 = True
+opt.dir_text = True
+opt.cuda_ray = True
+# opt.lambda_entropy = 1e-4
+# opt.lambda_opacity = 0
+if opt.backbone == 'vanilla':
+    from nerf.network import NeRFNetwork
+elif opt.backbone == 'tcnn':
+    from nerf.network_tcnn import NeRFNetwork
+elif opt.backbone == 'grid':
+    from nerf.network_grid import NeRFNetwork
+else:
+    raise NotImplementedError(f'--backbone {opt.backbone} is not implemented!')
+print(opt)
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+print(f'[INFO] loading models..')
+if opt.guidance == 'stable-diffusion':
+    from nerf.sd import StableDiffusion
+    guidance = StableDiffusion(device)
+elif opt.guidance == 'clip':
+    from nerf.clip import CLIP
+    guidance = CLIP(device)
+else:
+    raise NotImplementedError(f'--guidance {opt.guidance} is not implemented.')
+train_loader = NeRFDataset(opt, device=device, type='train', H=opt.h, W=opt.w, size=100).dataloader()
+valid_loader = NeRFDataset(opt, device=device, type='val', H=opt.H, W=opt.W, size=5).dataloader()
+test_loader = NeRFDataset(opt, device=device, type='test', H=opt.H, W=opt.W, size=100).dataloader()
+print(f'[INFO] everything loaded!')
+trainer = None
+model = None
+# define UI
+with gr.Blocks(css=".gradio-container {max-width: 512px; margin: auto;}") as demo:
+    # title
+    gr.Markdown('[Stable-DreamFusion](https://github.com/ashawkey/stable-dreamfusion) Text-to-3D Example')
+    # inputs
+    prompt = gr.Textbox(label="Prompt", max_lines=1, value="a DSLR photo of a koi fish")
+    iters = gr.Slider(label="Iters", minimum=1000, maximum=20000, value=5000, step=100)
+    seed = gr.Slider(label="Seed", minimum=0, maximum=2147483647, step=1, randomize=True)
+    button = gr.Button('Generate')
+    # outputs
+    image = gr.Image(label="image", visible=True)
+    video = gr.Video(label="video", visible=False)
+    logs = gr.Textbox(label="logging")
+    # gradio main func
+    def submit(text, iters, seed):
+        global trainer, model
+        # seed
+        opt.seed = seed
+        opt.text = text
+        opt.iters = iters
+        seed_everything(seed)
+        # clean up
+        if trainer is not None:
+            del model
+            del trainer
+            gc.collect()
+            torch.cuda.empty_cache()
+            print('[INFO] clean up!')
+        # simply reload everything...
+        model = NeRFNetwork(opt)
+        optimizer = lambda model: torch.optim.Adam(model.get_params(opt.lr), betas=(0.9, 0.99), eps=1e-15)
+        scheduler = lambda optimizer: optim.lr_scheduler.LambdaLR(optimizer, lambda iter: 0.1 ** min(iter / opt.iters, 1))
+        trainer = Trainer('df', opt, model, guidance, device=device, workspace=opt.workspace, optimizer=optimizer, ema_decay=0.95, fp16=opt.fp16, lr_scheduler=scheduler, use_checkpoint=opt.ckpt, eval_interval=opt.eval_interval, scheduler_update_every_step=True)
+        # train (every ep only contain 8 steps, so we can get some vis every ~10s)
+        STEPS = 8
+        max_epochs = np.ceil(opt.iters / STEPS).astype(np.int32)
+        # we have to get the explicit training loop out here to yield progressive results...
+        loader = iter(valid_loader)
+        start_t = time.time()
+        for epoch in range(max_epochs):
+            trainer.train_gui(train_loader, step=STEPS)
+            # manual test and get intermediate results
+            try:
+                data = next(loader)
+            except StopIteration:
+                loader = iter(valid_loader)
+                data = next(loader)
+            trainer.model.eval()
+            if trainer.ema is not None:
+                trainer.ema.store()
+                trainer.ema.copy_to()
+            with torch.no_grad():
+                with torch.cuda.amp.autocast(enabled=trainer.fp16):
+                    preds, preds_depth = trainer.test_step(data, perturb=False)
+            if trainer.ema is not None:
+                trainer.ema.restore()
+            pred = preds[0].detach().cpu().numpy()
+            # pred_depth = preds_depth[0].detach().cpu().numpy()
+            pred = (pred * 255).astype(np.uint8)
+            yield {
+                image: gr.update(value=pred, visible=True),
+                video: gr.update(visible=False),
+                logs: f"training iters: {epoch * STEPS} / {iters}, lr: {trainer.optimizer.param_groups[0]['lr']:.6f}",
+            }
+        # test
+        trainer.test(test_loader)
+        results = glob.glob(os.path.join(opt.workspace, 'results', '*rgb*.mp4'))
+        assert results is not None, "cannot retrieve results!"
+        results.sort(key=lambda x: os.path.getmtime(x)) # sort by mtime
+        end_t = time.time()
+        yield {
+            image: gr.update(visible=False),
+            video: gr.update(value=results[-1], visible=True),
+            logs: f"Generation Finished in {(end_t - start_t)/ 60:.4f} minutes!",
+        }
+    button.click(
+        submit,
+        [prompt, iters, seed],
+        [image, video, logs]
+    )
+# concurrency_count: only allow ONE running progress, else GPU will OOM.
+demo.queue(concurrency_count=1)
+demo.launch()

main.py CHANGED Viewed

@@ -23,16 +23,16 @@ if __name__ == '__main__':
     parser.add_argument('--seed', type=int, default=0)
     ### training options
-    parser.add_argument('--iters', type=int, default=15000, help="training iters")
     parser.add_argument('--lr', type=float, default=1e-3, help="initial learning rate")
     parser.add_argument('--ckpt', type=str, default='latest')
     parser.add_argument('--cuda_ray', action='store_true', help="use CUDA raymarching instead of pytorch")
     parser.add_argument('--max_steps', type=int, default=1024, help="max num steps sampled per ray (only valid when using --cuda_ray)")
-    parser.add_argument('--num_steps', type=int, default=128, help="num steps sampled per ray (only valid when not using --cuda_ray)")
-    parser.add_argument('--upsample_steps', type=int, default=0, help="num steps up-sampled per ray (only valid when not using --cuda_ray)")
     parser.add_argument('--update_extra_interval', type=int, default=16, help="iter interval to update extra status (only valid when using --cuda_ray)")
     parser.add_argument('--max_ray_batch', type=int, default=4096, help="batch size of rays at inference to avoid OOM (only valid when not using --cuda_ray)")
-    parser.add_argument('--albedo_iters', type=int, default=15000, help="training iters that only use albedo shading")
     # model options
     parser.add_argument('--bg_radius', type=float, default=1.4, help="if positive, use a background model at sphere(bg_radius)")
     parser.add_argument('--density_thresh', type=float, default=10, help="threshold for density grid to be occupied")
@@ -40,8 +40,9 @@ if __name__ == '__main__':
     parser.add_argument('--fp16', action='store_true', help="use amp mixed precision training")
     parser.add_argument('--backbone', type=str, default='grid', help="nerf backbone, choose from [grid, tcnn, vanilla]")
     # rendering resolution in training, decrease this if CUDA OOM.
-    parser.add_argument('--w', type=int, default=128, help="render width for NeRF in training")
-    parser.add_argument('--h', type=int, default=128, help="render height for NeRF in training")
     ### dataset options
     parser.add_argument('--bound', type=float, default=1, help="assume the scene is bounded in box(-bound, bound)")
@@ -51,9 +52,10 @@ if __name__ == '__main__':
     parser.add_argument('--fovy_range', type=float, nargs='*', default=[40, 70], help="training camera fovy range")
     parser.add_argument('--dir_text', action='store_true', help="direction-encode the text prompt, by appending front/side/back/overhead view")
     parser.add_argument('--angle_overhead', type=float, default=30, help="[0, angle_overhead] is the overhead region")
-    parser.add_argument('--angle_front', type=float, default=30, help="[0, angle_front] is the front region, [180, 180+angle_front] the back region, otherwise the side region.")
     parser.add_argument('--lambda_entropy', type=float, default=1e-4, help="loss scale for alpha entropy")
     parser.add_argument('--lambda_orient', type=float, default=1e-2, help="loss scale for orientation")
     ### GUI options
@@ -71,10 +73,16 @@ if __name__ == '__main__':
     if opt.O:
         opt.fp16 = True
         opt.dir_text = True
         opt.cuda_ray = True
     elif opt.O2:
         opt.fp16 = True
         opt.dir_text = True
     if opt.backbone == 'vanilla':
         from nerf.network import NeRFNetwork
@@ -98,7 +106,7 @@ if __name__ == '__main__':
     if opt.test:
         guidance = None # no need to load guidance model at test
-        trainer = Trainer('ngp', opt, model, guidance, device=device, workspace=opt.workspace, fp16=opt.fp16, use_checkpoint=opt.ckpt)
         if opt.gui:
             gui = NeRFGUI(opt, trainer)
@@ -127,10 +135,10 @@ if __name__ == '__main__':
         train_loader = NeRFDataset(opt, device=device, type='train', H=opt.h, W=opt.w, size=100).dataloader()
-        # decay to 0.01 * init_lr at last iter step
-        scheduler = lambda optimizer: optim.lr_scheduler.LambdaLR(optimizer, lambda iter: 0.01 ** min(iter / opt.iters, 1))
-        trainer = Trainer('ngp', opt, model, guidance, device=device, workspace=opt.workspace, optimizer=optimizer, ema_decay=0.95, fp16=opt.fp16, lr_scheduler=scheduler, use_checkpoint=opt.ckpt, eval_interval=opt.eval_interval)
         if opt.gui:
             trainer.train_loader = train_loader # attach dataloader to trainer

     parser.add_argument('--seed', type=int, default=0)
     ### training options
+    parser.add_argument('--iters', type=int, default=10000, help="training iters")
     parser.add_argument('--lr', type=float, default=1e-3, help="initial learning rate")
     parser.add_argument('--ckpt', type=str, default='latest')
     parser.add_argument('--cuda_ray', action='store_true', help="use CUDA raymarching instead of pytorch")
     parser.add_argument('--max_steps', type=int, default=1024, help="max num steps sampled per ray (only valid when using --cuda_ray)")
+    parser.add_argument('--num_steps', type=int, default=64, help="num steps sampled per ray (only valid when not using --cuda_ray)")
+    parser.add_argument('--upsample_steps', type=int, default=64, help="num steps up-sampled per ray (only valid when not using --cuda_ray)")
     parser.add_argument('--update_extra_interval', type=int, default=16, help="iter interval to update extra status (only valid when using --cuda_ray)")
     parser.add_argument('--max_ray_batch', type=int, default=4096, help="batch size of rays at inference to avoid OOM (only valid when not using --cuda_ray)")
+    parser.add_argument('--albedo_iters', type=int, default=1000, help="training iters that only use albedo shading")
     # model options
     parser.add_argument('--bg_radius', type=float, default=1.4, help="if positive, use a background model at sphere(bg_radius)")
     parser.add_argument('--density_thresh', type=float, default=10, help="threshold for density grid to be occupied")
     parser.add_argument('--fp16', action='store_true', help="use amp mixed precision training")
     parser.add_argument('--backbone', type=str, default='grid', help="nerf backbone, choose from [grid, tcnn, vanilla]")
     # rendering resolution in training, decrease this if CUDA OOM.
+    parser.add_argument('--w', type=int, default=64, help="render width for NeRF in training")
+    parser.add_argument('--h', type=int, default=64, help="render height for NeRF in training")
+    parser.add_argument('--jitter_pose', action='store_true', help="add jitters to the randomly sampled camera poses")
     ### dataset options
     parser.add_argument('--bound', type=float, default=1, help="assume the scene is bounded in box(-bound, bound)")
     parser.add_argument('--fovy_range', type=float, nargs='*', default=[40, 70], help="training camera fovy range")
     parser.add_argument('--dir_text', action='store_true', help="direction-encode the text prompt, by appending front/side/back/overhead view")
     parser.add_argument('--angle_overhead', type=float, default=30, help="[0, angle_overhead] is the overhead region")
+    parser.add_argument('--angle_front', type=float, default=60, help="[0, angle_front] is the front region, [180, 180+angle_front] the back region, otherwise the side region.")
     parser.add_argument('--lambda_entropy', type=float, default=1e-4, help="loss scale for alpha entropy")
+    parser.add_argument('--lambda_opacity', type=float, default=0, help="loss scale for alpha value")
     parser.add_argument('--lambda_orient', type=float, default=1e-2, help="loss scale for orientation")
     ### GUI options
     if opt.O:
         opt.fp16 = True
         opt.dir_text = True
+        # use occupancy grid to prune ray sampling, faster rendering.
         opt.cuda_ray = True
+        # opt.lambda_entropy = 1e-4
+        # opt.lambda_opacity = 0
     elif opt.O2:
         opt.fp16 = True
         opt.dir_text = True
+        opt.lambda_entropy = 1e-4 # necessary to keep non-empty
+        opt.lambda_opacity = 3e-3 # no occupancy grid, so use a stronger opacity loss.
     if opt.backbone == 'vanilla':
         from nerf.network import NeRFNetwork
     if opt.test:
         guidance = None # no need to load guidance model at test
+        trainer = Trainer('df', opt, model, guidance, device=device, workspace=opt.workspace, fp16=opt.fp16, use_checkpoint=opt.ckpt)
         if opt.gui:
             gui = NeRFGUI(opt, trainer)
         train_loader = NeRFDataset(opt, device=device, type='train', H=opt.h, W=opt.w, size=100).dataloader()
+        scheduler = lambda optimizer: optim.lr_scheduler.LambdaLR(optimizer, lambda iter: 0.1 ** min(iter / opt.iters, 1))
+        # scheduler = lambda optimizer: optim.lr_scheduler.OneCycleLR(optimizer, max_lr=opt.lr, total_steps=opt.iters, pct_start=0.1)
+        trainer = Trainer('df', opt, model, guidance, device=device, workspace=opt.workspace, optimizer=optimizer, ema_decay=None, fp16=opt.fp16, lr_scheduler=scheduler, use_checkpoint=opt.ckpt, eval_interval=opt.eval_interval, scheduler_update_every_step=True)
         if opt.gui:
             trainer.train_loader = train_loader # attach dataloader to trainer

nerf/network.py CHANGED Viewed

@@ -52,7 +52,7 @@ class NeRFNetwork(NeRFRenderer):
         if self.bg_radius > 0:
             self.num_layers_bg = num_layers_bg
             self.hidden_dim_bg = hidden_dim_bg
-            self.encoder_bg, self.in_dim_bg = get_encoder('frequency', input_dim=2)
             self.bg_net = MLP(self.in_dim_bg, 3, hidden_dim_bg, num_layers_bg, bias=True)
         else:
@@ -80,7 +80,7 @@ class NeRFNetwork(NeRFRenderer):
         return sigma, albedo
     # ref: https://github.com/zhaofuq/Instant-NSR/blob/main/nerf/network_sdf.py#L192
-    def finite_difference_normal(self, x, epsilon=5e-4):
         # x: [N, 3]
         dx_pos, _ = self.common_forward((x + torch.tensor([[epsilon, 0.00, 0.00]], device=x.device)).clamp(-self.bound, self.bound))
         dx_neg, _ = self.common_forward((x + torch.tensor([[-epsilon, 0.00, 0.00]], device=x.device)).clamp(-self.bound, self.bound))
@@ -148,10 +148,9 @@ class NeRFNetwork(NeRFRenderer):
         }
-    def background(self, x, d):
-        # x: [N, 2], in [-1, 1]
-        h = self.encoder_bg(x) # [N, C]
         h = self.bg_net(h)

         if self.bg_radius > 0:
             self.num_layers_bg = num_layers_bg
             self.hidden_dim_bg = hidden_dim_bg
+            self.encoder_bg, self.in_dim_bg = get_encoder('frequency', input_dim=3)
             self.bg_net = MLP(self.in_dim_bg, 3, hidden_dim_bg, num_layers_bg, bias=True)
         else:
         return sigma, albedo
     # ref: https://github.com/zhaofuq/Instant-NSR/blob/main/nerf/network_sdf.py#L192
+    def finite_difference_normal(self, x, epsilon=1e-2):
         # x: [N, 3]
         dx_pos, _ = self.common_forward((x + torch.tensor([[epsilon, 0.00, 0.00]], device=x.device)).clamp(-self.bound, self.bound))
         dx_neg, _ = self.common_forward((x + torch.tensor([[-epsilon, 0.00, 0.00]], device=x.device)).clamp(-self.bound, self.bound))
         }
+    def background(self, d):
+        h = self.encoder_bg(d) # [N, C]
         h = self.bg_net(h)

nerf/network_grid.py CHANGED Viewed

@@ -57,7 +57,7 @@ class NeRFNetwork(NeRFRenderer):
             # use a very simple network to avoid it learning the prompt...
             # self.encoder_bg, self.in_dim_bg = get_encoder('tiledgrid', input_dim=2, num_levels=4, desired_resolution=2048)
-            self.encoder_bg, self.in_dim_bg = get_encoder('frequency', input_dim=2)
             self.bg_net = MLP(self.in_dim_bg, 3, hidden_dim_bg, num_layers_bg, bias=True)
@@ -87,7 +87,7 @@ class NeRFNetwork(NeRFRenderer):
         return sigma, albedo
     # ref: https://github.com/zhaofuq/Instant-NSR/blob/main/nerf/network_sdf.py#L192
-    def finite_difference_normal(self, x, epsilon=5e-4):
         # x: [N, 3]
         dx_pos, _ = self.common_forward((x + torch.tensor([[epsilon, 0.00, 0.00]], device=x.device)).clamp(-self.bound, self.bound))
         dx_neg, _ = self.common_forward((x + torch.tensor([[-epsilon, 0.00, 0.00]], device=x.device)).clamp(-self.bound, self.bound))
@@ -155,10 +155,9 @@ class NeRFNetwork(NeRFRenderer):
         }
-    def background(self, x, d):
-        # x: [N, 2], in [-1, 1]
-        h = self.encoder_bg(x) # [N, C]
         h = self.bg_net(h)

             # use a very simple network to avoid it learning the prompt...
             # self.encoder_bg, self.in_dim_bg = get_encoder('tiledgrid', input_dim=2, num_levels=4, desired_resolution=2048)
+            self.encoder_bg, self.in_dim_bg = get_encoder('frequency', input_dim=3)
             self.bg_net = MLP(self.in_dim_bg, 3, hidden_dim_bg, num_layers_bg, bias=True)
         return sigma, albedo
     # ref: https://github.com/zhaofuq/Instant-NSR/blob/main/nerf/network_sdf.py#L192
+    def finite_difference_normal(self, x, epsilon=1e-2):
         # x: [N, 3]
         dx_pos, _ = self.common_forward((x + torch.tensor([[epsilon, 0.00, 0.00]], device=x.device)).clamp(-self.bound, self.bound))
         dx_neg, _ = self.common_forward((x + torch.tensor([[-epsilon, 0.00, 0.00]], device=x.device)).clamp(-self.bound, self.bound))
         }
+    def background(self, d):
+        h = self.encoder_bg(d) # [N, C]
         h = self.bg_net(h)

nerf/network_tcnn.py CHANGED Viewed

@@ -4,6 +4,7 @@ import torch.nn.functional as F
 from activation import trunc_exp
 from .renderer import NeRFRenderer
 import numpy as np
 import tinycudann as tcnn
@@ -65,19 +66,9 @@ class NeRFNetwork(NeRFRenderer):
             self.num_layers_bg = num_layers_bg
             self.hidden_dim_bg = hidden_dim_bg
-            self.encoder_bg = tcnn.Encoding(
-                n_input_dims=2,
-                encoding_config={
-                    "otype": "HashGrid",
-                    "n_levels": 4,
-                    "n_features_per_level": 2,
-                    "log2_hashmap_size": 16,
-                    "base_resolution": 16,
-                    "per_level_scale": 1.5,
-                },
-            )
-            self.bg_net = MLP(8, 3, hidden_dim_bg, num_layers_bg, bias=True)
         else:
             self.bg_net = None
@@ -156,11 +147,10 @@ class NeRFNetwork(NeRFRenderer):
         }
-    def background(self, x, d):
         # x: [N, 2], in [-1, 1]
-        h = (x + 1) / (2 * 1) # to [0, 1]
-        h = self.encoder_bg(h) # [N, C]
         h = self.bg_net(h)

 from activation import trunc_exp
 from .renderer import NeRFRenderer
+from encoding import get_encoder
 import numpy as np
 import tinycudann as tcnn
             self.num_layers_bg = num_layers_bg
             self.hidden_dim_bg = hidden_dim_bg
+            self.encoder_bg, self.in_dim_bg = get_encoder('frequency', input_dim=3)
+            self.bg_net = MLP(self.in_dim_bg, 3, hidden_dim_bg, num_layers_bg, bias=True)
         else:
             self.bg_net = None
         }
+    def background(self, d):
         # x: [N, 2], in [-1, 1]
+        h = self.encoder_bg(d) # [N, C]
         h = self.bg_net(h)

nerf/provider.py CHANGED Viewed

@@ -55,7 +55,7 @@ def get_view_direction(thetas, phis, overhead, front):
     return res
-def rand_poses(size, device, radius_range=[1, 1.5], theta_range=[0, 150], phi_range=[0, 360], return_dirs=False, angle_overhead=30, angle_front=60):
     ''' generate random poses from an orbit camera
     Args:
         size: batch size of generated poses.
@@ -82,16 +82,23 @@ def rand_poses(size, device, radius_range=[1, 1.5], theta_range=[0, 150], phi_ra
         radius * torch.sin(thetas) * torch.cos(phis),
     ], dim=-1) # [B, 3]
     # jitters
-    centers = centers + (torch.rand_like(centers) * 0.2 - 0.1)
-    targets = torch.randn_like(centers) * 0.2
     # lookat
     forward_vector = safe_normalize(targets - centers)
     up_vector = torch.FloatTensor([0, -1, 0]).to(device).unsqueeze(0).repeat(size, 1)
     right_vector = safe_normalize(torch.cross(forward_vector, up_vector, dim=-1))
-    up_noise = torch.randn_like(up_vector) * 0.02
     up_vector = safe_normalize(torch.cross(right_vector, forward_vector, dim=-1) + up_noise)
     poses = torch.eye(4, dtype=torch.float, device=device).unsqueeze(0).repeat(size, 1, 1)
@@ -170,7 +177,7 @@ class NeRFDataset:
         if self.training:
             # random pose on the fly
-            poses, dirs = rand_poses(B, self.device, radius_range=self.radius_range, return_dirs=self.opt.dir_text, angle_overhead=self.opt.angle_overhead, angle_front=self.opt.angle_front)
             # random focal
             fov = random.random() * (self.fovy_range[1] - self.fovy_range[0]) + self.fovy_range[0]

     return res
+def rand_poses(size, device, radius_range=[1, 1.5], theta_range=[0, 100], phi_range=[0, 360], return_dirs=False, angle_overhead=30, angle_front=60, jitter=False):
     ''' generate random poses from an orbit camera
     Args:
         size: batch size of generated poses.
         radius * torch.sin(thetas) * torch.cos(phis),
     ], dim=-1) # [B, 3]
+    targets = 0
     # jitters
+    if jitter:
+        centers = centers + (torch.rand_like(centers) * 0.2 - 0.1)
+        targets = targets + torch.randn_like(centers) * 0.2
     # lookat
     forward_vector = safe_normalize(targets - centers)
     up_vector = torch.FloatTensor([0, -1, 0]).to(device).unsqueeze(0).repeat(size, 1)
     right_vector = safe_normalize(torch.cross(forward_vector, up_vector, dim=-1))
+    if jitter:
+        up_noise = torch.randn_like(up_vector) * 0.02
+    else:
+        up_noise = 0
     up_vector = safe_normalize(torch.cross(right_vector, forward_vector, dim=-1) + up_noise)
     poses = torch.eye(4, dtype=torch.float, device=device).unsqueeze(0).repeat(size, 1, 1)
         if self.training:
             # random pose on the fly
+            poses, dirs = rand_poses(B, self.device, radius_range=self.radius_range, return_dirs=self.opt.dir_text, angle_overhead=self.opt.angle_overhead, angle_front=self.opt.angle_front, jitter=self.opt.jitter_pose)
             # random focal
             fov = random.random() * (self.fovy_range[1] - self.fovy_range[0]) + self.fovy_range[0]

nerf/renderer.py CHANGED Viewed

@@ -420,8 +420,8 @@ class NeRFRenderer(nn.Module):
         # mix background color
         if self.bg_radius > 0:
             # use the bg model to calculate bg_color
-            sph = raymarching.sph_from_ray(rays_o, rays_d, self.bg_radius) # [N, 2] in [-1, 1]
-            bg_color = self.background(sph, rays_d.reshape(-1, 3)) # [N, 3]
         elif bg_color is None:
             bg_color = 1
@@ -526,8 +526,8 @@ class NeRFRenderer(nn.Module):
         if self.bg_radius > 0:
             # use the bg model to calculate bg_color
-            sph = raymarching.sph_from_ray(rays_o, rays_d, self.bg_radius) # [N, 2] in [-1, 1]
-            bg_color = self.background(sph, rays_d) # [N, 3]
         elif bg_color is None:
             bg_color = 1

         # mix background color
         if self.bg_radius > 0:
             # use the bg model to calculate bg_color
+            # sph = raymarching.sph_from_ray(rays_o, rays_d, self.bg_radius) # [N, 2] in [-1, 1]
+            bg_color = self.background(rays_d.reshape(-1, 3)) # [N, 3]
         elif bg_color is None:
             bg_color = 1
         if self.bg_radius > 0:
             # use the bg model to calculate bg_color
+            # sph = raymarching.sph_from_ray(rays_o, rays_d, self.bg_radius) # [N, 2] in [-1, 1]
+            bg_color = self.background(rays_d) # [N, 3]
         elif bg_color is None:
             bg_color = 1

nerf/sd.py CHANGED Viewed

@@ -17,10 +17,10 @@ class StableDiffusion(nn.Module):
         try:
             with open('./TOKEN', 'r') as f:
                 self.token = f.read().replace('\n', '') # remove the last \n!
-                print(f'[INFO] successfully loaded hugging face user token!')
         except FileNotFoundError as e:
-            print(e)
-            print(f'[INFO] Please first create a file called TOKEN and copy your hugging face access token into it to download stable diffusion checkpoints.')
         self.device = device
         self.num_train_timesteps = 1000
@@ -94,9 +94,9 @@ class StableDiffusion(nn.Module):
         noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
         noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
-        # w(t), alpha_t * sigma_t^2
-        # w = (1 - self.alphas[t])
-        w = self.alphas[t] ** 0.5 * (1 - self.alphas[t])
         grad = w * (noise_pred - noise)
         # clip grad for stable training?

         try:
             with open('./TOKEN', 'r') as f:
                 self.token = f.read().replace('\n', '') # remove the last \n!
+                print(f'[INFO] loaded hugging face access token from ./TOKEN!')
         except FileNotFoundError as e:
+            self.token = True
+            print(f'[INFO] try to load hugging face access token from the default place, make sure you have run `huggingface-cli login`.')
         self.device = device
         self.num_train_timesteps = 1000
         noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
         noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+        # w(t), sigma_t^2
+        w = (1 - self.alphas[t])
+        # w = self.alphas[t] ** 0.5 * (1 - self.alphas[t])
         grad = w * (noise_pred - noise)
         # clip grad for stable training?

nerf/utils.py CHANGED Viewed

@@ -195,9 +195,6 @@ class Trainer(object):
         self.scheduler_update_every_step = scheduler_update_every_step
         self.device = device if device is not None else torch.device(f'cuda:{local_rank}' if torch.cuda.is_available() else 'cpu')
         self.console = Console()
-        # text prompt
-        ref_text = self.opt.text
         model.to(self.device)
         if self.world_size > 1:
@@ -208,20 +205,13 @@ class Trainer(object):
         # guide model
         self.guidance = guidance
         if self.guidance is not None:
-            assert ref_text is not None, 'Training must provide a text prompt!'
             for p in self.guidance.parameters():
                 p.requires_grad = False
-            if not self.opt.dir_text:
-                self.text_z = self.guidance.get_text_embeds([ref_text])
-            else:
-                self.text_z = []
-                for d in ['front', 'side', 'back', 'side', 'overhead', 'bottom']:
-                    text = f"{ref_text}, {d} view"
-                    text_z = self.guidance.get_text_embeds([text])
-                    self.text_z.append(text_z)
         else:
             self.text_z = None
@@ -257,7 +247,7 @@ class Trainer(object):
             "results": [], # metrics[0], or valid_loss
             "checkpoints": [], # record path of saved ckpt, to automatically remove old ckpt
             "best_result": None,
-            }
         # auto fix
         if len(metrics) == 0 or self.use_loss_as_metric:
@@ -297,6 +287,23 @@ class Trainer(object):
                 self.log(f"[INFO] Loading {self.use_checkpoint} ...")
                 self.load_checkpoint(self.use_checkpoint)
     def __del__(self):
         if self.log_ptr:
             self.log_ptr.close()
@@ -330,11 +337,11 @@ class Trainer(object):
             if rand > 0.8:
                 shading = 'albedo'
                 ambient_ratio = 1.0
-            elif rand > 0.4:
-                shading = 'lambertian'
-                ambient_ratio = 0.1
             else:
-                shading = 'textureless'
                 ambient_ratio = 0.1
         # _t = time.time()
@@ -343,6 +350,9 @@ class Trainer(object):
         pred_rgb = outputs['image'].reshape(B, H, W, 3).permute(0, 3, 1, 2).contiguous() # [1, 3, H, W]
         # torch.cuda.synchronize(); print(f'[TIME] nerf render {time.time() - _t:.4f}s')
         # text embeddings
         if self.opt.dir_text:
             dirs = data['dir'] # [B,]
@@ -352,22 +362,24 @@ class Trainer(object):
         # encode pred_rgb to latents
         # _t = time.time()
-        loss_guidance = self.guidance.train_step(text_z, pred_rgb)
         # torch.cuda.synchronize(); print(f'[TIME] total guiding {time.time() - _t:.4f}s')
         # occupancy loss
         pred_ws = outputs['weights_sum'].reshape(B, 1, H, W)
-        # mask_ws = outputs['mask'].reshape(B, 1, H, W) # near < far
-        # loss_ws = (pred_ws ** 2 + 0.01).sqrt().mean()
-        alphas = (pred_ws).clamp(1e-5, 1 - 1e-5)
-        # alphas = alphas ** 2 # skewed entropy, favors 0 over 1
-        loss_entropy = (- alphas * torch.log2(alphas) - (1 - alphas) * torch.log2(1 - alphas)).mean()
-        loss = loss_guidance + self.opt.lambda_entropy * loss_entropy
-        if 'loss_orient' in outputs:
             loss_orient = outputs['loss_orient']
             loss = loss + self.opt.lambda_orient * loss_orient
@@ -442,6 +454,9 @@ class Trainer(object):
     ### ------------------------------
     def train(self, train_loader, valid_loader, max_epochs):
         if self.use_tensorboardX and self.local_rank == 0:
             self.writer = tensorboardX.SummaryWriter(os.path.join(self.workspace, "run", self.name))

         self.scheduler_update_every_step = scheduler_update_every_step
         self.device = device if device is not None else torch.device(f'cuda:{local_rank}' if torch.cuda.is_available() else 'cpu')
         self.console = Console()
         model.to(self.device)
         if self.world_size > 1:
         # guide model
         self.guidance = guidance
+        # text prompt
         if self.guidance is not None:
             for p in self.guidance.parameters():
                 p.requires_grad = False
+            self.prepare_text_embeddings()
         else:
             self.text_z = None
             "results": [], # metrics[0], or valid_loss
             "checkpoints": [], # record path of saved ckpt, to automatically remove old ckpt
             "best_result": None,
+        }
         # auto fix
         if len(metrics) == 0 or self.use_loss_as_metric:
                 self.log(f"[INFO] Loading {self.use_checkpoint} ...")
                 self.load_checkpoint(self.use_checkpoint)
+    # calculate the text embs.
+    def prepare_text_embeddings(self):
+        if self.opt.text is None:
+            self.log(f"[WARN] text prompt is not provided.")
+            self.text_z = None
+            return
+        if not self.opt.dir_text:
+            self.text_z = self.guidance.get_text_embeds([self.opt.text])
+        else:
+            self.text_z = []
+            for d in ['front', 'side', 'back', 'side', 'overhead', 'bottom']:
+                text = f"{self.opt.text}, {d} view"
+                text_z = self.guidance.get_text_embeds([text])
+                self.text_z.append(text_z)
     def __del__(self):
         if self.log_ptr:
             self.log_ptr.close()
             if rand > 0.8:
                 shading = 'albedo'
                 ambient_ratio = 1.0
+            # elif rand > 0.4:
+            #     shading = 'textureless'
+            #     ambient_ratio = 0.1
             else:
+                shading = 'lambertian'
                 ambient_ratio = 0.1
         # _t = time.time()
         pred_rgb = outputs['image'].reshape(B, H, W, 3).permute(0, 3, 1, 2).contiguous() # [1, 3, H, W]
         # torch.cuda.synchronize(); print(f'[TIME] nerf render {time.time() - _t:.4f}s')
+        # print(shading)
+        # torch_vis_2d(pred_rgb[0])
         # text embeddings
         if self.opt.dir_text:
             dirs = data['dir'] # [B,]
         # encode pred_rgb to latents
         # _t = time.time()
+        loss = self.guidance.train_step(text_z, pred_rgb)
         # torch.cuda.synchronize(); print(f'[TIME] total guiding {time.time() - _t:.4f}s')
         # occupancy loss
         pred_ws = outputs['weights_sum'].reshape(B, 1, H, W)
+        if self.opt.lambda_opacity > 0:
+            loss_opacity = (pred_ws ** 2).mean()
+            loss = loss + self.opt.lambda_opacity * loss_opacity
+        if self.opt.lambda_entropy > 0:
+            alphas = (pred_ws).clamp(1e-5, 1 - 1e-5)
+            # alphas = alphas ** 2 # skewed entropy, favors 0 over 1
+            loss_entropy = (- alphas * torch.log2(alphas) - (1 - alphas) * torch.log2(1 - alphas)).mean()
+            loss = loss + self.opt.lambda_entropy * loss_entropy
+        if self.opt.lambda_orient > 0 and 'loss_orient' in outputs:
             loss_orient = outputs['loss_orient']
             loss = loss + self.opt.lambda_orient * loss_orient
     ### ------------------------------
     def train(self, train_loader, valid_loader, max_epochs):
+        assert self.text_z is not None, 'Training must provide a text prompt!'
         if self.use_tensorboardX and self.local_rank == 0:
             self.writer = tensorboardX.SummaryWriter(os.path.join(self.workspace, "run", self.name))

raymarching/src/raymarching.cu CHANGED Viewed

@@ -905,7 +905,7 @@ __global__ void kernel_composite_rays(
 }
-void composite_rays(const uint32_t n_alive, const uint32_t n_step, const float T_thresh, at::Tensor rays_alive, at::Tensor rays_t, const at::Tensor sigmas, const at::Tensor rgbs, const at::Tensor deltas, at::Tensor weights, at::Tensor depth, at::Tensor image) {
     static constexpr uint32_t N_THREAD = 128;
     AT_DISPATCH_FLOATING_TYPES_AND_HALF(
     image.scalar_type(), "composite_rays", ([&] {

 }
+void composite_rays(const uint32_t n_alive, const uint32_t n_step, const float T_thresh, at::Tensor rays_alive, at::Tensor rays_t, at::Tensor sigmas, at::Tensor rgbs, at::Tensor deltas, at::Tensor weights, at::Tensor depth, at::Tensor image) {
     static constexpr uint32_t N_THREAD = 128;
     AT_DISPATCH_FLOATING_TYPES_AND_HALF(
     image.scalar_type(), "composite_rays", ([&] {

readme.md CHANGED Viewed

@@ -17,13 +17,13 @@ This project is a **work-in-progress**, and contains lots of differences from th
 ## Notable differences from the paper
-* Since the Imagen model is not publicly available, we use [Stable Diffusion](https://github.com/CompVis/stable-diffusion) to replace it (implementation from [diffusers](https://github.com/huggingface/diffusers)). Different from Imagen, Stable-Diffusion is a latent diffusion model, which diffuses in a latent space instead of the original image space. Therefore, we need the loss to propagate back from the VAE's encoder part too, which introduces extra time cost in training. Currently, 15000 training steps take about 5 hours to train on a V100.
 * We use the [multi-resolution grid encoder](https://github.com/NVlabs/instant-ngp/) to implement the NeRF backbone (implementation from [torch-ngp](https://github.com/ashawkey/torch-ngp)), which enables much faster rendering (~10FPS at 800x800).
 * We use the Adam optimizer with a larger initial learning rate.
 ## TODOs
-* The normal evaluation & shading part.
 * Better mesh (improve the surface quality).
 # Install
@@ -33,7 +33,9 @@ git clone https://github.com/ashawkey/stable-dreamfusion.git
 cd stable-dreamfusion
 ```
-**Important**: To download the Stable Diffusion model checkpoint, you should create a file called `TOKEN` under this directory (i.e., `stable-dreamfusion/TOKEN`) and copy your hugging face [access token](https://huggingface.co/docs/hub/security-tokens) into it.
 ### Install with pip
 ```bash
@@ -71,14 +73,30 @@ First time running will take some time to compile the CUDA extensions.
 ```bash
 ### stable-dreamfusion setting
-## train with text prompt
 # `-O` equals `--cuda_ray --fp16 --dir_text`
 python main.py --text "a hamburger" --workspace trial -O
 ## after the training is finished:
-# test (exporting 360 video, and an obj mesh with png texture)
 python main.py --workspace trial -O --test
 # test with a GUI (free view control!)
 python main.py --workspace trial -O --test --gui
@@ -101,7 +119,7 @@ pred_rgb_512 = F.interpolate(pred_rgb, (512, 512), mode='bilinear', align_corner
 latents = self.encode_imgs(pred_rgb_512)
 ... # timestep sampling, noise adding and UNet noise predicting
 # 3. the SDS loss, since UNet part is ignored and cannot simply audodiff, we manually set the grad for latents.
-w = (1 - self.scheduler.alphas_cumprod[t]).to(self.device)
 grad = w * (noise_pred - noise)
 latents.backward(gradient=grad, retain_graph=True)
 ```
@@ -117,7 +135,6 @@ latents.backward(gradient=grad, retain_graph=True)
         Training is faster if only sample 128 points uniformly per ray (5h --> 2.5h).
         More testing is needed...
 * Shading & normal evaluation: `./nerf/network*.py > NeRFNetwork > forward`. Current implementation harms training and is disabled.
-    * use `--albedo_iters 1000` to enable random shading mode after 1000 steps from albedo, lambertian, and textureless.
     * light direction: current implementation use a plane light source, instead of a point light source...
 * View-dependent prompting: `./nerf/provider.py > get_view_direction`.
     * ues `--angle_overhead, --angle_front` to set the border. How to better divide front/back/side regions?

 ## Notable differences from the paper
+* Since the Imagen model is not publicly available, we use [Stable Diffusion](https://github.com/CompVis/stable-diffusion) to replace it (implementation from [diffusers](https://github.com/huggingface/diffusers)). Different from Imagen, Stable-Diffusion is a latent diffusion model, which diffuses in a latent space instead of the original image space. Therefore, we need the loss to propagate back from the VAE's encoder part too, which introduces extra time cost in training. Currently, 10000 training steps take about 3 hours to train on a V100.
 * We use the [multi-resolution grid encoder](https://github.com/NVlabs/instant-ngp/) to implement the NeRF backbone (implementation from [torch-ngp](https://github.com/ashawkey/torch-ngp)), which enables much faster rendering (~10FPS at 800x800).
 * We use the Adam optimizer with a larger initial learning rate.
 ## TODOs
+* Alleviate the multi-face [Janus problem](https://twitter.com/poolio/status/1578045212236034048).
 * Better mesh (improve the surface quality).
 # Install
 cd stable-dreamfusion
 ```
+**Important**: To download the Stable Diffusion model checkpoint, you should provide your [access token](https://huggingface.co/settings/tokens). You could choose either of the following ways:
+* Run `huggingface-cli login` and enter your token.
+* Create a file called `TOKEN` under this directory (i.e., `stable-dreamfusion/TOKEN`) and copy your token into it.
 ### Install with pip
 ```bash
 ```bash
 ### stable-dreamfusion setting
+## train with text prompt (with the default settings)
 # `-O` equals `--cuda_ray --fp16 --dir_text`
+# `--cuda_ray` enables instant-ngp-like occupancy grid based acceleration.
+# `--fp16` enables half-precision training.
+# `--dir_text` enables view-dependent prompting.
 python main.py --text "a hamburger" --workspace trial -O
+# if the above command fails to generate things (learns an empty scene), maybe try:
+# 1. disable random lambertian shading, simply use albedo as color:
+python main.py --text "a hamburger" --workspace trial -O --albedo_iters 10000 # i.e., set --albedo_iters >= --iters, which is default to 10000
+# 2. use a smaller density regularization weight:
+python main.py --text "a hamburger" --workspace trial -O --lambda_entropy 1e-5
+# you can also train in a GUI to visualize the training progress:
+python main.py --text "a hamburger" --workspace trial -O --gui
+# A Gradio GUI is also possible (with less options):
+python gradio_app.py # open in web browser
 ## after the training is finished:
+# test (exporting 360 video)
 python main.py --workspace trial -O --test
+# also save a mesh (with obj, mtl, and png texture)
+python main.py --workspace trial -O --test --save_mesh
 # test with a GUI (free view control!)
 python main.py --workspace trial -O --test --gui
 latents = self.encode_imgs(pred_rgb_512)
 ... # timestep sampling, noise adding and UNet noise predicting
 # 3. the SDS loss, since UNet part is ignored and cannot simply audodiff, we manually set the grad for latents.
+w = self.alphas[t] ** 0.5 * (1 - self.alphas[t])
 grad = w * (noise_pred - noise)
 latents.backward(gradient=grad, retain_graph=True)
 ```
         Training is faster if only sample 128 points uniformly per ray (5h --> 2.5h).
         More testing is needed...
 * Shading & normal evaluation: `./nerf/network*.py > NeRFNetwork > forward`. Current implementation harms training and is disabled.
     * light direction: current implementation use a plane light source, instead of a point light source...
 * View-dependent prompting: `./nerf/provider.py > get_view_direction`.
     * ues `--angle_overhead, --angle_front` to set the border. How to better divide front/back/side regions?