Spaces:

kxic
/

EscherNet

Running on Zero

App Files Files Community

EscherNet / README.md

kxhit

app

0054ddf 5 months ago

preview code

raw

history blame

6.28 kB

	---
	title: EscherNet
	app_file: app.py
	sdk: gradio
	sdk_version: 4.19.2
	---
	[comment]: <> (# EscherNet: A Generative Model for Scalable View Synthesis)

	<!-- PROJECT LOGO -->

	<p align="center">

	<h1 align="center">EscherNet: A Generative Model for Scalable View Synthesis</h1>
	<p align="center">
	<a href="https://kxhit.github.io"><strong>Xin Kong</strong></a>
	·
	<a href="https://shikun.io"><strong>Shikun Liu</strong></a>
	·
	<a href="https://shawlyu.github.io/"><strong>Xiaoyang Lyu</strong></a>
	·
	<a href="https://marwan99.github.io/"><strong>Marwan Taher</strong></a>
	·
	<a href="https://xjqi.github.io/"><strong>Xiaojuan Qi</strong></a>
	·
	<a href="https://www.doc.ic.ac.uk/~ajd/"><strong>Andrew J. Davison</strong></a>
	</p>

	[comment]: <> ( <h2 align="center">PAPER</h2>)
	<h3 align="center"><a href="https://arxiv.org/abs/2402.03908">Paper</a> \| <a href="https://kxhit.github.io/EscherNet">Project Page</a></h3>
	<div align="center"></div>

	<p align="center">
	<a href="">
	<img src="./scripts/teaser.png" alt="Logo" width="80%">
	</a>
	</p>
	<p align="center">
	EscherNet is a <strong>multi-view conditioned</strong> diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with the <strong>camera positional encoding (CaPE)</strong>, allowing precise and continuous relative control of the camera transformation between an <strong>arbitrary number of reference and target views</strong>.
	</p>
	<br>

	## Install
	```
	conda env create -f environment.yml -n eschernet
	conda activate eschernet
	```

	## Demo
	Run demo to generate randomly sampled 25 novel views from (1,2,3,5,10) reference views:
	```commandline
	bash eval_eschernet.sh
	```

	## Camera Positional Encoding (CaPE)
	CaPE is applied in self/cross-attention for encoding camera pose info into transformers. The main modification is in `diffusers/models/attention_processor.py`.

	To quickly check the implementation of CaPE (6DoF and 4DoF), run:
	```
	python CaPE.py
	```

	## Training
	### Objaverse 1.0 Dataset
	Download Zero123's Objaverse Rendering data:
	```commandline
	wget https://tri-ml-public.s3.amazonaws.com/datasets/views_release.tar.gz
	```
	Filter Zero-1-to-3 rendered views (empty images):
	```commandline
	cd scripts
	python objaverse_filter.py --path /data/objaverse/views_release
	```

	### Launch training
	Configure accelerator (8 A100 GPUs, bf16):
	```commandline
	accelerate config
	```

	Choose 4DoF or 6DoF CaPE (Camera Positional Encoding):
	```commandline
	cd 4DoF or 6DoF
	```

	Launch training:

	```commandline
	accelerate launch train_eschernet.py --train_data_dir /data/objectverse/views_release --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5 --train_batch_size 256 --dataloader_num_workers 16 --mixed_precision bf16 --gradient_checkpointing --T_in 3 --T_out 3 --T_in_val 10 --output_dir logs_N3M3B256_SD1.5 --push_to_hub --hub_model_id *** --hub_token hf_***************** --tracker_project_name eschernet
	```

	For monitoring training progress, we recommand [wandb](https://wandb.ai/site) for its simplicity and powerful features.
	```commandline
	wandb login
	```
	Offline mode:
	```commandline
	WANDB_MODE=offline python xxx.py
	```


	## Evaluation
	We provide [raw results](https://huggingface.co/datasets/kxic/EscherNet-Results) and two checkpoints [4DoF](https://huggingface.co/kxic/eschernet-4dof) and [6DoF](https://huggingface.co/kxic/eschernet-6dof) for easier comparison.

	### Datasets
	##### [GSO Google Scanned Objects](https://app.gazebosim.org/GoogleResearch/fuel/collections/Scanned%20Objects%20by%20Google%20Research)
	[GSO30](https://huggingface.co/datasets/kxic/EscherNet-Dataset/tree/main): We select 30 objects from GSO dataset and render 25 randomly sampled novel views for each object for both NVS and 3D reconstruction evaluation.

	##### [RTMV](https://drive.google.com/drive/folders/1cUXxUp6g25WwzHnm_491zNJJ4T7R_fum)
	We use the 10 scenes from `google_scanned.tar` under folder `40_scenes` for NVS evaluation.

	##### [NeRF_Synthetic](https://drive.google.com/drive/folders/1JDdLGDruGNXWnM1eqY1FNL9PlStjaKWi)
	We use the all 8 NeRF objects for 2D NVS evaluation.

	##### [Franka16](https://huggingface.co/datasets/kxic/EscherNet-Dataset/tree/main)
	We collected 16 real world object-centric recordings using a Franka Emika Panda robot arm with RealSense D435i Camera for real world NVS evaluation.

	##### [Text2Img](https://huggingface.co/datasets/kxic/EscherNet-Dataset/tree/main)
	We collected Text2Img generation results from internet, [Stable Diffusion XL](https://github.com/Stability-AI/generative-models) (1 view) and [MVDream](https://github.com/bytedance/MVDream) (4 views: front, right, back, left) for NVS evaluation.

	### Novel View Synthesis (NVS)
	To get 2D Novel View Synthesis (NVS) results, set `cape_type, checkpoint, data_type, data_dir` and run:
	```commandline
	bash ./eval_eschernet.sh
	```
	Evaluate 2D metrics (PSNR, SSIM, LPIPS):
	```commandline
	cd metrics
	python eval_2D_NVS.py
	```

	### 3D Reconstruction
	We firstly generate 36 novel views with `data_type=GSO3D` by:
	```commandline
	bash ./eval_eschernet.sh
	```
	Then we adopt [NeuS](https://github.com/Totoro97/NeuS) for 3D reconstruction:
	```commandline
	export CUDA_HOME=/usr/local/cuda-11.8
	pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
	cd 3drecon
	python run_NeuS.py
	```

	Evaluate 3D metrics (Chamfer Distance, IoU):
	```commandline
	cd metrics
	python eval_3D_GSO.py
	```


	## Gradio Demo
	TODO.

	To build locally:
	```commandline
	python gradio_eschernet.py
	```

	## Acknowledgement
	We have intensively borrow codes from the following repositories. Many thanks to the authors for sharing their codes.

	- [Zero-1-to-3](https://github.com/cvlab-columbia/zero123)
	- [SyncDreamer](https://github.com/liuyuan-pal/SyncDreamer)
	- [MVDream](https://github.com/bytedance/MVDream)
	- [NeuS](https://github.com/Totoro97/NeuS)


	## Citation
	If you find this work useful, a citation will be appreciated via:

	```
	@article{kong2024eschernet,
	title={EscherNet: A Generative Model for Scalable View Synthesis},
	author={Kong, Xin and Liu, Shikun and Lyu, Xiaoyang and Taher, Marwan and Qi, Xiaojuan and Davison, Andrew J},
	journal={arXiv preprint arXiv:2402.03908},
	year={2024}
	}
	```