3d_diffuser_actor / README.md

twke

Update README.md

f7c5dfc verified 5 months ago

preview code

raw

history blame contribute delete

No virus

6.5 kB

	---
	'[object Object]': null
	license: mit
	language:
	- en
	---

	# Model Card for 3D Diffuser Actor

	<!-- Provide a quick summary of what the model is/does. -->

	A robot manipulation policy that marries diffusion modeling with 3D scene representations.
	3D Diffuser Actor is trained and evaluated on [RLBench](https://github.com/stepjam/RLBench) or [CALVIN](https://github.com/mees/calvin) simulation.
	We release all code, checkpoints, and details involved in training these models.

	## Model Details

	The models released are the following:

	\| Benchmark \| Embedding dimension \| Diffusion timestep \|
	\|------\|------\|------\|
	\| [RLBench (PerAct)](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/diffuser_actor_peract.pth) \| 120 \| 100 \|
	\| [RLBench (GNFactor)](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/diffuser_actor_gnfactor.pth) \| 120\| 100 \|
	\| [CALVIN](https://huggingface.co/katefgroup/3d_diffuser_actor/blob/main/diffuser_actor_calvin.pth) \| 192 \| 25 \|

	### Model Description

	<!-- Provide a longer summary of what this model is. -->


	- Developed by: Katerina Group at CMU
	- Model type: a Diffusion model with 3D scene
	- License: The code and model are released under MIT License
	- Contact: ngkanats@andrew.cmu.edu


	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Project Page: https://3d-diffuser-actor.github.io
	- Repository: https://github.com/nickgkan/3d_diffuser_actor.git
	- Paper: [Link]()

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Input format
	3D Diffuser Actor takes the following inputs:

	1. `RGB observations`: a tensor of shape (batch_size, num_cameras, 3, H, W). The pixel values are in the range of [0, 1]
	2. `Point cloud observation`: a tensor of shape (batch_size, num_cameras, 3, H, W).
	3. `Instruction encodings`: a tensor of shape (batch_size, max_instruction_length, C). In this code base, the embedding dimension `C` is set to 512.
	4. `curr_gripper`: a tensor of shape (batch_size, history_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D).
	5. `trajectory_mask`: a tensor of shape (batch_size, trajectory_length), which is only used to indicate the length of each trajectory. To predict keyposes, we just need to set its shape to (batch_size, 1).
	6. `gt_trajectory`: a tensor of shape (batch_size, trajectory_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D). The input is only used during training.

	### Output format
	The model returns the diffusion loss, when `run_inference=False`, otherwise, it returns pose trajectory of shape (batch_size, trajectory_length, 8) when `run_inference=True`.

	### Usage
	For training, forward 3D Diffuser Actor with `run_inference=False`
	```
	> loss = model.forward(gt_trajectory,
	trajectory_mask,
	rgb_obs,
	pcd_obs,
	instruction,
	curr_gripper,
	run_inference=False)
	```

	For evaluation, forward 3D Diffuser Actor with `run_inference=True`
	```
	> fake_gt_trajectory = torch.full((1, trajectory_length, 7), 0).to(device)
	> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
	> trajectory = model.forward(fake_gt_trajectory,
	trajectory_mask,
	rgb_obs,
	pcd_obs,
	instruction,
	curr_gripper,
	run_inference=True)
	```

	Or you can forward the model with `compute_trajectory` function
	```
	> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
	> trajectory = model.compute_trajectory(trajectory_mask,
	rgb_obs,
	pcd_obs,
	instruction,
	curr_gripper)
	```


	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->
	Our model trained and evaluated on RLBench simulation with the PerAct setup:

	\| RLBench (PerAct) \| 3D Diffuser Actor \| [RVT](https://github.com/NVlabs/RVT) \|
	\| --------------------------------- \| -------- \| -------- \|
	\| average \| 81.3 \| 62.9 \|
	\| open drawer \| 89.6 \| 71.2 \|
	\| slide block \| 97.6 \| 81.6 \|
	\| sweep to dustpan \| 84.0 \| 72.0 \|
	\| meat off grill \| 96.8 \| 88 \|
	\| turn tap \| 99.2 \| 93.6 \|
	\| put in drawer \| 96.0 \| 88.0 \|
	\| close jar \| 96.0 \| 52.0 \|
	\| drag stick \| 100.0 \| 99.2 \|
	\| stack blocks \| 68.3 \| 28.8 \|
	\| screw bulbs \| 82.4 \| 48.0 \|
	\| put in safe \| 97.6 \| 91.2 \|
	\| place wine \| 93.6 \| 91.0 \|
	\| put in cupboard \| 85.6 \| 49.6 \|
	\| sort shape \| 44.0 \| 36.0 \|
	\| push buttons \| 98.4 \| 100.0 \|
	\| insert peg \| 65.6 \| 11.2 \|
	\| stack cups \| 47.2 \| 26.4 \|
	\| place cups \| 24.0 \| 4.0 \|


	Our model trained and evaluated on RLBench simulation with the GNFactor setup:

	\| RLBench (PerAct) \| 3D Diffuser Actor \| [GNFactor](https://github.com/YanjieZe/GNFactor) \|
	\| --------------------------------- \| -------- \| -------- \|
	\| average \| 78.4 \| 31.7 \|
	\| open drawer \| 89.3 \| 76.0 \|
	\| sweep to dustpan \| 894.7 \| 25.0 \|
	\| close jar \| 82.7 \| 25.3 \|
	\| meat off grill \| 88.0 \| 57.3 \|
	\| turn tap \| 80.0 \| 50.7 \|
	\| slide block \| 92.0 \| 20.0 \|
	\| put in drawer \| 77.3 \| 0.0 \|
	\| drag stick \| 98.7 \| 37.3 \|
	\| push buttons \| 69.3 \| 18.7 \|
	\| stack blocks \| 12.0 \| 4.0 \|

	Our model trained and evaluated on CALVIN simulation (train with environment A, B, C and test on D):

	\| RLBench (PerAct) \| 3D Diffuser Actor \| [GR-1](https://gr1-manipulation.github.io/) \| [SuSIE](https://rail-berkeley.github.io/susie/) \|
	\| --------------------------------- \| -------- \| -------- \| -------- \|
	\| task 1 \| 92.2 \| 85.4 \| 87.0 \|
	\| task 2 \| 78.7 \| 71.2 \| 69.0 \|
	\| task 3 \| 63.9 \| 59.6 \| 49.0 \|
	\| task 4 \| 51.2 \| 49.7 \| 38.0 \|
	\| task 5 \| 41.2 \| 40.1 \| 26.0 \|



	## Citation [optional]

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	```
	@article{,
	title={Action Diffusion with 3D Scene Representations},
	author={Ke, Tsung-Wei and Gkanatsios, Nikolaos and Fragkiadaki, Katerina}
	journal={Preprint},
	year={2024}
	}
	```



	## Model Card Contact

	For errors in this model card, contact Nikos or Tsung-Wei, {ngkanats, tsungwek} at andrew dot cmu dot edu.