Commit
5ddfc8b
1 Parent(s): f7f4ca2

Update README.md (#2)

Browse files

- Update README.md (fa19845f9de6de5f670562f61fe6b2f6242f2e10)


Co-authored-by: Ethan <Ethan-pooh@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +92 -6
README.md CHANGED
@@ -1,9 +1,95 @@
1
- ---
 
 
 
 
 
2
  tags:
3
- - pytorch_model_hub_mixin
4
- - model_hub_mixin
 
 
 
 
 
 
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Library: https://huggingface.co/robotics-diffusion-transformer/rdt-1b
9
- - Docs: [More Information Needed]
 
1
+ --
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: robotics
6
+ library_name: diffusers
7
  tags:
8
+ - robotics
9
+ - pytorch
10
+ - diffusers
11
+ - multimodal
12
+ - pretraining
13
+ - vla
14
+ - diffusion
15
+ - rdt
16
  ---
17
+ # RDT-1B
18
+
19
+ RDT-1B is a 1B-parameter imitation learning Diffusion Transformer pre-trained on 1M+ multi-robot episodes. Given language instruction and RGB images of up to three views, RDT can predict the next
20
+ 64 robot actions. RDT is compatible with almost all modern mobile manipulators, from single-arm to dual-arm, joint to EEF, pos. to vel., and even with a mobile chassis.
21
+
22
+ All the [code](https://github.com/GeneralEmbodiedSystem/RoboticsDiffusionTransformer/tree/main?tab=readme-ov-file) and pre-trained model weights are licensed under the MIT license.
23
+
24
+ Please refer to our [project page](https://rdt-robotics.github.io/rdt-robotics/) and [paper]() for more information.
25
+
26
+ ## Model Details
27
+
28
+ - **Developed by:** The RDT team consisting of researchers from the [TSAIL group](https://ml.cs.tsinghua.edu.cn/) at Tsinghua University
29
+ - **Task Type:** Vision-Language-Action (language, image => robot actions)
30
+ - **Modle Type:** Diffusion Policy with Transformers
31
+ - **License:** MIT
32
+ - **Language(s) (NLP):** en
33
+ - **Multi-Modal Encoders:**
34
+ - **Vision Backbone:** [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
35
+ - **Language Model:** [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl)
36
+ - **Pre-Training Datasets:** 46 datasets consisting of [RT-1 Dataset](https://robotics-transformer1.github.io/), [RH20T](https://rh20t.github.io/), [DROID](https://droid-dataset.github.io/), [BridgeData V2](https://rail-berkeley.github.io/bridgedata/), [RoboSet](https://robopen.github.io/roboset/), and a subset of [Open X-Embodiment](https://robotics-transformer-x.github.io/). See [todo]() for a detailed list.
37
+ - **Repository:** [repo_url]
38
+ - **Paper :** [paper_url]
39
+ - **Project Page:** https://rdt-robotics.github.io/rdt-robotics/
40
+
41
+ ## Uses
42
+
43
+ RDT takes language instruction, RGB image (of up to three views), control frequency (if any), and proprioception as input and predicts the next 64 robot actions in the form of the unified action space vector.
44
+ The unified action space vector includes all the main physical quantities of the robot manipulator (e.g., the end-effector and joint, position and velocity, and base movement).
45
+ To deploy on your robot platform, you need to pick the relevant quantities from the unified vector. See our repository for more information.
46
+
47
+ **Out-of-Scope**: Due to the embodiment gap, RDT cannot yet generalize to new robot platforms (not seen in the pre-training datasets).
48
+ In this case, we recommend collecting a small dataset of the target robot and then using it to fine-tune RDT.
49
+ See our repository for a tutorial.
50
+
51
+ Here's an example of how to use the RDT-1B model for inference on a robot:
52
+ ```python
53
+ # Clone the repository and install dependencies
54
+ from scripts.agilex_model import create_model
55
+ # Names of cameras used for visual input
56
+ CAMERA_NAMES = ['cam_high', 'cam_right_wrist', 'cam_left_wrist']
57
+ config = {
58
+ 'episode_len': 1000, # Max length of one episode
59
+ 'state_dim': 14, # Dimension of the robot's state
60
+ 'chunk_size': 64, # Number of actions to predict in one step
61
+ 'camera_names': CAMERA_NAMES,
62
+ }
63
+ pretrained_vision_encoder_name_or_path = "google/siglip-so400m-patch14-384"
64
+ # Create the model with the specified configuration
65
+ model = create_model(
66
+ args=config,
67
+ dtype=torch.bfloat16,
68
+ pretrained_vision_encoder_name_or_path=pretrained_vision_encoder_name_or_path,
69
+ control_frequency=25,
70
+ )
71
+ # Start inference process
72
+ # Load pre-computed language embeddings
73
+ lang_embeddings_path = 'your/language/embedding/path'
74
+ text_embedding = torch.load(lang_embeddings_path)['embeddings']
75
+ images: List(PIL.Image) = ... # The images from last 2 frame
76
+ proprio = ... # The current robot state
77
+ # Perform inference to predict the next chunk_size actions
78
+ actions = policy.step(
79
+ proprio=proprio,
80
+ images=images,
81
+ text_embeds=text_embedding
82
+ )
83
+ ```
84
+
85
+ <!-- RDT-1B supports finetuning on custom datasets, deploying and inferencing on real robots, as well as retraining the model.
86
+ Please refer to [our repository](https://github.com/GeneralEmbodiedSystem/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md) for all the above guides. -->
87
+
88
+
89
+ ## Citation
90
+
91
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
92
+
93
+ **BibTeX:**
94
 
95
+ [More Information Needed]