DreamWorld โ€” DROID Action-Conditioned World Models (RGB / RGBD)

Fine-tuned checkpoints of the Wan-T2V-1.3B backbone for action-conditioned video prediction on the DROID dataset, trained with VAE-encoded latent shards.

What's in this repo

Six fine-tuning runs (final step model.safetensors + states.pt):

Run Modality Conditioning Final step Notes
arm_a_rgb_latent/ RGB observed action (executed) 85000 base RGB run
arm_a_rgb_latent_ctrlworld_obs/ RGB observation state + executed action 80000 ctrlworld-style
arm_a_rgb_latent_commanded_action/ RGB commanded action (target) 55000 command-not-obs
arm_b_rgbd_latent/ RGBD observed action 55000 depth-conditioned
arm_b_rgbd_latent_ctrlworld_obs/ RGBD observation state + executed action 80000 RGBD + ctrlworld
arm_b_rgbd_latent_commanded_action/ RGBD commanded action 55000 RGBD + command

Each file is the EMA / final transformer weights produced by finetune/trainer/sft_trainer/trainer.py.

Eval artifacts

eval_results/ contains sample inference outputs and precheck results referenced in the project notes (see companion repo).

Training code

Code, configs, and dataset preprocessing pipeline live in private repos owned by huiliu0424. The training scripts that produced these checkpoints are at script/training/training_arm_{a,b}_latent[_*].json.

Latent shards (training data)

The VAE-encoded latent shards used during training (~900 GB total) are not included in this repo. They were derived from the DROID dataset processed with depth (foundationstereo) and 2D point flow (cotracker) signals. Reach out if you need access.

Citation

If you use these models, please cite the upstream Wan-T2V work and the DROID dataset.

Downloads last month
-
Video Preview
loading

Model tree for huiliu123/dreamworld-trained-models

Finetuned
(57)
this model