DreamWorld — DROID Action-Conditioned World Models (RGB / RGBD)

Fine-tuned checkpoints of the Wan-T2V-1.3B backbone for action-conditioned video prediction on the DROID dataset, trained with VAE-encoded latent shards.

What's in this repo

Six fine-tuning runs (final step model.safetensors + states.pt):

Run	Modality	Conditioning	Final step	Notes
`arm_a_rgb_latent/`	RGB	observed action (executed)	85000	base RGB run
`arm_a_rgb_latent_ctrlworld_obs/`	RGB	observation state + executed action	80000	ctrlworld-style
`arm_a_rgb_latent_commanded_action/`	RGB	commanded action (target)	55000	command-not-obs
`arm_b_rgbd_latent/`	RGBD	observed action	55000	depth-conditioned
`arm_b_rgbd_latent_ctrlworld_obs/`	RGBD	observation state + executed action	80000	RGBD + ctrlworld
`arm_b_rgbd_latent_commanded_action/`	RGBD	commanded action	55000	RGBD + command

Each file is the EMA / final transformer weights produced by finetune/trainer/sft_trainer/trainer.py.

Eval artifacts

eval_results/ contains sample inference outputs and precheck results referenced in the project notes (see companion repo).

Training code

Code, configs, and dataset preprocessing pipeline live in private repos owned by huiliu0424. The training scripts that produced these checkpoints are at script/training/training_arm_{a,b}_latent[_*].json.

Latent shards (training data)

The VAE-encoded latent shards used during training (~900 GB total) are not included in this repo. They were derived from the DROID dataset processed with depth (foundationstereo) and 2D point flow (cotracker) signals. Reach out if you need access.

Citation

If you use these models, please cite the upstream Wan-T2V work and the DROID dataset.

Downloads last month: -

Video Preview

Robotics

Model tree for huiliu123/dreamworld-trained-models

Base model

Wan-AI/Wan2.1-T2V-1.3B

Finetuned

(57)

this model