Octo Base

See https://github.com/octo-models/octo for instructions for using this model.

Octo Base is trained with a window size of 2, predicting 7-dimensional actions 4 steps into the future using a diffusion policy. The model is a Transformer with 93M parameters (equivalent to a ViT-B). Images are tokenized by preprocessing with a lightweight convolutional encoder, then grouped into 16x16 patches. Language is tokenized by applying the T5 tokenizer, and then applying the T5-Base language encoder.

Observations and tasks conform to the following spec:

Observations:

{
    image_primary: ('batch', 'history_window', 256, 256, 3),
    image_wrist: ('batch', 'history_window', 128, 128, 3),
}

Tasks:

{
    image_primary: ('batch', 256, 256, 3),
    image_wrist: ('batch', 128, 128, 3),
    language_instruction: {
        attention_mask: ('batch', 16),
        input_ids: ('batch', 16),
    },
}

At inference, you may pass in any subset of these observation and task keys, with a history window up to 2 timesteps.

This model was trained on a mix of datasets from the Open X-Embodiment dataset.

Dataset	Proportion of batch
Fractal (Brohan et al, 2022)	17.0%
Kuka (Kalashnikov et al, 2018)	17.0%
Bridge (Walke et al, 2023)	17.0%
BC-Z (Jang et al, 2022)	9.1%
Stanford Hydra Dataset (Belkhale et al, 2023)	6.0%
Language Table~ (Lynch et al, 2023)	5.9%
Taco Play (Rosete-Beas et al, 2022, Mees et al., 2023)	3.6%
Furniture Bench Dataset (Heo et al, 2023)	3.3%
UTAustin Mutex (Shah et al, 2023)	3.0%
Austin Sailor Dataset (Nasiriany et al, 2022)	2.9%
Roboturk (Mandlekar et al, 2018)	2.8%
Toto (Zhou et al, 2023)	2.4%
Austin Sirius Dataset (Liu et al, 2023)	2.3%
Berkeley Autolab UR5 (Chen et al)	1.5%
IAMLab CMU Pickup Insert (Saxena et al, 2023)	1.2%
Viola (Zhu et al, 2023)	1.2%
Berkeley Fanuc Manipulation (Zhu et al, 2023)	1.0%
NYU Franka Play Dataset (Cui et al, 2022)	0.9%
UCSD Kitchen Dataset (Ge Yan and Wang, 2023)	<0.1%
Jaco Play (Dass et al, 2023)	0.6%
Berkeley Cable Routing (Luo et al, 2023)	0.3%
Austin Buds Dataset (Zhu et al, 2022)	0.3%
CMU Stretch (Mendonca et al, 2023)	0.2%
NYU Door Opening (Pari et al, 2021)	0.1%
DLR EDAN Shared Control (Quere et al, 2020)	0.1%

Updates for Version 1.5

Language task tokens are now repeated at every timestep in the context window.
Augmented the language instructions in the data with rephrasings from GPT-3.5.
Bug fixes:
- Turned off dropout in the diffusion head due to incompatibility with layer norm.
- Fixed an off-by-one error with the attention mask.
- Fixed an issue where different image augmentations did not get fresh random seeds.