LAPA: Latent Action Pretraining from Videos

LAPA is the first unsupervised approach for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
LAPA outperforms the current state-of-the-art VLA model trained with ground-truth actions, building a new SOTA VLA model.
LAPA achieves over 30x greater pretraining efficiency compared to conventional VLA pretraining.

Model Summary

Developed by: The LAPA team consisting of researchers from KAIST, UW, Microsoft, NVIDIA, and AI2.
Model type: Vision-language-action (language, image => robot actions)
Language(s) (NLP): en
License: MIT
Finetuned from: LWM-Chat-1M-Jax, a VLM trained from:
- Vision Backbone: VQGAN
- Language Model: Llama-2
Pretraining Dataset: Open X-Embodiment
Website: https://latentactionpretraining.github.io/
Paper: https://arxiv.org/abs/2410.11758
Code: https://github.com/LatentActionPretraining/LAPA

Primary Use Cases

Our model is designed to accelerate research on unsupervised methods for building vision-language-action models, for use as a building block for generative AI powered features.

Use Case Considerations

Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of multimodal language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

Usage

Latent Inference

To analyze the output of the model, which is a sequence of latent actions (8^4), run the following command:

conda create -n lapa python=3.10 -y
conda activate lapa
git clone https://github.com/LatentActionPretraining/LAPA.git
pip install -r requirements.txt 
mkdir lapa_checkpoints && cd lapa_checkpoints
wget https://huggingface.co/latent-action-pretraining/LAPA-7B-openx/resolve/main/tokenizer.model
wget https://huggingface.co/latent-action-pretraining/LAPA-7B-openx/resolve/main/vqgan
wget https://huggingface.co/latent-action-pretraining/LAPA-7B-openx/resolve/main/params
cd ..
python -m latent_pretraining.inference

Fine-tuning

Since the released checkpoint is trained with latent pretraining objective, the outputs are not real actions that are executable in the real world. To make the model output executable actions, fine-tuning on a small set of trajectories that contain ground-truth actions (~150 trajs) to map the latent action space to the actual action space.

To finetune the model on SIMPLER, run the following command:

./scripts/finetune_simpler.sh

To finetune the model on a custom dataset, run the following command:

python data/finetune_preprocess.py --input_path "/path_to_json_file" --output_filename "data/real_finetune.jsonl" --csv_filename "data/real_finetune.csv"
./scripts/finetune_real.sh

Benchmarks

To understand the capabilities, we compare LAPA with a set of models over a variety of benchmarks. At the high-level overview of the model quality on representative benchmarks:

Real-World Experiments

	Scratch	OpenVLA (Bridge)	ActionVLA (Bridge)	LAPA (Bridge)	OpenVLA (OpenX)	LAPA (OpenX)	LAPA (Sthv2)
Knock	13.9	33.3	25.0	25.0	38.9	52.8	30.6
Cover	38.7	42.3	47.8	42.4	38.6	51.7	47.9
Pick and Place	11.1	22.2	19.4	43.4	54.2	45.8	23.6
Average	21.2	32.6	30.8	36.8	43.9	50.1	34.0

Training

Model


Developer	LAPA Team
Architecture	LAPA has 7B parameters where the base architecture is from Large-World-Model. The model consists of a pretrained LLaMA-2 language model and a VQGAN vision encoder.
Inputs	Text and Image
Context length	4K tokens
GPUs	8 H100-80G
Training time	34 hours
Training data	7.0B tokens
Outputs	Generated latent actions in response to the input
Dates	Trained on Sep 2024
Status	This is a static model trained on an offline dataset (Open-X Embodiment) for publicly available data. Future versions of the tuned models may be released as we improve models.
Supported languages	English
Release date	Oct 2024
License	MIT

Training Datasets

Our training data is from Open-X Embodiment Dataset. From the whole dataset, we use the similar mixture of subsets from OpenVLA.

Responsible AI Considerations

LAPA model can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:

Quality of Service: LAPA is trained without using any ground-truth action labels during pretraining. Therefore, it might fall short on complex tasks that require fine-grained motion planning.
Inappropriate or Offensive Content: Since the model is based on a vision-language model, it inherits the limitations of the backbone model. This model may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case.
Information Reliability: The latent action generated by the model may not be accurate since it is trained on a limited amount of pretraining data.

Developers should apply responsible AI best practices and are responsible for ensuring that a specific use-case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Important areas for consideration include:

High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes the model showing unintended behavior after fine-tuning. Additional safeguards should be implemented at the application level according to the deployment context.
Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
Copyrighted content: The model might generate content that infringes on copyright protections. Developers should implement measures to detect and filter copyrighted material, and end-users should be informed about the potential for unintended copyright violations and the importance of verifying original sources to avoid legal complications.

License

The model is licensed under the MIT license.

latent-action-pretraining
/

LAPA-7B-openx