|
--- |
|
license: mit |
|
license_link: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/resolve/main/LICENSE |
|
|
|
language: |
|
- multilingual |
|
pipeline_tag: text-generation |
|
tags: |
|
- nlp |
|
- code |
|
- vision |
|
widget: |
|
- messages: |
|
- role: user |
|
content: <|image_1|>\nWhat action should the robot take to {lang}? |
|
--- |
|
|
|
## TraceVLA-7B |
|
``TraceVLA-7B`` model is a vision-language-action model obtained by finetuning the base [OpenVLA](https://huggingface.co/openvla/openvla-7b) model with [visual trace prompting technique](https://arxiv.org/abs/2412.10345). |
|
|
|
### Results on SimplerEnv Fractal + SimplerEnv: |
|
|
|
#### Fractal: |
|
| Policy/Settings | Pick up Coke | Move near | Open/Close Drawer | Put in Drawer | Average Success Rate | |
|
|:------:|:------------:|:---------:|:------------:|:-----------:|:-------:| |
|
| (Visual Matching) OpenVLA-7B | 23.7% | **65.0%** | 57.4% | 0.% | 36.5% | |
|
| (Visual Matching) TraceVLA-7B | **45.0%** | 63.8% | **63.1%** | **11.1.%** | 45.8% | |
|
| (Variant Aggregation) OpenVLA-7B | 61.3% | 55.8% | 24.9% | 1.0% | 35.8% | |
|
| (Variant Aggregation) TraceVLA-7B | **64.3%** | **60.6%** | **61.6%** | **12.5.%** | **49.8%** | |
|
|
|
#### Bridge: |
|
| Policy/Settings | Put Spoon | Put Carrot | Stack Block | Put Eggplant | Average Success Rate | |
|
|:------:|:------------:|:---------:|:------------:|:-----------:|:-------:| |
|
| OpenVLA-7B | 8.3% | 8.3% | 4.2% | 45.8% | 16.7% | |
|
| TraceVLA-7B | **12.5%** | **16.6%** | **16.6%** | **65.0%** | **27.7%** | |
|
|
|
|
|
### Sample Inference Code |
|
Here is the sample inference code of TraceVLA-7B model. |
|
``` |
|
model_path = "furonghuang-lab/tracevla_7b" |
|
# Load Processor & VLA |
|
processor = AutoProcessor.from_pretrained( |
|
model_path, |
|
trust_remote_code=True, |
|
num_crops=1, |
|
) |
|
|
|
vla = AutoModelForCausalLM.from_pretrained( |
|
model_path, |
|
torch_dtype=torch.bfloat16, |
|
trust_remote_code=True, |
|
_attn_implementation='flash_attention_2', |
|
use_cache=True |
|
).to(device='cuda') |
|
|
|
# Load Visual Trace Processor |
|
# cotracker_model_path corresponds to the path to your downloaded scaled_offline.pth checkpoint |
|
from prismatic.eval.trace_processor import TraceProcessor |
|
trace_processor = TraceProcessor(cotracker_model_path) |
|
|
|
# Grab image input & format prompt |
|
# In case where the visual trace returned by Co-Tracker is not valid, we use the default openvla prompt. |
|
openvla_prompt_template = "In: What action should the robot take to {task_description}?\nOut:" |
|
tracevla_prompt_template = "In: You are given two images: one with the original robot observation, and another one marked with historical traces of the robot end effector and moving objects, separated by a special separator token. What action should the robot take to {task_description}?\nOut:" |
|
|
|
image: Image.Image = get_from_camera(...) |
|
image_overlaid, has_trace = trace_processors.process_image(image) |
|
|
|
if not has_trace: |
|
prompt = openvla_prompt_template.format(task_description=task_description) |
|
inputs = processor(prompt, [image, image]).to(device='cuda', dtype=torch.bfloat16) |
|
else: |
|
prompt = tracevla_prompt_template.format(task_description=task_description) |
|
inputs = processor(prompt, [image, image_overlaid]).to(device='cuda', dtype=torch.bfloat16) |
|
|
|
### Predict the action |
|
with torch.inference_mode(): |
|
action = vla.predict_action(**inputs) |
|
|
|
# Execute the action |
|
robot.act(action, ...) |
|
``` |
|
|
|
For more examples, including scripts for finetuning TraceVLA models on your own robot demonstration datasets, check out our [repository](https://github.com/FrankZheng2022/tracevla). |
|
|
|
|
|
|
|
|
|
### Citation |
|
|
|
If you find our code or models useful in your work, please cite [our paper](https://arxiv.org/abs/2412.10345): |
|
|
|
```bibtex |
|
@misc{zheng2024tracevlavisualtraceprompting, |
|
title={TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies}, |
|
author={Ruijie Zheng and Yongyuan Liang and Shuaiyi Huang and Jianfeng Gao and Hal Daumé III and Andrey Kolobov and Furong Huang and Jianwei Yang}, |
|
year={2024}, |
|
eprint={2412.10345}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.RO}, |
|
url={https://arxiv.org/abs/2412.10345}, |
|
} |
|
``` |