metadata

license: apache-2.0

A Touch, Vision, and Language Dataset for Multimodal Alignment

by Max (Letian) Fu, Gaurav Datta*, Huang Huang*, William Chung-Ho Panitch*, Jaimyn Drake*, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg at UC Berkeley, Meta AI, TU Dresden, and CeTI (*equal contribution).

[Paper] | [Project Page] | [Citation]

This repo contains the official checkpoints for A Touch, Vision, and Language Dataset for Multimodal Alignment.

The tactile encoders comes in three different sizes: ViT-Tiny, ViT-Small, and ViT-Base, all of which are stored in

ckpt/tvl_enc

TVL-LLaMA, the generative counterparts, are stored in

ckpt/tvl_llama

Inference

For zero-shot classification, we would require OpenCLIP with the following configuration:

CLIP_VISION_MODEL = "ViT-L-14"
CLIP_PRETRAIN_DATA = "datacomp_xl_s13b_b90k"

For TVL-LLaMA, please request access to the pre-trained LLaMA-2 from this form. In particular, we use llama-2-7b as the base model. The weights here contains the trained adapter, the tactile encoder, and the vision encoder for the ease of loading.

For the complete info, please take a look at the GitHub repo to see instructions on pretraining, fine-tuning, and evaluation with these models.