license: apache-2.0
A Touch, Vision, and Language Dataset for Multimodal Alignment
by Max (Letian) Fu, Gaurav Datta*, Huang Huang*, William Chung-Ho Panitch*, Jaimyn Drake*, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg at UC Berkeley, Meta AI, TU Dresden, and CeTI (*equal contribution).
[Paper] | [Project Page] | [Citation]
This repo contains the official checkpoints for A Touch, Vision, and Language Dataset for Multimodal Alignment.
The tactile encoders comes in three different sizes: ViT-Tiny, ViT-Small, and ViT-Base, all of which are stored in
ckpt/tvl_enc
TVL-LLaMA, the generative counterparts, are stored in
ckpt/tvl_llama
Inference
For zero-shot classification, we would require OpenCLIP with the following configuration:
CLIP_VISION_MODEL = "ViT-L-14"
CLIP_PRETRAIN_DATA = "datacomp_xl_s13b_b90k"
For TVL-LLaMA, please request access to the pre-trained LLaMA-2 from this form. In particular, we use llama-2-7b
as the base model. The weights here contains the trained adapter, the tactile encoder, and the vision encoder for the ease of loading.
For the complete info, please take a look at the GitHub repo to see instructions on pretraining, fine-tuning, and evaluation with these models.