metadata

license: apache-2.0

Model Card for SpaceLLaVA-lite

SpaceLLaVA-lite fine-tunes MobileVLM on a dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM

Model Details

Model Description

This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.

Developed by: remyx.ai
Model type: MultiModal Model, Vision Language Model, MobileVLM
License: Apache-2.0
Finetuned from model: MobileVLM

Model Sources

Repository: VQASynth
Paper: SpatialVLM

Uses

Use this model to query spatial relationships between objects in a scene.

Run it using MobileVLM inference code:

# assuming cwd is /path/to/MobileVLM/
from scripts.inference import inference_once
model_path = "/path/to/SpaceLLaVA-lite"
image_file = "/path/to/your-image.jpg"
prompt_str = "For each object in the scene, describe the distance between objects in meters"

args = type('Args', (), {
    "model_path": model_path,
    "image_file": image_file,
    "prompt": prompt_str,
    "conv_mode": "v1",
    "temperature": 0, 
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512,
    "load_8bit": False,
    "load_4bit": False,
})()

inference_once(args)

Try it on Discord: http://discord.gg/b2yGuCNpuC

Citation

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@article{chu2023mobilevlm,
  title={Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices},
  author={Chu, Xiangxiang and Qiao, Limeng and Lin, Xinyang and Xu, Shuang and Yang, Yang and Hu, Yiming and Wei, Fei and Zhang, Xinyu and Zhang, Bo and Wei, Xiaolin and others},
  journal={arXiv preprint arXiv:2312.16886},
  year={2023}
}

@article{chu2024mobilevlm,
  title={MobileVLM V2: Faster and Stronger Baseline for Vision Language Model},
  author={Chu, Xiangxiang and Qiao, Limeng and Zhang, Xinyu and Xu, Shuang and Wei, Fei and Yang, Yang and Sun, Xiaofei and Hu, Yiming and Lin, Xinyang and Zhang, Bo and others},
  journal={arXiv preprint arXiv:2402.03766},
  year={2024}
}