--- license: apache-2.0 --- ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/XQv9iMSZeLYVkXdxassGb.jpeg) # Model Card for SpaceLLaVA-lite **SpaceLLaVA-lite** fine-tunes [MobileVLM](https://github.com/Meituan-AutoML/MobileVLM) on a dataset designed with [VQASynth](https://github.com/remyxai/VQASynth/tree/main) to enhance spatial reasoning as in [SpatialVLM](https://spatial-vlm.github.io/) ## Model Details ### Model Description This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning. - **Developed by:** remyx.ai - **Model type:** MultiModal Model, Vision Language Model, MobileVLM - **License:** Apache-2.0 - **Finetuned from model:** MobileVLM ### Model Sources - **Repository:** [VQASynth](https://github.com/remyxai/VQASynth/tree/main) - **Paper:** [SpatialVLM](https://arxiv.org/abs/2401.12168) ## Uses Use this model to query spatial relationships between objects in a scene. Run it using [MobileVLM inference](https://github.com/Meituan-AutoML/MobileVLM/tree/main?tab=readme-ov-file#example-for-mobilevlmmobilevlm-v2-model-inference) code: ```python # assuming cwd is /path/to/MobileVLM/ from scripts.inference import inference_once model_path = "/path/to/SpaceLLaVA-lite" image_file = "/path/to/your-image.jpg" prompt_str = "For each object in the scene, describe the distance between objects in meters" args = type('Args', (), { "model_path": model_path, "image_file": image_file, "prompt": prompt_str, "conv_mode": "v1", "temperature": 0, "top_p": None, "num_beams": 1, "max_new_tokens": 512, "load_8bit": False, "load_4bit": False, })() inference_once(args) ``` Try it on Discord: http://discord.gg/b2yGuCNpuC ## Citation ``` @article{chen2024spatialvlm, title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities}, author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei}, journal = {arXiv preprint arXiv:2401.12168}, year = {2024}, url = {https://arxiv.org/abs/2401.12168}, } @article{chu2023mobilevlm, title={Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices}, author={Chu, Xiangxiang and Qiao, Limeng and Lin, Xinyang and Xu, Shuang and Yang, Yang and Hu, Yiming and Wei, Fei and Zhang, Xinyu and Zhang, Bo and Wei, Xiaolin and others}, journal={arXiv preprint arXiv:2312.16886}, year={2023} } @article{chu2024mobilevlm, title={MobileVLM V2: Faster and Stronger Baseline for Vision Language Model}, author={Chu, Xiangxiang and Qiao, Limeng and Zhang, Xinyu and Xu, Shuang and Wei, Fei and Yang, Yang and Sun, Xiaofei and Hu, Yiming and Lin, Xinyang and Zhang, Bo and others}, journal={arXiv preprint arXiv:2402.03766}, year={2024} } ```