remyxai
/

SpaceLLaVA-lite

Inference Endpoints

Model card Files Files and versions Community

SpaceLLaVA-lite / README.md

smellslikeml

update config, README

e122c7f 4 months ago

|

history blame contribute delete

No virus

3.12 kB

	---
	license: apache-2.0
	---

	![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/XQv9iMSZeLYVkXdxassGb.jpeg)

	# Model Card for SpaceLLaVA-lite

	SpaceLLaVA-lite fine-tunes [MobileVLM](https://github.com/Meituan-AutoML/MobileVLM) on a dataset designed with [VQASynth](https://github.com/remyxai/VQASynth/tree/main) to enhance spatial reasoning as in [SpatialVLM](https://spatial-vlm.github.io/)

	## Model Details

	### Model Description

	This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models.
	With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.


	- Developed by: remyx.ai
	- Model type: MultiModal Model, Vision Language Model, MobileVLM
	- License: Apache-2.0
	- Finetuned from model: MobileVLM

	### Model Sources

	- Repository: [VQASynth](https://github.com/remyxai/VQASynth/tree/main)
	- Paper: [SpatialVLM](https://arxiv.org/abs/2401.12168)

	## Uses

	Use this model to query spatial relationships between objects in a scene.

	Run it using [MobileVLM inference](https://github.com/Meituan-AutoML/MobileVLM/tree/main?tab=readme-ov-file#example-for-mobilevlmmobilevlm-v2-model-inference) code:
	```python
	# assuming cwd is /path/to/MobileVLM/
	from scripts.inference import inference_once
	model_path = "/path/to/SpaceLLaVA-lite"
	image_file = "/path/to/your-image.jpg"
	prompt_str = "For each object in the scene, describe the distance between objects in meters"

	args = type('Args', (), {
	"model_path": model_path,
	"image_file": image_file,
	"prompt": prompt_str,
	"conv_mode": "v1",
	"temperature": 0,
	"top_p": None,
	"num_beams": 1,
	"max_new_tokens": 512,
	"load_8bit": False,
	"load_4bit": False,
	})()

	inference_once(args)
	```

	Try it on Discord: http://discord.gg/b2yGuCNpuC

	## Citation
	```
	@article{chen2024spatialvlm,
	title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
	author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
	journal = {arXiv preprint arXiv:2401.12168},
	year = {2024},
	url = {https://arxiv.org/abs/2401.12168},
	}

	@article{chu2023mobilevlm,
	title={Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices},
	author={Chu, Xiangxiang and Qiao, Limeng and Lin, Xinyang and Xu, Shuang and Yang, Yang and Hu, Yiming and Wei, Fei and Zhang, Xinyu and Zhang, Bo and Wei, Xiaolin and others},
	journal={arXiv preprint arXiv:2312.16886},
	year={2023}
	}

	@article{chu2024mobilevlm,
	title={MobileVLM V2: Faster and Stronger Baseline for Vision Language Model},
	author={Chu, Xiangxiang and Qiao, Limeng and Zhang, Xinyu and Xu, Shuang and Wei, Fei and Yang, Yang and Sun, Xiaofei and Hu, Yiming and Lin, Xinyang and Zhang, Bo and others},
	journal={arXiv preprint arXiv:2402.03766},
	year={2024}
	}
	```