remyxai
/

SpaceLLaVA

Text Generation

Inference Endpoints

Model card Files Files and versions Community

SpaceLLaVA / README.md

salma-remyx's picture

update README

5332b77 verified 5 months ago

|

No virus

2.14 kB

	---
	license: apache-2.0
	---


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/iVKgqK6vTzCpCLVnWxmjA.png)

	# Model Card for SpaceLLaVA

	SpaceLLaVA uses LoRA to fine-tune [LLaVA](https://github.com/haotian-liu/LLaVA/tree/main) on a dataset designed with [VQASynth](https://github.com/remyxai/VQASynth/tree/main) to enhance spatial reasoning as in [SpatialVLM](https://spatial-vlm.github.io/)

	## Model Details

	### Model Description

	This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models.
	With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.


	- Developed by: remyx.ai
	- Model type: MultiModal Model, Vision Language Model, LLaVA
	- License: Apache-2.0
	- Finetuned from model: LLaVA

	### Model Sources

	- Repository: [VQASynth](https://github.com/remyxai/VQASynth/tree/main)
	- Paper: [SpatialVLM](https://arxiv.org/abs/2401.12168)

	## Uses

	Use this model to query spatial relationships between objects in a scene.

	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1WPE7Br5A5ERSij8BL1M22EoEMLVkD8EP?usp=sharing)


	Try it on Discord: http://discord.gg/b2yGuCNpuC


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/Rsu5VpDgdZh9jemw97w8T.png)

	## Citation
	```
	@article{chen2024spatialvlm,
	title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
	author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
	journal = {arXiv preprint arXiv:2401.12168},
	year = {2024},
	url = {https://arxiv.org/abs/2401.12168},
	}

	@misc{liu2023llava,
	title={Visual Instruction Tuning},
	author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
	publisher={NeurIPS},
	year={2023},
	}
	```