mtgv
/

MobileVLM-3B

Text Generation

Inference Endpoints

Model card Files Files and versions Community

MobileVLM-3B / README.md

liuzhaochen02

initial commit

28487fa 10 months ago

|

1.42 kB

	---
	license: apache-2.0
	tags:
	- MobileVLM
	---
	## Model Summery
	MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively.

	The MobileVLM-3B was built on our [MobileLLaMA-2.7B-Chat](https://huggingface.co/mtgv/MobileLLaMA-2.7B-Chat) to facilitate the off-the-shelf deployment.

	## Model Sources
	- Repository: https://github.com/Meituan-AutoML/MobileVLM
	- Paper: https://arxiv.org/abs/2312.16886

	## How to Get Started with the Model
	Inference examples can be found at [Github](https://github.com/Meituan-AutoML/MobileVLM).

	## Training Details
	Please refer to our paper: [MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices](https://arxiv.org/pdf/2312.16886.pdf)