MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
Abstract
We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InfMLLM: A Unified Framework for Visual-Language Tasks (2023)
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (2023)
- Contrastive Vision-Language Alignment Makes Efficient Instruction Learner (2023)
- VILA: On Pre-training for Visual Language Models (2023)
- Generative Multimodal Models are In-Context Learners (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
MobileVLM: Revolutionizing Mobile Vision Language Models
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 7
Browse 7 models citing this paperDatasets citing this paper 0
No dataset linking this paper