arxiv:2402.14905

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Published on Feb 22

· Submitted by

akhaliq on Feb 26

#1 Paper of the day

Upvote

103

Authors:

Zechun Liu ,

Changsheng Zhao ,

Forrest Iandola ,

Yuandong Tian ,

Igor Fedorov ,

Yunyang Xiong ,

Liangzhen Lai ,

Vikas Chandra

Abstract

This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.

View arXiv page View PDF Add to collection

Community

Bachstelze

Feb 26

It would be interesting to see a comparison to small encoder-decoder models like instructionRoBERTa or flan-T5.

Jenish-23

Feb 26

As a GPU poor I find this paper interesting and I am excited to try them out.
My questions are:

Have you guys considered Knowledge distilling Phi-2-2.7B model into smaller 350M model?
How does the design change affect the in-context learning ability of these models?
Does existing tool-chain PEFT, LORA and optimization techniques like AWQ, EXL2 and GPTQ work on these models?

ogimgio

15 days ago

Why not distilling from a larger model?

Kernel

Feb 26

Code and weight release?

librarian-bot

Feb 27

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

jennasu

Feb 27

If it can be downloaded I would like to test it on my device

muzaffarahmadmir

Feb 27

This comment has been hidden

weidu

Feb 28

Looking forward to trying it! Layer sharing saves only the memory, not the computation, so here is a thought on combining it with LORA: fine tune the shared layers with a low-rank update. Then you have different weights for each layer but increase little parameter number.

jonathanjordan21

Feb 29

Interesting. If the findings hold true for all small LLMs, then it is very possible to cut down encoder-decoder model size by applying layer sharing to the decoder part of the model. Model size has always been an issue for encoder-decoder models.

maveriq

Feb 29

Could someone reproduce a model config that would duplicate the number of parameters with number of layers, heads, key-value heads and embedding dimension, given in the paper?

I used Llama config with additionally setting tie_word_embeddings=True, but I don't get the same number of parameters. Probably I am missing something?

Secondly, the authors didn't mention the pretraining dataset they used. IMHO, controlling for that would be a better setup to measure the effect of model parameters.