Papers
arxiv:2402.14905

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Published on Feb 22
· Featured in Daily Papers on Feb 26
Authors:
,
,
,
,

Abstract

This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.

Community

It would be interesting to see a comparison to small encoder-decoder models like instructionRoBERTa or flan-T5.

As a GPU poor I find this paper interesting and I am excited to try them out.
My questions are:

  • Have you guys considered Knowledge distilling Phi-2-2.7B model into smaller 350M model?

  • How does the design change affect the in-context learning ability of these models?

  • Does existing tool-chain PEFT, LORA and optimization techniques like AWQ, EXL2 and GPTQ work on these models?

Code and weight release?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

If it can be downloaded I would like to test it on my device

This comment has been hidden

Looking forward to trying it! Layer sharing saves only the memory, not the computation, so here is a thought on combining it with LORA: fine tune the shared layers with a low-rank update. Then you have different weights for each layer but increase little parameter number.

Interesting. If the findings hold true for all small LLMs, then it is very possible to cut down encoder-decoder model size by applying layer sharing to the decoder part of the model. Model size has always been an issue for encoder-decoder models.

Could someone reproduce a model config that would duplicate the number of parameters with number of layers, heads, key-value heads and embedding dimension, given in the paper?

I used Llama config with additionally setting tie_word_embeddings=True, but I don't get the same number of parameters. Probably I am missing something?

Secondly, the authors didn't mention the pretraining dataset they used. IMHO, controlling for that would be a better setup to measure the effect of model parameters.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.14905 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.14905 in a Space README.md to link it from this page.

Collections including this paper 30