Papers
arxiv:2403.03853

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

Published on Mar 6
· Featured in Daily Papers on Mar 7
Authors:
,
,
,
,
,
,

Abstract

As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.

Community

Many papers taking transformer block pruning approach are released recently.
SLEB focuses on block pruning's advantages to width pruning or 2:4 structured pruning from the perspective of speedup.
https://huggingface.co/papers/2402.09025
Shortened LLaMA reports performance of block pruned LLMs' after parameter-efficient retraining https://huggingface.co/papers/2402.02834

·

We are all contemporaries striving towards the same goals in this field.

Curious why they didnt't compare to greedy or beam search over layer removal sequences (scored by downstream perplexity) or even block-influence-greedy (remove lowest block influence, then rescore and compute block influence)

·

Your suggestion is intriguing and definitely merits consideration. However, I have a couple of reservations:

The downstream task specialization involves different models that may not be sufficiently generalizable, potentially compromising overall performance due to their specialized nature.
Additionally, the implementation of beam search and block-influence-greedy algorithms, while innovative, presents significant complexity and incurs considerable computational costs.

Any open source implementation?

·

We have provided a detailed description of our layer removal setup in the appendix of our paper, where it can be thoroughly examined.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

·

All I have to do is just referring @librarian-bot ?

I love the block importance concept and I think its valuable to how we understand transformers. Im not sure if/when I would use this in production though given the trade-offs.

·

What are the trade-offs? I haven't seen them mentioned anywhere explicitly.

Can I get the information of which dataset is used to calculate Block Influence and extra details (sequence length, Did they used average of all tokens or sampled one...)?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2403.03853 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.03853 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2403.03853 in a Space README.md to link it from this page.

Collections including this paper 21