arxiv:2403.03853

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

Published on Mar 6

· Submitted by

akhaliq on Mar 7

#3 Paper of the day

Upvote

Authors:

Bingning Wang ,

Abstract

As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.

View arXiv page View PDF Add to collection

Community

leapr

Mar 7

•

edited Mar 7

Many papers taking transformer block pruning approach are released recently.
SLEB focuses on block pruning's advantages to width pruning or 2:4 structured pruning from the perspective of speedup.
https://huggingface.co/papers/2402.09025
Shortened LLaMA reports performance of block pruned LLMs' after parameter-efficient retraining https://huggingface.co/papers/2402.02834

mendaxia

Mar 8

We are all contemporaries striving towards the same goals in this field.

alexbie98

Mar 7

Curious why they didnt't compare to greedy or beam search over layer removal sequences (scored by downstream perplexity) or even block-influence-greedy (remove lowest block influence, then rescore and compute block influence)

mendaxia

Mar 8

Your suggestion is intriguing and definitely merits consideration. However, I have a couple of reservations:

The downstream task specialization involves different models that may not be sufficiently generalizable, potentially compromising overall performance due to their specialized nature.
Additionally, the implementation of beam search and block-influence-greedy algorithms, while innovative, presents significant complexity and incurs considerable computational costs.

mrfakename

Mar 7

Any open source implementation?

mendaxia

Mar 8

We have provided a detailed description of our layer removal setup in the appendix of our paper, where it can be thoroughly examined.

librarian-bot

Mar 8

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

sonsus

Apr 1

•

edited Apr 1

All I have to do is just referring @librarian-bot ?

derek-thomas

Mar 14

I love the block importance concept and I think its valuable to how we understand transformers. Im not sure if/when I would use this in production though given the trade-offs.

kendalf89

Mar 17

What are the trade-offs? I haven't seen them mentioned anywhere explicitly.

leapr

Apr 1

Can I get the information of which dataset is used to calculate Block Influence and extra details (sequence length, Did they used average of all tokens or sampled one...)?

ZhangRC

May 12

Although some layers look "redundant" for MMLU, they are not really redundant for all tasks. Removing any layer in the middle results in an increase of perplexity, and has negative impacts on other tasks such as multilingual generation, math and logical reasoning.