The Unreasonable Ineffectiveness of the Deeper Layers

Published on Mar 26
· Featured in Daily Papers on Mar 27


We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.


  • See DistillBert for more on this 😂


This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Thanks for sharing your work! I was able to demonstrate the model healing process here while using ShortGPT's block influence metric for layer removal/pruning.

@gromovand @kushaltirumala @hassansh @pglo @danintheory super cool! any plans to release the code?


I attempted to implement their angular distance and healing here. Let me know if you catch anything wrong, hope it can help!

We tried to replicate the results. It seems true. Deeper layers can be removed, and still, we can get a model that can generate text.


Cool! Is this the same as @shivr 's implementation?

Working on reproducing this and similar pruning criteria here:
Linear approximation of the last token is there, along with angular distances, bi score etc.

The goal of the library: choose your distance (layer importance metric), get cropped model. :rocket:

Sign up or log in to comment

Models citing this paper 21

Browse 21 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 30