Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Shortened LLM Model Card

Shortened LLM is a depth-pruned version of large language models for efficient text generation.

Compression Method

  • After identifying unimportant Transformer blocks, we perform one-shot pruning.
  • In retraining pruned models for quality recovery, we leverage continued pretraining (CPT), which involves updating all parameters, on a large-scale pretraining corpus.
  • Once CPT is completed, the model in this card is further finetuned with low-rank adaptation (LoRA) on an instruction tuning dataset.

Models from Aggressive Pruning & CPT Retraining (arXiv-v2):

Source
Model
Pruning
Ratio
Pruning
Criterion
Retraining
Method
HF Models
Link
Vicuna-v1.3-7B 20% PPL CPT nota-ai/cpt_st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B 45% PPL CPT nota-ai/cpt_st-vicuna-v1.3-3.7b-ppl
Vicuna-v1.3-7B 60% PPL CPT nota-ai/cpt_st-vicuna-v1.3-2.7b-ppl
Vicuna-v1.3-7B 80% PPL CPT nota-ai/cpt_st-vicuna-v1.3-1.5b-ppl
Vicuna-v1.3-7B 20% PPL CPT⇒LoRA nota-ai/cpt-lora_st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B 45% PPL CPT⇒LoRA nota-ai/cpt-lora_st-vicuna-v1.3-3.7b-ppl
Vicuna-v1.3-7B 60% PPL CPT⇒LoRA nota-ai/cpt-lora_st-vicuna-v1.3-2.7b-ppl
Vicuna-v1.3-7B 80% PPL CPT⇒LoRA nota-ai/cpt-lora_st-vicuna-v1.3-1.5b-ppl
Click to see the results:
  • EleutherAI/lm-evaluation-harness version 3326c54
results

Experimental Setup for CPT of Pruned Vicuna-7B

  • Dataset: SlimPajama-627B
  • Training using 8 NVIDIA H100 GPUs.
    • 5.5B parameters: 37B training tokens (for 6 days)
    • 3.7B parameters: 74B tokens (for 8 days)
    • 2.7B parameters: 150B tokens (for 12 days)
    • 1.5B parameters: 271B tokens (for 11 days)
  • AdamW optimizer with (β1, β2)=(0.9, 0.95); a learning rate of 0.0001; a weight decay of 0.1.
  • Global batch size: 512 (micro-batch size of 2 × 32 gradient accumulation steps × 8 GPUs).
Click to see the learning curve:

Zero-shot performance over the course of training for models from Vicuna-7B-v1.3 at different pruning ratios. For each model size, the CPT duration was limited to a two-week period, but additional training could further improve the quality.

results

Experimental Setup for LoRA Instruction Tuning

  • Dataset: Refined Alpaca
  • Training using 1 NVIDIA A100 GPU.
    • The retraining costs are low, with the entire process being executed on a single GPU.
    • For example, LoRA retraining of a 20%-pruned model from 7B parameters requires about 2 hours and 22GB VRAM.
  • A LoRA rank of 8; AdamW optimizer with a learning rate of 0.0001.
  • A batch size of 64 over 2 epochs.

Models from Moderate Pruning & LoRA Retraining (arXiv-v1):

Source
Model
Pruning
Ratio
Pruning
Criterion
HF Models
Link
LLaMA-1-7B 20% PPL nota-ai/st-llama-1-5.5b-ppl
LLaMA-1-7B 20% Taylor+ nota-ai/st-llama-1-5.5b-taylor
Vicuna-v1.3-7B 20% PPL nota-ai/st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B 20% Taylor+ nota-ai/st-vicuna-v1.3-5.5b-taylor
Vicuna-v1.3-13B 21% PPL nota-ai/st-vicuna-v1.3-10.5b-ppl
Vicuna-v1.3-13B 21% Taylor+ nota-ai/st-vicuna-v1.3-10.5b-taylor
Click to see the results:
  • EleutherAI/lm-evaluation-harness version 3326c54
results

License

  • All rights related to this repository and the compressed models are reserved by Nota Inc.
  • The intended use is strictly limited to research and non-commercial projects.

Acknowledgments

Citation

@article{kim2024shortened,
  title={Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={arXiv preprint arXiv:2402.02834},      
  year={2024},
  url={https://arxiv.org/abs/2402.02834}
}
@article{kim2024mefomo,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
  year={2024},
  url={https://openreview.net/forum?id=18VGxuOdpu}
}
Downloads last month
6
Safetensors
Model size
5.52B params
Tensor type
FP16
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including nota-ai/cpt-lora_st-vicuna-v1.3-5.5b-ppl