llama-3.2-3b-attn-drop-6
Model Description
This model is a surgically optimized version of meta-llama/Llama-3.2-3B, created as part of Chapter 8 in the book "Rearchitecting LLMs".
- Book: Rearchitecting LLMs
- Framework: OptiPFair
- Technique: Attention Optimization (Physical Attention Layer Removal)
- Chapter: Chapter 8 - Attention Optimization
- Notebook: CH08_NB04_uploadHF
- Paper: What Matters in Transformers? Not All Attention is Needed β He et al., 2024
Implementation
How It Works
Unlike KV cache quantization, this technique permanently removes the least important attention modules from the model architecture. 6 attention layers were removed (indices: 18, 20, 21, 22, 23, 24).
The importance metric is:
S = 1 - CosineSim(X_A, Y_A)
Where X_A is the hidden state entering the attention sublayer (captured before input_layernorm) and Y_A is the hidden state after the attention computation and its residual connection (X_A + Attention(LayerNorm(X_A))).
Calibration Data
Importance scores were computed over 400 samples from Cosmopedia, weighted to cover the same range of tasks as the evaluation benchmarks:
| Subset | Weight | Samples |
|---|---|---|
| stories | 0.300 | 120 |
| web_samples_v2 | 0.200 | 80 |
| web_samples_v1 | 0.150 | 60 |
| wikihow | 0.150 | 60 |
| openstax | 0.125 | 50 |
| stanford | 0.075 | 30 |
Layer Importance Scores
Attention importance scores (ascending β lowest = most redundant):
Layer Score
------ ----------
21 0.009053 β dropped
22 0.010405 β dropped
23 0.012417 β dropped
20 0.012783 β dropped
18 0.014297 β dropped
24 0.014580 β dropped
25 0.015039
19 0.019702
...
1 0.139336
0 0.287197
Physical Deletion
Layers 18, 20, 21, 22, 23, 24 were physically removed. For each selected LlamaDecoderLayer, self_attn and input_layernorm are deleted and the layer's forward() is patched to route hidden states directly to the MLP block:
def forward_no_attn(self, hidden_states, ...):
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
return hidden_states
Parameters removed: 151,013,376 (~302 MB in FP16, 4.7% reduction)
Loading the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"oopere/llama-3.2-3b-attn-drop-6",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("oopere/llama-3.2-3b-attn-drop-6")
Note: This model was developed and tested on a Google Colab T4 GPU (free tier). Use
torch.float16on T4; switch totorch.bfloat16on Ampere-class GPUs or newer.
Benchmarks
| Benchmark | Metric | Baseline | Pruned | Ξ |
|---|---|---|---|---|
| ARC Easy | acc_norm | 0.7180 | 0.7197 | +0.24% |
| HellaSwag | acc_norm | 0.7405 | 0.7254 | -2.04% |
| LAMBADA OpenAI | accuracy | 0.6969 | 0.5639 | -19.08% |
| PIQA | acc_norm | 0.7813 | 0.7715 | -1.25% |
| WinoGrande | accuracy | 0.6961 | 0.6827 | -1.93% |
Intended Use
This model is intended as a learning artifact for readers of Rearchitecting LLMs (Chapter 8 Hands-On Lab). It demonstrates that a non-trivial fraction of attention layers in a modern LLM can be physically removed with surprisingly small benchmark loss. It is not intended for production use.
Generated on 2026-06-21 β NUM_LAYERS_TO_DROP=6 β throughput: 9.27 tok/s
- Downloads last month
- 7
Model tree for oopere/llama-3.2-3b-attn-drop-6
Base model
meta-llama/Llama-3.2-3B