llama-3.2-3b-attn-drop-6

Model Description

This model is a surgically optimized version of meta-llama/Llama-3.2-3B, created as part of Chapter 8 in the book "Rearchitecting LLMs".

Book: Rearchitecting LLMs
Framework: OptiPFair
Technique: Attention Optimization (Physical Attention Layer Removal)
Chapter: Chapter 8 - Attention Optimization
Notebook: CH08_NB04_uploadHF
Paper: What Matters in Transformers? Not All Attention is Needed — He et al., 2024

Implementation

How It Works

Unlike KV cache quantization, this technique permanently removes the least important attention modules from the model architecture. 6 attention layers were removed (indices: 18, 20, 21, 22, 23, 24).

The importance metric is:

S = 1 - CosineSim(X_A, Y_A)

Where X_A is the hidden state entering the attention sublayer (captured before input_layernorm) and Y_A is the hidden state after the attention computation and its residual connection (X_A + Attention(LayerNorm(X_A))).

Calibration Data

Importance scores were computed over 400 samples from Cosmopedia, weighted to cover the same range of tasks as the evaluation benchmarks:

Subset	Weight	Samples
stories	0.300	120
web_samples_v2	0.200	80
web_samples_v1	0.150	60
wikihow	0.150	60
openstax	0.125	50
stanford	0.075	30

Layer Importance Scores

Attention importance scores (ascending — lowest = most redundant):

Layer   Score
------  ----------
   21    0.009053   ← dropped
   22    0.010405   ← dropped
   23    0.012417   ← dropped
   20    0.012783   ← dropped
   18    0.014297   ← dropped
   24    0.014580   ← dropped
   25    0.015039
   19    0.019702
  ...
    1    0.139336
    0    0.287197

Physical Deletion

Layers 18, 20, 21, 22, 23, 24 were physically removed. For each selected LlamaDecoderLayer, self_attn and input_layernorm are deleted and the layer's forward() is patched to route hidden states directly to the MLP block:

def forward_no_attn(self, hidden_states, ...):
    residual      = hidden_states
    hidden_states = self.post_attention_layernorm(hidden_states)
    hidden_states = self.mlp(hidden_states)
    hidden_states = residual + hidden_states
    return hidden_states

Parameters removed: 151,013,376 (~302 MB in FP16, 4.7% reduction)

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "oopere/llama-3.2-3b-attn-drop-6",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("oopere/llama-3.2-3b-attn-drop-6")

Note: This model was developed and tested on a Google Colab T4 GPU (free tier). Use torch.float16 on T4; switch to torch.bfloat16 on Ampere-class GPUs or newer.

Benchmarks

Benchmark	Metric	Baseline	Pruned	Δ
ARC Easy	acc_norm	0.7180	0.7197	+0.24%
HellaSwag	acc_norm	0.7405	0.7254	-2.04%
LAMBADA OpenAI	accuracy	0.6969	0.5639	-19.08%
PIQA	acc_norm	0.7813	0.7715	-1.25%
WinoGrande	accuracy	0.6961	0.6827	-1.93%

Intended Use

This model is intended as a learning artifact for readers of Rearchitecting LLMs (Chapter 8 Hands-On Lab). It demonstrates that a non-trivial fraction of attention layers in a modern LLM can be physically removed with surprisingly small benchmark loss. It is not intended for production use.

Generated on 2026-06-21 — NUM_LAYERS_TO_DROP=6 — throughput: 9.27 tok/s