llama-3.2-3b-attn-drop-6

Model Description

This model is a surgically optimized version of meta-llama/Llama-3.2-3B, created as part of Chapter 8 in the book "Rearchitecting LLMs".

linkedin-profile-banner-martra


Implementation

How It Works

Unlike KV cache quantization, this technique permanently removes the least important attention modules from the model architecture. 6 attention layers were removed (indices: 18, 20, 21, 22, 23, 24).

The importance metric is:

S = 1 - CosineSim(X_A, Y_A)

Where X_A is the hidden state entering the attention sublayer (captured before input_layernorm) and Y_A is the hidden state after the attention computation and its residual connection (X_A + Attention(LayerNorm(X_A))).

Calibration Data

Importance scores were computed over 400 samples from Cosmopedia, weighted to cover the same range of tasks as the evaluation benchmarks:

Subset Weight Samples
stories 0.300 120
web_samples_v2 0.200 80
web_samples_v1 0.150 60
wikihow 0.150 60
openstax 0.125 50
stanford 0.075 30

Layer Importance Scores

Attention importance scores (ascending β€” lowest = most redundant):

Layer   Score
------  ----------
   21    0.009053   ← dropped
   22    0.010405   ← dropped
   23    0.012417   ← dropped
   20    0.012783   ← dropped
   18    0.014297   ← dropped
   24    0.014580   ← dropped
   25    0.015039
   19    0.019702
  ...
    1    0.139336
    0    0.287197

Physical Deletion

Layers 18, 20, 21, 22, 23, 24 were physically removed. For each selected LlamaDecoderLayer, self_attn and input_layernorm are deleted and the layer's forward() is patched to route hidden states directly to the MLP block:

def forward_no_attn(self, hidden_states, ...):
    residual      = hidden_states
    hidden_states = self.post_attention_layernorm(hidden_states)
    hidden_states = self.mlp(hidden_states)
    hidden_states = residual + hidden_states
    return hidden_states

Parameters removed: 151,013,376 (~302 MB in FP16, 4.7% reduction)


Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "oopere/llama-3.2-3b-attn-drop-6",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("oopere/llama-3.2-3b-attn-drop-6")

Note: This model was developed and tested on a Google Colab T4 GPU (free tier). Use torch.float16 on T4; switch to torch.bfloat16 on Ampere-class GPUs or newer.


Benchmarks

Benchmark Metric Baseline Pruned Ξ”
ARC Easy acc_norm 0.7180 0.7197 +0.24%
HellaSwag acc_norm 0.7405 0.7254 -2.04%
LAMBADA OpenAI accuracy 0.6969 0.5639 -19.08%
PIQA acc_norm 0.7813 0.7715 -1.25%
WinoGrande accuracy 0.6961 0.6827 -1.93%

Intended Use

This model is intended as a learning artifact for readers of Rearchitecting LLMs (Chapter 8 Hands-On Lab). It demonstrates that a non-trivial fraction of attention layers in a modern LLM can be physically removed with surprisingly small benchmark loss. It is not intended for production use.

Generated on 2026-06-21 β€” NUM_LAYERS_TO_DROP=6 β€” throughput: 9.27 tok/s

Downloads last month
7
Safetensors
Model size
3B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for oopere/llama-3.2-3b-attn-drop-6

Finetuned
(466)
this model

Dataset used to train oopere/llama-3.2-3b-attn-drop-6

Collection including oopere/llama-3.2-3b-attn-drop-6

Paper for oopere/llama-3.2-3b-attn-drop-6