You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Gemma4-lite-round2 GPTQ INT4 (W4A16)

Model Overview

This model is a pruned and quantized version of google/gemma-4-E4B-it. Through FLAP structured pruning of the FFN layers followed by GPTQ INT4 weight-only quantization, it significantly reduces parameter count and inference memory while maintaining model performance.

Base Model Information

Base Model: google/gemma-4-E4B-it
Model Type: Multimodal Large Language Model (Text + Image + Audio)
Architecture: Gemma4ForConditionalGeneration
Layers: 42 layers
Data Type: bfloat16 (unquantized parts), INT4 (quantized parts)

Compression Methods

1. FLAP Pruning (Fluctuation-based Adaptive Structured Pruning)

This model follows the FLAP method for FFN structured pruning.

Bias Compensation
- Adds bias compensation for the average contribution of pruned channels, maintaining good performance without retraining
- Original Gemma4's down_proj has no bias; bias parameters are added after pruning

2. GPTQ INT4 Quantization (W4A16, weight-only)

This model uses GPTQ INT4 weight-only quantization via GPTQModel for efficient inference with vLLM.

Quantization Method: GPTQ (GPT-QModel) W4A16 weight-only
Weight Quantization: 4-bit symmetric group-wise quantization
Activation: Not quantized (remains in bfloat16)
Quantization Format: GPTQ v1 (gptq checkpoint format)
Quantization Tool: GPTQModel 7.0.0

Quantization Configuration

Parameter	Value	Description
`bits`	4	Quantization bit-width (INT4)
`group_size`	128	Quantization group size
`sym`	True	Symmetric quantization
`desc_act`	True	Activation descending order (improves quantization accuracy)
`format`	gptq	GPTQ v1 format, compatible with vLLM
`damp_percent`	0.01	Hessian diagonal damping percentage
`calibration_samples`	256	Number of calibration samples (WikiText-2)
`calibration_seq_len`	2048	Calibration sequence length

Quantized Modules

The following modules in the language model are quantized to INT4:

Module Type	Description
`self_attn.o_proj`	Attention output projection
`mlp.gate_proj`	MLP gate projection
`mlp.up_proj`	MLP up projection
`mlp.down_proj`	MLP down projection

Total: 168 quantized modules (42 layers × 4 modules per layer)

Unquantized Modules (kept in bfloat16)

The following modules are explicitly excluded from quantization via GPTQModel's dynamic configuration and remain in bfloat16:

Module Pattern	Reason
`self_attn.q_proj`	QKV projections are sensitive to quantization error; keeping bf16 significantly improves accuracy
`self_attn.k_proj`	Same as above
`self_attn.v_proj`	Same as above
`per_layer_projection`	PLE module weights are small (~1.25MB bf16 per layer) but suffer from large quantization error
`per_layer_input_gate`	Same as above
Vision tower weights (`model.vision_tower.*`)	Vision encoder typically does not need quantization
Audio branch weights (`model.audio_tower.*`)	Audio encoder typically does not need quantization
Embeddings (`embed_tokens.weight`, `embed_tokens_per_layer.weight`)	Embedding layers are not suitable for quantization
LayerNorm/RMSNorm weights	Normalization layers have minimal parameters, no need for quantization
All bias tensors (including FLAP `down_proj.bias`)	Bias terms kept at original precision
Language model head (`lm_head`)	Output projection kept at original precision

GPTQ Quantization Principle

GPTQ employs a layer-wise quantization strategy that minimizes quantization error based on approximate second-order information (Hessian matrix):

For each layer:
  1. Compute the Hessian matrix H⁻¹ for the layer (based on calibration data activations)
  2. Quantize weights column by column, using H⁻¹ to correct unquantized columns
     to compensate for quantization error:
     δ_w = -w_q_err · (H⁻¹_{jj} / [H⁻¹]_{j,.})
  3. When desc_act=True, process columns in descending order of activation magnitude,
     prioritizing important weights

Quantized weight: W_int4 = quantize(W_bf16, scale, zero_point)
Dequantized:      W_bf16 ≈ W_int4 · scale + zero_point
Where scale and zero_point are computed per group of group_size=128

Pruning Configuration

This model adopts a non-uniform pruning strategy, with differentiated processing for Gemma4's YOCO architecture:

Layer Range	Role	Pruning Ratio	intermediate_size
0-3	sliding_attention	20%	8192
4	sliding_attention	0%	10240
5-6	full_attention & sliding_attention	20%	8192
7-8	sliding_attention	0%	10240
9	sliding_attention	10%	9216
10-11	sliding_attention & full_attention	0%	10240
12-13	sliding_attention	20%	8192
14	sliding_attention	0%	10240
15-16	sliding_attention	20%	8192
17	full_attention	0%	10240
18-19	sliding_attention	20%	8192
20-21	sliding_attention	20%	8192
22-23	sliding_attention & full_attention	0%	10240
24-27	sliding_attention	20%	8192
28-29	sliding_attention & full_attention	0%	10240
30-31	sliding_attention	20%	8192
32	sliding_attention	0%	10240
33-34	sliding_attention	20%	8192
35-41	full_attention & sliding_attention	0%	10240

intermediate_size Distribution After Pruning

intermediate_size	Layer Count	Description
10240 (original)	19 layers	Unpruned
9216 (10% pruned)	1 layer	Lightly pruned
8192 (20% pruned)	22 layers	Pruned

FFN Parameter Compression: ~15%

Model Structure Changes

Configuration Changes

{
  "text_config": {
    "intermediate_size": 10240,
    "intermediate_sizes": [8192, 8192, ..., 10240, 10240],
    "flap_pruned": true
  },
  "quantization_config": {
    "quant_method": "gptq",
    "bits": 4,
    "group_size": 128,
    "sym": true,
    "desc_act": true,
    "format": "gptq",
    "checkpoint_format": "gptq",
    "dynamic": {
      "-:.*per_layer_projection": {},
      "-:.*per_layer_input_gate": {},
      "-:.*self_attn.q_proj$": {},
      "-:.*self_attn.k_proj$": {},
      "-:.*self_attn.v_proj$": {}
    }
  }
}

intermediate_sizes: Actual intermediate_size per layer (added after non-uniform pruning)
flap_pruned: Indicates the model has undergone FLAP pruning
quantization_config: GPTQ INT4 quantization configuration

Weight Changes

gate_proj / up_proj: Rows corresponding to pruned channels are removed
down_proj:
- Columns corresponding to pruned channels are removed
- New bias parameter added (bias compensation values)
Quantized weights: Stored as INT4 packed weights (qweight) with corresponding scale/zero-point tensors (scales, qzeros, g_idx)

Usage

Deployment command

vLLM Deployment

# **Required** Download the model with all files (**including plugin files**) to local storage
MODEL_DIR=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('ISCASRGL/gemma4-lite-round2'))")

# Set PYTHONPATH to include the plugin (required due to model modifications from pruning)
export PYTHONPATH="$MODEL_DIR:$PYTHONPATH"

# Start vLLM service
vllm serve ISCASRGL/gemma4-lite-round2 --config $MODEL_DIR/vllm_config.yaml

vLLM Plugin Description

This model includes vllm_flap_plugin for direct deployment of FLAP-pruned models in vLLM:

per-layer intermediate_size: Supports different FFN widths per layer after non-uniform pruning
FLAP bias compensation: Adds bias support for down_proj
Conditional patch: Only activates when config.flap_pruned=True, does not affect non-FLAP models

Technical Details

Bias Compensation Principle

Before pruning: FFN(x) = down_proj(act(gate_proj(x)) * up_proj(x))
After pruning:  FFN'(x) = down_proj_pruned(h_pruned) + output_bias

Where:
output_bias = Σ_{j∈pruned} E[h_j] × W_down[:, j]
            = (E[h] * ~mask) @ W_down.T

GPTQ INT4 Quantization Details

Quantization process:
  1. Collect calibration data activations, compute Hessian matrix H
  2. Quantize column by column: w_q = round(w / scale) - zero_point
  3. Error compensation: correct remaining unquantized column weights
  4. desc_act: process columns in descending order of |H_jj|

Dequantization: W_bf16 ≈ (W_int4 - zero_point) × scale
Group quantization: scale and zero_point computed per group of group_size=128
Symmetric quantization: zero_point = 0, W_bf16 ≈ W_int4 × scale

File Structure

├── config.json                       # Model configuration (with quantization_config)
├── model-00001-of-00003.safetensors  # Model weights (shard 1)
├── model-00002-of-00003.safetensors  # Model weights (shard 2)
├── model-00003-of-00003.safetensors  # Model weights (shard 3)
├── model.safetensors.index.json      # Weight index
├── quantize_config.json              # GPTQ quantization configuration
├── quant_log.csv                     # Quantization log (per-layer quantization error)
├── generation_config.json            # Generation configuration
├── processor_config.json             # Multimodal processor configuration
├── tokenizer.json                    # Tokenizer
├── tokenizer_config.json             # Tokenizer configuration
├── chat_template.jinja               # Chat template
├── vllm_flap_plugin.egg-info         # Plugin metadata
└── vllm_flap_plugin/                 # vLLM compatibility plugin
    ├── __init__.py
    └── README.md

Compression Summary

Compression Stage	Method	Reduction
FLAP Pruning	Non-uniform FFN pruning	~15% FFN parameters
GPTQ INT4 Quantization	Weight-only 4-bit quantization	~25% memory for quantized modules

Total Model Size: ~9.9 GB (compared to ~16 GB for the original bfloat16 model, approximately 38% reduction)

Downloads last month: 285

Safetensors

Model size

8B params

Tensor type

BF16

I32

F16

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ISCASRGL/gemma4-lite-round2

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Quantized

(242)

this model