You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Gemma4-lite-round2 GPTQ INT4 (W4A16)

Model Overview

This model is a pruned and quantized version of google/gemma-4-E4B-it. Through FLAP structured pruning of the FFN layers followed by GPTQ INT4 weight-only quantization, it significantly reduces parameter count and inference memory while maintaining model performance.

Base Model Information

  • Base Model: google/gemma-4-E4B-it
  • Model Type: Multimodal Large Language Model (Text + Image + Audio)
  • Architecture: Gemma4ForConditionalGeneration
  • Layers: 42 layers
  • Data Type: bfloat16 (unquantized parts), INT4 (quantized parts)

Compression Methods

1. FLAP Pruning (Fluctuation-based Adaptive Structured Pruning)

This model follows the FLAP method for FFN structured pruning.

  1. Bias Compensation
    • Adds bias compensation for the average contribution of pruned channels, maintaining good performance without retraining
    • Original Gemma4's down_proj has no bias; bias parameters are added after pruning

2. GPTQ INT4 Quantization (W4A16, weight-only)

This model uses GPTQ INT4 weight-only quantization via GPTQModel for efficient inference with vLLM.

  • Quantization Method: GPTQ (GPT-QModel) W4A16 weight-only
  • Weight Quantization: 4-bit symmetric group-wise quantization
  • Activation: Not quantized (remains in bfloat16)
  • Quantization Format: GPTQ v1 (gptq checkpoint format)
  • Quantization Tool: GPTQModel 7.0.0

Quantization Configuration

Parameter Value Description
bits 4 Quantization bit-width (INT4)
group_size 128 Quantization group size
sym True Symmetric quantization
desc_act True Activation descending order (improves quantization accuracy)
format gptq GPTQ v1 format, compatible with vLLM
damp_percent 0.01 Hessian diagonal damping percentage
calibration_samples 256 Number of calibration samples (WikiText-2)
calibration_seq_len 2048 Calibration sequence length

Quantized Modules

The following modules in the language model are quantized to INT4:

Module Type Description
self_attn.o_proj Attention output projection
mlp.gate_proj MLP gate projection
mlp.up_proj MLP up projection
mlp.down_proj MLP down projection

Total: 168 quantized modules (42 layers × 4 modules per layer)

Unquantized Modules (kept in bfloat16)

The following modules are explicitly excluded from quantization via GPTQModel's dynamic configuration and remain in bfloat16:

Module Pattern Reason
self_attn.q_proj QKV projections are sensitive to quantization error; keeping bf16 significantly improves accuracy
self_attn.k_proj Same as above
self_attn.v_proj Same as above
per_layer_projection PLE module weights are small (~1.25MB bf16 per layer) but suffer from large quantization error
per_layer_input_gate Same as above
Vision tower weights (model.vision_tower.*) Vision encoder typically does not need quantization
Audio branch weights (model.audio_tower.*) Audio encoder typically does not need quantization
Embeddings (embed_tokens.weight, embed_tokens_per_layer.weight) Embedding layers are not suitable for quantization
LayerNorm/RMSNorm weights Normalization layers have minimal parameters, no need for quantization
All bias tensors (including FLAP down_proj.bias) Bias terms kept at original precision
Language model head (lm_head) Output projection kept at original precision

GPTQ Quantization Principle

GPTQ employs a layer-wise quantization strategy that minimizes quantization error based on approximate second-order information (Hessian matrix):

For each layer:
  1. Compute the Hessian matrix H⁻¹ for the layer (based on calibration data activations)
  2. Quantize weights column by column, using H⁻¹ to correct unquantized columns
     to compensate for quantization error:
     δ_w = -w_q_err · (H⁻¹_{jj} / [H⁻¹]_{j,.})
  3. When desc_act=True, process columns in descending order of activation magnitude,
     prioritizing important weights

Quantized weight: W_int4 = quantize(W_bf16, scale, zero_point)
Dequantized:      W_bf16 ≈ W_int4 · scale + zero_point
Where scale and zero_point are computed per group of group_size=128

Pruning Configuration

This model adopts a non-uniform pruning strategy, with differentiated processing for Gemma4's YOCO architecture:

Layer Range Role Pruning Ratio intermediate_size
0-3 sliding_attention 20% 8192
4 sliding_attention 0% 10240
5-6 full_attention & sliding_attention 20% 8192
7-8 sliding_attention 0% 10240
9 sliding_attention 10% 9216
10-11 sliding_attention & full_attention 0% 10240
12-13 sliding_attention 20% 8192
14 sliding_attention 0% 10240
15-16 sliding_attention 20% 8192
17 full_attention 0% 10240
18-19 sliding_attention 20% 8192
20-21 sliding_attention 20% 8192
22-23 sliding_attention & full_attention 0% 10240
24-27 sliding_attention 20% 8192
28-29 sliding_attention & full_attention 0% 10240
30-31 sliding_attention 20% 8192
32 sliding_attention 0% 10240
33-34 sliding_attention 20% 8192
35-41 full_attention & sliding_attention 0% 10240

intermediate_size Distribution After Pruning

intermediate_size Layer Count Description
10240 (original) 19 layers Unpruned
9216 (10% pruned) 1 layer Lightly pruned
8192 (20% pruned) 22 layers Pruned

FFN Parameter Compression: ~15%

Model Structure Changes

Configuration Changes

{
  "text_config": {
    "intermediate_size": 10240,
    "intermediate_sizes": [8192, 8192, ..., 10240, 10240],
    "flap_pruned": true
  },
  "quantization_config": {
    "quant_method": "gptq",
    "bits": 4,
    "group_size": 128,
    "sym": true,
    "desc_act": true,
    "format": "gptq",
    "checkpoint_format": "gptq",
    "dynamic": {
      "-:.*per_layer_projection": {},
      "-:.*per_layer_input_gate": {},
      "-:.*self_attn.q_proj$": {},
      "-:.*self_attn.k_proj$": {},
      "-:.*self_attn.v_proj$": {}
    }
  }
}
  • intermediate_sizes: Actual intermediate_size per layer (added after non-uniform pruning)
  • flap_pruned: Indicates the model has undergone FLAP pruning
  • quantization_config: GPTQ INT4 quantization configuration

Weight Changes

  1. gate_proj / up_proj: Rows corresponding to pruned channels are removed
  2. down_proj:
    • Columns corresponding to pruned channels are removed
    • New bias parameter added (bias compensation values)
  3. Quantized weights: Stored as INT4 packed weights (qweight) with corresponding scale/zero-point tensors (scales, qzeros, g_idx)

Usage

Deployment command

vLLM Deployment

# **Required** Download the model with all files (**including plugin files**) to local storage
MODEL_DIR=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('ISCASRGL/gemma4-lite-round2'))")

# Set PYTHONPATH to include the plugin (required due to model modifications from pruning)
export PYTHONPATH="$MODEL_DIR:$PYTHONPATH"

# Start vLLM service
vllm serve ISCASRGL/gemma4-lite-round2 --config $MODEL_DIR/vllm_config.yaml 

vLLM Plugin Description

This model includes vllm_flap_plugin for direct deployment of FLAP-pruned models in vLLM:

  • per-layer intermediate_size: Supports different FFN widths per layer after non-uniform pruning
  • FLAP bias compensation: Adds bias support for down_proj
  • Conditional patch: Only activates when config.flap_pruned=True, does not affect non-FLAP models

Technical Details

Bias Compensation Principle

Before pruning: FFN(x) = down_proj(act(gate_proj(x)) * up_proj(x))
After pruning:  FFN'(x) = down_proj_pruned(h_pruned) + output_bias

Where:
output_bias = Σ_{j∈pruned} E[h_j] × W_down[:, j]
            = (E[h] * ~mask) @ W_down.T

GPTQ INT4 Quantization Details

Quantization process:
  1. Collect calibration data activations, compute Hessian matrix H
  2. Quantize column by column: w_q = round(w / scale) - zero_point
  3. Error compensation: correct remaining unquantized column weights
  4. desc_act: process columns in descending order of |H_jj|

Dequantization: W_bf16 ≈ (W_int4 - zero_point) × scale
Group quantization: scale and zero_point computed per group of group_size=128
Symmetric quantization: zero_point = 0, W_bf16 ≈ W_int4 × scale

File Structure

├── config.json                       # Model configuration (with quantization_config)
├── model-00001-of-00003.safetensors  # Model weights (shard 1)
├── model-00002-of-00003.safetensors  # Model weights (shard 2)
├── model-00003-of-00003.safetensors  # Model weights (shard 3)
├── model.safetensors.index.json      # Weight index
├── quantize_config.json              # GPTQ quantization configuration
├── quant_log.csv                     # Quantization log (per-layer quantization error)
├── generation_config.json            # Generation configuration
├── processor_config.json             # Multimodal processor configuration
├── tokenizer.json                    # Tokenizer
├── tokenizer_config.json             # Tokenizer configuration
├── chat_template.jinja               # Chat template
├── vllm_flap_plugin.egg-info         # Plugin metadata
└── vllm_flap_plugin/                 # vLLM compatibility plugin
    ├── __init__.py
    └── README.md

Compression Summary

Compression Stage Method Reduction
FLAP Pruning Non-uniform FFN pruning ~15% FFN parameters
GPTQ INT4 Quantization Weight-only 4-bit quantization ~25% memory for quantized modules

Total Model Size: ~9.9 GB (compared to ~16 GB for the original bfloat16 model, approximately 38% reduction)

Downloads last month
285
Safetensors
Model size
8B params
Tensor type
BF16
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ISCASRGL/gemma4-lite-round2

Quantized
(242)
this model