You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Gemma4-lite-v1

Model Overview

This model is a pruned and quantized version of google/gemma-4-E4B-it. Through structured pruning of the FFN layers and FP8 quantization, it significantly reduces parameter count and inference overhead while maintaining model performance.

Base Model Information

  • Base Model: google/gemma-4-E4B-it
  • Model Type: Multimodal Large Language Model (Text + Image + Audio)
  • Architecture: Gemma4ForConditionalGeneration
  • Layers: 42 layers
  • Data Type: bfloat16 (unquantized parts), FP8 (quantized parts)

Compression Methods

1. FLAP Pruning (Fluctuation-based Adaptive Structured Pruning)

This model follows the FLAP method for FFN structured pruning.

  1. Bias Compensation
    • Adds bias compensation for the average contribution of pruned channels, maintaining good performance without retraining
    • Original Gemma4's down_proj has no bias; bias parameters are added after pruning

2. FP8 Quantization (compressed-tensors format)

This model uses FP8 (8-bit floating point) quantization following the compressed-tensors format for efficient inference with vLLM.

  • Quantization Method: FP8 E4M3FN (float8_e4m3fn)
  • Weight Quantization: Channel-wise symmetric quantization
  • Activation Quantization: Dynamic token-wise quantization
  • Quantization Format: compressed-tensors

Quantized Modules

The following modules in the language model are quantized to FP8:

Module Type Description
self_attn.o_proj Attention output projection
mlp.gate_proj MLP gate projection
mlp.up_proj MLP up projection
mlp.down_proj MLP down projection

Total: 168 quantized modules (42 layers × 4 modules per layer)

Unquantized Modules

The following modules remain in bfloat16:

  • Vision tower weights (model.vision_tower.*)
  • Audio branch weights (model.audio_tower.*)
  • Query/Key/Value projections (self_attn.q_proj, self_attn.k_proj, self_attn.v_proj)
  • Embeddings (embed_tokens.weight, embed_tokens_per_layer.weight)
  • LayerNorm/RMSNorm weights
  • All bias tensors (including FLAP down_proj.bias)
  • Language model head (lm_head)

Pruning Configuration

This model adopts a non-uniform pruning strategy, with differentiated processing for Gemma4's YOCO architecture:

Layer Range Role Pruning Ratio intermediate_size
0-11 self-decoder 0% 10240
12-16 sliding_attention 20% 8192
17 full_attention 0% 10240
18-22 sliding_attention 20% 8192
23 self-decoder (full_attention) 0% 10240
24-27 sliding_attention 20% 8192
28-29 the first kv_shared_layer and full_attention 0% 10240
30-34 sliding_attention 20% 8192
35-41 sliding_attention & full_attention 0% 10240

intermediate_size Distribution After Pruning

intermediate_size Layer Count Description
10240 (original) 23 layers Unpruned
8192 (20% pruned) 19 layers Pruned

FFN Parameter Compression: ~15%

Model Structure Changes

Configuration Changes

{
  "text_config": {
    "intermediate_size": 10240,
    "intermediate_sizes": [10240, 10240, ..., 8192, 8192, ...],
    "flap_pruned": true
  },
  "quantization_config": {
    "quant_method": "compressed-tensors",
    "format": "float-quantized",
    "config_groups": {
      "group_0": {
        "weights": {
          "num_bits": 8,
          "type": "float",
          "strategy": "channel"
        },
        "input_activations": {
          "num_bits": 8,
          "type": "float",
          "strategy": "token",
          "dynamic": true
        }
      }
    }
  }
}
  • intermediate_sizes: Actual intermediate_size per layer (added after non-uniform pruning)
  • flap_pruned: Indicates the model has undergone FLAP pruning
  • quantization_config: FP8 quantization configuration in compressed-tensors format

Weight Changes

  1. gate_proj / up_proj: Rows corresponding to pruned channels are removed
  2. down_proj:
    • Columns corresponding to pruned channels are removed
    • New bias parameter added (bias compensation values)
  3. Quantized weights: Stored as FP8 E4M3FN with corresponding scale tensors (*_scale)

Usage

vLLM Deployment

# Download the model with all files to local storage
MODEL_DIR=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('ISCASRGL/gemma4-lite-v1'))")

# Set PYTHONPATH to include the plugin (required due to model modifications from pruning)
export PYTHONPATH="$MODEL_DIR:$PYTHONPATH"

# Start vLLM service
vllm serve ISCASRGL/gemma4-lite-v1 --config $MODEL_DIR/vllm_config.yaml 

vLLM Plugin Description

This model includes vllm_flap_plugin for direct deployment of FLAP-pruned models in vLLM:

  • per-layer intermediate_size: Supports different FFN widths per layer after non-uniform pruning
  • FLAP bias compensation: Adds bias support for down_proj
  • Conditional patch: Only activates when config.flap_pruned=True, does not affect non-FLAP models

Technical Details

Bias Compensation Principle

Before pruning: FFN(x) = down_proj(act(gate_proj(x)) * up_proj(x))
After pruning: FFN'(x) = down_proj_pruned(h_pruned) + output_bias

Where:
output_bias = Σ_{j∈pruned} E[h_j] × W_down[:, j]
            = (E[h] * ~mask) @ W_down.T

FP8 Quantization Details

Quantized weight: W_fp8 = clamp(W_f32 / scale, -FP8_MAX, FP8_MAX).to(float8_e4m3fn)
Scale: scale = max(|W|, dim=1) / FP8_MAX

Where FP8_MAX = 448.0 (max value of float8_e4m3fn)

File Structure

gemma4-lite-v1/
├── config.json                    # Model configuration (with quantization_config)
├── model-00001-of-00004.safetensors  # Model weights (sharded)
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json   # Weight index
├── flap_bias_info.json            # FLAP pruning metadata
├── tokenizer.json                 # Tokenizer
├── tokenizer_config.json
├── generation_config.json
├── processor_config.json          # Multimodal processor config
├── chat_template.jinja            # Chat template
├── vllm_config.yaml               # vLLM configuration
├── vllm_flap_plugin.egg-info      # Plugin info
└── vllm_flap_plugin/              # vLLM compatibility plugin
    ├── __init__.py
    └── README.md

Compression Summary

Compression Stage Method Reduction
FLAP Pruning Non-uniform FFN pruning ~15% FFN parameters
FP8 Quantization Weight + Activation quantization ~50% memory for quantized modules

Total Compression: Approximately 30-40% reduction in model size compared to the original bfloat16 model.

Downloads last month
98
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ISCASRGL/gemma4-lite-v1

Quantized
(202)
this model