You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Gemma4-lite-v1

Model Overview

This model is a pruned and quantized version of google/gemma-4-E4B-it. Through structured pruning of the FFN layers and FP8 quantization, it significantly reduces parameter count and inference overhead while maintaining model performance.

Base Model Information

Base Model: google/gemma-4-E4B-it
Model Type: Multimodal Large Language Model (Text + Image + Audio)
Architecture: Gemma4ForConditionalGeneration
Layers: 42 layers
Data Type: bfloat16 (unquantized parts), FP8 (quantized parts)

Compression Methods

1. FLAP Pruning (Fluctuation-based Adaptive Structured Pruning)

This model follows the FLAP method for FFN structured pruning.

Bias Compensation
- Adds bias compensation for the average contribution of pruned channels, maintaining good performance without retraining
- Original Gemma4's down_proj has no bias; bias parameters are added after pruning

2. FP8 Quantization (compressed-tensors format)

This model uses FP8 (8-bit floating point) quantization following the compressed-tensors format for efficient inference with vLLM.

Quantization Method: FP8 E4M3FN (float8_e4m3fn)
Weight Quantization: Channel-wise symmetric quantization
Activation Quantization: Dynamic token-wise quantization
Quantization Format: compressed-tensors

Quantized Modules

The following modules in the language model are quantized to FP8:

Module Type	Description
`self_attn.o_proj`	Attention output projection
`mlp.gate_proj`	MLP gate projection
`mlp.up_proj`	MLP up projection
`mlp.down_proj`	MLP down projection

Total: 168 quantized modules (42 layers × 4 modules per layer)

Unquantized Modules

The following modules remain in bfloat16:

Vision tower weights (model.vision_tower.*)
Audio branch weights (model.audio_tower.*)
Query/Key/Value projections (self_attn.q_proj, self_attn.k_proj, self_attn.v_proj)
Embeddings (embed_tokens.weight, embed_tokens_per_layer.weight)
LayerNorm/RMSNorm weights
All bias tensors (including FLAP down_proj.bias)
Language model head (lm_head)

Pruning Configuration

This model adopts a non-uniform pruning strategy, with differentiated processing for Gemma4's YOCO architecture:

Layer Range	Role	Pruning Ratio	intermediate_size
0-11	self-decoder	0%	10240
12-16	sliding_attention	20%	8192
17	full_attention	0%	10240
18-22	sliding_attention	20%	8192
23	self-decoder (full_attention)	0%	10240
24-27	sliding_attention	20%	8192
28-29	the first kv_shared_layer and full_attention	0%	10240
30-34	sliding_attention	20%	8192
35-41	sliding_attention & full_attention	0%	10240

intermediate_size Distribution After Pruning

intermediate_size	Layer Count	Description
10240 (original)	23 layers	Unpruned
8192 (20% pruned)	19 layers	Pruned

FFN Parameter Compression: ~15%

Model Structure Changes

Configuration Changes

{
  "text_config": {
    "intermediate_size": 10240,
    "intermediate_sizes": [10240, 10240, ..., 8192, 8192, ...],
    "flap_pruned": true
  },
  "quantization_config": {
    "quant_method": "compressed-tensors",
    "format": "float-quantized",
    "config_groups": {
      "group_0": {
        "weights": {
          "num_bits": 8,
          "type": "float",
          "strategy": "channel"
        },
        "input_activations": {
          "num_bits": 8,
          "type": "float",
          "strategy": "token",
          "dynamic": true
        }
      }
    }
  }
}

intermediate_sizes: Actual intermediate_size per layer (added after non-uniform pruning)
flap_pruned: Indicates the model has undergone FLAP pruning
quantization_config: FP8 quantization configuration in compressed-tensors format

Weight Changes

gate_proj / up_proj: Rows corresponding to pruned channels are removed
down_proj:
- Columns corresponding to pruned channels are removed
- New bias parameter added (bias compensation values)
Quantized weights: Stored as FP8 E4M3FN with corresponding scale tensors (*_scale)

Usage

vLLM Deployment

# Download the model with all files to local storage
MODEL_DIR=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('ISCASRGL/gemma4-lite-v1'))")

# Set PYTHONPATH to include the plugin (required due to model modifications from pruning)
export PYTHONPATH="$MODEL_DIR:$PYTHONPATH"

# Start vLLM service
vllm serve ISCASRGL/gemma4-lite-v1 --config $MODEL_DIR/vllm_config.yaml

vLLM Plugin Description

This model includes vllm_flap_plugin for direct deployment of FLAP-pruned models in vLLM:

per-layer intermediate_size: Supports different FFN widths per layer after non-uniform pruning
FLAP bias compensation: Adds bias support for down_proj
Conditional patch: Only activates when config.flap_pruned=True, does not affect non-FLAP models

Technical Details

Bias Compensation Principle

Before pruning: FFN(x) = down_proj(act(gate_proj(x)) * up_proj(x))
After pruning: FFN'(x) = down_proj_pruned(h_pruned) + output_bias

Where:
output_bias = Σ_{j∈pruned} E[h_j] × W_down[:, j]
            = (E[h] * ~mask) @ W_down.T

FP8 Quantization Details

Quantized weight: W_fp8 = clamp(W_f32 / scale, -FP8_MAX, FP8_MAX).to(float8_e4m3fn)
Scale: scale = max(|W|, dim=1) / FP8_MAX

Where FP8_MAX = 448.0 (max value of float8_e4m3fn)

File Structure

gemma4-lite-v1/
├── config.json                    # Model configuration (with quantization_config)
├── model-00001-of-00004.safetensors  # Model weights (sharded)
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json   # Weight index
├── flap_bias_info.json            # FLAP pruning metadata
├── tokenizer.json                 # Tokenizer
├── tokenizer_config.json
├── generation_config.json
├── processor_config.json          # Multimodal processor config
├── chat_template.jinja            # Chat template
├── vllm_config.yaml               # vLLM configuration
├── vllm_flap_plugin.egg-info      # Plugin info
└── vllm_flap_plugin/              # vLLM compatibility plugin
    ├── __init__.py
    └── README.md

Compression Summary

Compression Stage	Method	Reduction
FLAP Pruning	Non-uniform FFN pruning	~15% FFN parameters
FP8 Quantization	Weight + Activation quantization	~50% memory for quantized modules

Total Compression: Approximately 30-40% reduction in model size compared to the original bfloat16 model.

Downloads last month: 98

Safetensors

Model size

8B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ISCASRGL/gemma4-lite-v1

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Quantized

(202)

this model