Instructions to use ISCASRGL/gemma4-lite-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ISCASRGL/gemma4-lite-v1 with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ISCASRGL/gemma4-lite-v1") model = AutoModelForImageTextToText.from_pretrained("ISCASRGL/gemma4-lite-v1") - Notebooks
- Google Colab
- Kaggle
Gemma4-lite-v1
Model Overview
This model is a pruned and quantized version of google/gemma-4-E4B-it. Through structured pruning of the FFN layers and FP8 quantization, it significantly reduces parameter count and inference overhead while maintaining model performance.
Base Model Information
- Base Model: google/gemma-4-E4B-it
- Model Type: Multimodal Large Language Model (Text + Image + Audio)
- Architecture: Gemma4ForConditionalGeneration
- Layers: 42 layers
- Data Type: bfloat16 (unquantized parts), FP8 (quantized parts)
Compression Methods
1. FLAP Pruning (Fluctuation-based Adaptive Structured Pruning)
This model follows the FLAP method for FFN structured pruning.
- Bias Compensation
- Adds bias compensation for the average contribution of pruned channels, maintaining good performance without retraining
- Original Gemma4's
down_projhas no bias; bias parameters are added after pruning
2. FP8 Quantization (compressed-tensors format)
This model uses FP8 (8-bit floating point) quantization following the compressed-tensors format for efficient inference with vLLM.
- Quantization Method: FP8 E4M3FN (float8_e4m3fn)
- Weight Quantization: Channel-wise symmetric quantization
- Activation Quantization: Dynamic token-wise quantization
- Quantization Format: compressed-tensors
Quantized Modules
The following modules in the language model are quantized to FP8:
| Module Type | Description |
|---|---|
self_attn.o_proj |
Attention output projection |
mlp.gate_proj |
MLP gate projection |
mlp.up_proj |
MLP up projection |
mlp.down_proj |
MLP down projection |
Total: 168 quantized modules (42 layers × 4 modules per layer)
Unquantized Modules
The following modules remain in bfloat16:
- Vision tower weights (
model.vision_tower.*) - Audio branch weights (
model.audio_tower.*) - Query/Key/Value projections (
self_attn.q_proj,self_attn.k_proj,self_attn.v_proj) - Embeddings (
embed_tokens.weight,embed_tokens_per_layer.weight) - LayerNorm/RMSNorm weights
- All bias tensors (including FLAP
down_proj.bias) - Language model head (
lm_head)
Pruning Configuration
This model adopts a non-uniform pruning strategy, with differentiated processing for Gemma4's YOCO architecture:
| Layer Range | Role | Pruning Ratio | intermediate_size |
|---|---|---|---|
| 0-11 | self-decoder | 0% | 10240 |
| 12-16 | sliding_attention | 20% | 8192 |
| 17 | full_attention | 0% | 10240 |
| 18-22 | sliding_attention | 20% | 8192 |
| 23 | self-decoder (full_attention) | 0% | 10240 |
| 24-27 | sliding_attention | 20% | 8192 |
| 28-29 | the first kv_shared_layer and full_attention | 0% | 10240 |
| 30-34 | sliding_attention | 20% | 8192 |
| 35-41 | sliding_attention & full_attention | 0% | 10240 |
intermediate_size Distribution After Pruning
| intermediate_size | Layer Count | Description |
|---|---|---|
| 10240 (original) | 23 layers | Unpruned |
| 8192 (20% pruned) | 19 layers | Pruned |
FFN Parameter Compression: ~15%
Model Structure Changes
Configuration Changes
{
"text_config": {
"intermediate_size": 10240,
"intermediate_sizes": [10240, 10240, ..., 8192, 8192, ...],
"flap_pruned": true
},
"quantization_config": {
"quant_method": "compressed-tensors",
"format": "float-quantized",
"config_groups": {
"group_0": {
"weights": {
"num_bits": 8,
"type": "float",
"strategy": "channel"
},
"input_activations": {
"num_bits": 8,
"type": "float",
"strategy": "token",
"dynamic": true
}
}
}
}
}
intermediate_sizes: Actual intermediate_size per layer (added after non-uniform pruning)flap_pruned: Indicates the model has undergone FLAP pruningquantization_config: FP8 quantization configuration in compressed-tensors format
Weight Changes
- gate_proj / up_proj: Rows corresponding to pruned channels are removed
- down_proj:
- Columns corresponding to pruned channels are removed
- New
biasparameter added (bias compensation values)
- Quantized weights: Stored as FP8 E4M3FN with corresponding scale tensors (
*_scale)
Usage
vLLM Deployment
# Download the model with all files to local storage
MODEL_DIR=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('ISCASRGL/gemma4-lite-v1'))")
# Set PYTHONPATH to include the plugin (required due to model modifications from pruning)
export PYTHONPATH="$MODEL_DIR:$PYTHONPATH"
# Start vLLM service
vllm serve ISCASRGL/gemma4-lite-v1 --config $MODEL_DIR/vllm_config.yaml
vLLM Plugin Description
This model includes vllm_flap_plugin for direct deployment of FLAP-pruned models in vLLM:
- per-layer intermediate_size: Supports different FFN widths per layer after non-uniform pruning
- FLAP bias compensation: Adds bias support for
down_proj - Conditional patch: Only activates when
config.flap_pruned=True, does not affect non-FLAP models
Technical Details
Bias Compensation Principle
Before pruning: FFN(x) = down_proj(act(gate_proj(x)) * up_proj(x))
After pruning: FFN'(x) = down_proj_pruned(h_pruned) + output_bias
Where:
output_bias = Σ_{j∈pruned} E[h_j] × W_down[:, j]
= (E[h] * ~mask) @ W_down.T
FP8 Quantization Details
Quantized weight: W_fp8 = clamp(W_f32 / scale, -FP8_MAX, FP8_MAX).to(float8_e4m3fn)
Scale: scale = max(|W|, dim=1) / FP8_MAX
Where FP8_MAX = 448.0 (max value of float8_e4m3fn)
File Structure
gemma4-lite-v1/
├── config.json # Model configuration (with quantization_config)
├── model-00001-of-00004.safetensors # Model weights (sharded)
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json # Weight index
├── flap_bias_info.json # FLAP pruning metadata
├── tokenizer.json # Tokenizer
├── tokenizer_config.json
├── generation_config.json
├── processor_config.json # Multimodal processor config
├── chat_template.jinja # Chat template
├── vllm_config.yaml # vLLM configuration
├── vllm_flap_plugin.egg-info # Plugin info
└── vllm_flap_plugin/ # vLLM compatibility plugin
├── __init__.py
└── README.md
Compression Summary
| Compression Stage | Method | Reduction |
|---|---|---|
| FLAP Pruning | Non-uniform FFN pruning | ~15% FFN parameters |
| FP8 Quantization | Weight + Activation quantization | ~50% memory for quantized modules |
Total Compression: Approximately 30-40% reduction in model size compared to the original bfloat16 model.
- Downloads last month
- 98