ANEMLL
QAT version quantized for Apple Neural Engine.
Gemma3 models trained with BF16 can produce residual stream activations that exceed FP16's representable range (±65,504). FP16 inference is a hard requirement for ANE. This causes:
- Overflow to
infin FP16 computation - NaN propagation through subsequent layers
- Complete model failure on ANE (which uses FP16)
Model Quality Benchmarks
FP16 Scaling for ANE Compatibility
Gemma3 4B QAT models produce activations that exceed FP16 range (±65,504) during inference. We apply weight scaling (α=0.1875) to prevent overflow:
- Embedding weights scaled by α=0.1875 (3/16)
- LM head logits divided by α to restore original scale
- Zero runtime overhead - transformation applied at conversion time
- 100% token match with BF16 reference
Quantization Results
| Configuration | KL Divergence | Correlation | Match Rate | Notes |
|---|---|---|---|---|
| FP16 baseline (no LUT) | 0.0006 | 0.995 | 99.86% | Best quality |
| FFN LUT4,4 + LM LUT6,4 | 0.196 | 0.959 | 90% | This model |
| FFN LUT4,8 only | 0.284 | 0.971 | 87% | Larger size |
| FFN LUT4,8 + LM LUT6,4 | 0.279 | 0.970 | 86% | - |
Metric Guidelines
| Metric | Healthy | Concerning |
|---|---|---|
| KL Divergence | < 0.3 | > 0.5 |
| Correlation | > 0.95 | < 0.90 |
| Match Rate | > 85% | < 75% |
Reference
- HF Model:
google/gemma-3-4b-it-qat-int4-unquantized - Scaling: α=0.1875 (FP16 overflow prevention)
- Context: 4096 tokens
- Sliding Window: 1024
LUT Quantization
ANEMLL applies LUT (Lookup Table) palettization to compress weights. This model uses per-channel grouping (per_channel=4) where every 4 output channels share one lookup table for finer granularity and better quality.
Model Architecture
4-Function Chunk Design
Each FFN chunk contains 4 CoreML functions to support Gemma 3's sliding window attention:
| Function | Purpose |
|---|---|
infer |
Single-token inference (position < sliding_window) |
prefill |
Batch prefill processing (position < sliding_window) |
infer_rotate |
Single-token inference after KV rotation (position ≥ sliding_window) |
prefill_rotate |
Batch prefill after KV rotation (position ≥ sliding_window) |
Why 4 functions? When position reaches the sliding window boundary (1024), KV cache must be rotated. The *_rotate functions handle inference after rotation, enabling efficient sliding window attention without recomputing the entire cache.
Gemma 3 Hybrid Local-Global Attention
Gemma 3 uses a hybrid attention architecture that combines local and global attention:
Local Attention (Sliding Window)
- Applied to most layers
- Window size: 1024 tokens
- Efficient O(n × w) complexity where w = window size
- Bounded memory usage for long sequences
Global Attention
- Applied every 4th layer (layers 4, 8, 12, ...)
- Full context access up to 4096 tokens
- Preserves long-range dependencies
Advantages:
- Memory Efficiency: ~75% smaller KV cache compared to full attention
- Speed: Faster inference for long sequences
- Quality: Global attention layers maintain document-level coherence
- Scalability: Efficiently handles longer contexts than pure global attention
iOS/macOS Distribution
This folder contains pre-compiled CoreML models (.mlmodelc directories) ready for iOS and macOS deployment. No zip extraction required.
Requirements
- macOS 15 (Sequoia) or later with Apple Silicon
- iOS 17+ for mobile deployment
- 8GB+ RAM recommended
- Python 3.9+ for testing
- CoreML Tools 8.x+ and HuggingFace Transformers
Installation
# Install Git LFS (required for large model files)
brew install git-lfs
git lfs install
# Install Python dependencies
pip install coremltools transformers
# Clone the repository
git clone https://huggingface.co/anemll/anemll-google-gemma-3-4b-it-qat-int4-unquantized-ctx4096_0.3.5
# Navigate to iOS/macOS distribution folder
cd anemll-google-gemma-3-4b-it-qat-int4-unquantized-ctx4096_0.3.5/ios
# Verify models are ready (should see .mlmodelc directories)
ls -la *.mlmodelc
You should see four .mlmodelc directories:
gemma3_embeddings.mlmodelc
gemma3_FFN_PF_lut4_chunk_01of02.mlmodelc
gemma3_FFN_PF_lut4_chunk_02of02.mlmodelc
gemma3_lm_head_lut6.mlmodelc
Test Inference
# Basic chat interface
python chat.py --meta ./meta.yaml
# Or with conversation history
python chat_full.py --meta ./meta.yaml
Controls:
Ctrl-Dto exitCtrl-Cto interrupt generation
Note: First load takes time as macOS places the model on ANE. Subsequent loads are instant.
Model Specifications
| Parameter | Value |
|---|---|
| Base Model | google/gemma-3-4b-it-qat-int4-unquantized |
| Context Length | 4096 tokens |
| Sliding Window | 1024 tokens |
| Batch Size | 64 |
| Number of Chunks | 2 |
| FFN Quantization | LUT4 (4-bit, per_channel=4) |
| LM Head Quantization | LUT6 (6-bit, per_channel=4) |
| Embeddings | FP16 (unquantized) |
| FP16 Scaling | α=0.1875 |
Model Files
| File | Size | Description |
|---|---|---|
gemma3_embeddings.mlmodelc |
1.3 GB | Token embeddings (FP16) |
gemma3_FFN_PF_lut4_chunk_01of02.mlmodelc |
788 MB | FFN layers 1-13 + prefill |
gemma3_FFN_PF_lut4_chunk_02of02.mlmodelc |
788 MB | FFN layers 14-26 |
gemma3_lm_head_lut6.mlmodelc |
488 MB | Language model head |
Total Size: ~3.4 GB
iOS/macOS App Integration
- Copy
.mlmodelcdirectories to your Xcode project - Include
config.jsonfor offline tokenizer - Reference: ANEMLL ChatBot TestFlight
iOS: Requires unzipped
.mlmodelc(this distribution) +config.jsonmacOS: Supports both zipped and unzipped formats
Links
License
- ANEMLL Pipeline: MIT License
- Gemma Model: Google Gemma Terms of Use
- Downloads last month
- 109