ANEMLL

QAT version quantized for Apple Neural Engine.

Gemma3 models trained with BF16 can produce residual stream activations that exceed FP16's representable range (±65,504). FP16 inference is a hard requirement for ANE. This causes:

  • Overflow to inf in FP16 computation
  • NaN propagation through subsequent layers
  • Complete model failure on ANE (which uses FP16)

Model Quality Benchmarks

FP16 Scaling for ANE Compatibility

Gemma3 4B QAT models produce activations that exceed FP16 range (±65,504) during inference. We apply weight scaling (α=0.1875) to prevent overflow:

  • Embedding weights scaled by α=0.1875 (3/16)
  • LM head logits divided by α to restore original scale
  • Zero runtime overhead - transformation applied at conversion time
  • 100% token match with BF16 reference

Quantization Results

Configuration KL Divergence Correlation Match Rate Notes
FP16 baseline (no LUT) 0.0006 0.995 99.86% Best quality
FFN LUT4,4 + LM LUT6,4 0.196 0.959 90% This model
FFN LUT4,8 only 0.284 0.971 87% Larger size
FFN LUT4,8 + LM LUT6,4 0.279 0.970 86% -

Metric Guidelines

Metric Healthy Concerning
KL Divergence < 0.3 > 0.5
Correlation > 0.95 < 0.90
Match Rate > 85% < 75%

Reference

  • HF Model: google/gemma-3-4b-it-qat-int4-unquantized
  • Scaling: α=0.1875 (FP16 overflow prevention)
  • Context: 4096 tokens
  • Sliding Window: 1024

LUT Quantization

ANEMLL applies LUT (Lookup Table) palettization to compress weights. This model uses per-channel grouping (per_channel=4) where every 4 output channels share one lookup table for finer granularity and better quality.


Model Architecture

4-Function Chunk Design

Each FFN chunk contains 4 CoreML functions to support Gemma 3's sliding window attention:

Function Purpose
infer Single-token inference (position < sliding_window)
prefill Batch prefill processing (position < sliding_window)
infer_rotate Single-token inference after KV rotation (position ≥ sliding_window)
prefill_rotate Batch prefill after KV rotation (position ≥ sliding_window)

Why 4 functions? When position reaches the sliding window boundary (1024), KV cache must be rotated. The *_rotate functions handle inference after rotation, enabling efficient sliding window attention without recomputing the entire cache.

Gemma 3 Hybrid Local-Global Attention

Gemma 3 uses a hybrid attention architecture that combines local and global attention:

Local Attention (Sliding Window)

  • Applied to most layers
  • Window size: 1024 tokens
  • Efficient O(n × w) complexity where w = window size
  • Bounded memory usage for long sequences

Global Attention

  • Applied every 4th layer (layers 4, 8, 12, ...)
  • Full context access up to 4096 tokens
  • Preserves long-range dependencies

Advantages:

  1. Memory Efficiency: ~75% smaller KV cache compared to full attention
  2. Speed: Faster inference for long sequences
  3. Quality: Global attention layers maintain document-level coherence
  4. Scalability: Efficiently handles longer contexts than pure global attention

iOS/macOS Distribution

This folder contains pre-compiled CoreML models (.mlmodelc directories) ready for iOS and macOS deployment. No zip extraction required.

Requirements

  • macOS 15 (Sequoia) or later with Apple Silicon
  • iOS 17+ for mobile deployment
  • 8GB+ RAM recommended
  • Python 3.9+ for testing
  • CoreML Tools 8.x+ and HuggingFace Transformers

Installation

# Install Git LFS (required for large model files)
brew install git-lfs
git lfs install

# Install Python dependencies
pip install coremltools transformers

# Clone the repository
git clone https://huggingface.co/anemll/anemll-google-gemma-3-4b-it-qat-int4-unquantized-ctx4096_0.3.5

# Navigate to iOS/macOS distribution folder
cd anemll-google-gemma-3-4b-it-qat-int4-unquantized-ctx4096_0.3.5/ios

# Verify models are ready (should see .mlmodelc directories)
ls -la *.mlmodelc

You should see four .mlmodelc directories:

gemma3_embeddings.mlmodelc
gemma3_FFN_PF_lut4_chunk_01of02.mlmodelc
gemma3_FFN_PF_lut4_chunk_02of02.mlmodelc
gemma3_lm_head_lut6.mlmodelc

Test Inference

# Basic chat interface
python chat.py --meta ./meta.yaml

# Or with conversation history
python chat_full.py --meta ./meta.yaml

Controls:

  • Ctrl-D to exit
  • Ctrl-C to interrupt generation

Note: First load takes time as macOS places the model on ANE. Subsequent loads are instant.


Model Specifications

Parameter Value
Base Model google/gemma-3-4b-it-qat-int4-unquantized
Context Length 4096 tokens
Sliding Window 1024 tokens
Batch Size 64
Number of Chunks 2
FFN Quantization LUT4 (4-bit, per_channel=4)
LM Head Quantization LUT6 (6-bit, per_channel=4)
Embeddings FP16 (unquantized)
FP16 Scaling α=0.1875

Model Files

File Size Description
gemma3_embeddings.mlmodelc 1.3 GB Token embeddings (FP16)
gemma3_FFN_PF_lut4_chunk_01of02.mlmodelc 788 MB FFN layers 1-13 + prefill
gemma3_FFN_PF_lut4_chunk_02of02.mlmodelc 788 MB FFN layers 14-26
gemma3_lm_head_lut6.mlmodelc 488 MB Language model head

Total Size: ~3.4 GB


iOS/macOS App Integration

  1. Copy .mlmodelc directories to your Xcode project
  2. Include config.json for offline tokenizer
  3. Reference: ANEMLL ChatBot TestFlight

iOS: Requires unzipped .mlmodelc (this distribution) + config.json macOS: Supports both zipped and unzipped formats


Links


License

Downloads last month
109
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collections including anemll/anemll-google-gemma-3-4b-it-qat-int4-unquantized-ctx4096_0.3.5