gemma-4-e2b-flowcast-v3 · `flowcast-sota-v3`

Flowcast v3 is the production LoRA fine-tune of Gemma 4 E2B Text-int4 for macOS voice-agent desktop automation. It supersedes flowcast-sota-v1 with improved expanded-benchmark coverage while maintaining 100% on the core production gate.

Say it. Plan it. Do it.

What changed in v3

Surgical refine on expanded-benchmark failures (browser commands, OOD retail/dev, paraphrases). Training resumes from gen-refine adapter with targeted repair examples — not a full retrain.

Gate	v1	v3	Δ
Core hard eval (117)	100%	100%	—
Core held-out (27)	100%	100%	—
Expanded hard quality (170)	98.2%	99.4%	+1.2%
Expanded held-out (39)	97.4%	100%	+2.6%
Generalization suite	91.4%	97.1%	+5.7%
Core p50 latency	~1028ms	~1002ms	~same

Hard quality = task accuracy excluding latency SLA. v3 is the only variant that passes both production gates (100% core, ≥99% expanded quality).

Quick start (MLX, Apple Silicon)

pip install mlx-lm huggingface_hub

from huggingface_hub import snapshot_download
from mlx_lm import load, generate

base = snapshot_download("mlx-community/Gemma4-E2B-IT-Text-int4")
adapter = snapshot_download("nsalerni/gemma-4-e2b-flowcast-v3")

model, tokenizer = load(base, adapter_path=adapter)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Classify: go to gmail"}],
    tokenize=False,
    add_generation_prompt=True,
)
print(generate(model, tokenizer, prompt=prompt, max_tokens=128))

GemmaFlow / flowcast integration

from huggingface_hub import snapshot_download
from gemmaflow_tune.production import create_production_runner
from gemmaflow_tune.prompts import build_automation_planning_for_case

adapter = snapshot_download("nsalerni/gemma-4-e2b-flowcast-v3")
runner = create_production_runner(adapter_path=adapter)

case = {"transcript": "go to gmail", "tags": ["web_routing"]}
prompt = build_automation_planning_for_case(case, "hybrid_slim")
result = runner.generate("", user_content=prompt, max_tokens=256)
print(result.text)

Files

File	Description
`adapters.safetensors`	LoRA weights (checkpoint `0000030`)
`adapter_config.json`	LoRA config + base model reference
`inference_config.json`	Recommended runtime settings + benchmark scores

Recommended inference settings

{
  "prompt_mode": "hybrid_slim",
  "json_early_stop": true,
  "use_prompt_kv_cache": true,
  "temperature": 0.02,
  "top_p": 0.85,
  "max_tokens": 256
}

Training lineage

Fine-tuned with gemmaflow-tune:

Base: mlx-community/Gemma4-E2B-IT-Text-int4 (0.7B active, ~2.5 GB disk)
Method: LoRA (rank 16, 16 layers, attn projections)
Pipeline: v1 SFT → v2 gen-refine → v3 expanded surgical refine (30 iters)
Predecessor adapter: flowcast-sota-v1 → flowcast-v2-gen-refine

Citation

@misc{gemma4e2bflowcastv32026,
  title  = {gemma-4-e2b-flowcast-v3: Voice Desktop Automation for GemmaFlow},
  author = {Salerni, Nicola},
  year   = {2026},
  url    = {https://huggingface.co/nsalerni/gemma-4-e2b-flowcast-v3}
}

License

Apache 2.0. Base model subject to Google Gemma license.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Model tree for nsalerni/gemma-4-e2b-flowcast-v3

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it