Instructions to use nsalerni/gemma-4-e2b-flowcast-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use nsalerni/gemma-4-e2b-flowcast-v3 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("nsalerni/gemma-4-e2b-flowcast-v3") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use nsalerni/gemma-4-e2b-flowcast-v3 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "nsalerni/gemma-4-e2b-flowcast-v3" --prompt "Once upon a time"
gemma-4-e2b-flowcast-v3 ยท flowcast-sota-v3
Flowcast v3 is the production LoRA fine-tune of Gemma 4 E2B Text-int4 for macOS voice-agent desktop automation. It supersedes flowcast-sota-v1 with improved expanded-benchmark coverage while maintaining 100% on the core production gate.
Say it. Plan it. Do it.
What changed in v3
Surgical refine on expanded-benchmark failures (browser commands, OOD retail/dev, paraphrases). Training resumes from gen-refine adapter with targeted repair examples โ not a full retrain.
| Gate | v1 | v3 | ฮ |
|---|---|---|---|
| Core hard eval (117) | 100% | 100% | โ |
| Core held-out (27) | 100% | 100% | โ |
| Expanded hard quality (170) | 98.2% | 99.4% | +1.2% |
| Expanded held-out (39) | 97.4% | 100% | +2.6% |
| Generalization suite | 91.4% | 97.1% | +5.7% |
| Core p50 latency | ~1028ms | ~1002ms | ~same |
Hard quality = task accuracy excluding latency SLA. v3 is the only variant that passes both production gates (100% core, โฅ99% expanded quality).
Quick start (MLX, Apple Silicon)
pip install mlx-lm huggingface_hub
from huggingface_hub import snapshot_download
from mlx_lm import load, generate
base = snapshot_download("mlx-community/Gemma4-E2B-IT-Text-int4")
adapter = snapshot_download("nsalerni/gemma-4-e2b-flowcast-v3")
model, tokenizer = load(base, adapter_path=adapter)
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Classify: go to gmail"}],
tokenize=False,
add_generation_prompt=True,
)
print(generate(model, tokenizer, prompt=prompt, max_tokens=128))
GemmaFlow / flowcast integration
from huggingface_hub import snapshot_download
from gemmaflow_tune.production import create_production_runner
from gemmaflow_tune.prompts import build_automation_planning_for_case
adapter = snapshot_download("nsalerni/gemma-4-e2b-flowcast-v3")
runner = create_production_runner(adapter_path=adapter)
case = {"transcript": "go to gmail", "tags": ["web_routing"]}
prompt = build_automation_planning_for_case(case, "hybrid_slim")
result = runner.generate("", user_content=prompt, max_tokens=256)
print(result.text)
Files
| File | Description |
|---|---|
adapters.safetensors |
LoRA weights (checkpoint 0000030) |
adapter_config.json |
LoRA config + base model reference |
inference_config.json |
Recommended runtime settings + benchmark scores |
Recommended inference settings
{
"prompt_mode": "hybrid_slim",
"json_early_stop": true,
"use_prompt_kv_cache": true,
"temperature": 0.02,
"top_p": 0.85,
"max_tokens": 256
}
Training lineage
Fine-tuned with gemmaflow-tune:
- Base:
mlx-community/Gemma4-E2B-IT-Text-int4(0.7B active, ~2.5 GB disk) - Method: LoRA (rank 16, 16 layers, attn projections)
- Pipeline: v1 SFT โ v2 gen-refine โ v3 expanded surgical refine (30 iters)
- Predecessor adapter:
flowcast-sota-v1โflowcast-v2-gen-refine
Citation
@misc{gemma4e2bflowcastv32026,
title = {gemma-4-e2b-flowcast-v3: Voice Desktop Automation for GemmaFlow},
author = {Salerni, Nicola},
year = {2026},
url = {https://huggingface.co/nsalerni/gemma-4-e2b-flowcast-v3}
}
License
Apache 2.0. Base model subject to Google Gemma license.
Quantized
Model tree for nsalerni/gemma-4-e2b-flowcast-v3
Base model
google/gemma-4-E2B