gemma-3n-E2B-it solitaire-advisor LoRA

A LoRA adapter that distils a 31B Gemma Klondike Solitaire advisor into the ~2B-effective Gemma 3n E2B text-only model, runnable locally on a 16 GB Apple Silicon Mac via MLX.

A newer model is available. The Gemma 4 E2B successor, same teacher and data pipeline but evaluated on full games, is published at chayuto/gemma-4-e2b-it-solitaire-advisor-lora and is the project's lead student. This Gemma 3n repo remains the v1 baseline.

This is the first distillation run. The shipped weights (adapters.safetensors) are the iter-750 checkpoint, the best of a 1,000-iter training based on intermediate-checkpoint evaluation. It nearly closes the tier-score gap to the teacher (-1.32 -> -0.27, ~80 % recovery on the 20-state eval bench) and recovers 6 of 7 teacher-foundation moves the untuned base model missed, including a triple-replicated failure state that had defeated three prior small-model experiments. The iter-1000 checkpoint (under checkpoints/) is also available but is mildly overfit: mean tier 2.75 vs iter-750's 3.15, with 2 fewer foundation recoveries. See the learning-curve section for details.

Why Gemma 3n and not Gemma 4 E2B? The intended student was Gemma 4 E2B, the same series as the teacher and the project's long-term target. As of mlx-lm 0.31.3 (the latest at this writing), all mlx-community Gemma 4 E2B quants fail to load with a 140-parameter architecture-mismatch error (layers 15-34's alternating-attention k_norm / k_proj / v_proj weights are not implemented in mlx-lm's Gemma4Model class). The student here is therefore the previous-generation gemma-3n-E2B, which mlx-lm fully supports. A Gemma 4 E2B variant of this adapter is now published at chayuto/gemma-4-e2b-it-solitaire-advisor-lora: the base was unblocked with a small local sanitize() patch, same training script and data pipeline. The successor is evaluated on full games, where it beats the untuned base and generalizes to fresh, never-seen decks.

The teacher is gemma-4-31b-it (Google's Gemma 4 31B, accessed through a separate harvester app).

Model details

Base model mlx-community/gemma-3n-E2B-it-text-4bit-dwq
Adapter type LoRA (QLoRA over a 4-bit DWQ-quantised base)
LoRA rank 16 (scale 2.0, dropout 0.05)
LoRA target modules self_attn.{q,k,v,o}_proj, mlp.{gate,up,down}_proj
Trainable params 11.27 M of 4.46 B total (0.253%)
Training framework mlx-lm 0.31.3
Hardware Apple M5 16 GB unified memory (Metal GPU)
Adapter size on disk 45 MB per checkpoint
Iterations trained 1,000 (shipped checkpoint = iter 750)
Wall-clock training time ~95 minutes
Quantisation base remains 4-bit; LoRA weights bfloat16

Intended use

In scope. Acting as a move-selection advisor inside a Klondike Solitaire client that already enforces game rules:

  • Imperfect-information draw-1 Klondike (one card flipped from stock per draw); the advisor is shown the full visible state plus the count of face-down cards
  • Single-turn decisions: given the prompt schema below, emit a single JSON object choosing one of the offered legal moves
  • Local inference on Apple Silicon (8 GB+ unified memory) via mlx-lm

Out of scope.

  • Open-ended chat or general-purpose text generation. The model has been fine-tuned to a narrow JSON-emitting role and is expected to be measurably worse than the base model at unrelated tasks.
  • Game-rule enforcement. The advisor selects from a legalMoves array supplied in the prompt; it does not verify legality from first principles and should not be trusted to do so.
  • Optimal Solitaire play. The distillation target is a 31B model that itself is imperfect (it stalls on ~21 % of post-cutover games observed in production). This adapter inherits that ceiling.
  • Other Solitaire variants (Spider, FreeCell, etc.), out-of-distribution.

Usage

Install

# Apple Silicon, Python 3.12 venv recommended (mlx wheels are not on 3.14+)
python3.12 -m venv venv && source venv/bin/activate
pip install mlx mlx-lm huggingface-hub

Quick start, Python

from huggingface_hub import snapshot_download
from mlx_lm import load, generate

# Pull the adapter once (~45 MB for the shipped iter-750 weights;
# checkpoints/ subdir adds another 180 MB if you want intermediate iters too).
adapter_path = snapshot_download(
    repo_id="chayuto/gemma-3n-e2b-it-solitaire-advisor-lora",
    allow_patterns=["adapters.safetensors", "adapter_config.json"],
)

# Load base + LoRA. First call also downloads the base model (~3 GB).
model, tokenizer = load(
    "mlx-community/gemma-3n-E2B-it-text-4bit-dwq",
    adapter_path=adapter_path,
)

# Wrap your Solitaire prompt as a single user message and apply the chat
# template (this matches what the model was trained against).
solitaire_prompt = open("your_solitaire_prompt.txt").read()
wrapped = tokenizer.apply_chat_template(
    [{"role": "user", "content": solitaire_prompt}],
    tokenize=False, add_generation_prompt=True,
)
response = generate(model, tokenizer, prompt=wrapped, max_tokens=512)
print(response)

Resource envelope on Apple M5 16 GB:

  • First load: 2.3 s (cached) + ~5 min one-time base-model download
  • Per call: ~13 s wall, ~6.3 GB peak Metal memory
  • The model is text-only and runs entirely on the local Metal GPU; no network after initial download.

Concrete example

A real prompt from the eval bench is at eval/sample_prompt.txt (state oscillation-bfb84a, 4471 chars, 4 cards on foundations, 18 face-down, Ace of Spades in waste). Its tail looks like this:

LEGAL MOVES (respond with the index of your chosen move):
 [0] tableau_to_tableau Move KH plus 1 more from column 1 to column 2 (empty)
 [1] discard_to_foundation Send AS from the waste to the spades foundation
 [2] draw_card Draw the next card from the stock onto the waste

PROGRESS: foundation=4/52, face-down remaining=18, completion=8%
...
Now choose the best move and reply with only the JSON object.

The iter-750 adapter responds with:

{
    "board_analysis": "The board is mostly face-down with 18 hidden cards remaining. Foundations are partially built (AH, AD, 2C, S). The waste pile contains the AS. The only legal tableau move is moving the KH and QC from column 1 to column 2, which does not reveal any hidden cards or create any new strategic opportunities.",
    "strategic_plan": "The primary objective is to reveal the Ace of Spades and other low cards (2, 3, 5, 6, 8, 9) to the foundations and tableau columns. ...",
    "final_decision": {
    "move_index": 1,
    "confidence": 0.95,
    "alternative_move_index": -1
    }
}

move_index: 1 correctly sends the Ace of Spades to the spades foundation

  • the optimal play here, and a move the untuned base missed on 3/3 runs. This is the headline oscillation-bfb84a recovery referenced in §Foundation move recovery below.

Expected I/O contract

Input: a single user-role message containing the full Solitaire game prompt in the project's harvester format. The prompt must include LEGAL MOVES block; the model is trained to index into it. See eval/prompts_C0/*/prompt.txt for 20 worked examples across early/midgame/oscillation game states.

Output: a single JSON object with three required keys:

{
    "board_analysis": "string, terse description of the visible state",
    "strategic_plan": "string, why the chosen move is preferred",
    "final_decision": {
    "move_index": 0,
    "confidence": 0.9,
    "alternative_move_index": 1
    }
}
  • move_index is the 0-based index into the prompt's legalMoves array. This is the only field the client needs to consume.
  • confidence is inherited from the teacher and is poorly calibrated (saturates at 0.85-0.95). Do not route on it.
  • alternative_move_index can be -1 if no alternative is suggested.
  • The strategic_plan prose sometimes references move indices inconsistently (an artifact of the PRIOR REASONING section in the training prompts); trust final_decision.move_index, not the narrative.

Robustness

Iter-750 produces valid JSON on 20/20 eval states. Two of 20 choose an illegal move_index (off-by-one on 2-move arrays). Clients should defensively fall back to the highest-tier legal move when move_index >= len(legalMoves). The iter-1000 checkpoint under checkpoints/0001000_adapters.safetensors eliminates both illegals at the cost of 2 foundation moves, see the learning-curve section to choose.

Training data

Source. Production play logs from a Klondike Solitaire client where the 31B gemma-4-31b-it teacher chose moves turn-by-turn. Logs were collected between April and May 2026 across multiple app commits and prompt-template versions. The collection harness, app, and raw harvest format are tracked in a separate, private repo.

Selection. 1,536 of 1,730 candidate decisions kept after the standard ingest pipeline filters (success outcome, valid rawResponse JSON with the three required keys, not from a stalled game). 25 distinct play sessions, heterogeneous prompt templates (~63 % pre-cutover legacy format, ~37 % the current production template 0462323c...). Split at the session level 80 / 10 / 10 -> 1,279 train / 126 val / 131 test.

License. Released as a derived training corpus under CC-BY-4.0 in the project's published dataset (separately staged). No personally identifying information; logs contain only game seeds, board states, model responses, and timing.

Known data-quality issues the adapter inherits:

  • 11 % of source rows dropped by the ingest filter for malformed rawResponse, root cause not yet localised in the harvester.
  • Teacher confidence field is saturated (median 0.90, never below 0.80) even in lost games, not a reliable training signal; it is included in completions but treated as suspect downstream.
  • Mixed prompt-template formats in training; eval is on the most-recent template only.
  • No deck seed in logs (open harvester P0), so we cannot verify the teacher's choices against solver-optimal play.

Training procedure

Hyperparameters

model: mlx-community/gemma-3n-E2B-it-text-4bit-dwq
max_seq_length: 2048
batch_size: 1
num_layers: 16
grad_checkpoint: true
learning_rate: 2.0e-4
iters: 1000
steps_per_report: 10
steps_per_eval: 100
save_every: 250
val_batches: 25

lora_parameters:
    rank: 16
    scale: 2.0
    dropout: 0.05
    keys:
    - self_attn.q_proj
    - self_attn.k_proj
    - self_attn.v_proj
    - self_attn.o_proj
    - mlp.gate_proj
    - mlp.up_proj
    - mlp.down_proj

The keys: list explicitly scopes LoRA to attention + MLP. The default broader sweep tries to wrap Gemma 3n's altup.prediction_coefs linear, which breaks altup.predict()'s direct .weight access mid-training.

Loss curve

iter train loss val loss peak MLX wall (cumulative)
1 - 6.365 - 0 m
10 3.160 - 11.45 GB ~1 m
100 0.388 0.426 11.45 GB ~10 m
250 (checkpoint) (checkpoint) 11.49 GB ~25 m
500 (checkpoint) (checkpoint) 11.49 GB ~50 m
750 (checkpoint) (checkpoint) 11.49 GB ~75 m
1000 0.222 0.369 11.49 GB ~95 m

Most of the learning happened in the first 100 iters (val 6.365 -> 0.426). Iters 100-1,000 contributed an additional 0.057 of val-loss improvement - diminishing but still positive at 1,000. Train/val gap at the end is 0.147 (mild memorisation, val still trending down).

Checkpoints

All four training checkpoints are published under checkpoints/:

  • 0000250_adapters.safetensors
  • 0000500_adapters.safetensors
  • 0000750_adapters.safetensors (= the root adapters.safetensors, best by tier score)
  • 0001000_adapters.safetensors (final iter; mildly overfit, see learning curve below)

Learning curve, why iter 750, not iter 1000

The 20-state eval bench was run against the untuned base and each of the four saved checkpoints. The curve is not monotonic:

checkpoint mean tier Δ vs teacher Δ vs untuned foundation recovery (of 7) illegal JSON valid
untuned (iter 0) 2.10 -1.32 0.00 2 / 7 1 20 / 20
iter 250 2.10 -1.32 0.00 3 / 7 3 20 / 20
iter 500 2.60 -0.82 +0.50 4 / 7 2 18 / 20
iter 750 3.15 -0.27 +1.05 6 / 7 2 20 / 20
iter 1000 2.75 -0.67 +0.65 4 / 7 0 20 / 20

Key observations:

  1. Iter 750 is the strategic peak. Mean tier 3.15 is within 0.27 of the 31B teacher's 3.42; 6 of 7 teacher-foundation states are correctly recovered (vs only 2 of 7 untuned, 4 of 7 at iter 1000).
  2. Iter 1000 has lost ground. Two of the four foundation moves recovered at iter 750 regressed to non-foundation choices by iter 1000. Mean tier dropped from 3.15 to 2.75.
  3. There is a real format / strategy tradeoff at iter 1000. It is the only checkpoint with zero illegal moves, but the strategic regression outweighs the marginal format gain (iter 750's 2 illegal moves can be handled client-side by falling back to a draw).
  4. Iter 500 had a brief JSON-format instability (2/20 generations missed the JSON schema). This had recovered by iter 750. Worth flagging as a known training-dynamics quirk on this dataset size.

If you want strict format reliability at the cost of strategic strength, the iter-1000 weights under checkpoints/0001000_adapters.safetensors are appropriate. For most use cases, the shipped iter-750 weights are the right default.

Raw per-checkpoint scored eval results are published under eval/: posttune_at250.json, posttune_at500.json, posttune_at750.json, posttune_at1000.json, plus the aggregated learning_curve.json.

Evaluation

Bench

20-state evaluation bench drawn from the Phase 1.5 prompt-format study, composed of 5 early-game, 8 midgame, and 7 oscillation states from two post-cutover production sessions (template 0462323c...). The bench intentionally includes 7 states where the 31B teacher chose a {tableau,discard}_to_foundation move, the failure mode this distillation was most intended to fix.

The 31B teacher's pick on each state is the production-recorded ground truth. Tier scoring (foundation = 6, reveal = 5, waste_play = 4, shuffle = 2, draw = 1, recycle = 1, illegal = 0) is the same scale used in the A4 Phase 1.5 prompt-format study. A single generation per state was used; future work should add multiple runs per state for variance estimation.

Headline (iter-750 shipped weights)

metric untuned base this adapter (iter 750) iter 1000 (for ref) 31B teacher
JSON validity 20 / 20 20 / 20 20 / 20 -
Illegal moves chosen 1 / 20 2 / 20 0 / 20 -
Teacher-pick agreement 11 / 20 11 / 20 11 / 20 -
Mean tier (all 20) 2.10 3.15 2.75 3.42
Gap to teacher -1.32 -0.27 -0.67 -
Foundation recovery (of 7 missed) 2 / 7 6 / 7 4 / 7 -

Teacher-pick agreement is unchanged in count but shifted in composition: the adapter recovered some agreements on foundation states and lost some on states where its disagreement is a strictly higher-tier move than the teacher chose. Raw agreement under-counts the improvement; tier score captures it.

Per category

category n untuned adapter (iter 750) iter 1000 teacher Δ adapter vs untuned
early 5 2.60 4.20 3.20 4.20 +1.60 (matches teacher)
midgame 8 1.38 2.12 1.75 2.00 +0.74 (beats teacher mean)
oscillation 7 2.57 3.57 3.57 4.29 +1.00

Oscillation gained the most, same category where the foundation-miss failure mode lived in the untuned base.

Foundation-move recovery (the primary fine-tuning target)

The bench includes 7 states where the teacher chose a foundation move. At iter 750, 6 of 7 are correctly recovered:

state untuned (iter 0) iter 750 (shipped) iter 1000 teacher
early-3687a40eda7b shuffle foundation foundation foundation
early-e6291973dd07 shuffle foundation draw foundation
midgame-4ab5735a4f20 draw foundation draw foundation
oscillation-026f3139d6f2 foundation foundation foundation foundation
oscillation-30700e2ca639 foundation foundation foundation foundation
oscillation-a774c0d22f24 draw foundation shuffle foundation
oscillation-bfb84ae55c3f draw foundation foundation foundation
Recovered count 2 / 7 6 / 7 4 / 7 -

oscillation-bfb84a is notable: previously a 3-experiment replicated failure mode (C0-Haiku missed 1/3 runs, A4-Haiku missed 1/3 runs, untuned 3n-E2B missed 3/3 runs). The adapter solves it from iter 250 onward and the solution is stable through iter 1000.

The single state still missed at iter 750 (midgame-4ab5735a4f20) was also the hardest at iter 1000 (still missed there too). It is a state where the foundation move is at move_index=1 of a 4-move array; the adapter consistently prefers move_index=0 (a draw). Probably needs targeted training-data augmentation to fix.

Adapter strictly outperforms the teacher on three states

States where the iter-750 adapter's pick is a higher tier than the teacher's:

state teacher iter-750 adapter tier improvement
midgame-0d463176c4be draw (1) shuffle (2) +1
midgame-a658537fe2ae draw (1) shuffle (2) +1
oscillation-21cc5243e1d8 draw (1) shuffle (2) +1

These are all draw -> shuffle substitutions: when the teacher punted with a draw, the adapter found a productive tableau move. The 31B teacher is not an oracle; it leaves some tier points on the table that distillation has picked up.

Two illegal moves remain at iter 750

state n legal iter-750 chose note
midgame-81dc0fb02394 2 2 off-by-one (same state that was illegal untuned)
oscillation-d0ff552ed744 2 2 off-by-one

Both are choosing move_index=2 on a 2-move array. Client code should fall back to the highest-tier legal move. Iter-1000 fixes both but at the cost of two foundation moves, net negative trade.

Reproducing the evaluation

git clone https://huggingface.co/chayuto/gemma-3n-e2b-it-solitaire-advisor-lora
cd gemma-3n-e2b-it-solitaire-advisor-lora
pip install mlx mlx-lm

# Untuned baseline
python eval/baseline_n20_runner.py

# This adapter
python eval/posttune_n20_runner.py --adapter-path .

Wall time ~5 min per arm on M5.

Limitations

  • N = 20 single-run eval. The +0.65 tier delta is large enough to be directionally trustworthy, but per-state changes (especially 1- or 2-state foundation gains) should not be over-interpreted as guarantees on unseen positions.
  • Heterogeneous training templates. Training mixed pre-cutover legacy and current production prompt formats; eval is on post-cutover only. Effect on generalisation across template shift is unmeasured.
  • No endgame states in bench. Both source post-cutover sessions stalled at most 25% progress (genuine sample), so the adapter's endgame behaviour is untested. Prior gemma-4-31b-it evidence suggests endgame is a different failure regime; treat extrapolation with caution.
  • Trained on a teacher that itself loses ~ 55 % of games. The adapter's ceiling is teacher-level play, not optimal play. The "lost agreements that are wins" rows hint there is room to outperform the teacher in places, but the dataset doesn't actively reward that, only teacher imitation.
  • Memorisation risk. Final train loss 0.222 vs val 0.369 shows mild divergence; pushing iters past ~1,500 without data augmentation is likely to widen this.
  • Confidence field is suspect. The teacher emits confidence: 0.9 ± 0.05 almost regardless of board state; the adapter learned this poorly-calibrated signal. Do not use final_decision.confidence for routing decisions.
  • Apple-Silicon-only. Distributed via mlx. CUDA/CPU inference would need conversion through transformers / PEFT, which is not validated here.

Bias and ethical considerations

The model produces Solitaire move recommendations. Risks of bias in the classical sense (race, gender, etc.) are not directly applicable. Worth noting:

  • The teacher (and therefore the adapter) inherits whatever value judgements are encoded in the harvester's prompt, including the rule "prefer revealing face-down cards before sending cards to foundations" which is a heuristic that loses to certain optimal lines.
  • Production use will lock in the teacher's playstyle. If the goal is a diverse advisor, training on a single teacher is the wrong objective.

License

The adapter is released under the Gemma Terms of Use (inherited from the base model). Use, redistribution, and modification require compliance with the Gemma Prohibited Use Policy.

The training scripts and evaluation code in this repository (training/, eval/) are released under the MIT License.

The training data (separately staged for HuggingFace Datasets) is CC-BY-4.0.

Citation

If you use this adapter, please cite:

@misc{orapinpatipat2026solitaireadvisor,
    title = {Distilling a 31B Klondike Solitaire advisor into Gemma 3n E2B via MLX QLoRA},
    author = {Orapinpatipat, Chayut},
    year = {2026},
    month = may,
    howpublished = {\url{https://huggingface.co/chayuto/gemma-3n-e2b-it-solitaire-advisor-lora}},
    note = {LoRA adapter; v1 = 1,000-iter checkpoint},
}

Acknowledgements

  • Base model mlx-community/gemma-3n-E2B-it-text-4bit-dwq from the mlx-community team.
  • Training framework mlx-lm from Apple Machine Learning Research.
  • Teacher model gemma-4-31b-it from Google DeepMind.

Project status

This is the first end-to-end distillation run of an ongoing project (solitaire-analytics). The complete training pipeline, all evaluation infrastructure, and a degradation-gated runway of tiered smoke tests (T0 through T5) are documented in the methodology notes in this repo.

Planned next iterations:

  1. done Eval the 250/500/750-iter intermediate checkpoints to find the optimal early-stopping point. (done, iter 750 selected as shipped weights)
  2. Targeted training-data augmentation on the one remaining missed-foundation state (midgame-4ab5735a4f20) to push foundation recovery from 6/7 to 7/7.
  3. Re-train on a post-cutover-only slice once at least 1,000 such rows are available (currently 351). Should reduce template-shift confound.
  4. done Re-publish on Gemma 4 E2B Published at chayuto/gemma-4-e2b-it-solitaire-advisor-lora, the base unblocked via a small local sanitize() patch. It is now the project's lead student and is evaluated on full games (beats the untuned base, generalizes to fresh decks); this Gemma 3n repo remains the v1 fallback / reproducibility baseline.
Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chayuto/gemma-3n-e2b-it-solitaire-advisor-lora