Effect-v4 Qwen3.6-35B-A3B — Champion (GGUF, llama.cpp)

A local model fine-tuned to write idiomatic, compiling Effect v4 (effect@4.0.0-beta.80) TypeScript. Built $0-local on a single Apple M5 Max (48 GB): continued-pretraining → instruction SFT (LoRA), fused into the base, then converted to GGUF for portable CPU/GPU inference with llama.cpp. Same champion weights (v7s43_i200) as the MLX release — this repo is the portable GGUF build.

Why this exists: effect@4.0.0-beta.80 is a beta that postdates the pretraining of essentially every LLM — its exact API surface is absent from base models, so they hallucinate v3-isms. That sparsity is the whole point: this is a small, honest domain expert for a library the big models haven't seen.

Honest framing first. These are the fine-tuned weights. On a frozen, real-tsc --strict compile gate (24 held-out tasks) they are a genuine but limited expert: single-greedy + RAG ≈ 9.7/24 mean (best checkpoint 13/24). The headline ~23/24 number is the full serving pipeline (best-of-16 sampling + retrieval + a deterministic import-resolver + a tsc verifier), not the bare weights — see How to actually get 23/24. Treat this as a research artifact, strongest when paired with retrieval and a compiler-in-the-loop.

What it is

  • Base: mlx-community/Qwen3.6-35B-A3B-4bit — the text tower of the Qwen3.6 hybrid GatedDeltaNet MoE (qwen3_5_moe, 35.9B total / ~3B active). Vision tower dropped (text code model).
  • Fine-tune (champion v7s43_i200): CPT on a curated Effect-v4 source corpus (effect-smol, EffectPatterns, examples) → instruction SFT (rank-8 LoRA, 423 gate-validated instruction→code pairs, every target compiled under the exact tsc gate).
  • This repo: the champion LoRA fused into the base, dequantized to bf16, converted to GGUF and quantized with llama.cpp (mainline). Verified to generate coherent Effect TypeScript on a raw greedy CPU smoke test before release.

Conversion note (GatedDeltaNet + MoE): converting this hybrid arch from MLX-origin weights required one non-obvious fix — mlx-lm bakes the +1 zero-centered-RMSNorm shift into its saved norm weights, and convert_hf_to_gguf.py adds +1 again, so the norms must be un-shifted before conversion or every layer is double-shifted into garbage. With that corrected, the GGUF matches the MLX model's behavior. (The earlier …-v3-gguf repo predates this fix and is broken — use this repo instead.)

Files / quants

file quant size notes
effect-qwen36-35b-champion-q4_k_m.gguf Q4_K_M ~20 GB recommended default — small, fast, smoke-verified
effect-qwen36-35b-champion-q8_0.gguf Q8_0 ~36 GB near-lossless, for max fidelity

Eval (real tsc --strict, frozen 24-task held-out benchmark)

Raw single-greedy + RAG — honest, same-harness, multi-seed flat mean (never a cherry-picked run). This is the bare-weights number; the ~23/24 headline is the serving pipeline below, not this table:

config compile@24
base model (no fine-tune) 3 / 24
this model, single-greedy + RAG (flat mean) 9.67 / 24
this model, best checkpoint single point 13 / 24

The dominant residual failure is decoding discipline (the model knows the API — best-of-N reaches 22–24/24 — but greedy decoding sometimes omits a namespace import). This is closed by tooling, not by more training: every $0 in-weights lever (more data, self-distillation, external real-repo data, RAG-tuning, decode-time constraints) was tested and plateaus here. Pushing raw ≥15 needs RL-with-compiler-reward (out of $0-local scope).

How to actually get 23/24

The production pipeline (open-source in the training repo, serve/serve.py) wraps these weights with:

  1. best-of-16 sampling (temp 0.8 / top-p 0.95) — tsc is a perfect verifier; keep any sample that compiles,
  2. targeted RAG over tsc-gated Effect-v4 idiom snippets,
  3. a deterministic import-resolver (fixes TS2307/TS2304 namespace imports),
  4. an optional 1-pass tsc-feedback repair.

That stack reaches ~23/24 on the broad served product. This repo gives you the expert weights; add your own best-of-N + a compiler check for production use.

Usage (llama.cpp)

# build/download llama.cpp, then:

# one-shot raw completion
llama-completion -m effect-qwen36-35b-champion-q4_k_m.gguf -no-cnv -n 256 --temp 0 \
  -p 'import { Effect } from "effect"'

# chat (Qwen chat template is embedded in the GGUF)
llama-cli -m effect-qwen36-35b-champion-q4_k_m.gguf \
  -p 'Write Effect v4 code: a Schema.Struct for a User with a branded UserId.'

# OpenAI-compatible server
llama-server -m effect-qwen36-35b-champion-q4_k_m.gguf --port 8080

Tip: for best results, prepend a few real Effect-v4 example snippets (RAG) and sample N times keeping the first that compiles under tsc.

Limitations

  • Single-greedy compile rate is ~⅓–½ of hard tasks; pair with RAG + best-of-N + a tsc gate.
  • effect@4.0.0-beta.80 only; later betas may shift APIs.
  • Reasoning/thinking is disabled — it's a direct code generator.
  • Quantized (Q4_K_M / Q8_0). For the native-precision Apple-Silicon build see the MLX repo.

Supersedes jrad123777/effect-qwen36-35b-v3-gguf (an earlier, weaker checkpoint from a broken pipeline).

Built $0-local. Trained, evaluated against the installed .d.ts with tsc as the only arbiter, and documented honestly.

Downloads last month
169
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jrad123777/effect-qwen36-35b-gguf

Adapter
(4)
this model