Instructions to use jrad123777/effect-qwen36-35b-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use jrad123777/effect-qwen36-35b-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="jrad123777/effect-qwen36-35b-gguf", filename="effect-qwen36-35b-champion-q4_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use jrad123777/effect-qwen36-35b-gguf with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf jrad123777/effect-qwen36-35b-gguf:Q4_K_M # Run inference directly in the terminal: llama cli -hf jrad123777/effect-qwen36-35b-gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf jrad123777/effect-qwen36-35b-gguf:Q4_K_M # Run inference directly in the terminal: llama cli -hf jrad123777/effect-qwen36-35b-gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf jrad123777/effect-qwen36-35b-gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf jrad123777/effect-qwen36-35b-gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf jrad123777/effect-qwen36-35b-gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf jrad123777/effect-qwen36-35b-gguf:Q4_K_M
Use Docker
docker model run hf.co/jrad123777/effect-qwen36-35b-gguf:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use jrad123777/effect-qwen36-35b-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jrad123777/effect-qwen36-35b-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jrad123777/effect-qwen36-35b-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jrad123777/effect-qwen36-35b-gguf:Q4_K_M
- Ollama
How to use jrad123777/effect-qwen36-35b-gguf with Ollama:
ollama run hf.co/jrad123777/effect-qwen36-35b-gguf:Q4_K_M
- Unsloth Studio
How to use jrad123777/effect-qwen36-35b-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jrad123777/effect-qwen36-35b-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jrad123777/effect-qwen36-35b-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jrad123777/effect-qwen36-35b-gguf to start chatting
- Pi
How to use jrad123777/effect-qwen36-35b-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf jrad123777/effect-qwen36-35b-gguf:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "jrad123777/effect-qwen36-35b-gguf:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use jrad123777/effect-qwen36-35b-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf jrad123777/effect-qwen36-35b-gguf:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default jrad123777/effect-qwen36-35b-gguf:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use jrad123777/effect-qwen36-35b-gguf with Docker Model Runner:
docker model run hf.co/jrad123777/effect-qwen36-35b-gguf:Q4_K_M
- Lemonade
How to use jrad123777/effect-qwen36-35b-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull jrad123777/effect-qwen36-35b-gguf:Q4_K_M
Run and chat with the model
lemonade run user.effect-qwen36-35b-gguf-Q4_K_M
List all available models
lemonade list
Effect-v4 Qwen3.6-35B-A3B — Champion (GGUF, llama.cpp)
A local model fine-tuned to write idiomatic, compiling Effect v4
(effect@4.0.0-beta.80) TypeScript. Built $0-local on a single Apple M5 Max (48 GB):
continued-pretraining → instruction SFT (LoRA), fused into the base, then converted to GGUF for
portable CPU/GPU inference with llama.cpp. Same champion weights (v7s43_i200) as the
MLX release — this repo is the portable
GGUF build.
Why this exists:
effect@4.0.0-beta.80is a beta that postdates the pretraining of essentially every LLM — its exact API surface is absent from base models, so they hallucinate v3-isms. That sparsity is the whole point: this is a small, honest domain expert for a library the big models haven't seen.
Honest framing first. These are the fine-tuned weights. On a frozen, real-
tsc --strictcompile gate (24 held-out tasks) they are a genuine but limited expert: single-greedy + RAG ≈ 9.7/24 mean (best checkpoint 13/24). The headline ~23/24 number is the full serving pipeline (best-of-16 sampling + retrieval + a deterministic import-resolver + atscverifier), not the bare weights — see How to actually get 23/24. Treat this as a research artifact, strongest when paired with retrieval and a compiler-in-the-loop.
What it is
- Base:
mlx-community/Qwen3.6-35B-A3B-4bit— the text tower of the Qwen3.6 hybrid GatedDeltaNet MoE (qwen3_5_moe, 35.9B total / ~3B active). Vision tower dropped (text code model). - Fine-tune (champion
v7s43_i200): CPT on a curated Effect-v4 source corpus (effect-smol, EffectPatterns, examples) → instruction SFT (rank-8 LoRA, 423 gate-validated instruction→code pairs, every target compiled under the exacttscgate). - This repo: the champion LoRA fused into the base, dequantized to bf16, converted to GGUF and
quantized with
llama.cpp(mainline). Verified to generate coherent Effect TypeScript on a raw greedy CPU smoke test before release.
Conversion note (GatedDeltaNet + MoE): converting this hybrid arch from MLX-origin weights required one non-obvious fix — mlx-lm bakes the
+1zero-centered-RMSNorm shift into its saved norm weights, andconvert_hf_to_gguf.pyadds+1again, so the norms must be un-shifted before conversion or every layer is double-shifted into garbage. With that corrected, the GGUF matches the MLX model's behavior. (The earlier…-v3-ggufrepo predates this fix and is broken — use this repo instead.)
Files / quants
| file | quant | size | notes |
|---|---|---|---|
effect-qwen36-35b-champion-q4_k_m.gguf |
Q4_K_M | ~20 GB | recommended default — small, fast, smoke-verified |
effect-qwen36-35b-champion-q8_0.gguf |
Q8_0 | ~36 GB | near-lossless, for max fidelity |
Eval (real tsc --strict, frozen 24-task held-out benchmark)
Raw single-greedy + RAG — honest, same-harness, multi-seed flat mean (never a cherry-picked run). This is the bare-weights number; the ~23/24 headline is the serving pipeline below, not this table:
| config | compile@24 |
|---|---|
| base model (no fine-tune) | 3 / 24 |
| this model, single-greedy + RAG (flat mean) | 9.67 / 24 |
| this model, best checkpoint single point | 13 / 24 |
The dominant residual failure is decoding discipline (the model knows the API — best-of-N reaches 22–24/24 — but greedy decoding sometimes omits a namespace import). This is closed by tooling, not by more training: every $0 in-weights lever (more data, self-distillation, external real-repo data, RAG-tuning, decode-time constraints) was tested and plateaus here. Pushing raw ≥15 needs RL-with-compiler-reward (out of $0-local scope).
How to actually get 23/24
The production pipeline (open-source in the training repo, serve/serve.py) wraps these weights with:
- best-of-16 sampling (temp 0.8 / top-p 0.95) —
tscis a perfect verifier; keep any sample that compiles, - targeted RAG over tsc-gated Effect-v4 idiom snippets,
- a deterministic import-resolver (fixes
TS2307/TS2304namespace imports), - an optional 1-pass
tsc-feedback repair.
That stack reaches ~23/24 on the broad served product. This repo gives you the expert weights; add your own best-of-N + a compiler check for production use.
Usage (llama.cpp)
# build/download llama.cpp, then:
# one-shot raw completion
llama-completion -m effect-qwen36-35b-champion-q4_k_m.gguf -no-cnv -n 256 --temp 0 \
-p 'import { Effect } from "effect"'
# chat (Qwen chat template is embedded in the GGUF)
llama-cli -m effect-qwen36-35b-champion-q4_k_m.gguf \
-p 'Write Effect v4 code: a Schema.Struct for a User with a branded UserId.'
# OpenAI-compatible server
llama-server -m effect-qwen36-35b-champion-q4_k_m.gguf --port 8080
Tip: for best results, prepend a few real Effect-v4 example snippets (RAG) and sample N times keeping the
first that compiles under tsc.
Limitations
- Single-greedy compile rate is ~⅓–½ of hard tasks; pair with RAG + best-of-N + a
tscgate. effect@4.0.0-beta.80only; later betas may shift APIs.- Reasoning/thinking is disabled — it's a direct code generator.
- Quantized (Q4_K_M / Q8_0). For the native-precision Apple-Silicon build see the MLX repo.
Supersedes jrad123777/effect-qwen36-35b-v3-gguf (an earlier, weaker checkpoint from a broken pipeline).
Built $0-local. Trained, evaluated against the installed .d.ts with tsc as the only arbiter, and
documented honestly.
- Downloads last month
- 169
4-bit