prism-coder:9b — Prism Memory Tool Router (Default Tier)
QLoRA fine-tuned Qwen3.5-9B for MCP tool routing in the Prism Coder system. Replaces the previous 14B tier with 36% smaller footprint and higher accuracy.
Quick Start
ollama pull dcostenco/prism-coder:9b
BFCL Benchmark — 100% × 3 seeds
64/64 × 3 shuffled runs = 100.0%, 0 hallucinations
| Category | Count | Accuracy |
|---|---|---|
| simple | 10 | 100% |
| relevance_detection | 10 | 100% |
| hallucination | 10 | 100% |
| disambiguation | 8 | 100% |
| format_sensitivity | 5 | 100% |
| ast_parameter | 5 | 100% |
| edge_case | 8 | 100% |
| multi_turn_chain | 8 | 100% |
vs Previous 14B (Qwen 3)
| Metric | 14B (Qwen 3) | 9B (Qwen 3.5) |
|---|---|---|
| BFCL Overall | 90.3% | 100.0% |
| Model Size | 9.0 GB | 5.8 GB |
| Hallucinations | 0 | 0 |
Training
- Method: QLoRA (4-bit base + bf16 adapters) on Apple Silicon M5 48GB
- Base:
mlx-community/Qwen3.5-9B-MLX-4bitfor training,Qwen/Qwen3.5-9Bbf16 for merge - LoRA: rank=32, alpha=64, 16 layers
- Corpus: 26K rows (40% AAC / 12% abstention / 12% safety / 36% tool-use)
- Iterations: 2000 at LR 1e-4
- Layer 3: Inference-time rules for tool remapping, param normalization, multi-turn chain parsing
Architecture
Qwen3.5-9B uses a hybrid attention architecture:
- Linear attention layers (Gated DeltaNet) — O(n) inference, pattern matching
- Full attention layers (standard softmax) — precise retrieval and reasoning
Fleet Position
| Model | Ollama tag | Size | BFCL | Role |
|---|---|---|---|---|
| Qwen3.5-2B Q3_K_M | dcostenco/prism-coder:2b |
2.3 GB | 99.1% | iPhone / mobile |
| Qwen3.5-4B Q4_K_M | dcostenco/prism-coder:4b |
3.4 GB | 100% | Verifier / 8 GB+ |
| Qwen3.5-9B Q4_K_M | dcostenco/prism-coder:9b |
5.8 GB | 100% | Default router |
| prism-coder:32b | dcostenco/prism-coder:32b |
19 GB | 100% | Complex tasks |
Links
- Ollama model page — pull and run
- Prism MCP Server — the MCP server
- Qwen3.5-9B base — upstream model
- Downloads last month
- -