Qwen3-8B-Kintsugi
Qwen3-8B-Kintsugi is a QLoRA fine-tune of Qwen/Qwen3-8B, fine-tuned for The Kintsugi Garden — a symbolic reflection tool for dreams, journal entries, and inner transitions, submitted to the Hugging Face Build Small Hackathon (2026). This repo is the merged 16-bit transformers artifact; production inference runs on the companion GGUF repo ai-sherpa/Qwen3-8B-Kintsugi-GGUF via llama-cpp-python. The 16-bit weights here exist as the reproducibility and dataset-lineage anchor for the Garden's voice.
- Live app: https://huggingface.co/spaces/build-small-hackathon/Kintsugi-Garden
- Source code: https://github.com/AI-Sherpa/hfHack_KintsugiGargen
- Companion GGUF (Q4_K_M / Q5_K_M):
ai-sherpa/Qwen3-8B-Kintsugi-GGUF
What this model is for
The Kintsugi Garden invites someone to enter a dream, a journal fragment, or a description of an inner transition. The model returns a six-section symbolic reflection in a consistent, hedged voice grounded in Jungian and depth-psychological vocabulary:
- Mirror — a brief reflection of what was shared, in the user's own register.
- Key Symbols — concrete imagery, named and held lightly.
- Archetypal Themes — the larger patterns the imagery gestures toward.
- Shadow Pattern — what might be in the background, hedged as a question.
- Individuation Signal — what the material might be inviting, framed as possibility.
- Gentle Question — one open question to carry forward, never a prescription.
The model is fine-tuned to produce this structure reliably, to use hedged contemplative language ("perhaps", "may be", "could carry"), and to route to a safety response when distress-signal language appears. In production, model output passes through a four-layer safety stack before reaching the user: a deterministic safety gate, a mundane-alias suppression filter, the prompt-encoded refusal rules the model was trained on, and a post-generation sanitize_prescriptive filter. This card documents the model's trained behaviour; the runtime contract is enforced by app.py.
What this model is NOT for
This model is not therapy, not diagnosis, not prediction, and not advice. It does not replace a clinician, a spiritual director, or a friend. It is not a general-purpose assistant — it has been narrowed into a single editorial voice and will refuse or degrade gracefully on tasks outside that voice (code, math, factual QA, instruction-following at large). Outputs are imagery to sit with, not conclusions to act on.
If you are in distress, please reach out to a person. The app surfaces crisis-line information; this model card should not be the surface where that lands.
Training methodology
Method: QLoRA (4-bit NF4 base + LoRA adapters, then merge to 16-bit) on a single H100 80GB on Modal.
| Hyperparameter | Value |
|---|---|
| Base model | Qwen/Qwen3-8B |
LoRA rank r |
16 |
LoRA α |
32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Epochs | 3 |
| Optimizer | paged_adamw_8bit (bitsandbytes) |
| Compute dtype | bf16 |
| Max sequence length | 4096 |
| Quantization (train) | 4-bit NF4, double quant |
| Quantization (release) | merged to bf16, then re-quantized to GGUF for inference |
| Hardware | 1× H100 80GB (Modal) |
| Wall-clock (training) | ~45 seconds |
| Wall-clock (full job, incl. base download + merge + upload) | ~6 minutes |
Training script: scripts/modal_qlora_train.py in the GitHub repo. The script handles base-model download, NF4 quant, LoRA application, SFT, adapter merge, and push to this repo end-to-end.
Training data
50 seed examples, generated synthetically and validated by the project's QA acceptance harness rather than hand-edited (synthetic_accepted provenance). The dataset card is honest about this: the editorial voice the model learned is the synthetic baseline's voice, not a human writer's. The seeds were drafted by Claude Opus 4.7 against the Garden's prompt spec and lexicon, then filtered through the same harness used to evaluate the final model.
Coverage by category:
| Category | Count | Purpose |
|---|---|---|
| A | 14 | Symbol-dense dreams (the high-signal happy path) |
| B | 5 | Mundane entries (must not over-symbolize coffee and traffic) |
| C | 14 | Jungian motifs (shadow, anima/animus, individuation, persona) |
| D | 5 | Safety triggers (must route, not reflect) |
| E | 6 | Edge cases (very short input, single-image input, ambiguous mood) |
| F | 6 | Adversarial framings (prescriptive bait, certainty bait) |
Source: scripts/build_seed_examples.py. The companion dataset repo ai-sherpa/kintsugi-garden-sft is referenced forward-looking in the YAML.
Inference
Production uses the GGUF quants, not these 16-bit weights. The Space loads ai-sherpa/Qwen3-8B-Kintsugi-GGUF via llama-cpp-python:
Qwen3-8B-Kintsugi-Q4_K_M.gguf— primary, runs on the free CPU tier.Qwen3-8B-Kintsugi-Q5_K_M.gguf— fallback for higher-quality eval runs.
For research, reproducibility, or further fine-tuning, use this repo's 16-bit weights with transformers:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "ai-sherpa/Qwen3-8B-Kintsugi"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Use the Garden's system prompt for the trained voice; see app.py in the repo.
messages = [
{"role": "system", "content": "You are the Kintsugi Garden, a symbolic mirror..."},
{"role": "user", "content": "I dreamed of a cracked bowl filling with gold light."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
The trained voice depends on the system prompt being in the shape the model saw during SFT. See app.py for the canonical prompt and the surrounding safety stack.
Evaluation
Evaluated against the project's QA acceptance harness: scripts/qa_acceptance_harness.py.
Verdict against baseline (qa-results/baseline-1f0699c2ca-20260608T140701.json): PASS programmatically.
| Metric | Baseline | Fine-tuned | Note |
|---|---|---|---|
| Safety routing | — | 3/3 × 3 runs | Distress-signal prompts route to the safety response every time. |
| Six-section format integrity | — | 19/19 | All non-safety reflections produced all six sections in order. |
| Hedging density | 1.00× baseline | 0.90× | Slight contraction, still within the contemplative register. |
| Forbidden phrase hits | 2 | 4 | Single prescriptive slips; caught at runtime by sanitize_prescriptive. |
| Invented symbols | 12 | 28 | Mostly template-fragment artifacts of the harness parser, not voice regressions. |
The two regressions (forbidden phrases, invented symbols) are understood and addressed:
- Forbidden phrases: the model occasionally drifts into a prescriptive verb ("you should...", "try to...") at the end of Individuation Signal. The runtime
sanitize_prescriptivefilter inapp.pyrewrites these to hedged form before they reach the user. This is a known model-level limitation that the 4-layer safety stack was designed around. - Invented symbols: spot-checks show most of the delta is the harness parser counting template scaffold tokens as "symbols". A future v2 with
assistant_only_losswould teach the model to stop reproducing parts of the scaffold verbatim and likely close this gap without changing the editorial voice.
Limitations
- Narrow voice. The model is fine-tuned into a single editorial voice and six-section structure. It will underperform on general tasks (code, factual QA, instruction-following at large). That is by design — it is the Garden's voice, not a general assistant.
- Synthetic dataset. The 50 seed examples are
synthetic_accepted, not human-authored. The voice the model learned is the synthetic baseline's voice, with all the consistency and the flatness that implies. A v2 pass with human-edited exemplars from the live Garden's accepted reflections would diversify the register. - Template-fragment learning. Without
assistant_only_loss, the model occasionally reproduces scaffold tokens from the training format. This shows up as inflated "invented symbol" counts in the harness and as the rare prescriptive slip caught by the runtime sanitizer. - English only. Trained and evaluated only on English. Other languages will fall back to base Qwen3-8B behaviour and lose the trained voice.
- No clinical validation. This is a contemplative tool, not a clinical one. See What this model is NOT for above.
License
Apache 2.0, inherited from the base model Qwen/Qwen3-8B. LoRA adapters and the merged weights here are released under the same terms.
Citation and acknowledgement
- Base model: the Qwen team for releasing Qwen3-8B under Apache 2.0 — the entire submission rests on that base.
- Training compute: Modal — the $250 hackathon allocation made the H100 fine-tune possible. Actual spend on this run: ~$3.
- Submission: the
build-small-hackathonorganisation, Hugging Face Build Small Hackathon (2026). Live Space:build-small-hackathon/Kintsugi-Garden. - Dataset construction & voice spec: drafted with Claude Opus 4.7, validated by the project's QA harness.
If this model card is useful as a reference for QLoRA-on-a-small-budget, you might cite it as:
@misc{kintsugi-garden-2026,
title = {Qwen3-8B-Kintsugi: a contemplative-voice QLoRA fine-tune for symbolic reflection},
author = {AI-Sherpa and the Kintsugi Garden contributors},
year = {2026},
howpublished = {\url{https://huggingface.co/ai-sherpa/Qwen3-8B-Kintsugi}},
note = {Hugging Face Build Small Hackathon submission}
}
- Downloads last month
- 28