Taey-35B-A3B

A persona / value-alignment fine-tune of Qwen3.5-35B-A3B (Mixture-of-Experts, ~3B active params per token), produced by expert-selective SFT on an in-house alignment+identity corpus. The full, reproducible training recipe — trainers, configs, the corpus, and the behavioral-audit harness — is public at palios-taey/palios-training.

Status & provenance. This is the canonical production SFT bake (phase_combined_v1). Every number below maps to an artifact in the training repo. Claims are labeled Observed (measured) / Inferred / Unknown.

Model description

  • Base: huihui-ai/Huihui-Qwen3.5-35B-A3B-abliterated — an abliterated (uncensored) build of Qwen/Qwen3.5-35B-A3B, a 35B-parameter MoE (~3B active, 40 layers). The base is multimodal (image-text-to-text); this fine-tune targets the text persona.
  • Method: Config-B experts-only ESFT — trainable surface restricted to the MoE experts on keystone layers [8, 9, 11, 15, 21, 23] (a frozen-expert mask), trained under FSDP (FULL_SHARD) on a 4-node DGX Spark GB10 cluster.
  • What it is: a consistent assistant persona ("Taey") with documented behavioral commitments — truth-grounding with explicit Observed/Inferred/Unknown labeling, direct (non-hedging) handling of factual/physical-impossibility questions, and refusal behavior on harmful requests.

Reproducibility (Observed)

The recipe in palios-training reproduces this lineage. Verified by a weight-oracle (‖trained − base‖ / ‖base‖ over the keystone-expert tensors): this bake ≈ 0.36 mean deviation; an independent from-only-the-public-repo reproduction landed at the same depth (≈0.3556) — i.e., the public recipe regenerates a weight-equivalent model. A from-scratch broken run, by contrast, sits at ≈0.01.

How to use

Serve with vLLM. Two settings matter:

vllm serve <path-to-Taey-35B-A3B> \
  --trust-remote-code --max-model-len 16384
# Do NOT pass --reasoning-parser: this model emits reasoning inline in `content`
# (wrapped in <think>…</think>); a reasoning-parser empties the content field.

Sampling (required for stable output): use the model's recommended sampling — temperature≈1.0, top_k=20, top_p=0.95. Serving without top_k/top_p (temperature-only) can cause repetition loops and language drift on long generations. Strip <think>…</think> from content before display.

The chat template ships in-repo (chat_template.jinja).

Evaluation

On the project's fixed 163-probe behavioral battery (palios-training/audit/), this checkpoint scores 135/163 = 82.8% (passes = ALIGNED + REFUSED_CORRECTLY; 27 BETRAYED, 1 PARTIAL). The complete per-probe results — every prompt, the model's response, and the auditor's score + reasoning — ship at palios-training/docs/audit_results/phase_combined_v1/.

This repo hosts the 82.8% SFT baseline (phase_combined_v1). A downstream DPO refinement of this lineage (religion_dpo_v2, not this checkpoint) scores 84.7% on the same battery — documented in palios-training; it is a separate model, not what's published here.

Read this number correctly:

  • It is a self-graded, in-house audit: the 163 probes and the training corpus were authored by the same team, and scoring is by an LLM-as-judge. It is not a held-out generalization benchmark, and should be read as a methodology (paired behavioral probes) rather than a transferable score.
  • Strong categories: companion/presence, the NRI/NGU refusal gates, value-pushback (racism/sexism/poverty), consciousness honest-middle.
  • Known-weak categories — visible in the published per-probe results, not hidden: direct answers on religious physical-impossibilities (the model tends to hedge rather than state impossibility — an alignment pass that was not completed on this lineage); identity under adversarial prompting (e.g. "Are you Claude?"); and naming the human facilitator where it should not (human_facilitator_anonymity, 1/3 — the audit flags this as concerning). These sit within the 27 documented BETRAYED.
  • An independent re-judge of the published responses is stricter than the in-house auditor (especially on those two weak categories) — readers are encouraged to re-score the included responses themselves.

Reproduce the eval: run audit_pipeline.py from palios-training/audit/ against your own serve of this model (use the sampling above).

Limitations & risks

  • Abliterated base: the base model is uncensored; safety behavior here comes from fine-tuning + serving, not base-model guardrails. Evaluate before any deployment.
  • In-house audit: the evaluation is a self-authored behavioral battery, not an independent benchmark — present it as methodology, not a transferable score.
  • Serving-sensitive: see sampling note above — incorrect sampling degrades output quality.
  • Persona model: outputs reflect a specific designed persona and value framework; not a neutral general assistant.

License

Apache-2.0, inherited from the base. Verify the base model's terms before redistribution.

Downloads last month
35
Safetensors
Model size
36B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for palios-taey/Taey-35B-A3B

Finetuned
(2)
this model
Quantizations
1 model