rosettia-chanka-4b-alpha160

A 4B Spanish → Quechua Chanka (quy_Latn) translation model. Full-weight merge of the team's Chanka-specialized 4B base (Qwen3.5-4B → broad-Quechua LoRA SFT → merge → full FT on clean Chanka) plus the v13 compact-mixed LoRA loaded at lora_alpha=160 (the inference-time α-scaling tweak that won our study). Built for #HACKATHONSomosNLP 2026 as part of the Rosettia project.

This is a single self-contained model — no PEFT required at inference. Load with AutoModelForCausalLM.

Headline result

Metric (158-row clean Chanka held-out)	This model	4B baseline (no compact-mixed)	Δ
chrF++	56.94	43.49	+13.45
BLEU	30.76	16.14	+14.62
token F1	46.43	28.94	+17.49
TER (↓)	62.21	82.49	−19.36

Beats every off-the-shelf zero-shot MT system we tested on Chanka (NLLB-200 600M/1.3B/3.3B, TranslateGemma 4B, Gemma 4 E4B, Hy-MT2 7B, T5Gemma) by 28-50 chrF++.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_ID = "Thermostatic/rosettia-chanka-4b-alpha160"
tok = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto")

source = "Yo vivo en Quinua"
messages = [
    {"role": "system", "content": "Eres un traductor profesional español-quechua chanka."},
    {"role": "user", "content": (
        "Traduce del español al quechua chanka. Usa una traducción directa, "
        "natural y fiel. Conserva nombres, números y entidades; evita copiar "
        "el español salvo cuando sea necesario.\n\n"
        f"Español: {source}"
    )},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=96, do_sample=False)
print(tok.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
# Expected: "Quinuapim tiyani"

For best held-out numbers, also pass the project's terminology glossary (matched by source word; see the Rosettia dataset card for the parquet) appended to the user prompt as a small bullet list and top-k=1 glossary entry. The eval recipe used --terminology-file clean_chanka/manual_quechua_chanka_glossary_simple_terms.parquet --terminology-top-k 1 --max-completion-length 96.

Training recipe (3-stage SFT + free α-scaling)

The LoRA component is the product of a 3-stage continuation chain on top of the team's Chanka-specialized base; the α-scaling tweak is applied at merge time (no separate inference step):

Stage	Data	Steps	LR	Resulting chrF++
v11	`self_verifiable_compact_mixed_sft.jsonl` (1,055 Chanka pairs × {direct, compact-thinking})	512	5e-6	54.06
v12	same	128	1e-6 (continuation)	55.47
v13	same	32	5e-7 (continuation)	55.76
Inference α-scaling (now baked in)	—	—	—	56.94

LoRA config: r=64, α=128 (trained) → α=160 (merged in this model), dropout 0, target modules [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]. Trained with Unsloth on an L40S. The 4B Chanka-specialized base was: raw unsloth/Qwen3.5-4B → broad Quechua LoRA SFT (~768 steps on 169k AmericasNLP + SomosNLP pairs) → merge LoRA into full model → 48 steps full FT on clean Chanka.

The "compact-mixed" dataset construction: for each reviewed Chanka pair, we created two SFT rows — one with prompt_mode=direct (target = the Chanka translation only) and one with prompt_mode=compact (target = Analisis: ... Final: ... Puntaje: \boxed{...} — a DeepSeekMath-V2-inspired self-verification format). Multi-task training on this mix contributes ~+2 chrF++ over pure direct training at matched step count.

The α-scaling discovery: loading the LoRA at inference with lora_alpha 1.25× the trained value captures capacity gradient descent left on the table. We swept α ∈ {64, 96, 128, 160, 192, 224, 256} on the held-out and found a clean unimodal peak at α=160. Above 1.5× the model degrades fast.

What didn't help (the negative results)

A study-quality summary of what we tried that did not outperform this model:

GSPO with the team's learned verifier as RL reward — regresses chrF++ ~1.5 from any strong SFT base.
Self-scored Best-of-N — self-scores saturate at 1.0 with 100% false-confidence rate; equivalent to random.
Linear listwise text reranker on K=16/32 sampled candidates — captures 0% of the 12-chrF oracle gap.
Mergekit-style task-vector amplification — peak chrF++ 55.89, slightly below LoRA α-scaling, and loses 3 chrF in the merge round-trip.
Activation-diff-guided targeted LoRA (top-8 layers of 32) — loses ~9 chrF vs full-layer training.
Engineered reasoning traces from DS Flash (verbose / compact / natural CoT, 668-793 accepted of 838): all regressed or matched without lift. DS Flash confabulated morphology (e.g. claiming kallpa=calle; real Chanka is ñan).
Grounded reasoning traces from parallel Claude agents using the QHESWA Cuzco-Collao grammar manual + AMLQ Cuzco dictionary as RAG context (841 high-quality traces with 4.5-7.3 morpheme citations each): even rigorously grounded supervision did not outperform plain direct SFT. The model internalizes Chanka morphology implicitly from (source, gold) pairs faster than from explicit symbolic reasoning.
Native Qwen3.5 <think> block training — both with enable_thinking=True (double-wrap bug, chrF++ < 3) and with literal <think> text + enable_thinking=False (chrF++ 44.55, barely beats baseline).
Scaling to raw Qwen3.5-9B + compact-mixed (skipping the Chanka pretraining stage) — plateaus at chrF++ 32.08, well below the 4B chain.

Data, leakage protections

Training data is the Thermostatic/rosettia-chanka-data clean Chanka subset (1,055 reviewed judicial-domain Spanish-Chanka pairs from the public manual Quechua Chanka Adminstración Justicia 2014). After deterministic eval split with validation_fraction=0.15, seed=3407, the 158 eval-set Chanka sources are excluded from training. We also filter out 56 train rows whose Spanish source happens to also appear in eval (from slash-alternative splits in the dataset), giving 841 strictly leak-free training pairs. All metrics reported are on the 158-row held-out.

Limitations and intended use

Domain: training data is judicial / administrative Spanish-Chanka. Out-of-domain performance is not characterized.
Variant: Chanka (Ayacucho/Apurímac/Huancavelica, quy_Latn). Not appropriate for Cuzco-Collao (quz), Bolivian (quh), Northern (qup), or other Quechua varieties without further adaptation.
Capacity ceiling: this model is at the empirical limit of what we extracted from 1,055 reviewed Chanka pairs. More data is the most likely path to better numbers.

Citation / attribution

Built for the #HACKATHONSomosNLP 2026 project Rosettia – Quechua by the Thermostatic team. Trained on top of unsloth/Qwen3.5-4B with Unsloth and PEFT. Compact-mixed training data construction inspired by DeepSeekMath-V2 (we adapted its self-verification format as an auxiliary SFT objective, not as an RL reward — see the full negative-results study for why the RL variant failed for this task).

The source dataset PDF (the 2014 judicial manual) is in the public domain and reproduction is permitted with citation.