rosettia-chanka-4b-alpha160

A 4B Spanish → Quechua Chanka (quy_Latn) translation model. Full-weight merge of the team's Chanka-specialized 4B base (Qwen3.5-4B → broad-Quechua LoRA SFT → merge → full FT on clean Chanka) plus the v13 compact-mixed LoRA loaded at lora_alpha=160 (the inference-time α-scaling tweak that won our study). Built for #HACKATHONSomosNLP 2026 as part of the Rosettia project.

This is a single self-contained model — no PEFT required at inference. Load with AutoModelForCausalLM.

Headline result

Metric (158-row clean Chanka held-out) This model 4B baseline (no compact-mixed) Δ
chrF++ 56.94 43.49 +13.45
BLEU 30.76 16.14 +14.62
token F1 46.43 28.94 +17.49
TER (↓) 62.21 82.49 −19.36

Beats every off-the-shelf zero-shot MT system we tested on Chanka (NLLB-200 600M/1.3B/3.3B, TranslateGemma 4B, Gemma 4 E4B, Hy-MT2 7B, T5Gemma) by 28-50 chrF++.

chrF++ across models

Multi-metric comparison

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_ID = "Thermostatic/rosettia-chanka-4b-alpha160"
tok = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto")

source = "Yo vivo en Quinua"
messages = [
    {"role": "system", "content": "Eres un traductor profesional español-quechua chanka."},
    {"role": "user", "content": (
        "Traduce del español al quechua chanka. Usa una traducción directa, "
        "natural y fiel. Conserva nombres, números y entidades; evita copiar "
        "el español salvo cuando sea necesario.\n\n"
        f"Español: {source}"
    )},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=96, do_sample=False)
print(tok.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
# Expected: "Quinuapim tiyani"

For best held-out numbers, also pass the project's terminology glossary (matched by source word; see the Rosettia dataset card for the parquet) appended to the user prompt as a small bullet list and top-k=1 glossary entry. The eval recipe used --terminology-file clean_chanka/manual_quechua_chanka_glossary_simple_terms.parquet --terminology-top-k 1 --max-completion-length 96.

Training recipe (3-stage SFT + free α-scaling)

The LoRA component is the product of a 3-stage continuation chain on top of the team's Chanka-specialized base; the α-scaling tweak is applied at merge time (no separate inference step):

Stage Data Steps LR Resulting chrF++
v11 self_verifiable_compact_mixed_sft.jsonl (1,055 Chanka pairs × {direct, compact-thinking}) 512 5e-6 54.06
v12 same 128 1e-6 (continuation) 55.47
v13 same 32 5e-7 (continuation) 55.76
Inference α-scaling (now baked in) 56.94

LoRA config: r=64, α=128 (trained) → α=160 (merged in this model), dropout 0, target modules [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]. Trained with Unsloth on an L40S. The 4B Chanka-specialized base was: raw unsloth/Qwen3.5-4B → broad Quechua LoRA SFT (~768 steps on 169k AmericasNLP + SomosNLP pairs) → merge LoRA into full model → 48 steps full FT on clean Chanka.

The "compact-mixed" dataset construction: for each reviewed Chanka pair, we created two SFT rows — one with prompt_mode=direct (target = the Chanka translation only) and one with prompt_mode=compact (target = Analisis: ... Final: ... Puntaje: \boxed{...} — a DeepSeekMath-V2-inspired self-verification format). Multi-task training on this mix contributes ~+2 chrF++ over pure direct training at matched step count.

The α-scaling discovery: loading the LoRA at inference with lora_alpha 1.25× the trained value captures capacity gradient descent left on the table. We swept α ∈ {64, 96, 128, 160, 192, 224, 256} on the held-out and found a clean unimodal peak at α=160. Above 1.5× the model degrades fast.

What didn't help (the negative results)

A study-quality summary of what we tried that did not outperform this model:

  • GSPO with the team's learned verifier as RL reward — regresses chrF++ ~1.5 from any strong SFT base.
  • Self-scored Best-of-N — self-scores saturate at 1.0 with 100% false-confidence rate; equivalent to random.
  • Linear listwise text reranker on K=16/32 sampled candidates — captures 0% of the 12-chrF oracle gap.
  • Mergekit-style task-vector amplification — peak chrF++ 55.89, slightly below LoRA α-scaling, and loses 3 chrF in the merge round-trip.
  • Activation-diff-guided targeted LoRA (top-8 layers of 32) — loses ~9 chrF vs full-layer training.
  • Engineered reasoning traces from DS Flash (verbose / compact / natural CoT, 668-793 accepted of 838): all regressed or matched without lift. DS Flash confabulated morphology (e.g. claiming kallpa=calle; real Chanka is ñan).
  • Grounded reasoning traces from parallel Claude agents using the QHESWA Cuzco-Collao grammar manual + AMLQ Cuzco dictionary as RAG context (841 high-quality traces with 4.5-7.3 morpheme citations each): even rigorously grounded supervision did not outperform plain direct SFT. The model internalizes Chanka morphology implicitly from (source, gold) pairs faster than from explicit symbolic reasoning.
  • Native Qwen3.5 <think> block training — both with enable_thinking=True (double-wrap bug, chrF++ < 3) and with literal <think> text + enable_thinking=False (chrF++ 44.55, barely beats baseline).
  • Scaling to raw Qwen3.5-9B + compact-mixed (skipping the Chanka pretraining stage) — plateaus at chrF++ 32.08, well below the 4B chain.

Data, leakage protections

Training data is the Thermostatic/rosettia-chanka-data clean Chanka subset (1,055 reviewed judicial-domain Spanish-Chanka pairs from the public manual Quechua Chanka Adminstración Justicia 2014). After deterministic eval split with validation_fraction=0.15, seed=3407, the 158 eval-set Chanka sources are excluded from training. We also filter out 56 train rows whose Spanish source happens to also appear in eval (from slash-alternative splits in the dataset), giving 841 strictly leak-free training pairs. All metrics reported are on the 158-row held-out.

Limitations and intended use

  • Domain: training data is judicial / administrative Spanish-Chanka. Out-of-domain performance is not characterized.
  • Variant: Chanka (Ayacucho/Apurímac/Huancavelica, quy_Latn). Not appropriate for Cuzco-Collao (quz), Bolivian (quh), Northern (qup), or other Quechua varieties without further adaptation.
  • Capacity ceiling: this model is at the empirical limit of what we extracted from 1,055 reviewed Chanka pairs. More data is the most likely path to better numbers.

Citation / attribution

Built for the #HACKATHONSomosNLP 2026 project Rosettia – Quechua by the Thermostatic team. Trained on top of unsloth/Qwen3.5-4B with Unsloth and PEFT. Compact-mixed training data construction inspired by DeepSeekMath-V2 (we adapted its self-verification format as an auxiliary SFT objective, not as an RL reward — see the full negative-results study for why the RL variant failed for this task).

The source dataset PDF (the 2014 judicial manual) is in the public domain and reproduction is permitted with citation.

Downloads last month
-
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Thermostatic/rosettia-chanka-4b-alpha160

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(113)
this model

Dataset used to train Thermostatic/rosettia-chanka-4b-alpha160

Paper for Thermostatic/rosettia-chanka-4b-alpha160

Evaluation results

  • chrF++ on Rosettia clean Chanka (158-row held-out)
    self-reported
    56.940
  • BLEU on Rosettia clean Chanka (158-row held-out)
    self-reported
    30.760
  • TER on Rosettia clean Chanka (158-row held-out)
    self-reported
    62.210