TinyAya β€” Turkish⇄Hindi Speech-to-Speech Translation (v0.2)

⚠️ Research preview β€” this checkpoint overfit. Released for transparency and to document the full training trajectory. The recommended weights are best_by_val (step 1,000), not the final step-15,000 checkpoint. See Evaluation & the overfitting story.

⚠️ Dataset disclosure β€” honest correction

This checkpoint was trained on tiny-aya-translate/fleurs-tr-hi-mimi-encoded β€” Mimi-encoded FLEURS Turkish↔Hindi read speech (~27% real FLEURS audio + ~73% multi-voice TTS over FLEURS text; β‰ˆ8.3k train / 929 val).

This was not the dataset we intended to train on. The project's synthetic data pipeline β€” FLORES + OPUS-100 + machine-translated conversational text, rendered with multi-voice TTS into tiny-aya-translate/tr-hi-mimi-encoded (~1.3M clips) β€” is the corpus our accompanying write-up describes. The training launcher's HF_DATASET default pointed at the fleurs- sibling repo, so v0.2 silently trained on FLEURS read speech instead of the synthetic conversational corpus. We are disclosing this openly rather than quietly re-labeling the run.

v0.3 corrects the data source (β†’ tr-hi-mimi-encoded) together with the codebase fixes (parallel-stream collator, regularization, deep-codebook weighting). See tr-hi-s2st-v0.3.

Moshi-style simultaneous speech-to-speech translation for Turkish ⇄ Hindi: a LoRA-fine-tuned Cohere2 backbone fused with a frozen Moshi depth decoder, operating on Mimi audio codes in a parallel two-stream format.

  • Developed by: tiny-aya-translate
  • Funded by: Google TPU Research Cloud (TRC) β€” see Acknowledgements
  • Model type: parallel two-stream S2ST (Cohere2 + LoRA β†’ CB0; frozen Moshi depth decoder β†’ CB1–7)
  • Languages: Turkish (tr), Hindi (hi)
  • License: Apache-2.0 (for this adapter + training code; see the base-model caveat below)
  • Finetuned from: CohereLabs/tiny-aya-base (+ frozen Moshi / Mimi from Kyutai)

Provenance: sweep β†’ production run

This recipe was selected by a proxy-first W&B hyperparameter sweep (8 Bayesian

  • hyperband trials), then trained to 15k steps on a single TPU v6e-8.

Training procedure

Hardware Google TPU v6e-8 (1 host Γ— 8 chips), SPMD / FSDPv2-LoRA, bf16
Duration 15,000 steps in ~24.9 h (~5.1 s/step), single continuous run
Effective batch 256 (batch 8 Γ— grad-accum 4 Γ— 8 chips), max_frames 300
Stability 0 non-finite / NaN / loss-spike alerts across the whole run
Recipe lora.r=64, lora.alpha=128, lr_lora=4.6e-4, lr_depth=1.1e-4, text_weight=0.2, warmup=500, weight_decay=0.01
Data tiny-aya-translate/fleurs-tr-hi-mimi-encoded (Mimi-encoded FLEURS TR↔HI read speech) β€” ⚠️ not the intended synthetic corpus (see Dataset disclosure above)

Evaluation & the overfitting story

The run was mechanically flawless but the recipe overfit: validation loss bottomed at step 1,000 and rose monotonically while train loss kept falling.

step train loss val/loss cb0 val acc
1000 5.475 2.859 ← best 13.9%
5000 3.231 3.376 14.3%
10000 1.887 4.060 13.8%
15000 1.566 4.197 (worst) 13.8%

Final per-codebook val accuracy: cb0 13.8%, cb1 3.9%, cb2 1.9%, cb3 1.4%, cb4 0.8%, cb5 0.8%, cb6 0.6%, cb7 0.5%. The text stream effectively memorized (train text loss β†’ 0.39); audio is the bottleneck. Likely cause: the proxy sweep optimized short-horizon val/audio_loss, which favored high capacity (lora_r=64) that then memorized the train set over 15k steps.

Downstream metrics (to be measured)

No speech-quality metrics have been computed yet. Planned evaluation, with the checkpoint to use = best_by_val (step 1,000):

Metric Measures Status
ASR-BLEU (Whisper-Lv3 β†’ SacreBLEU) translation quality ☐ TODO
ASR-chrF / chrF++ quality, more ASR-roundtrip-robust than BLEU ☐ TODO
BLASER 2.0 text-free S2ST quality ☐ optional
COMET / COMET-Kiwi semantic adequacy ☐ optional
DNSMOS / UTMOS / NISQA audio naturalness (MOS) ☐ TODO
WER intelligibility ☐ TODO

Reporting protocol (when filled): ASR backend + version, decoding (greedy/beam), seed, text normalization, and number of eval samples will be stated.

Checkpoints in this repo

keep_last_n rotation during the run means only these survived (steps 2,000–12,000 were not retained β€” fixed for future runs):

Folder Step val/loss Use
best_by_val/ 1,000 2.859 βœ… recommended (full resumable state)
checkpoints/step_13000/ 13,000 ~4.19 overfit (trajectory)
checkpoints/step_14000/ 14,000 ~4.19 overfit (trajectory)
checkpoints/step_15000/ 15,000 4.197 overfit (trajectory)
checkpoints/step_15000_final/ 15,000 4.197 canonical final (overfit)

Each checkpoint contains the composite components β€” depth_decoder.pt, text_embed.pt, audio_heads.pt, model_audio_embed.pt, projection.pt, metadata.json β€” plus the LoRA adapter under peft_adapter/ (adapter_model.safetensors + adapter_config.json). train_15k.log is the full training log.

Intended uses & limitations

  • Intended: research, reproduction, and studying the TR⇄HI S2ST + overfitting trajectory. A teaching artifact for sweepβ†’runβ†’diagnosis.
  • Out of scope / limitations: not production-ready β€” overfit, low per-codebook accuracy, no human eval, narrow domain (FLEURS read speech). May hallucinate or produce low-naturalness audio. Do not deploy for real translation without re-training (see below).

How to use

This is a composite model (custom architecture), not a drop-in transformers pipeline. Load via the training repo's src/model/composite.py: base Cohere2 backbone + the LoRA adapter (peft_adapter/) + the .pt components, then decode Mimi codes to audio. See the repo README for the loading + inference path. Use the best_by_val folder.

Recommended next run (fixing the overfit)

Lower capacity + regularization + early stopping: lora_r 16–32, higher weight_decay/dropout, stop at ~1–2k steps (or use best-N checkpoint averaging), and/or more/augmented data.

Bias, risks & limitations

Trained on FLEURS (read speech, limited speakers/domains); quality and fairness across dialects, accents, code-switching, and spontaneous speech are untested. Speech translation can mistranslate, omit, or fabricate content β€” outputs must not be relied upon for high-stakes communication.

License caveat (important)

The Apache-2.0 license here covers this LoRA adapter and the training code. The model is built on CohereLabs/tiny-aya-base and Moshi/Mimi β€” your use of the assembled model is governed by those upstream licenses (the Aya family is often released under non-commercial terms). Check the base-model and Moshi/Mimi licenses before any commercial use.

Acknowledgements

This model was trained on Cloud TPU v6e-8 hardware generously provided by Google's TPU Research Cloud (TRC) program. We thank the TRC team for supporting this research.

Citation

@misc{tinyaya_tr_hi_s2st_v0_2,
  title  = {TinyAya: Turkish-Hindi Speech-to-Speech Translation (v0.2)},
  author = {tiny-aya-translate},
  year   = {2026},
  note   = {Cohere2 + frozen Moshi depth decoder, LoRA, trained on Google TRC TPU v6e-8},
  url    = {https://huggingface.co/tiny-aya-translate/tr-hi-s2st-v0.2}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tiny-aya-translate/tr-hi-s2st-v0.2

Adapter
(10)
this model

Dataset used to train tiny-aya-translate/tr-hi-s2st-v0.2