Text Generation
GGUF
English
Chinese
subtitle
translation
fine-tuned
causal-lm

SubtitleEN2TW-0.6B (Q5_K_M GGUF)

A fine-tuned quantized model for cue-level, context-aware English → Taiwan Traditional Chinese subtitle translation, optimized for local real-time inference.

Model Details

Item Value
Base model NiuTrans/LMT-60-0.6B
Fine-tuning method Supervised Fine-Tuning (SFT), causal LM loss on assistant turn only
Task EN → TW subtitle translation, cue-level, 0–3 cues of context
Quantization Q5_K_M (GGUF, via llama.cpp)
Best checkpoint Step 22,000 · eval loss 1.4804
Precision BF16 during training
Context window 256 tokens (subtitle cues are short)

Intended Use

This model is designed for one task only:

Given recent English subtitle cues as context and the current English subtitle cue, output only the Taiwan Traditional Chinese translation of the current cue.

In scope:

  • Real-time local subtitle overlay
  • Cue-level streaming inference
  • English → Taiwan Traditional Chinese (繁體中文・台灣用語)

Out of scope:

  • General-purpose translation
  • Long-document translation
  • Simplified Chinese output
  • General instruction following

Input / Output Format

CTX:
<0–3 previous English subtitle cues>

CUR:
<current English subtitle cue>

Output: Taiwan Traditional Chinese translation of CUR only.

Example:

CTX:
I sent you the file.
Check your inbox.

CUR:
You got it?

Expected output:

你收到了嗎?

Usage (llama.cpp)

./llama-cli \
  -m SubtitleEN2TW-0.6B-Q5_K_M.gguf \
  --temp 0.0 \
  -p "CTX:\nI sent you the file.\nCheck your inbox.\n\nCUR:\nYou got it?" \
  -n 64

Usage (Python · llama-cpp-python)

from llama_cpp import Llama

llm = Llama(model_path="SubtitleEN2TW-0.6B-Q5_K_M.gguf", n_ctx=256)

ctx_cues = ["I sent you the file.", "Check your inbox."]
cur_cue  = "You got it?"

ctx_block = "\n".join(ctx_cues)
prompt = f"CTX:\n{ctx_block}\n\nCUR:\n{cur_cue}"

out = llm(prompt, max_tokens=64, temperature=0.0, stop=["\n\n"])
print(out["choices"][0]["text"].strip())

Training Data

The SFT dataset was built from two public subtitle corpora. The actual dataset is not redistributed due to upstream licensing constraints. Pipeline details and the diagnostic test set are available at Aiden1020/SubtitleEN2TW-SFT-Pipeline.

Source Format Language pair Scale (before filtering)
OpenSubtitles v2024 (OPUS) Moses parallel text enzh_TW ~18.6 M sentence pairs
TVSub Timestamped subtitle cues enzh ~2.2 M cue pairs

Filtering steps applied:

  1. Text cleaning: remove HTML/ASS/VTT tags, music symbols, encoding-corruption markers (PUA chars, kana, Cyrillic), subtitle-group watermarks
  2. Simplified Chinese rejection: OpenCC s2t diff ratio > 3% → dropped
  3. CTX-leakage filter: discard samples where the Chinese target is disproportionately longer than the English cue, or closely matches the previous Chinese cue
  4. English-echo filter: discard samples where the English source appears verbatim in the Chinese output
  5. Length validation: English 1–200 chars, Chinese 1–120 chars

Context size distribution in training set (0–3 cues): 30% / 30% / 25% / 15%.

A small manually curated diagnostic set (~3% mix-in) was included to cover leakage probes, short responses, ambiguous phrases, and Taiwan terminology.

Limitations

  • Trained on subtitle-domain data only; may produce unnatural output for other domains.
  • The training data is primarily from OpenSubtitles which contains crowd-sourced subtitles; quality varies.
  • Output targets Taiwan Traditional Chinese style but may occasionally produce Hong Kong or neutral Traditional Chinese forms.
  • Context crossing episode boundaries is treated as acceptable noise; no document-level segmentation is applied.
  • Not suitable for general chat or instruction following.

Citation

If you use this model, please also cite the base model:

@misc{luoyf2025lmt,
  title={NiuTrans.LMT: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs},
  author={Yingfeng Luo, Ziqiang Xu, Yuxuan Ouyang, Murun Yang, Dingyang Lin, Kaiyan Chang, Tong Zheng, Bei Li, Peinan Feng, Quan Du, Tong Xiao, Jingbo Zhu},
  year={2025},
  eprint={2511.07003},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2511.07003}
}

License

This model is released under the Apache License 2.0.

See NOTICE for third-party attribution.

Downloads last month
47
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Aiden1020/SubtitleEN2TW-0.6B

Quantized
(2)
this model

Dataset used to train Aiden1020/SubtitleEN2TW-0.6B

Paper for Aiden1020/SubtitleEN2TW-0.6B