HY-MT 1.8B LoRA — Chinese→Vietnamese Xianxia/Cultivation Translation (Phase 7)

LoRA adapter (r=16, ~74 MB) on top of tencent/HY-MT1.5-1.8B, fine-tuned for Chinese→Vietnamese translation of xianxia / cultivation web novels. Specializes in:

Sino-Vietnamese (Hán-Việt) rendering of cultivation terms, proper nouns, sect/martial-art names.
Long narrative chunks (up to ~400-500 ZH chars per chunk) without truncation.
Faithful preservation of canonical entity names from a glossary-driven training pipeline.

Quick start

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_id = "tencent/HY-MT1.5-1.8B"
adapter_id = "DanVP/hy-mt-xianxia-lora-vi"

tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(
    base_model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(base, adapter_id)
model = model.merge_and_unload()  # bake adapter for inference speed
model.eval()

zh = "他望着天边最后一抹残霞，仿佛看见了千年前那场惊天动地的一战……"
messages = [{
    "role": "user",
    "content": "将以下文本翻译为越南语，注意只需要输出翻译后的结果，不要额外解释：\n\n" + zh,
}]
inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

with torch.no_grad():
    out = model.generate(
        inputs, max_new_tokens=512, do_sample=False, num_beams=1, repetition_penalty=1.05,
    )
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True).strip())

Intended use

Translation of long-form Chinese xianxia/cultivation web novel chapters into Vietnamese with Sino-Vietnamese (Hán-Việt) proper nouns and natural narrative flow.
Suitable as a backbone for human-in-the-loop translation workflows where a glossary post-fix step (drift normalization) sits downstream.

Out-of-scope

General-domain Chinese→Vietnamese (modern news, technical docs) — adapter is specialized for novel narrative + xianxia vocabulary.
Translation between other language pairs.
Non-narrative content (code, structured data, tables).

Training


Base model	`tencent/HY-MT1.5-1.8B`
Adapter type	LoRA
Rank `r`	16
Alpha	32
Dropout	0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Trainable params	19.4 M (1.07 % of full 1.81 B)
Precision	bf16
Effective batch	4 × 8 grad-accum = 32
Epochs	2
Learning rate	1e-4 (cosine schedule, 3 % warmup)
Weight decay	0.01
Sequence length	1536
Total updates	4192
Final eval loss	1.764

Training data

67,052 ZH→VI chat-format pairs:

~46k base mixture: filtered xianxia translation pairs (≥2.5 vi/zh char ratio to remove truncated targets).
~18k booster: high-quality xianxia narrative chunks.
~5k topstyle: stylistic / dialogue-heavy chunks.
45 hard-negative pairs mined via Gemini 3.1 Flash Lite Preview from prior model failures (entity dropout + truncation chunks).

Hardware

1× NVIDIA A100 80GB (Vertex AI Custom Training, SPOT, us-central1)
Training duration: 4 h 38 m
Approx cost: ~US $8 spot

Evaluation

Blind eval set (200 ZH→VI pairs from held-out novels)

Metric	This adapter	Prior baseline (Phase 5 Job A)
BLEU	33.40	33.71
chrF	51.96	51.53
Length ratio (pred/gold)	0.999	0.979
Han leak (ZH chars in VI output)	0	0
Empty outputs	0	0

The adapter trades ~0.3 BLEU points for better character-level quality (chrF +0.43), better length match (0.999 vs 0.979), and dramatically lower entity-dropout / truncation rates (see below).

Entity dropout audit (sonhai + meiqian, chunk-400 inference)

After expanded glossary repair + dynamic-cap fix (3× input tokens, default in local_inference.py):

Audit	This adapter	Prior baseline (Phase 5 Job A)
Sonhai overall dropout	1.8 %	~21 %
Sonhai key-entity dropout (申尤昆 / 祁自如 / 山海提灯)	0 % each	19 % / high / high
Meiqian overall dropout	1.9 %	(no comparable baseline)
Meiqian key entities (周天翊 / 昆墟 / 白真真)	0 % each	—
Sonhai truncation (any kind)	3.3 % (2 chunks: id 0 chapter-title fragment + id 20 EOS edge)	13.1 %
Meiqian truncation (any kind: cap, ratio<1.5, early-stop)	0 %	12.1 %
Han-leak (Chinese chars in Vietnamese output)	0	0

The vi_drift_fix glossary at glossary_sonhai_repair.json / glossary_meiqian_repair.json is required to reach these numbers — the model itself outputs predictable variants (e.g. Kỳ Tự Du / Kỳ Tự Nhất for 祁自如) which a string-replacement layer normalizes to canonical.

Inference performance


Mode	bf16 (merged adapter)
Test GPU	RTX 5070 Ti Mobile 12 GB
Speed	~27–29 tokens/s
Peak VRAM	~4 GB
Recommended `max_new_tokens`	512–3072 (dynamic by input length)

Quantization caveat

bitsandbytes int8/int4 inference exhibits severe entity dropout regression (int4 ≈ 53 % dropout, int8 ≈ 4× slower with worse quality on Blackwell laptop GPUs due to MatMul8bitLt fp16 cast issues). bf16 only is recommended for production.

Limitations

Production-candidate, not yet final: this adapter passes 4 of 5 strict gates from the Phase 7 plan; sonhai truncation sits at 3.3 % (limit 3.0 %) due to two specific chunks. See "Known failure modes" below.
The adapter expects the HY-MT chat template; using a non-HY-MT base model requires retemplating the user content.
The downstream vi_drift_fix glossary repair is required for the entity-dropout numbers above. Without it, ~6-7 % of entity mentions show recognizable drift variants.

Known failure modes

Chapter-title fragments (e.g. 12-char "《山海提灯》\n第一章春" alone): model transliterates literally instead of using canonical Sino-Vietnamese ("Mai Sơn Đỉnh Đèn" instead of "Sơn Hải Đề Đăng"). Workaround: feed at least one full sentence of context.
Embedded paragraph-internal quote-close (”) can confuse the model into emitting EOS early (sonhai chunk 20: 90 output tokens of a 462-token budget; persists at 2048 budget too). Will be fixed in a focused Phase 7.1 retrain.
Concept nouns that look like proper nouns (e.g. 请神 "summoning a god"): model sometimes outputs cầu thần or Tà Thần instead of canonical Thỉnh Thần. Glossary vi_drift_fix covers the common variants.
Modern proper nouns (e.g. 嵩阳高中 = a high school) occasionally drop or substitute one morpheme (e.g. trường Đào Dương instead of Trường cấp ba Tung Dương). Glossary covers the observed variants.

How to apply with downstream glossary repair

The training pipeline assumes a downstream "drift repair" step that normalizes minor variant spellings of canonical entity names against a glossary (e.g. "Thân Du Côn" → "Thân Vưu Côn"). After generation, run a string-replacement pass with a per-novel glossary; this pushes effective entity hit rate from ~89-92 % up to ~96 %+ on tested corpora.

Citation

If you use this adapter, please cite the base model:

@misc{tencent2025hymt,
  author       = {Tencent},
  title        = {HY-MT1.5-1.8B},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/tencent/HY-MT1.5-1.8B}},
}

License

This adapter is a Model Derivative of tencent/HY-MT1.5-1.8B and is therefore distributed under the Tencent HY Community License Agreement (the same license that applies to the base model). See the full license text.

Notable restrictions from the base license that flow through to this adapter:

Geographic restriction: the license does not apply in the European Union, United Kingdom, and South Korea. Users in those territories cannot use this adapter under this license.
MAU > 100 M: services with more than 100 million monthly active users (as of the base model release date 2025-12-30) need a separate license from Tencent.
Encouraged: marking products built on this adapter as "Powered by Tencent HY".
The base model's Acceptable Use Policy (Exhibit A) applies.

Tencent retains all rights to the "Tencent HY" trademark.

Framework versions

PEFT 0.15.2
transformers (training): 4.x
torch: 2.11+cu128 (inference tested)

Downloads last month: 3

Model tree for DanVP/hy-mt-xianxia-lora-vi

Base model

tencent/HY-MT1.5-1.8B

Adapter

(5)

this model