zhtw-en

English This model translates Traditional Chinese sentences into English, with a focus on understanding Taiwanese-style Traditional Chinese and producing more accurate English translations.

This model is a fine-tuned version of Helsinki-NLP/opus-mt-zh-en on the zetavg/coct-en-zh-tw-translations-twp-300k dataset.

It achieves the following results on the evaluation set:

  • Loss: 2.4350
  • Num Input Tokens Seen: 55653732

Intended Uses & Limitations

Intended Use Cases

  • Translating single sentences from Chinese to English.
  • Applications requiring understanding of the Chinese language as spoken in Taiwan.

Limitations

  • Designed for single-sentence translation so will not perform well on longer texts without pre-processing
  • Sometimes hallucinates or omits information, especially with short or long inputs
  • Further fine-tuning will address this

Training and Evaluation Data

This model was trained and evaluated on the Corpus of Contemporary Taiwanese Mandarin (COCT) translations dataset.

  • Training Data: 80% of the COCT dataset
  • Validation Data: 20% of the COCT dataset
Chinese 該模型旨在將繁體中文翻譯成英文,重點是理解台灣風格的繁體中文並產生更準確的英文翻譯。

模型基於 Helsinki-NLP/opus-mt-zh-en 並在 zetavg/coct-en-zh-tw-translations-twp-300k 資料集上進行微調。

在評估集上,模型取得了以下結果:

  • 損失:2.4350
  • 處理的輸入標記數量:55,653,732

預期用途與限制

預期用途

  • 將單一中文句子翻譯為英文。
  • 適用於需要理解台灣中文的應用程式。

限制

  • 本模型專為單句翻譯設計,因此在處理較長文本時可能表現不佳,若未經預處理。
  • 在某些情況下,模型可能會產生幻覺或遺漏信息,特別是在輸入過短或過長的情況下。
  • 進一步的微調將有助於改善這些問題。

訓練與評估數據

該模型使用 當代台灣普通話語料庫 (COCT) 資料集進行訓練和評估。

  • 訓練資料:COCT 資料集的 80%
  • 驗證資料:COCT 資料集的 20%

Example

from transformers import pipeline

model_checkpoint = "agentlans/zhtw-en"
translator = pipeline("translation", model=model_checkpoint)

# From Chinese Wikipedia's article of the day (摘自中文維基百科的今日文章)
translator("《阿奇大戰鐵血戰士》是2015年4至7月黑馬漫畫和阿奇漫畫在美國發行的四期限量連環漫畫圖書,由亞歷克斯·德坎皮創作,費爾南多·魯伊斯繪圖,屬跨公司跨界作品。")[0]['translation_text']

# Output (輸出)
# Acer's Iron Blood Fighter is a four-year series of comic books published in the United States by Black Horse and Ah Chi comics from April to July of that year. The book was created by Alexander d'Campie and painted by Philnanto Ruiz. It is a cross-firm work.

# Compare with my own gold standard translation: (與我自己的黃金標準翻譯比較:)
# "Archie vs. Predator" is a limited four-issue comic book series published by Black Horse and Archie Comics from April to July 2015. It was created by Alex de Campi and drawn by Fernando Ruiz. It's a crossover work.

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning Rate: 5e-05
  • Train Batch Size: 8
  • Eval Batch Size: 8
  • Seed: 42
  • Optimizer: adamw_torch with betas=(0.9,0.999) and epsilon=1e-08
  • LR Scheduler Type: linear
  • Number of Epochs: 3.0

Training Results

Click here to see the training and validation losses
Training Loss Epoch Step Validation Loss Input Tokens Seen
3.2254 0.0804 2500 2.9105 1493088
3.0946 0.1608 5000 2.8305 2990968
3.0473 0.2412 7500 2.7737 4477792
2.9633 0.3216 10000 2.7307 5967560
2.9355 0.4020 12500 2.6843 7463192
2.9076 0.4824 15000 2.6587 8950264
2.8714 0.5628 17500 2.6304 10443344
2.8716 0.6433 20000 2.6025 11951096
2.7989 0.7237 22500 2.5822 13432464
2.7941 0.8041 25000 2.5630 14919424
2.7692 0.8845 27500 2.5497 16415080
2.757 0.9649 30000 2.5388 17897832
2.7024 1.0453 32500 2.6006 19384812
2.7248 1.1257 35000 2.6042 20876844
2.6764 1.2061 37500 2.5923 22372340
2.6854 1.2865 40000 2.5793 23866100
2.683 1.3669 42500 2.5722 25348084
2.6871 1.4473 45000 2.5538 26854100
2.6551 1.5277 47500 2.5443 28332612
2.661 1.6081 50000 2.5278 29822156
2.6497 1.6885 52500 2.5266 31319476
2.6281 1.7689 55000 2.5116 32813220
2.6067 1.8494 57500 2.5047 34298052
2.6112 1.9298 60000 2.4935 35783604
2.5207 2.0102 62500 2.4946 37281092
2.4799 2.0906 65000 2.4916 38768588
2.4727 2.1710 67500 2.4866 40252972
2.4719 2.2514 70000 2.4760 41746300
2.4738 2.3318 72500 2.4713 43241188
2.4629 2.4122 75000 2.4630 44730244
2.4524 2.4926 77500 2.4575 46231060
2.435 2.5730 80000 2.4553 47718964
2.4621 2.6534 82500 2.4475 49209724
2.4492 2.7338 85000 2.4440 50712980
2.4536 2.8142 87500 2.4394 52204380
2.4148 2.8946 90000 2.4360 53695620
2.4243 2.9750 92500 2.4350 55190020

Framework Versions

  • Transformers 4.48.1
  • Pytorch 2.3.0+cu121
  • Datasets 3.2.0
  • Tokenizers 0.21.0
Downloads last month
8
Safetensors
Model size
77.5M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for agentlans/zhtw-en

Finetuned
(10)
this model

Dataset used to train agentlans/zhtw-en