bart-translation-zh-yue

Cantonese to Simplified Chinese translation model fine-tuned on indiejoseph/bart-base-cantonese using the LLMs generated dataset.

It achieves the following results on the evaluation set:

Loss: 0.5042
Bleu: 36.3458
Gen Len: 19.8785

Model description

Since the base model indiejoseph/bart-base-cantonese is further pre-trained based on fnlp/bart-base-chinese, However, it inherits the issue of its whitespace tokenizer, which results in space delimiters between every individual Chinese character in the outputs. To address this problem, I have created a translation pipeline that mitigates the inconsistent output of Simplified Chinese with SequenceBiasLogitsProcessor from the transformers library.

Usage

from translation_pipeline import TranslationPipeline

pipe = TranslationPipeline(device=0)

print(pipe('近年成为许多港人热门移居地的英国中部城巿诺定咸（又译诺丁汉，Nottingham），多年来一直面对财政困境，市议会周三（11月29日）宣布破产，是继英国第二大城市伯明翰今年9月宣布破产后，近期「爆煲」的另一个英国主要城市。诺定咸除了维持法例规定必须提供的服务外，巿政府将暂停所有非必要的公共开支。', max_length=300))

# Output: 近年成為好多港人熱門移居地嘅英國中部城巿諾定咸（又譯諾丁漢，Nottingham），多年來一直面對財政困境，市議會喺11月29號宣佈破產，係繼英國第二大城市伯明翰今年9月宣布破產後，近期「爆煲」嘅另一個英國主要城市。諾定鹹除咗維持法例規定必須提供嘅服務之外，巿政府將暫停所有非必要嘅公共開支。

Intended uses

Cantonese Chinese Translation: The model can be utilized to translate text from Cantonese Chinese to other languages, enabling communication and understanding across different linguistic backgrounds.
Language Learning: The model can assist language learners in understanding and translating Cantonese Chinese texts, aiding in the acquisition of Cantonese language skills.

Limitations

Domain Specificity: The model's performance may vary when translating texts that contain domain-specific or technical terminology. It is trained on general language data and may struggle with specialized vocabulary.
Accuracy and Fluency: While the model strives to provide accurate and fluent translations, it may occasionally produce errors or less natural-sounding output. Post-editing or human review may be necessary for critical or high-stakes translations.
Cultural Nuances: Translations generated by the model might not capture the full range of cultural nuances and contextual meanings present in the original text. Human interpretation and cultural understanding are essential for accurate translations in sensitive or culturally specific contexts.
Potential for Harmful or Hate Speech: The training dataset was generated from Language Models (LLMs), which may inadvertently include instances of harmful or hate speech. While efforts have been made to filter and mitigate such content, the model's output may still occasionally contain offensive or inappropriate language. It is essential to exercise caution and implement appropriate content moderation measures when utilizing the model to ensure the generated translations align with ethical standards and community guidelines.

Training and evaluation data

The training and evaludation dataset are generated by ChatGPT and Palm2.

Leverage over 4,000 Chinese and Cantonese phrase pairs meticulously gathered from diverse websites and dictionaries as the foundation for generating initial seed sentences in Chinese using ChatGPT. Subsequently, employ the Palm2 API to translate all Chinese sentences into Cantonese, while dedicating attention to manually rectifying any typos and enhancing overall fluency and linguistic variety.

Utilizing the collected Chinese and Cantonese phrase pairs, each phrase is employed to generate ten unique sentences, resulting in a comprehensive dataset size of approximately 40,000 sentences. These sentences serve as the basis for training and refining the translation model, ensuring a robust and diverse language understanding.

Similarly, the evaluation dataset is meticulously crafted using a comparable methodology to the training dataset. This ensures that the evaluation data reflects the same level of quality, diversity, and linguistic nuances, providing a reliable benchmark for assessing the performance and effectiveness of the translation model.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 32
eval_batch_size: 32
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 4.0

Training results

Training Loss	Epoch	Step	Validation Loss	Bleu	Gen Len
0.135	1.0	3521	0.4865	35.3577	19.8859
0.0983	2.0	7042	0.4813	36.0938	19.8796
0.072	3.0	10563	0.4847	36.193	19.8817
0.0552	4.0	14084	0.5042	36.3458	19.8785

Framework versions

Transformers 4.35.0.dev0
Pytorch 2.1.1+cu121
Datasets 2.14.6
Tokenizers 0.14.1

indiejoseph
/

bart-translation-zh-yue