File size: 3,155 Bytes

ef9ed30
031fd29
 
 
 
 
1fcd09b
031fd29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef9ed30
 
ab8ce0f
ef9ed30
 
 
b55bdfc
ef9ed30
915a7e3
 
 
ef9ed30
915a7e3
ef9ed30
7675b79
 
 
2f22fb4
7675b79

---
tags:
- traditional chinese
- zh-tw
- zh-hant
- taiwan
widget:
- text: |-
    <|system|>
    對於輸入內容的中文文字，請將中國用語轉成台灣的用語，其他非中文文字或非中國用語都維持不變。

    範例：
    Input: ```這個視頻的質量真高啊```
    Output: ```這個影片的品質真高啊```</s>
    <|user|>
    Input: ```這個軟件的質量真高啊```</s>
    <|assistant|>
    Output: 
- text: |-
    <|system|>
    對於輸入內容的中文文字，請將中國用語轉成台灣的用語，其他非中文文字或非中國用語都維持不變。

    範例：
    Input: ```這個視頻的質量真高啊```
    Output: ```這個影片的品質真高啊```</s>
    <|user|>
    Input: ```我們建立了數據庫，用來儲存和管理線上服務的信息```</s>
    <|assistant|>
    Output: 
license: agpl-3.0
datasets:
- MBZUAI/Bactrian-X
language:
- zh
---

# Taiwan Words Translator 繁體中文台灣化翻譯器 by LLMs

<!-- Provide a quick summary of what the model is/does. -->

https://github.com/SuJiaKuan/llm_tw_word

The model supports translation that converts text with China words to text with only Taiwan words. Example:
- Input: `這個軟件的質量真高啊`
- Output: `這個軟體的品質真高啊`

#### This Model

This model is fine-tuned from [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) (by applying Instruction Finetuning). The dataset is collected from [MBZUAI/Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X) and automatically labeled by [繁化姬](https://zhconvert.org).

#### How to use
You can follow the example usage below, or see [here](https://github.com/SuJiaKuan/llm_tw_word/blob/main/llm_tw_word/translate.py) to know how to integrate the model into a Python class.

```python
import torch
from transformers import pipeline

SYSTEM_PROMPT = """\
對於輸入內容的中文文字，請將中國用語轉成台灣的用語，其他非中文文字或非中國用語都維持不變。

範例：
Input: ```這個視頻的質量真高啊```
Output: ```這個影片的品質真高啊```\
"""

text_trad = "這個軟件的質量真高啊"

pipeline = pipeline(
    "text-generation",
    model="feabries/TaiwanWordTranslator-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "Input: ```{}```".format(text_trad)
messages = [{
    "role": "system",
    "content": SYSTEM_PROMPT,
}, {
    "role": "user",
    "content": prompt,
}]
input_text = pipeline.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
outputs = pipeline(
    input_text,
    do_sample=False,
    max_new_tokens=2048,
)
print(outputs[0]["generated_text"])
# <|system|>
# 對於輸入內容的中文文字，請將中國用語轉成台灣的用語，其他非中文文字或非中國用語都維持不變。
# 
# 範例：
# Input: ```這個視頻的質量真高啊```
# Output: ```這個影片的品質真高啊```</s>
# <|user|>
# Input: ```這個軟件的質量真高啊```</s>
# <|assistant|>
# Output: ```這個軟體的品質真高啊```
```