cantonese-qwen3-8b-instruct

融入純粹粵語語音 manifold（語言流形）嘅多任務 Instruct 大模型。本模型基於 cantonese-qwen3-8b-base 進行深度大規模指令微調（SFT），專注於拉丁化粵語書寫系統 —— Liujgoj 溜歌粵語。

透過將音素同語義對齊，本模型擺脫咗傳統漢字對粵語思維嘅束縛，為語音原生（Speech-native）AI 奠定強大嘅文本語意底座。

🚀 模型亮點

純語音導向（Phonology-first）：基於 Liujgoj 拉丁化正詞法（Tone-as-letter 字母表調法：j, r, x, q, h），繞過漢字表意限制，實現更高效嘅 AI 語意向量建模。
強大數據錘煉：精選超過 60 部經典香港電影對白、高頻粵語詞庫（約 13,000 詞），精心策劃超 14 萬行高質量多任務對話與指令對（Instruction pairs）進行全參數/大窗口微調。

🛠️ 快速部署與使用

1. 使用 Hugging Face 官方最新 `hf` 工具下載

由於模型採用最新 high-performance 傳輸架構，推薦使用最新 hf 工具進行下載（速度極快）：

export HF_XET_HIGH_PERFORMANCE=1
hf download Yvthyvq/cantonese-qwen3-8b-instruct --local-dir ./cantonese-qwen3-8b-instruct

---

## 🚀 快速開始 / Quick Start

### 1. 安裝與環境變量配置
```bash
export HF_XET_HIGH_PERFORMANCE=1
export HF_ENDPOINT=[https://hf-mirror.com](https://hf-mirror.com)

### 2. Python 檔案推演 (使用 Transformers 載入)
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Yvthyvq/cantonese-qwen3-8b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 提示詞請盡量使用 Liujgoj 粵語拼寫或地道口語進行互動
messages = [
    {"role": "user", "content": "Neiq hour, neiq horyiq zouh dij mej?"} # 範例
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 26

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for Yvthyvq/cantonese-qwen3-8b-instruct

Base model

Qwen/Qwen3-8B-Base

Finetuned

Yvthyvq/cantonese-qwen3-8b-base

Finetuned

(1)

this model

Quantizations

1 model

cantonese-qwen3-8b-instruct

🚀 模型亮點

🛠️ 快速部署與使用

1. 使用 Hugging Face 官方最新 hf 工具下載

Model tree for Yvthyvq/cantonese-qwen3-8b-instruct

1. 使用 Hugging Face 官方最新 `hf` 工具下載