Translating to Traditional Chinese?

#2
by exoplanet - opened

Hi,
Great work. I'm wondering whether this model would work with input in Simplified Chinese to give output in Traditional Chinese? If not, are there any models you'd recommend for this step?
Cheers!

exoplanet changed discussion status to closed

Just found out about OpenCC, would that be the mechanism to use, if the model prefers Traditional Chinese as input?
Thx

exoplanet changed discussion status to open

this model would work with input in Simplified Chinese to give output in Traditional Chinese?

No, if the input is in Simplified Chinese, then the output will typically also be in Simplified Chinese, unless the task explicitly involves translation from Simplified to Traditional Chinese. (幫我把 xxx 翻譯為繁體中文)

Just found out about OpenCC, would that be the mechanism to use, if the model prefers Traditional Chinese as input?

It depends. OpenCC is a solid choice for character-, word-, and phrase-level conversion between Simplified and Traditional Chinese. However, it doesn’t handle rule-based or culturally contextual differences. For example, you can convert 请问民法第一条是什么? to 請問民法第一條是什麼? using OpenCC. But when this converted prompt is send to a LLM, it might interpret it based on a Traditional Chinese legal context (e.g., 「民事,法律所未規定者,依習慣;無習慣者,依法理。」), rather than a Simplified Chinese one (e.g., 「为了保护民事主体的合法权益……制定本法。」).

Therefore, if your use case is not sensitive to cultural context, converting the input to Traditional Chinese may work. But in cases where context matters, the conversion may lead to unexpected responses. For these cases, a Simplified Chinese optimized LLM, such as Qwen or GLM, might work better.

Thanks for you detailed answer. In fact, I'd rather work with Traditional Chinese for both input and output, however I've not been able to find the following two resources as open-weights: 1) a model to translate from/to English 2) a model to classify toxic content. Do you have any pointers for these? Cheers!

I see, the choice of model really depends on the translation quality you're aiming for.

If you’re looking for fast and reasonably readable translations, you might want to try kyara-1.5-2b (from this repo) or kyara-2.5-9b. Despite the chinese tag in the repo name, we maintain a balanced dataset between Chinese and English, as demonstrated in the benchmarks. If the content is culturally specific to Taiwan, you might consider culture-optimized models like Taide or Taiwan LLaMA.

If high-quality translation is your priority, then models like Qwen3, GLM, Mistral 3.1, or Gemma 3 could be better fits. Qwen and GLM are strong in Chinese overall, but they might introduce Simplified Chinese bias. On the other hand, Mistral and Gemma perform slightly weaker in Chinese, but their output tends to align more with Taiwan’s linguistic and cultural norms.

Moreover, We're also currently working on a model called Loyang, which is designed for reasoning and writing tasks. It has shown strong performance—on par with or better than the Gemma 3 series—and might be released in the coming weeks.

As for toxic comment classification: all the models mentioned above can handle this to some degree. But if your task involves only simple binary labels (e.g., toxic vs. non-toxic), training a lightweight BERT classifier might be more efficient than using a full LLM.

The smaller model (2b) would be our preferred size, in q8 quantization, although I couldn't find the URL for its weights.

Later when we have more resources available, we would want to switch to your 9b model to increase quality, although vanilla Gemma3 / Qwen3 could also be the alternatives to test against as you mentioned. However, we do not have the language skills or the tooling to make these evaluations at the moment, so your advice is the true north for me.

Looking forward to the release of Loyang, is it based on Gemma 3's 4B, or the new Qwen 3 ? If it's the former, a QAT release would make it a viable option for us given our memory constraints.

That's also a good idea, any pointers to a toxicity dataset for Taiwanese Mandarin could get the ball running for a dedicated classifier.

The smaller model (2b) would be our preferred size, in q8 quantization, although I couldn't find the URL for its weights.

We haven’t released an official quantized version of the 2B model. However, Q8 quantization—either dynamic or offline—can be applied using inference engines such as vLLM. You can refer to the tutorial for more information.

Later when we have more resources available, we would want to switch to your 9b model to increase quality, although vanilla Gemma3 / Qwen3 could also be the alternatives to test against as you mentioned. However, we do not have the language skills or the tooling to make these evaluations at the moment, so your advice is the true north for me. Looking forward to the release of Loyang, is it based on Gemma 3's 4B, or the new Qwen 3 ? If it's the former, a QAT release would make it a viable option for us given our memory constraints.

Unfortunately, It's based on Gemma-3-27B and we currently don’t have the resources to train models at smaller scales.

That's also a good idea, any pointers to a toxicity dataset for Taiwanese Mandarin could get the ball running for a dedicated classifier.

A potential starting point might be the PTT Gossiping datasets. There are several open-source variants available on Hugging Face and GitHub that could serve as a useful baseline. Also, you can check the toxic comment classification dataset created by Jigsaw, which is a multilingual version with fine-grained lables.

Great information, thanks so much.
I suppose by kyara-1.5-2b you meant the gemma-2-2b-it-chinese-kyara-dpo repo, right? What's the maximum quantization you'd recommend on the 2b gemma model?
We have a use-case for your Gemma-3-27b based model as well, feel free to ping upon releasing it.
Using one or both of these datasets would be a step in the right direction for the toxicity classification need we have. Still hopeful something ready made pops up somewhere.

I suppose by kyara-1.5-2b you meant the gemma-2-2b-it-chinese-kyara-dpo repo, right?

Yes, that's correct.

What's the maximum quantization you'd recommend on the 2b gemma model?

I haven’t benchmarked the quantized version of the 2B model yet, so you may want to evaluate it within your specific scenario.
However, if you're not running on edge devices or facing VRAM constraints, I think quantization may be unnecessary for a 2B model.

Sounds good. As I don't speak the language, I'll make your 9b model a judge to assess whether a quantization performs well enough for prod. Looking forward to the 27b beast you're working on.

exoplanet changed discussion status to closed

Sign up or log in to comment