hon9kon9ize
/

CantoneseLLMChat-v0.5

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

indiejoseph commited on 9 days ago

Commit

6a66a97

•

1 Parent(s): 5c00a83

Update README.md

Files changed (1) hide show

README.md +2 -5

README.md CHANGED Viewed

@@ -4,15 +4,12 @@ language:
 - yue
 ---
-**This is a preview version, and this repository will be deleted once the new version is released. We are currently in the process of finding the balance between overfitting and generalization in DPO training. For more details about the problems we encountered in this version, please refer to the Limitation section. Please join our [Discord server](https://discord.gg/gG6GPp8XxQ) to give us your feedback**
-Continual pretraining model of the [Yi-6B](https://huggingface.co/01-ai/Yi-6B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.
-The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (200 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
 Here is a space you can interact with [CantoneseLLMChat](https://huggingface.co/spaces/hon9kon9ize/CantoneseLLMChat)
-[Technical Report](https://hon9kon9ize.com/posts/2024-04-28-cantonesellm_tech_report)
 ### Result

 - yue
 ---
+Continual pretraining model of the [Yi-6B](https://huggingface.co/01-ai/Yi-1.5-6B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.
+The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (400 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
 Here is a space you can interact with [CantoneseLLMChat](https://huggingface.co/spaces/hon9kon9ize/CantoneseLLMChat)
 ### Result