indiejoseph
/

bart-base-cantonese

Text2Text Generation

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

indiejoseph commited on Nov 29, 2023

Commit

d8dbbba

•

1 Parent(s): b814aaf

Update README.md

Files changed (1) hide show

README.md +16 -4

README.md CHANGED Viewed

@@ -1,9 +1,23 @@
 ---
 tags:
 - generated_from_trainer
 model-index:
 - name: bart-base-cantonese
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -11,11 +25,9 @@ should probably proofread and complete it, then remove this comment. -->
 # bart-base-cantonese
-This model was trained from scratch on the None dataset.
-## Model description
-More information needed
 ## Intended uses & limitations

 ---
+base_model: fnlp/bert-base-chinese
 tags:
 - generated_from_trainer
 model-index:
 - name: bart-base-cantonese
   results: []
+datasets:
+- indiejoseph/wikipedia-zh-yue-filtered
+- indiejoseph/cc100-yue
+- indiejoseph/ted-transcriptions-cantonese
+- indiejoseph/c4-cantonese-filtered
+- mozilla-foundation/common_voice_13_0
+- jed351/rthk_news
+- jed351/shikoto_zh_hk
+widget:
+- text: "今日去咗旺角[MASK]"
+  example_title: "Mong Kok"
+- text: "今時今日香港係一個[MASK]。"
+  example_title: "Hong Kong"
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 # bart-base-cantonese
+This model is a continue pre-train version of [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) on filtered Cantonese common crawl dataset with 950M tokens.
+This tokenizer has extended the Bert tokenizer from fnlp/bart-base-chinese with 500 more Chinese characters commonly found in Cantonese
 ## Intended uses & limitations