indiejoseph commited on
Commit
d8dbbba
1 Parent(s): b814aaf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -4
README.md CHANGED
@@ -1,9 +1,23 @@
1
  ---
 
2
  tags:
3
  - generated_from_trainer
4
  model-index:
5
  - name: bart-base-cantonese
6
  results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ---
8
 
9
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -11,11 +25,9 @@ should probably proofread and complete it, then remove this comment. -->
11
 
12
  # bart-base-cantonese
13
 
14
- This model was trained from scratch on the None dataset.
15
 
16
- ## Model description
17
-
18
- More information needed
19
 
20
  ## Intended uses & limitations
21
 
 
1
  ---
2
+ base_model: fnlp/bert-base-chinese
3
  tags:
4
  - generated_from_trainer
5
  model-index:
6
  - name: bart-base-cantonese
7
  results: []
8
+ datasets:
9
+ - indiejoseph/wikipedia-zh-yue-filtered
10
+ - indiejoseph/cc100-yue
11
+ - indiejoseph/ted-transcriptions-cantonese
12
+ - indiejoseph/c4-cantonese-filtered
13
+ - mozilla-foundation/common_voice_13_0
14
+ - jed351/rthk_news
15
+ - jed351/shikoto_zh_hk
16
+ widget:
17
+ - text: "今日去咗旺角[MASK]"
18
+ example_title: "Mong Kok"
19
+ - text: "今時今日香港係一個[MASK]。"
20
+ example_title: "Hong Kong"
21
  ---
22
 
23
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
25
 
26
  # bart-base-cantonese
27
 
28
+ This model is a continue pre-train version of [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) on filtered Cantonese common crawl dataset with 950M tokens.
29
 
30
+ This tokenizer has extended the Bert tokenizer from fnlp/bart-base-chinese with 500 more Chinese characters commonly found in Cantonese
 
 
31
 
32
  ## Intended uses & limitations
33