system commited on
Commit
019ac78
1 Parent(s): ea7baa7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: yue
3
+ license: apache-2.0
4
+ metrics:
5
+ - DRCD
6
+ - openrice-senti
7
+ - lihkg-cat
8
+ - wordshk-sem
9
+ ---
10
+
11
+ # ELECTRA Hongkongese Small
12
+
13
+ ## Model description
14
+
15
+ ELECTRA trained exclusively with data from Hong Kong. A signaficant amount of Hongkongese/Cantonese/Yue is included in the training data.
16
+
17
+ ## Intended uses & limitations
18
+
19
+ This model is an alternative to Chinese models. It may offer better performance for tasks catering to the langauge usage of Hong Kongers. Yue Wikipedia is used which is much smaller than Chinese Wikipedia; this model will lack the breath of knowledge compared to other Chinese models.
20
+
21
+ #### How to use
22
+
23
+ This is the small model trained from the official repo. Further finetuning will be needed for use on downstream tasks. Other model sizes are also available.
24
+
25
+ #### Limitations and bias
26
+
27
+ The training data consists of mostly news articles and blogs. There is probably a bias towards formal language usage.
28
+
29
+ ## Training data
30
+
31
+ The following is the list of data sources. Total characters is about 507M.
32
+
33
+ | Data | % |
34
+ | ------------------------------------------------- | --: |
35
+ | News Articles / Blogs | 58% |
36
+ | Yue Wikipedia / EVCHK | 18% |
37
+ | Restaurant Reviews | 12% |
38
+ | Forum Threads | 12% |
39
+ | Online Fiction | 1% |
40
+
41
+ The following is the distribution of different languages within the corpus.
42
+
43
+ | Language | % |
44
+ | ------------------------------------------------- | --: |
45
+ | Standard Chinese | 62% |
46
+ | Hongkongese | 30% |
47
+ | English | 8% |
48
+
49
+ ## Training procedure
50
+
51
+ Model was trained on a single TPUv3 from the official repo with the default parameters.
52
+
53
+ | Parameter | Value |
54
+ | ------------------------------------------------ | ----: |
55
+ | Batch Size | 384 |
56
+ | Max Sequence Size | 512 |
57
+ | Generator Hidden Size | 1.0 |
58
+ | Vocab Size | 30000 |
59
+
60
+ *Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)*
61
+
62
+ ## Eval results
63
+
64
+ Average evaluation task results over 10 runs. Comparison using the original repo model and code. Chinese models are available from [Joint Laboratory of HIT and iFLYTEK Research (HFL)](https://huggingface.co/hfl)
65
+
66
+ | Model | DRCD (EM/F1) | openrice-senti | lihkg-cat | wordshk-sem |
67
+ |:-----------:|:------------:|:--------------:|:---------:|:-----------:|
68
+ | Chinese | 78.5 / 85.6 | 77.9 | 63.7 | 79.2 |
69
+ | Hongkongese | 76.7 / 84.4 | 79.0 | 62.6 | 80.0 |
70
+