Back to all models
fill-mask mask_token: [MASK]
Query this model
🔥 This model is currently loaded and running on the Inference API. ⚠️ This model could not be loaded by the inference API. ⚠️ This model can be loaded on the Inference API on-demand.
JSON Output
API endpoint
								$
								curl -X POST \
-H "Authorization: Bearer YOUR_ORG_OR_USER_API_TOKEN" \
-H "Content-Type: application/json" \
-d '"json encoded string"' \
https://api-inference.huggingface.co/models/toastynews/electra-hongkongese-small-discriminator
Share Copied link to clipboard

Monthly model downloads

toastynews/electra-hongkongese-small-discriminator toastynews/electra-hongkongese-small-discriminator
31 downloads
last 30 days

pytorch

tf

Contributed by

toastynews Toasty News
7 models

How to use this model directly from the 🤗/transformers library:

			
Copy to clipboard
from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("toastynews/electra-hongkongese-small-discriminator") model = AutoModelWithLMHead.from_pretrained("toastynews/electra-hongkongese-small-discriminator")
Uploaded in S3

ELECTRA Hongkongese Small

Model description

ELECTRA trained exclusively with data from Hong Kong. A signaficant amount of Hongkongese/Cantonese/Yue is included in the training data.

Intended uses & limitations

This model is an alternative to Chinese models. It may offer better performance for tasks catering to the langauge usage of Hong Kongers. Yue Wikipedia is used which is much smaller than Chinese Wikipedia; this model will lack the breath of knowledge compared to other Chinese models.

How to use

This is the small model trained from the official repo. Further finetuning will be needed for use on downstream tasks. Other model sizes are also available.

Limitations and bias

The training data consists of mostly news articles and blogs. There is probably a bias towards formal language usage.

Training data

The following is the list of data sources. Total characters is about 507M.

Data %
News Articles / Blogs 58%
Yue Wikipedia / EVCHK 18%
Restaurant Reviews 12%
Forum Threads 12%
Online Fiction 1%

The following is the distribution of different languages within the corpus.

Language %
Standard Chinese 62%
Hongkongese 30%
English 8%

Training procedure

Model was trained on a single TPUv3 from the official repo with the default parameters.

Parameter Value
Batch Size 384
Max Sequence Size 512
Generator Hidden Size 1.0
Vocab Size 30000

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

Eval results

Average evaluation task results over 10 runs. Comparison using the original repo model and code. Chinese models are available from Joint Laboratory of HIT and iFLYTEK Research (HFL)

Model DRCD (EM/F1) openrice-senti lihkg-cat wordshk-sem
Chinese 78.5 / 85.6 77.9 63.7 79.2
Hongkongese 76.7 / 84.4 79.0 62.6 80.0