Edit model card

bert-large-cantonese

Description

This model is tranied from scratch on Cantonese text. It is a BERT model with a large architecture (24-layer, 1024-hidden, 16-heads, 326M parameters).

The first training stage is to pre-train the model on 128 length sequences with a batch size of 512 for 1 epoch. the second stage is to continued pre-train the model on 512 length sequences with a batch size of 512 for one more epoch.

How to use

You can use this model directly with a pipeline for masked language modeling:

from transformers import pipeline

mask_filler = pipeline(
    "fill-mask",
    model="hon9kon9ize/bert-large-cantonese"
)

mask_filler("雞蛋六隻,糖呢就兩茶匙,仲有[MASK]橙皮添。")

; [{'score': 0.08160534501075745,
;   'token': 943,
;   'token_str': '個',
;   'sequence': '雞 蛋 六 隻 , 糖 呢 就 兩 茶 匙 , 仲 有 個 橙 皮 添 。'},
;  {'score': 0.06182105466723442,
;   'token': 1576,
;   'token_str': '啲',
;   'sequence': '雞 蛋 六 隻 , 糖 呢 就 兩 茶 匙 , 仲 有 啲 橙 皮 添 。'},
;  {'score': 0.04600336775183678,
;   'token': 1646,
;   'token_str': '嘅',
;   'sequence': '雞 蛋 六 隻 , 糖 呢 就 兩 茶 匙 , 仲 有 嘅 橙 皮 添 。'},
;  {'score': 0.03743772581219673,
;   'token': 3581,
;   'token_str': '橙',
;   'sequence': '雞 蛋 六 隻 , 糖 呢 就 兩 茶 匙 , 仲 有 橙 橙 皮 添 。'},
;  {'score': 0.031560592353343964,
;   'token': 5148,
;   'token_str': '紅',
;   'sequence': '雞 蛋 六 隻 , 糖 呢 就 兩 茶 匙 , 仲 有 紅 橙 皮 添 。'}]

Training hyperparameters

The following hyperparameters were used during first training:

  • Batch size: 512
  • Learning rate: 1e-4
  • Learning rate scheduler: linear decay
  • 1 Epoch
  • Warmup ratio: 0.1

Loss plot on WanDB

The following hyperparameters were used during second training:

  • Batch size: 512
  • Learning rate: 5e-5
  • Learning rate scheduler: linear decay
  • 1 Epoch
  • Warmup ratio: 0.1

Loss plot on WanDB

Downloads last month
15
Safetensors
Model size
326M params
Tensor type
F32
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using hon9kon9ize/bert-large-cantonese 1