metadata

library_name: transformers
license: cc-by-4.0
base_model: indiejoseph/bert-base-cantonese
tags:
  - generated_from_trainer
model-index:
  - name: bert-base-cantonese
    results: []

bert-base-cantonese

This model is a continuation of indiejoseph/bert-base-cantonese, a BERT-based model pre-trained on a substantial corpus of Cantonese text. The dataset was sourced from a variety of platforms, including news articles, social media posts, and web pages. The text was segmented into sentences containing 11 to 460 tokens per line. To ensure data quality, Minhash LSH was employed to eliminate near-duplicate sentences, resulting in a final dataset comprising 161,338,273 tokens. Training was conducted using the run_mlm.py script from the transformers library.

WandB

Intended uses & limitations

This model is intended to be used for further fine-tuning on Cantonese downstream tasks.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 180
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 1440
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 5.0

Framework versions

Transformers 4.45.0
Pytorch 2.4.1+cu121
Datasets 2.20.0
Tokenizers 0.20.0