hon9kon9ize
/

bert-base-cantonese

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

bert-base-cantonese / README.md

indiejoseph's picture

Upload folder using huggingface_hub

67662b0 verified 2 months ago

|

1.73 kB

	---
	library_name: transformers
	license: cc-by-4.0
	base_model: indiejoseph/bert-base-cantonese
	tags:
	- generated_from_trainer
	model-index:
	- name: bert-base-cantonese
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# bert-base-cantonese

	This model is a continuation of [indiejoseph/bert-base-cantonese](https://huggingface.co/indiejoseph/bert-base-cantonese), a BERT-based model pre-trained on a substantial corpus of Cantonese text. The dataset was sourced from a variety of platforms, including news articles, social media posts, and web pages. The text was segmented into sentences containing 11 to 460 tokens per line. To ensure data quality, Minhash LSH was employed to eliminate near-duplicate sentences, resulting in a final dataset comprising 161,338,273 tokens. Training was conducted using the `run_mlm.py` script from the `transformers` library.

	[WandB](https://wandb.ai/indiejoseph/public/runs/wy2ja88z/workspace?nw=nwuserindiejoseph)


	## Intended uses & limitations

	This model is intended to be used for further fine-tuning on Cantonese downstream tasks.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 180
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 8
	- total_train_batch_size: 1440
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 5.0


	### Framework versions

	- Transformers 4.45.0
	- Pytorch 2.4.1+cu121
	- Datasets 2.20.0
	- Tokenizers 0.20.0