Taiwan-ELM-270M / README.md
liswei's picture
Update model with 2x training data and more efficient vocabulary
ee1b4c2 verified
|
raw
history blame
1.07 kB
---
library_name: transformers
license: apache-2.0
datasets:
- liswei/zhtw-news-and-articles-2B
base_model: apple/OpenELM-270M
language:
- zh
---
# Model Card for Chinese-OpenELM-270M
Continual pre-trained from [apple/OpenELM-270M](https://huggingface.co/apple/OpenELM-270M) with [liswei/zhtw-news-and-articles-2B](https://huggingface.co/datasets/liswei/zhtw-news-and-articles-2B):
* Extended vocabulary from 32000 to 61758 tokens with additional Traditional Chinese characters.
* Tokenizer is trained on [liswei/zhtw-news-and-articles-2B](https://huggingface.co/datasets/liswei/zhtw-news-and-articles-2B) and pruned from 96000 to 61758 tokens while maintaining 95% coverage on the pre-training dataset.
* Additional token embeddings are initialized with the mean vector of existing embeddings.
* Traditional Chinese perplexity = 1.6871 on held-out evaluation dataset.
* Applied [GaLore](https://arxiv.org/abs/2403.03507) for efficient training with following hyperparameters:
* Rank: 1024
* Scale: 4.0
* Update interval: 200
* Layer-wise training: False