--- license: mit datasets: - p208p2002/wudao language: - zh --- # Chinese TinyLlama A demo project that pretrains a tinyllama on Chinese corpora, with minimal modification to the huggingface transformers code. It serves as a use case to demonstrate how to use the huggingface version [TinyLlama](https://github.com/whyNLP/tinyllama) to pretrain a model on a large corpus. See the [Github Repo](https://github.com/whyNLP/tinyllama-zh) for more details. ## Usage ```python # Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("whynlp/tinyllama-zh", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-zh") ``` ## Model Details ### Model Description This model is trained on [WuDaoCorpora Text](https://www.scidb.cn/en/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab). The dataset contains about 45B tokens and the model is trained for 2 epochs. The training takes about 6 days on 8 A100 GPUs. The model uses the `THUDM/chatglm3-6b` tokenizer from huggingface. - **Model type:** Llama - **Language(s) (NLP):** Chinese - **License:** MIT - **Finetuned from model [optional]:** TinyLlama-2.5T checkpoint ## Uses The model does not perform very well (The CMMLU result is slightly above 25). For better performance, one may use a better corpus (e.g. [wanjuan](https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0)). Again, this project only serves as a demonstration of how to pretrain a TinyLlama on a large corpus.