--- license: llama2 datasets: - pkupie/mc2_corpus language: - bo - ug - mn - kk --- # MC^2Llama-13B [Github Repo](https://github.com/luciusssss/mc2_corpus) We continually pretrain [llama_chinese_13b](https://huggingface.co/quzhe/llama_chinese_13B) with [MC^2](https://huggingface.co/datasets/pkupie/mc2_corpus), which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script. See details in the [paper](https://arxiv.org/abs/2311.08348). *We have also released another model trained on MC^2: [MC^2XLMR-large](https://huggingface.co/pkupie/mc2-xlmr-large).* ## Usage the model and tokenizer can be loaded via: ```python from transformers import LlamaForCausalLM, LlamaTokenizer tokenizer = LlamaTokenizer.from_pretrained("pkupie/mc2-llama-13b") model = LlamaForCausalLM.from_pretrained("pkupie/mc2-llama-13b") ``` ## Citation ``` @article{zhang2024mc, title={MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China}, author={Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong}, journal={arXiv preprint arXiv:2311.08348}, year={2024} } ```