whynlp commited on
Commit
bb24762
1 Parent(s): 9ffb4b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -0
README.md CHANGED
@@ -1,3 +1,39 @@
1
  ---
2
  license: mit
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - p208p2002/wudao
5
+ language:
6
+ - zh
7
  ---
8
+ # Chinese TinyLlama
9
+
10
+ A demo project that pretrains a tinyllama on Chinese corpora, with minimal modification to the huggingface transformers code. It serves as a use case to demonstrate how to use the huggingface version [TinyLlama](https://github.com/whyNLP/tinyllama) to pretrain a model on a large corpus.
11
+
12
+ See the [Github Repo](https://github.com/whyNLP/tinyllama-zh) for more details.
13
+
14
+ ## Usage
15
+
16
+ ```python
17
+ # Load model directly
18
+ from transformers import AutoTokenizer, AutoModelForCausalLM
19
+
20
+ tokenizer = AutoTokenizer.from_pretrained("whynlp/tinyllama-zh", trust_remote_code=True)
21
+ model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-zh")
22
+ ```
23
+
24
+ ## Model Details
25
+
26
+ ### Model Description
27
+
28
+ This model is trained on [WuDaoCorpora Text](https://www.scidb.cn/en/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab). The dataset contains about 45B tokens and the model is trained for 2 epochs. The training takes about 6 days on 8 A100 GPUs.
29
+
30
+ The model uses the `THUDM/chatglm3-6b` tokenizer from huggingface.
31
+
32
+ - **Model type:** Llama
33
+ - **Language(s) (NLP):** Chinese
34
+ - **License:** MIT
35
+ - **Finetuned from model [optional]:** TinyLlama-2.5T checkpoint
36
+
37
+ ## Uses
38
+
39
+ The model does not perform very well (The CMMLU result is slightly above 25). For better performance, one may use a better corpus (e.g. [wanjuan](https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0)). Again, this project only serves as a demonstration of how to pretrain a TinyLlama on a large corpus.