jay68 commited on
Commit
e1f1553
1 Parent(s): e95fbbd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -1
README.md CHANGED
@@ -1,3 +1,132 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: gpl-3.0
3
+ tags:
4
+ - text2text-generation
5
+ pipeline_tag: text2text-generation
6
+ language:
7
+ - zh
8
+ - en
9
  ---
10
+
11
+ Considering LLaMA's license constraints, the model is for research and learning only.
12
+ Please strictly respect LLaMA's usage policy. We are not allowed to publish weights for LLaMA, of course, even finetuned, but there is no problem publishing the difference, a patch that we suggest to apply to the files.
13
+ The encryption is a simple XOR between files, ensuring that only the people that have access to the original weights (from completely legal sources, of course) can transform them into finetuned weights.
14
+ You can find the decrypt code on https://github.com/LianjiaTech/BELLE/tree/main/models .
15
+
16
+
17
+ # Model Card for Model ID
18
+
19
+ ## Welcome
20
+ If you find this model helpful, please *like* this model and star us on https://github.com/LianjiaTech/BELLE !
21
+
22
+ ## Model description
23
+ We release our base model described in the paper
24
+ [Towards Better Instruction Following Language Models for Chinese](https://github.com/LianjiaTech/BELLE/blob/main/docs/Towards%20Better%20Instruction%20Following%20Language%20Models%20for%20Chinese.pdf)
25
+
26
+ We extend original LLaMA's vocabulary for an efficiency tokenization of Chinese.
27
+ This model is derived through the following steps:
28
+ 1. Train a tokenizer with a vocabulary of 50K tokens on 12M lines of Chinese text.
29
+ 2. Merge the trained vocabulary with the original LLaMA vocabulary, resulting in a new vocabulary of 79,458 tokens.
30
+ 3. Resize word embeddings and further pretrain LLaMA on 3.4B Chinese words with other parameters fixed.
31
+
32
+ We test the extended tokenizer and the original tokenizer on 5,000 lines of Chinese text, and the average tokens of a line reduces from 733 to 291.
33
+
34
+
35
+ ## Download, Convert & Check
36
+ 1. After you git clone this model
37
+ ```
38
+ md5sum ./*
39
+ 228a21b7bf927f7ffd44c16c88256684 ./config.json.fb090219f6fed69687ab8f9c902f7802cff8060b08007ca0e5af177a8f9613d5.enc
40
+ f9b33d359f17a437f6c24b4de6f2272e ./generation_config.json.fd7ff399e5568cc21a0a8414f43df88ef7c424995b9b97a90563165d2cf79efd.enc
41
+ 1c12c5bb95b1d191779ef160624a622a ./pytorch_model-00001-of-00002.bin.3b0666c50d7fd55d5116e788ec51aa96a34ba6816e86ffbee1dbe983bf511b4b.enc
42
+ 1a67804dbdfd2168ef30ec077b73e90d ./pytorch_model-00002-of-00002.bin.763b336a89ef37327716d9c097835720662da656bdc27afde27daec9d0873284.enc
43
+ 0d6db7f247a51589f3dd6d08dbfe64ce ./pytorch_model.bin.index.json.4f08b269e18619675bc3fd62f6efb3a8d59f9d54fa50f5625d0bba7adabaf90e.enc
44
+ 34696bfce7b27548cfc2410e2b55762e ./special_tokens_map.json.96bdbb8504d9967606e5f661ccc7cbbac44a3661af863a7a58614670a0ccab33.enc
45
+ 6014cf2235521f974c8d9fb69b6cf07e ./tokenizer_config.json.7078cc180b3d35e7ccd06b49ede4a7fef85f2572bda40c1fe2fc8f9ab25418d3.enc
46
+ 56724a79091f3d1877cca65c6412d646 ./tokenizer.model.0b716a618c9e7c45648f91d997431eba3b0ff111b17ce7b777280ed771a49f95.enc
47
+ ```
48
+
49
+ 2. Decrypt the files using the scripts in https://github.com/LianjiaTech/BELLE/tree/main/models
50
+
51
+ You can use the following command in Bash.
52
+ Please replace "/path/to_encrypted" with the path where you stored your encrypted file,
53
+ replace "/path/to_original_llama_7B" with the path where you stored your original llama7B file,
54
+ and replace "/path/to_finetuned_model" with the path where you want to save your final trained model.
55
+
56
+ ```bash
57
+ mkdir /path/to_finetuned_model
58
+ for f in "/path/to_encrypted"/*; \
59
+ do if [ -f "$f" ]; then \
60
+ python3 decrypt.py "$f" "/path/to_original_llama_7B/consolidated.00.pth" "/path/to_finetuned_model/"; \
61
+ fi; \
62
+ done
63
+ ```
64
+
65
+ After executing the aforementioned command, you will obtain the following files.
66
+
67
+ ```
68
+ ./config.json
69
+ ./generation_config.json
70
+ ./pytorch_model-00001-of-00002.bin
71
+ ./pytorch_model-00002-of-00002.bin
72
+ ./pytorch_model.bin.index.json
73
+ ./special_tokens_map.json
74
+ ./tokenizer_config.json
75
+ ./tokenizer.model
76
+ ```
77
+
78
+ 3. Check md5sum
79
+
80
+ You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery.
81
+ Here are the MD5 checksums for the relevant files:
82
+ ```
83
+ md5sum ./*
84
+ df363050c4ded5c3136270cef715a7d1 ./config.json
85
+ 2917a1cafb895cf57e746cfd7696bfe5 ./generation_config.json
86
+ a88865ce42f45c0c88cd4f7f8ecd75ea ./pytorch_model-00001-of-00002.bin
87
+ ce23ee57ecc73a78b0117e38a68f8d84 ./pytorch_model-00002-of-00002.bin
88
+ e5385004e4876ea6b93d6126e845a82f ./pytorch_model.bin.index.json
89
+ 15f7a943faa91a794f38dd81a212cb01 ./special_tokens_map.json
90
+ 08f6f621dba90b2a23c6f9f7af974621 ./tokenizer_config.json
91
+ 6ffe559392973a92ea28032add2a8494 ./tokenizer.model
92
+ ```
93
+
94
+ ## Use model
95
+ This model is a pre-trained language model and has not been instruction-tuned.
96
+ To obtain well instruction-following capabilities, please finetune it with your instruction data.
97
+
98
+
99
+ ## Limitations
100
+ There still exists a few issues in the model trained on current base model and data:
101
+
102
+ 1. The model might generate factual errors when asked to follow instructions related to facts.
103
+
104
+ 2. Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.
105
+
106
+ 3. Needs improvements on reasoning and coding.
107
+
108
+ Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.
109
+
110
+
111
+ ## Citation
112
+
113
+ Please cite our paper and github when using our code, data or model.
114
+
115
+ ```
116
+ @misc{ji2023better,
117
+ title={Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation},
118
+ author={Yunjie Ji and Yan Gong and Yong Deng and Yiping Peng and Qiang Niu and Baochang Ma and Xiangang Li},
119
+ year={2023},
120
+ eprint={2304.07854},
121
+ archivePrefix={arXiv},
122
+ primaryClass={cs.CL}
123
+ }
124
+ @misc{BELLE,
125
+ author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, Xiangang Li},
126
+ title = {BELLE: Be Everyone's Large Language model Engine},
127
+ year = {2023},
128
+ publisher = {GitHub},
129
+ journal = {GitHub repository},
130
+ howpublished = {\url{https://github.com/LianjiaTech/BELLE}},
131
+ }
132
+ ```