File size: 6,682 Bytes
0ce6066
732daf4
0ce6066
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
732daf4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ce6066
732daf4
 
 
 
 
 
 
 
0ce6066
 
 
732daf4
 
 
0ce6066
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
732daf4
0ce6066
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
license: gpl-3.0
tags:
- text2text-generation
pipeline_tag: text2text-generation
language:
- zh
- en
---

Considering LLaMA's license constraints, the model is for research and learning only. 
Please strictly respect LLaMA's usage policy. We are not allowed to publish weights for LLaMA, of course, even finetuned, but there is no problem publishing the difference, a patch that we suggest to apply to the files. 
The encryption is a simple XOR between files, ensuring that only the people that have access to the original weights (from completely legal sources, of course) can transform them into finetuned weights. 
You can find the decrypt code on https://github.com/LianjiaTech/BELLE/tree/main/models .


# Model Card for Model ID

## Welcome
If you find this model helpful, please *like* this model and star us on https://github.com/LianjiaTech/BELLE !

## Update
A new checkpoint trained with learning rate of 5e-6 is uploaded. 
In our evaluation, llama trained with smaller lr achieved better performance. 

## Model description
BELLE-LLAMA-7B-2M-enc is based on LLAMA 7B and finetuned with 2M Chinese data combined with 50,000 pieces of English data from the open source Stanford-Alpaca, resulting in good Chinese instruction understanding and response generation capabilities. 

The code of Chinese data generation and other detailed information can be found in our Github project repository: https://github.com/LianjiaTech/BELLE.


## Training hyper-parameters
| Parameter | Value |
| ------ | ------ |
| Batch size | 16 |
| Learning rate | 5e-6 |
| Epochs | 3 |
|Weight_decay | 0.0 |
|Warmup_rate | 0.03 |
|LR_scheduler | cosine |

## Download, Convert & Check
1. After you git clone this model
```
md5sum ./*
45afa71e3067de5119233a57ef9d093d  ./config.json.99a4ef2a26cb38c7f684cb83ed9343f660c561dd5a02a97d1b34b47419324dc5.enc
f9b33d359f17a437f6c24b4de6f2272e  ./generation_config.json.fd7ff399e5568cc21a0a8414f43df88ef7c424995b9b97a90563165d2cf79efd.enc
172013287b452114abf5c0e64936f45b  ./pytorch_model-00001-of-00002.bin.166879223b7504f1632d72b1577d57bceaa8fdeee1857c61119e575c50a4aae5.enc
384f8dc3b6da063c5f7554c52c531c44  ./pytorch_model-00002-of-00002.bin.2319db050dc286cb22c6e08a51a4ec0d9377017a7182a20a12c39eb658f39c80.enc
2ac1e5262eefd012918724d68813d03e  ./pytorch_model.bin.index.json.f56e69fedde5d28e4f37f2b62f74e8522bbfa13395a6d696d1ef99222a431ab7.enc
c066b68b4139328e87a694020fc3a6c3  ./special_tokens_map.json.ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356.enc
2d5d4156fd237fceae85f28d06751020  ./tokenizer_config.json.a672113277a674d753b5cdcfa6bfc860dc69bfcc5511bdccb0c6af3ed08873a0.enc
39ec1b33fbf9a0934a8ae0f9a24c7163  ./tokenizer.model.9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347.enc
```

2. Decrypt the files using the scripts in https://github.com/LianjiaTech/BELLE/tree/main/models

You can use the following command in Bash.
Please replace "/path/to_encrypted" with the path where you stored your encrypted file, 
replace "/path/to_original_llama_7B" with the path where you stored your original llama7B file, 
and replace "/path/to_finetuned_model" with the path where you want to save your final trained model.

```bash
mkdir /path/to_finetuned_model
for f in "/path/to_encrypted"/*; \
    do if [ -f "$f" ]; then \
       python3 decrypt.py "$f" "/path/to_original_llama_7B/consolidated.00.pth" "/path/to_finetuned_model/"; \
    fi; \
done
```

After executing the aforementioned command, you will obtain the following files.

```
./config.json
./generation_config.json
./pytorch_model-00001-of-00002.bin
./pytorch_model-00002-of-00002.bin
./pytorch_model.bin.index.json
./special_tokens_map.json
./tokenizer_config.json
./tokenizer.model
```

3. Check md5sum

You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery.
Here are the MD5 checksums for the relevant files:
```
md5sum ./*
a57bf2d0d7ec2590740bc4175262610b  ./config.json
2917a1cafb895cf57e746cfd7696bfe5  ./generation_config.json
252143e5ed0f0073dc5c04159a0f78c2  ./pytorch_model-00001-of-00002.bin
3f71478bd783685f0a45fc742af85042  ./pytorch_model-00002-of-00002.bin
d5230ae5fb3bfd12df98af123be53cf5  ./pytorch_model.bin.index.json
8a80554c91d9fca8acb82f023de02f11  ./special_tokens_map.json
414f52220807d1300ad700283141de69  ./tokenizer_config.json
eeec4125e9c7560836b4873b6f8e3025  ./tokenizer.model
```

## Use model
Please note that the input should be formatted as follows in both **training** and **inference**.
``` python
Human: {input} \n\nAssistant:
``` 

In order to load BELLE-LLAMA-7B-2M-enc with huggingface transformers, please install the main version, as the latest stable version doesn't support LLAMA (as of March 26, 2023).
``` python
pip install git+https://github.com/huggingface/transformers
```

After you decrypt the files, BELLE-LLAMA-7B-2M can be easily loaded with LlamaForCausalLM.
``` python
from transformers import LlamaForCausalLM, AutoTokenizer
import torch

ckpt = '/path/to_finetuned_model/'
device = torch.device('cuda')
model = LlamaForCausalLM.from_pretrained(ckpt, device_map='auto', low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(ckpt)
prompt = "Human: 写一首中文歌曲,赞美大自然 \n\nAssistant: "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generate_ids = model.generate(input_ids, max_new_tokens=500, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5, repetition_penalty=1., eos_token_id=2, bos_token_id=1, pad_token_id=0)
output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
response = output[len(prompt):]

```

## Limitations
There still exists a few issues in the model trained on current base model and data:

1. The model might generate factual errors when asked to follow instructions related to facts.

2. Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.

3. Needs improvements on reasoning and coding.

Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.


## Citation

Please cite us when using our code, data or model.

```
@misc{BELLE,
  author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, Xiangang Li},
  title = {BELLE: Be Everyone's Large Language model Engine},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/LianjiaTech/BELLE}},
}
```