Considering LLaMA's license constraints, the model is for research and learning only. Please strictly respect LLaMA's usage policy. We are not allowed to publish weights for LLaMA, of course, even finetuned, but there is no problem publishing the difference, a patch that we suggest to apply to the files. The encryption is a simple XOR between files, ensuring that only the people that have access to the original weights (from completely legal sources, of course) can transform them into finetuned weights. You can find the decrypt code on https://github.com/LianjiaTech/BELLE/tree/main/models .
Model Card for Model ID
Welcome
If you find this model helpful, please like this model and star us on https://github.com/LianjiaTech/BELLE !
Model description
This model comes from a two-phrase training on original LLaMA 13B.
- Extending the vocabulary with additional 50K tokens specific for Chinese and further pretraining these word embeddings on Chinese corpus.
- Full-parameter finetuning the model with 4M high-quality instruction-following examples.
Download, Convert & Check
- After you git clone this model
md5sum ./*
211b6252c73e638cb87e04edef1c91c6 config.json.7b4504868ddce248768954077a76ffe29a34c6cc2b4510426b4da77d1e9afb4c.enc
f9b33d359f17a437f6c24b4de6f2272e generation_config.json.fd7ff399e5568cc21a0a8414f43df88ef7c424995b9b97a90563165d2cf79efd.enc
07efffcfb738722f00c9b7ac81044bb9 pytorch_model-00001-of-00003.bin.1a523c0d01807d7fcde8d73537f09e346ff303a4769b8a6659114358621fc838.enc
fe66f8672c07e9e5bdfec4dd45e1e093 pytorch_model-00002-of-00003.bin.98e48fb6812bb87843c7276a85ed34124f67df5654d8cf0b6bb9302ecfe3a37f.enc
^@b3b4a0f1d6b399543d3d7ac50f9ce936 pytorch_model-00003-of-00003.bin.79921900f30a9ec501177fca2f593f90cb9f5ab235c05863cc4d384450cf3f6f.enc
7aef01bb265647be2a9acd1c7ea69bd8 pytorch_model.bin.index.json.af10ab40cc0368fba37018148447e3dcd9b72829a38e26c9eaf3eda3a7850b56.enc
34696bfce7b27548cfc2410e2b55762e special_tokens_map.json.96bdbb8504d9967606e5f661ccc7cbbac44a3661af863a7a58614670a0ccab33.enc
24e4f14cc3330576dcd1fd12760d35f3 tokenizer_config.json.2e333c3e1c77e7e9c6ceb573b02355deaf303ca8180bbac40f1d0405209ee457.enc
56724a79091f3d1877cca65c6412d646 tokenizer.model.0b716a618c9e7c45648f91d997431eba3b0ff111b17ce7b777280ed771a49f95.enc
- Decrypt the files using the scripts in https://github.com/LianjiaTech/BELLE/tree/main/models
You can use the following command in Bash. Please replace "/path/to_encrypted" with the path where you stored your encrypted file, replace "/path/to_original_llama_7B" with the path where you stored your original llama7B file, and replace "/path/to_finetuned_model" with the path where you want to save your final trained model.
mkdir /path/to_finetuned_model
for f in "/path/to_encrypted"/*; \
do if [ -f "$f" ]; then \
python3 decrypt.py "$f" "/path/to_original_llama_7B/consolidated.00.pth" "/path/to_finetuned_model/"; \
fi; \
done
After executing the aforementioned command, you will obtain the following files.
./config.json
./generation_config.json
./pytorch_model-00001-of-00002.bin
./pytorch_model-00002-of-00002.bin
./pytorch_model.bin.index.json
./special_tokens_map.json
./tokenizer_config.json
./tokenizer.model
- Check md5sum
You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery. Here are the MD5 checksums for the relevant files:
md5sum ./*
1e28fe60969b1d4dcc3f97586082c5e5 config.json
2917a1cafb895cf57e746cfd7696bfe5 generation_config.json
2a8deacda3e22be63fe854da92006203 pytorch_model-00001-of-00003.bin
1bab042c86403f440517c8ae958716ed pytorch_model-00002-of-00003.bin
6fbd17996033fb5ec0263cdb07131de7 pytorch_model-00003-of-00003.bin
5762c0c9a1ca9366500390d0d335b2b6 pytorch_model.bin.index.json
15f7a943faa91a794f38dd81a212cb01 special_tokens_map.json
b87fab00f218c984135af5a0db353f22 tokenizer_config.json
6ffe559392973a92ea28032add2a8494 tokenizer.model
Use model
Please note that the input should be formatted as follows in both training and inference.
Human: {input} \n\nBelle:
After you decrypt the files, BELLE-LLaMA-EXT-13B can be easily loaded with LlamaForCausalLM.
from transformers import LlamaForCausalLM, AutoTokenizer
import torch
ckpt = '/path/to_finetuned_model/'
device = torch.device('cuda')
model = LlamaForCausalLM.from_pretrained(ckpt, device_map='auto', low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(ckpt)
prompt = "Human: 写一首ä¸æ–‡æŒæ›²ï¼Œèµžç¾Žå¤§è‡ªç„¶ \n\nBelle: "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generate_ids = model.generate(input_ids, max_new_tokens=300, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5,repetition_penalty=1.2, eos_token_id=2, bos_token_id=1, pad_token_id=0)
output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
response = output[len(prompt):]
print(response)
Limitations
There still exists a few issues in the model trained on current base model and data:
The model might generate factual errors when asked to follow instructions related to facts.
Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.
Needs improvements on reasoning and coding.
Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.
Citation
Please cite our paper and github when using our code, data or model.
@misc{ji2023better,
title={Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation},
author={Yunjie Ji and Yan Gong and Yong Deng and Yiping Peng and Qiang Niu and Baochang Ma and Xiangang Li},
year={2023},
eprint={2304.07854},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{BELLE,
author = {BELLEGroup},
title = {BELLE: Be Everyone's Large Language model Engine},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/LianjiaTech/BELLE}},
}