For optimal performance, we refrain from fine-tuning the model's identity. Thus, inquiries such as "Who are you" or "Who developed you" may yield random responses that are not necessarily accurate.
Updates
- ๐๐๐ [July 25, 2024] We now introduce shenzhi-wang/Llama3.1-70B-Chinese-Chat! Compared to the original Meta-Llama-3.1-70B-Instruct model, our llama3.1-70B-Chinese-Chat model significantly reduces the issues of "Chinese questions with English answers" and the mixing of Chinese and English in responses. The training dataset contains >100K preference pairs, and it exhibits significant enhancements, especially in roleplay, function calling, and math capabilities!
- ๐ฅ We provide the official q3_k_m, q4_k_m, q8_0, and f16 GGUF versions of Llama3.1-70B-Chinese-Chat at https://huggingface.co/shenzhi-wang/Llama3.1-70B-Chinese-Chat/tree/main/gguf!
- ๐ฅ We provide the official ollama version of Llama3.1-70B-Chinese-Chat at https://ollama.com/wangshenzhi/llama3.1_70b_chinese_chat! Quick use:
ollama run wangshenzhi/llama3.1_70b_chinese_chat
.
Model Summary
llama3.1-70B-Chinese-Chat is an instruction-tuned language model for Chinese & English users with various abilities such as roleplaying & tool-using built upon the Meta-Llama-3.1-70B-Instruct model.
Developers: Shenzhi Wang*, Yaowei Zheng*, Guoyin Wang (in.ai), Shiji Song, Gao Huang. (*: Equal Contribution)
- License: Llama-3.1 License
- Base Model: Meta-Llama-3.1-70B-Instruct
- Model Size: 8.03B
- Context length: 128K (reported by Meta-Llama-3.1-70B-Instruct model, untested for our Chinese model)
1. Introduction
This is the first model specifically fine-tuned for Chinese & English users based on the Meta-Llama-3.1-70B-Instruct model. The fine-tuning algorithm used is ORPO [1].
Compared to the original Meta-Llama-3.1-70B-Instruct model, our llama3.1-70B-Chinese-Chat model significantly reduces the issues of "Chinese questions with English answers" and the mixing of Chinese and English in responses.
[1] Hong, Jiwoo, Noah Lee, and James Thorne. "Reference-free Monolithic Preference Optimization with Odds Ratio." arXiv preprint arXiv:2403.07691 (2024).
Training framework: LLaMA-Factory.
Training details:
- epochs: 3
- learning rate: 1.5e-6
- learning rate scheduler type: cosine
- Warmup ratio: 0.1
- cutoff len (i.e. context length): 8192
- orpo beta (i.e. $\lambda$ in the ORPO paper): 0.05
- global batch size: 128
- fine-tuning type: full parameters
- optimizer: paged_adamw_32bit
2. Usage
2.1 Usage of Our BF16 Model
Please upgrade the
transformers
package to ensure it supports Llama3.1 models. The current version we are using is4.43.0
.Use the following Python script to download our BF16 model
from huggingface_hub import snapshot_download
snapshot_download(repo_id="shenzhi-wang/Llama3.1-70B-Chinese-Chat", ignore_patterns=["*.gguf"]) # Download our BF16 model without downloading GGUF models.
- Inference with the BF16 model
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "/Your/Local/Path/to/Llama3.1-70B-Chinese-Chat"
dtype = torch.bfloat16
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype=dtype,
)
chat = [
{"role": "user", "content": "ๅไธ้ฆๅ
ณไบๆบๅจๅญฆไน ็่ฏใ"},
]
input_ids = tokenizer.apply_chat_template(
chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=8192,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = outputs[0][input_ids.shape[-1] :]
print(tokenizer.decode(response, skip_special_tokens=True))
2.2 Usage of Our GGUF Models
- Download our GGUF models from the gguf_models folder;
- Use the GGUF models with LM Studio;
- You can also follow the instructions from https://github.com/ggerganov/llama.cpp/tree/master#usage to use gguf models.
- Downloads last month
- 8
Model tree for Orion-zhen/Llama3.1-70B-Chinese-Chat-4.0bpw-exl2
Base model
meta-llama/Llama-3.1-70B