--- license: other language: - en pipeline_tag: text-generation inference: false tags: - transformers - gguf - imatrix - neural-chat-7b-v3-3 --- Quantizations of https://huggingface.co/Intel/neural-chat-7b-v3-3 # From original readme ## How To Use Context length for this model: 8192 tokens (same as https://huggingface.co/mistralai/Mistral-7B-v0.1) ### Reproduce the model Here is the sample code to reproduce the model: [GitHub sample code](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/finetuning/finetune_neuralchat_v3). Here is the documentation to reproduce building the model: ```bash git clone https://github.com/intel/intel-extension-for-transformers.git cd intel-extension-for-transformers docker build --no-cache ./ --target hpu --build-arg REPO=https://github.com/intel/intel-extension-for-transformers.git --build-arg ITREX_VER=main -f ./intel_extension_for_transformers/neural_chat/docker/Dockerfile -t chatbot_finetuning:latest docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host chatbot_finetuning:latest # after entering docker container cd examples/finetuning/finetune_neuralchat_v3 ``` We select the latest pretrained mistralai/Mistral-7B-v0.1 and the open source dataset Open-Orca/SlimOrca to conduct the experiment. The below script use deepspeed zero2 to lanuch the training with 8 cards Gaudi2. In the `finetune_neuralchat_v3.py`, the default `use_habana=True, use_lazy_mode=True, device="hpu"` for Gaudi2. And if you want to run it on NVIDIA GPU, you can set them `use_habana=False, use_lazy_mode=False, device="auto"`. ```python deepspeed --include localhost:0,1,2,3,4,5,6,7 \ --master_port 29501 \ finetune_neuralchat_v3.py ``` Merge the LoRA weights: ```python python apply_lora.py \ --base-model-path mistralai/Mistral-7B-v0.1 \ --lora-model-path finetuned_model/ \ --output-path finetuned_model_lora ``` ### Use the model ### FP32 Inference with Transformers ```python import transformers model_name = 'Intel/neural-chat-7b-v3-3' model = transformers.AutoModelForCausalLM.from_pretrained(model_name) tokenizer = transformers.AutoTokenizer.from_pretrained(model_name) def generate_response(system_input, user_input): # Format the input using the provided template prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n" # Tokenize and encode the prompt inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False) # Generate a response outputs = model.generate(inputs, max_length=1000, num_return_sequences=1) response = tokenizer.decode(outputs[0], skip_special_tokens=True) # Extract only the assistant's response return response.split("### Assistant:\n")[-1] # Example usage system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer." user_input = "calculate 100 + 520 + 60" response = generate_response(system_input, user_input) print(response) # expected response """ To calculate the sum of 100, 520, and 60, we will follow these steps: 1. Add the first two numbers: 100 + 520 2. Add the result from step 1 to the third number: (100 + 520) + 60 Step 1: Add 100 and 520 100 + 520 = 620 Step 2: Add the result from step 1 to the third number (60) (620) + 60 = 680 So, the sum of 100, 520, and 60 is 680. """ ``` ### BF16 Inference with Intel Extension for Transformers and Intel Extension for Pytorch ```python from transformers import AutoTokenizer, TextStreamer import torch from intel_extension_for_transformers.transformers import AutoModelForCausalLM import intel_extension_for_pytorch as ipex model_name = "Intel/neural-chat-7b-v3-3" prompt = "Once upon a time, there existed a little girl," tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) inputs = tokenizer(prompt, return_tensors="pt").input_ids streamer = TextStreamer(tokenizer) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) model = ipex.optimize(model.eval(), dtype=torch.bfloat16, inplace=True, level="O1", auto_kernel_selection=True) outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300) ``` ### INT4 Inference with Transformers and Intel Extension for Transformers ```python from transformers import AutoTokenizer, TextStreamer from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig model_name = "Intel/neural-chat-7b-v3-3" # for int8, should set weight_dtype="int8" config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int4") prompt = "Once upon a time, there existed a little girl," tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) inputs = tokenizer(prompt, return_tensors="pt").input_ids streamer = TextStreamer(tokenizer) model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config) outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300) ```