duyntnet
/

neural-chat-7b-v3-3-imatrix-GGUF

+---
+license: other
+language:
+- en
+pipeline_tag: text-generation
+inference: false
+tags:
+- transformers
+- gguf
+- imatrix
+- neural-chat-7b-v3-3
+---
+Quantizations of https://huggingface.co/Intel/neural-chat-7b-v3-3
+# From original readme
+## How To Use
+Context length for this model: 8192 tokens (same as https://huggingface.co/mistralai/Mistral-7B-v0.1)
+### Reproduce the model
+Here is the sample code to reproduce the model: [GitHub sample code](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/finetuning/finetune_neuralchat_v3). Here is the documentation to reproduce building the model:
+```bash
+git clone https://github.com/intel/intel-extension-for-transformers.git
+cd intel-extension-for-transformers
+docker build --no-cache ./ --target hpu --build-arg REPO=https://github.com/intel/intel-extension-for-transformers.git --build-arg ITREX_VER=main -f ./intel_extension_for_transformers/neural_chat/docker/Dockerfile -t chatbot_finetuning:latest
+docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host chatbot_finetuning:latest
+# after entering docker container
+cd examples/finetuning/finetune_neuralchat_v3
+```
+We select the latest pretrained mistralai/Mistral-7B-v0.1 and the open source dataset Open-Orca/SlimOrca to conduct the experiment.
+The below script use deepspeed zero2 to lanuch the training with 8 cards Gaudi2. In the `finetune_neuralchat_v3.py`, the default `use_habana=True, use_lazy_mode=True, device="hpu"` for Gaudi2. And if you want to run it on NVIDIA GPU, you can set them `use_habana=False, use_lazy_mode=False, device="auto"`.
+```python
+deepspeed --include localhost:0,1,2,3,4,5,6,7 \
+    --master_port 29501 \
+    finetune_neuralchat_v3.py
+```
+Merge the LoRA weights:
+```python
+python apply_lora.py \
+    --base-model-path mistralai/Mistral-7B-v0.1 \
+    --lora-model-path finetuned_model/ \
+    --output-path finetuned_model_lora
+```
+### Use the model
+### FP32 Inference with Transformers
+```python
+import transformers
+model_name = 'Intel/neural-chat-7b-v3-3'
+model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
+tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
+def generate_response(system_input, user_input):
+    # Format the input using the provided template
+    prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"
+    # Tokenize and encode the prompt
+    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False)
+    # Generate a response
+    outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
+    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    # Extract only the assistant's response
+    return response.split("### Assistant:\n")[-1]
+# Example usage
+system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
+user_input = "calculate 100 + 520 + 60"
+response = generate_response(system_input, user_input)
+print(response)
+# expected response
+"""
+To calculate the sum of 100, 520, and 60, we will follow these steps:
+1. Add the first two numbers: 100 + 520
+2. Add the result from step 1 to the third number: (100 + 520) + 60
+Step 1: Add 100 and 520
+100 + 520 = 620
+Step 2: Add the result from step 1 to the third number (60)
+(620) + 60 = 680
+So, the sum of 100, 520, and 60 is 680.
+"""
+```
+### BF16 Inference with Intel Extension for Transformers and Intel Extension for Pytorch
+```python
+from transformers import AutoTokenizer, TextStreamer
+import torch
+from intel_extension_for_transformers.transformers import AutoModelForCausalLM
+import intel_extension_for_pytorch as ipex
+model_name = "Intel/neural-chat-7b-v3-3"
+prompt = "Once upon a time, there existed a little girl,"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+inputs = tokenizer(prompt, return_tensors="pt").input_ids
+streamer = TextStreamer(tokenizer)
+model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
+model = ipex.optimize(model.eval(), dtype=torch.bfloat16, inplace=True, level="O1", auto_kernel_selection=True)
+outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
+```
+### INT4 Inference with Transformers and Intel Extension for Transformers
+```python
+from transformers import AutoTokenizer, TextStreamer
+from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
+model_name = "Intel/neural-chat-7b-v3-3"
+# for int8, should set weight_dtype="int8"
+config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int4")
+prompt = "Once upon a time, there existed a little girl,"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+inputs = tokenizer(prompt, return_tensors="pt").input_ids
+streamer = TextStreamer(tokenizer)
+model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
+outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
+```