duyntnet commited on
Commit
7a8e61b
1 Parent(s): b67fe01

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -0
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - transformers
9
+ - gguf
10
+ - imatrix
11
+ - neural-chat-7b-v3-3
12
+ ---
13
+ Quantizations of https://huggingface.co/Intel/neural-chat-7b-v3-3
14
+
15
+
16
+ # From original readme
17
+
18
+ ## How To Use
19
+
20
+ Context length for this model: 8192 tokens (same as https://huggingface.co/mistralai/Mistral-7B-v0.1)
21
+
22
+ ### Reproduce the model
23
+ Here is the sample code to reproduce the model: [GitHub sample code](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/finetuning/finetune_neuralchat_v3). Here is the documentation to reproduce building the model:
24
+
25
+ ```bash
26
+ git clone https://github.com/intel/intel-extension-for-transformers.git
27
+ cd intel-extension-for-transformers
28
+
29
+ docker build --no-cache ./ --target hpu --build-arg REPO=https://github.com/intel/intel-extension-for-transformers.git --build-arg ITREX_VER=main -f ./intel_extension_for_transformers/neural_chat/docker/Dockerfile -t chatbot_finetuning:latest
30
+
31
+ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host chatbot_finetuning:latest
32
+
33
+ # after entering docker container
34
+ cd examples/finetuning/finetune_neuralchat_v3
35
+
36
+ ```
37
+ We select the latest pretrained mistralai/Mistral-7B-v0.1 and the open source dataset Open-Orca/SlimOrca to conduct the experiment.
38
+
39
+ The below script use deepspeed zero2 to lanuch the training with 8 cards Gaudi2. In the `finetune_neuralchat_v3.py`, the default `use_habana=True, use_lazy_mode=True, device="hpu"` for Gaudi2. And if you want to run it on NVIDIA GPU, you can set them `use_habana=False, use_lazy_mode=False, device="auto"`.
40
+
41
+ ```python
42
+ deepspeed --include localhost:0,1,2,3,4,5,6,7 \
43
+ --master_port 29501 \
44
+ finetune_neuralchat_v3.py
45
+ ```
46
+
47
+ Merge the LoRA weights:
48
+
49
+ ```python
50
+ python apply_lora.py \
51
+ --base-model-path mistralai/Mistral-7B-v0.1 \
52
+ --lora-model-path finetuned_model/ \
53
+ --output-path finetuned_model_lora
54
+ ```
55
+
56
+ ### Use the model
57
+
58
+ ### FP32 Inference with Transformers
59
+
60
+ ```python
61
+ import transformers
62
+
63
+
64
+ model_name = 'Intel/neural-chat-7b-v3-3'
65
+ model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
66
+ tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
67
+
68
+ def generate_response(system_input, user_input):
69
+
70
+ # Format the input using the provided template
71
+ prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"
72
+
73
+ # Tokenize and encode the prompt
74
+ inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False)
75
+
76
+ # Generate a response
77
+ outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
78
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
79
+
80
+ # Extract only the assistant's response
81
+ return response.split("### Assistant:\n")[-1]
82
+
83
+
84
+ # Example usage
85
+ system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
86
+ user_input = "calculate 100 + 520 + 60"
87
+ response = generate_response(system_input, user_input)
88
+ print(response)
89
+
90
+ # expected response
91
+ """
92
+ To calculate the sum of 100, 520, and 60, we will follow these steps:
93
+
94
+ 1. Add the first two numbers: 100 + 520
95
+ 2. Add the result from step 1 to the third number: (100 + 520) + 60
96
+
97
+ Step 1: Add 100 and 520
98
+ 100 + 520 = 620
99
+
100
+ Step 2: Add the result from step 1 to the third number (60)
101
+ (620) + 60 = 680
102
+
103
+ So, the sum of 100, 520, and 60 is 680.
104
+ """
105
+ ```
106
+
107
+ ### BF16 Inference with Intel Extension for Transformers and Intel Extension for Pytorch
108
+ ```python
109
+ from transformers import AutoTokenizer, TextStreamer
110
+ import torch
111
+ from intel_extension_for_transformers.transformers import AutoModelForCausalLM
112
+ import intel_extension_for_pytorch as ipex
113
+
114
+ model_name = "Intel/neural-chat-7b-v3-3"
115
+ prompt = "Once upon a time, there existed a little girl,"
116
+
117
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
118
+ inputs = tokenizer(prompt, return_tensors="pt").input_ids
119
+ streamer = TextStreamer(tokenizer)
120
+
121
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
122
+ model = ipex.optimize(model.eval(), dtype=torch.bfloat16, inplace=True, level="O1", auto_kernel_selection=True)
123
+
124
+ outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
125
+ ```
126
+
127
+ ### INT4 Inference with Transformers and Intel Extension for Transformers
128
+ ```python
129
+ from transformers import AutoTokenizer, TextStreamer
130
+ from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
131
+ model_name = "Intel/neural-chat-7b-v3-3"
132
+
133
+ # for int8, should set weight_dtype="int8"
134
+ config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int4")
135
+ prompt = "Once upon a time, there existed a little girl,"
136
+
137
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
138
+ inputs = tokenizer(prompt, return_tensors="pt").input_ids
139
+ streamer = TextStreamer(tokenizer)
140
+
141
+ model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
142
+ outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
143
+
144
+ ```