Intel
/

neural-chat-7b-v3-1

@@ -247,6 +247,67 @@ The model was submitted to the [LLM Leaderboard](https://huggingface.co/spaces/H
 | [Intel/neural-chat-7b-v3](https://huggingface.co/Intel/neural-chat-7b-v3) | **57.31** | 67.15 | 83.29 | 62.26  | 58.77 | 78.06 | 1.21 | 50.43 |
 | [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) | **59.06** | 66.21 | 83.64 | 62.37  | 59.65 | 78.14 | 19.56 | 43.84 |
 ## Ethical Considerations and Limitations
 Neural-chat-7b-v3-1 can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

 | [Intel/neural-chat-7b-v3](https://huggingface.co/Intel/neural-chat-7b-v3) | **57.31** | 67.15 | 83.29 | 62.26  | 58.77 | 78.06 | 1.21 | 50.43 |
 | [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) | **59.06** | 66.21 | 83.64 | 62.37  | 59.65 | 78.14 | 19.56 | 43.84 |
+## Testing Model Quantizability
+The following code block can be run to determine, for PyTorch models, if that model is amenable to quantization.
+One caveat - the Intel Extension for PyTorch uses optimum ipex, which is pre-release and needs further testing.
+To install the dependencies, you should first install Intel Extensions for PyTorch and tehn pip install each of the following dependencies:
+- torch
+- optimum.intel
+- optimum[ipex]
+- transformers
+### Intel Extension for PyTorch method:
+In this case, we are testing if neural-chat-7b-v3-1 can be quantized and this testing method demonstrates the model size change, for example:
+when the base type is specified to be torch.bfloat16 but also specifying that load_in_4bit=True which causes the weights only to be quantized we see an output from the model testing as follows:
+- **model_quantize_internal: model size  = 27625.02 MB**
+- **model_quantize_internal: quant size  =  4330.80 MB**
+This code should run from within a python script - such as ipex_test.py as follows:
+```python
+import torch
+import os
+from transformers import AutoTokenizer
+from intel_extension_for_transformers.transformers import AutoModelForCausalLM, pipeline
+model_name = "Intel/neural-chat-7b-v3-1"
+prompt = "Once upon a time, there existed a little girl,"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+inputs = tokenizer(prompt, return_tensors="pt").input_ids
+result = {torch.bfloat16:"failed"}
+typ = torch.bfloat16
+try:
+    model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True,  torch_dtype = typ)
+    outputs = model.generate(inputs, max_new_tokens=20)
+    result[typ] = f"passed, {os.stat(model.bin_file).st_size}"
+except:
+    result[typ] = "failed"
+print("\n\nResults of quantizing: ")
+# determine if Quantized
+with open(r"output.log", 'r') as fp:
+    for l_no, line in enumerate(fp):
+        # search string
+        if 'model_quantize_internal' in line:
+            print(line)
+print("\n\nExecution results ")
+for k,v in result.items():
+    print(k,v)
+print("\n\nModel Output: ")
+tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
+```
+Run the code as folows from a bash terminal:
+```bash
+python ipex_test.py 2>&1 | tee output.log
+```
+The entire output is captured in the output.log but it will be summarized,
+along with output from the model indicating either pass or fail of the quantization as well as model output for a given prompt.
 ## Ethical Considerations and Limitations
 Neural-chat-7b-v3-1 can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.