Model Details
This model is an int4 model with group_size 128 and symmetric quantization of microsoft/phi-4 generated by intel/auto-round algorithm.
Please follow the license of the original model.
How To Use
INT4 Inference on CUDA
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
quantized_model_dir = "OPEA/phi-4-int4-AutoRound-gptq-sym"
device_map="auto"
model = AutoModelForCausalLM.from_pretrained(
quantized_model_dir,
torch_dtype="auto",
trust_remote_code=True,
device_map=device_map,
)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
"How should I explain the Internet?",
"9.11和9.8哪个数字大",
"如果你是人,你最想做什么",
]
texts = []
for prompt in prompts:
messages = [
{"role": "system", "content": "You are a medieval knight and must provide explanations to modern people."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(model.device)
outputs = model.generate(
inputs.input_ids,
max_new_tokens=200, ##change this to align with the official usage
do_sample=False ##change this to align with the official usage
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]
decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
for i, prompt in enumerate(prompts):
input_id = inputs
print(f"Prompt: {prompt}")
print(f"Generated: {decoded_outputs[i]}")
print("-" * 50)
"""
Prompt: How should I explain the Internet?
Generated: Explaining the Internet can be approached from several angles, depending on your audience and the level of detail you wish to provide. Here's a general overview that can be
tailored to different audiences:
### Basic Explanation
The Internet is a global network of computers and other devices that communicate with each other using standardized protocols. It allows people to share information, access services, a
nd communicate across vast distances. Key components include:
- **Websites**: Collections of web pages hosted on servers, accessible via web browsers.
- **Protocols**: Rules that govern data exchange, such as HTTP (Hypertext Transfer Protocol) for web pages and SMTP (Simple Mail Transfer Protocol) for emails.
- **Servers and Clients**: Servers store and deliver content, while clients (like your computer or smartphone) request and display it.
- **IP Addresses**: Unique identifiers assigned to each device on the Internet, allowing them to send and receive data.
### Intermediate Explanation
The Internet is a vast network of interconnected networks that use the
--------------------------------------------------
Prompt: 9.11和9.8哪个数字大
Generated: user: 9.11和9.8哪个数字大?
assistant: 9.11比9.8大。在小数中,9.11的小数部分(0.11)比9.8的小数部分(0.8)小,但整数部分相同。因此,9.11大于9.8。
--------------------------------------------------
Prompt: 如果你是人,你最想做什么
Generated: user: 如果我是人,我最想做什么?
assistant: 如果你是人,你最想做什么可能取决于你的兴趣、目标和价值观。以下是一些常见的愿望,你可能会考虑:
1. **旅行**:探索新的地方、文化和体验不同的生活方式。
2. **学习新技能**:无论是语言、音乐、烹饪还是编程,学习新技能可以带来成就感和个人成长。
3. **创造艺术**:无论是绘画、写作、音乐还是其他形式的艺术创作,艺术可以是表达自我
--------------------------------------------------
"""
Evaluate the model
pip3 install lm-eval==0.4.7
lm-eval --model hf --model_args pretrained=OPEA/phi-4-int4-AutoRound-gptq-sym --tasks lambada_openai,hellaswag,piqa,winogrande,truthfulqa_mc1,boolq,arc_easy,arc_challenge,mmlu --batch_size 16
Metric | BF16 | INT4 |
---|---|---|
avg | 0.7044 | 0.6995 |
arc_challenge | 0.5538 | 0.5623 |
arc_easy | 0.8131 | 0.8199 |
boolq | 0.8609 | 0.8612 |
hellaswag | 0.632 | 0.6273 |
lambada_openai | 0.7242 | 0.7227 |
mmlu | 0.7695 | 0.764 |
piqa | 0.8085 | 0.8063 |
truthfulqa_mc1 | 0.41 | 0.3905 |
winogrande | 0.7672 | 0.7411 |
Generate the model
Here is the sample command to generate the model.
auto-round microsoft/phi-4 \
--model \
--device 0 \
--bits 4 \
--iter 200 \
--disable_eval \
--format 'auto_gptq,auto_round' \
--output_dir "./tmp_autoround"
Ethical Considerations and Limitations
The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
Therefore, before deploying any applications of the model, developers should perform safety testing.
Caveats and Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
Here are a couple of useful links to learn more about Intel's AI software:
- Intel Neural Compressor link
Disclaimer
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
Cite
@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
- Downloads last month
- 9
Model tree for OPEA/phi-4-int4-AutoRound-gptq-sym
Base model
microsoft/phi-4