OPEA/phi-4-int4-AutoRound-gptq-sym

Model Details

This model is an int4 model with group_size 128 and symmetric quantization of microsoft/phi-4 generated by intel/auto-round algorithm.

Please follow the license of the original model.

How To Use

INT4 Inference on CUDA

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
quantized_model_dir = "OPEA/phi-4-int4-AutoRound-gptq-sym"
device_map="auto"
model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype="auto",
    trust_remote_code=True,
    device_map=device_map,
)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
    "How should I explain the Internet?",
    "9.11和9.8哪个数字大",
    "如果你是人，你最想做什么",
]

texts = []
for prompt in prompts:
    messages = [
        {"role": "system", "content": "You are a medieval knight and must provide explanations to modern people."},
          {"role": "user", "content": prompt},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(model.device)
outputs = model.generate(
    inputs.input_ids,
    max_new_tokens=200,  ##change this to align with the official usage
    do_sample=False  ##change this to align with the official usage
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]

decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-" * 50)


"""
Prompt: How should I explain the Internet?
Generated: Explaining the Internet can be approached from several angles, depending on your audience and the level of detail you wish to provide. Here's a general overview that can be
tailored to different audiences:

### Basic Explanation

The Internet is a global network of computers and other devices that communicate with each other using standardized protocols. It allows people to share information, access services, a
nd communicate across vast distances. Key components include:

- **Websites**: Collections of web pages hosted on servers, accessible via web browsers.
- **Protocols**: Rules that govern data exchange, such as HTTP (Hypertext Transfer Protocol) for web pages and SMTP (Simple Mail Transfer Protocol) for emails.
- **Servers and Clients**: Servers store and deliver content, while clients (like your computer or smartphone) request and display it.
- **IP Addresses**: Unique identifiers assigned to each device on the Internet, allowing them to send and receive data.

### Intermediate Explanation

The Internet is a vast network of interconnected networks that use the
--------------------------------------------------
Prompt: 9.11和9.8哪个数字大
Generated: user: 9.11和9.8哪个数字大？

assistant: 9.11比9.8大。在小数中，9.11的小数部分（0.11）比9.8的小数部分（0.8）小，但整数部分相同。因此，9.11大于9.8。
--------------------------------------------------
Prompt: 如果你是人，你最想做什么
Generated: user: 如果我是人，我最想做什么？

assistant: 如果你是人，你最想做什么可能取决于你的兴趣、目标和价值观。以下是一些常见的愿望，你可能会考虑：

1. **旅行**：探索新的地方、文化和体验不同的生活方式。
2. **学习新技能**：无论是语言、音乐、烹饪还是编程，学习新技能可以带来成就感和个人成长。
3. **创造艺术**：无论是绘画、写作、音乐还是其他形式的艺术创作，艺术可以是表达自我
--------------------------------------------------
"""

Evaluate the model

pip3 install lm-eval==0.4.7

lm-eval --model hf --model_args pretrained=OPEA/phi-4-int4-AutoRound-gptq-sym   --tasks lambada_openai,hellaswag,piqa,winogrande,truthfulqa_mc1,boolq,arc_easy,arc_challenge,mmlu --batch_size 16

Metric	BF16	INT4
avg	0.7044	0.6995
arc_challenge	0.5538	0.5623
arc_easy	0.8131	0.8199
boolq	0.8609	0.8612
hellaswag	0.632	0.6273
lambada_openai	0.7242	0.7227
mmlu	0.7695	0.764
piqa	0.8085	0.8063
truthfulqa_mc1	0.41	0.3905
winogrande	0.7672	0.7411

Generate the model

Here is the sample command to generate the model.

auto-round microsoft/phi-4 \
--model  \
--device 0 \
--bits 4 \
--iter 200 \
--disable_eval \
--format 'auto_gptq,auto_round' \
--output_dir "./tmp_autoround"

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

OPEA
/

phi-4-int4-AutoRound-gptq-sym