Phi-4 Mini Instruct Q8_0 GGUF

Overview

This repository contains a Post-Training Quantized (PTQ) GGUF version of Microsoft's Phi-4 Mini Instruct model.

The original model was converted from Hugging Face Safetensors format to GGUF (F16) and subsequently quantized to Q8_0 using llama.cpp to enable efficient CPU-based inference while significantly reducing storage requirements.

Base Model Information

Item Value
Base Model microsoft/Phi-4-mini-instruct
Original Author Microsoft
Original License MIT
Original Format Safetensors
Quantized Format GGUF
Quantization Method Post-Training Quantization (PTQ)
Quantization Type Q8_0

Original model:

https://huggingface.co/microsoft/Phi-4-mini-instruct


Quantization Pipeline

The following workflow was used to create this model:

Phi-4 Mini Instruct (Safetensors)
                โ†“
      GGUF Conversion (F16)
                โ†“
 Post-Training Quantization (Q8_0)
                โ†“
      Optimized GGUF Model

Conversion Process

  1. Downloaded Phi-4 Mini Instruct from Hugging Face.
  2. Converted the original Safetensors weights to GGUF (F16) using llama.cpp.
  3. Generated an intermediate F16 GGUF model.
  4. Applied Q8_0 Post-Training Quantization.
  5. Verified model functionality using llama.cpp inference.
  6. Validated compatibility with local deployment frameworks.

Quantization Results

Metric Value
Original GGUF (F16) Size 7.15 GB
Quantized GGUF (Q8_0) Size 4.08 GB
Storage Reduction ~43%
GPU Required No
CPU Inference Supported Yes
Quantization Backend llama.cpp

Hardware Used

  • Intel Core i7-1165G7
  • Windows 11
  • CPU-only quantization workflow
  • No NVIDIA GPU required

Repository Contents

File Description
phi4-q8_0.gguf Quantized GGUF model
README.md Documentation and usage instructions
LICENSE Original MIT License from Microsoft

Using with llama.cpp

Run directly with llama.cpp:

llama-cli -m phi4-q8_0.gguf

Example:

llama-cli -m phi4-q8_0.gguf -p "Explain post-training quantization."

Using with Ollama

Create a file named:

Modelfile

Contents:

FROM ./phi4-q8_0.gguf

Create the model:

ollama create phi4-mini-q8 -f Modelfile

Run:

ollama run phi4-mini-q8

Using with Python (llama-cpp-python)

Install:

pip install llama-cpp-python

Example:

from llama_cpp import Llama

llm = Llama(
    model_path="phi4-q8_0.gguf",
    n_ctx=4096
)

response = llm(
    "Explain quantization.",
    max_tokens=200
)

print(response["choices"][0]["text"])

Intended Use

This model is suitable for:

  • Local LLM deployment
  • CPU-only inference
  • Educational and research purposes
  • Edge AI applications
  • Resource-constrained environments
  • GGUF-compatible inference engines

License

This repository contains a quantized conversion of Microsoft's Phi-4 Mini Instruct model.

The original model is distributed under the MIT License by Microsoft. The included LICENSE file is retained from the original model repository.

All rights, ownership, model architecture, training methodology, and intellectual property remain with Microsoft.

This repository only provides a GGUF conversion and Q8_0 post-training quantized version of the original model.


Acknowledgements

  • Microsoft for the Phi-4 Mini Instruct model.
  • llama.cpp for GGUF conversion and quantization tooling.
  • Hugging Face for model hosting and distribution.

Quantization Author

K VIGNESH

Performed:

  • GGUF conversion
  • Q8_0 Post-Training Quantization
  • Validation and testing
  • Local deployment verification
  • CPU inference benchmarking

using llama.cpp and open-source tooling.

Downloads last month
24
GGUF
Model size
4B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kvignesh/phi4-mini-q8_0-gguf

Quantized
(154)
this model