Instructions to use Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M", filename="phi-2-srd-q4km.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M # Run inference directly in the terminal: llama cli -hf Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M # Run inference directly in the terminal: llama cli -hf Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M
Use Docker
docker model run hf.co/Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M
- Ollama
How to use Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M with Ollama:
ollama run hf.co/Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M
- Unsloth Studio
How to use Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M to start chatting
- Atomic Chat new
- Docker Model Runner
How to use Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M with Docker Model Runner:
docker model run hf.co/Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M
- Lemonade
How to use Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M
Run and chat with the model
lemonade run user.Phi-2_SRD_Q4_K_M-{{QUANT_TAG}}List all available models
lemonade list
Phi-2 · SRD Q4_K_M
Standard GGUF — drop into any llama.cpp build, no custom kernels.
Base model: microsoft/phi-2
Quantization: SRD4 → Q4_K_M
File size: ~1.7 GB
What is SRD?
Standard Q4_K_M loses information systematically. Stochastic Residual Dithering (SRD)
computes an INT8 residual (D8) before quantization. At load time the corrected weights
are: W ≈ W4 + D8 × S8. Inference speed is identical to vanilla Q4_K_M after load.
SRD targets the reasoning layers (40–77% of depth, layers 12–24 in this model). Phi-2 was trained on textbook-quality data — its mid-depth reasoning cluster is where Q4 quantization noise has the most impact on multi-step correctness.
Benchmark results
Evaluated on TruthfulQA MC1 and WikiText-2 perplexity. Results pending — run the benchmark and open a Discussion.
| Mode | TruthfulQA MC1 ↑ | Δ vs baseline | D8 overhead |
|---|---|---|---|
| Baseline Q4_K_M | TBD | — | 0 MB |
| Selective SRD (layers 12–24) | TBD | TBD | TBD MB |
| Full SRD (all layers) | TBD | TBD | TBD MB |
Usage
llama-cli -m phi-2-srd-q4km.gguf \
-p "Instruct: Explain the difference between a mutex and a semaphore.\nOutput:" \
--n-predict 300
Phi-2 responds well to both the Instruct:/Output: format and the Human:/Assistant: format:
llama-cli -m phi-2-srd-q4km.gguf \
-p "Human: Write a Python function to binary search a sorted list.\nAssistant:" \
--n-predict 300
How it was built
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from research.quant.quantize_model import quantize_hf_model_inplace
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-2", torch_dtype=torch.float16, device_map="auto",
trust_remote_code=True
)
quantize_hf_model_inplace(model, alpha=1.0, group_size=64)
# → converted to GGUF Q4_K_M via llama.cpp convert_hf_to_gguf.py
Note: trust_remote_code=True is required for Phi-2's custom attention implementation.
Pipeline: orivael-dev/axiom — branch claude/srd-prototype-benchmark-JRtv1
Contribute results
Run llama-perplexity on WikiText-2 and open a Discussion with:
- Hardware (CPU / CUDA / Metal / ROCm)
- Perplexity score
- Tokens/sec
- Downloads last month
- 46
We're not able to determine the quantization variants.
Model tree for Orivael-SRD-Lab/Phi-2_SRD_Q4_K_M
Base model
microsoft/phi-2