Instructions to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pbhappliedsystems/phi-4-reasoning-plus-gguf-F16", filename="phi-4-reasoning-plus-gguf-F16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16 # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16 # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16 # Run inference directly in the terminal: ./llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
Use Docker
docker model run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
- LM Studio
- Jan
- Ollama
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with Ollama:
ollama run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
- Unsloth Studio new
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 to start chatting
- Docker Model Runner
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with Docker Model Runner:
docker model run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
- Lemonade
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
Run and chat with the model
lemonade run user.phi-4-reasoning-plus-gguf-F16-F16
List all available models
lemonade list
- Phi-4-reasoning-plus Β· GGUF F16
- Try the Live AI Agent Demo
- Model Description
- PBH Applied Systems Evaluation β quant_eval v7.21
- What These Results Actually Mean β Pipeline Compatibility Finding
- Signal-Level Diagnostics (F16)
- Recommended Deployment Approach for F16
- Hardware Requirements
- Usage
- Artifact Provenance
- Evaluation Methodology
- Why This Card Exists β The Evaluation Report Pitch
- π¬ About quant_eval & This Evaluation Series
- About PBH Applied Systems
- π Work With PBH Applied Systems
- License
- Try the Live AI Agent Demo
Phi-4-reasoning-plus Β· GGUF F16
Converted and evaluated by PBH Applied Systems, LLC β Applied AI/ML Consulting Β· LLM Optimization & Deployment Β· Quantized AI Infrastructure
π¬ This repository is part of a production-oriented evaluation series. Every model published under
pbhappliedsystemshas been independently evaluated using quant_eval v7.21 β a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families β not perplexity or benchmark leaderboard proxies.
π This is the full-precision F16 baseline repository. The Q4_K_M evaluated deployment variant is published at
pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M.
β οΈ Critical context before reading this card. The F16 evaluation results documented here reflect a runner/pipeline compatibility issue specific to the
full_weight_transformers(HuggingFace Transformers) runner and Phi-4-reasoning-plus's<think>block output format. The results do not represent the model's underlying capabilities. They represent what happens when the evaluation pipeline does not correctly handle the model's chain-of-thought output format β and they serve as a concrete case study in why pipeline compatibility must be verified before deployment, not assumed.
Try the Live AI Agent Demo
Launch the PBH Applied Systems AI Agent Demo β
This model is part of the PBH Applied Systems evaluated model series that supports the live AI Agent Demo. The demo lets visitors interact with production-style agent workflows powered by open-weight language models evaluated through PBH Applied Systems' quant_eval framework.
The F16 model serves a different role than the Q4_K_M deployment variant. F16 is the full-precision baseline used to measure what the model can do before quantization. quant_eval then compares the quantized model against this baseline to identify which capabilities are preserved, which degrade, and which tasks require guardrails or a higher-precision deployment.
This comparison is central to the demo. It helps determine which model belongs in which agent role:
- Reasoning models are selected for planning, analysis, and auditable decision workflows.
- Document models are selected for long-context extraction, summarization, and structured Q&A.
- Code models are selected for task completion, structured output, API scaffolding, and automation workflows.
- Quantized variants are selected when they preserve enough behavior to reduce cost, latency, and GPU requirements.
- F16 variants remain important when maximum fidelity, cleaner tool execution, or reduced quantization risk matters more than speed or cost.
The live demo shows the deployment side of that process. The F16 card documents the reference behavior. The Q4_K_M card shows what changes after compression. Together, they explain how PBH Applied Systems uses quant_eval to choose the correct LLM for the correct agent type instead of guessing from model size or leaderboard reputation.
Model Description
This repository contains the full-precision F16 GGUF of microsoft/Phi-4-reasoning-plus, a 14-billion parameter reasoning-tuned model from Microsoft. Phi-4-reasoning-plus generates extended chain-of-thought reasoning traces enclosed in <think>...</think> blocks before emitting its final response.
In the PBH Applied Systems evaluation pipeline, this F16 run (20260222_023834) operated in cache-generation mode (skip_quant=true), producing the full_weight_cache.json intended as the reference baseline for the Q4_K_M comparison run. However, as documented below, the F16 evaluation results were substantially affected by a pipeline compatibility issue β the HuggingFace Transformers runner did not correctly strip <think> block tokens before extraction, causing widespread evaluation failures that do not reflect the model's actual output capabilities.
The F16 GGUF itself was correctly produced. The hardware requirements and artifact provenance sections below are accurate. The evaluation results require the interpretive context provided in this card.
Key Characteristics
- Parameters: 14B
- Architecture: Reasoning (extended chain-of-thought,
<think>block format) - Format: GGUF F16 (full precision)
- File size: 29.3 GB
- SHA256:
6491352a2d3d756fdd4b1538f188bafafc8e940658f1771308ffdaeddd86a385 - Minimum VRAM (GPU inference): ~32 GB
- Recommended GPU tier: A100 40 GB Β· 2Γ A10G Β· RTX 4090 (with partial offload)
- Context window: 16,384 tokens
- Observed inference time (eval hardware): avg 234.39 sec/case on RTX 4090
- License: MIT
Why 234.39 sec/case? At full F16 precision, Phi-4-reasoning-plus generates extensive
<think>reasoning traces before each response. The fuzz family averaged 309.84 sec/case with individual cases reaching 320 seconds. Stateful followup cases ran ~278 seconds each. JSON shelf placement cases averaged 319 seconds. This is the chain-of-thought cost at full precision β the model is reasoning for 4β5 minutes before outputting. For reference, the Q4_K_M variant averages 25.84 sec/case.
PBH Applied Systems Evaluation β quant_eval v7.21
Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID:
20260222_023834Β· Fixtures:golden_oracle_fixtures_v7_21(SHA256:6d71a0b9147c...) Β· Seed: 42 Hardware: NVIDIA RTX 4090 Β· Runner:full_weight_transformers(F16 only) Β· Total rows: 42
Per-Family Results β F16 (full_weight_transformers)
| Family | N | Pass Rate | Avg Secs | Bucket Score | Notes |
|---|---|---|---|---|---|
| json_multistep | 5 | 0.000 | 178.05 | 0.000 | All 4 gating signals fail on all 5 cases |
| stateful_followup | 2 | 0.000 | 278.11 | 0.000 | Both turns fail to parse |
| toolcall_only | 2 | 0.000 | 136.65 | 0.000 | No JSON object produced |
| mixed_brief_json | 2 | 0.000 | 140.38 | 1.000 | JSON valid; ANSWER line missing |
| toolcall | 2 | 0.000 | 160.32 | 0.000 | Stage-1 schema fails on both |
| json | 4 | n/a | 319.12 | 0.000 | tool_parse_fail on all 4 |
| fuzz | 20 | n/a | 309.84 | 1.500 | 3/20 pass (fuzz_0007, _0016, _0017) |
| mcq | 5 | n/a | 10.00 | 0.000 | Empty raw output on all 5 |
What These Results Actually Mean β Pipeline Compatibility Finding
The <think> Block Problem
Phi-4-reasoning-plus wraps its chain-of-thought reasoning in <think>...</think> tags before emitting its final answer. Every single case in this F16 evaluation run begins with assistant<think> in the raw output β meaning the HuggingFace Transformers runner included the think-block content in the response text that was passed to the evaluator, rather than stripping it at the pipeline level.
The evaluator receives output like:
assistant<think>We are Phi. The question: "Return JSON only. Task: For each arriving
item, choose a shelf A, B, C, or STOP..." [200+ lines of deliberation] ...So the answer
should be A. But wait, let me reconsider...
</think>
{"tool_name": "place_item", "args": {"choice": "A"}}
The extraction logic β looking for a JSON object matching the task schema β either fails to locate the valid output buried after the <think> block, or the model exhausts its generation budget during the think phase and never emits the final answer.
This is confirmed by the per-family evidence:
json family: All 4 cases show detail=tool_parse_fail step=1 no_json_object with raw output showing 315β322 seconds of assistant<think> content. The model reasoned for over 5 minutes per case and produced no extractable JSON.
json_multistep: All 5 cases show the same pattern β assistant<think> followed by extensive deliberation, no valid schema output, all four gating signals fail simultaneously on every case.
MCQ: All 5 cases produce invalid_choice raw='' β empty raw output after 9β10 seconds. The model generates a brief think trace and then terminates with no visible text. The think content is consumed but nothing is emitted after it.
stateful_followup: Both cases run 278 seconds, produce mismatch failures. The model reasons extensively about the task but does not emit the expected JSON state updates in a form the evaluator can parse.
mixed_brief_json: Both cases show json_parse_ok=1, schema_ok=1 β the JSON block is valid and schema-correct β but answer_line_ok=0 because the required ANSWER: <integer> line either doesn't appear or appears inside the think block rather than in the response body.
The Three Passing Fuzz Cases
Three fuzz cases pass cleanly: fuzz_0007 (320.15s), fuzz_0016 (160.69s), fuzz_0017 (319.11s). All three achieve bucket_score=10 and detail=ok. This is direct evidence that the model is not broken β when extraction coincidentally succeeds, the output is correct. The model is reasoning correctly. The pipeline is mishandling output.
fuzz_0016 at 160.69s is particularly informative β it runs in approximately half the time of the other passing cases, suggesting the reasoning chain terminated earlier and the valid JSON was emitted within the generation window before the think block consumed all available tokens.
Why Q4_K_M Outperforms F16
This is the most unusual finding in the evaluated series: the quantized variant substantially outperforms the full-precision model across every measured family.
| Family | F16 Pass Rate | Q4_K_M Pass Rate | Q4_K_M Avg Secs |
|---|---|---|---|
| stateful_followup | 0.000 | 1.000 | 22.89 |
| mixed_brief_json | 0.000 | 1.000 | 17.46 |
| toolcall (stage-1) | 0.000 | 1.000 | 13.98 |
| json_multistep | 0.000 | 0.200 | 14.52 |
The reason is the runner difference, not the quantization. The Q4_K_M evaluation used the phi4_reasoning_plus_quant runner (llama.cpp), which handles Phi-4's special tokens β including <|im_end|> and the think-block delimiters β differently from the HuggingFace Transformers pipeline. The llama.cpp runner correctly stops generation at the EOS token and returns the content before it, avoiding the think-block contamination problem.
This does not mean the Q4_K_M variant is unconditionally superior. The Q4_K_M card documents its own EOS token contamination pattern (<|im_end|> appearing as literal text), which is a different manifestation of the same underlying stop-token handling complexity in this model family. The Q4_K_M EOS contamination causes its own failures on json_multistep (4/5 cases) and MCQ (all 5 cases).
What This Means for F16 Deployment
The F16 GGUF is correctly produced. The model weights are not degraded. The conversion pipeline is sound.
The failure mode is pipeline-specific. When used with llama.cpp (which correctly handles <think> block token stripping), the F16 GGUF should perform substantially better than the HuggingFace Transformers runner evaluation suggests. The three passing fuzz cases at bucket=10 demonstrate that the underlying output quality is intact when extraction works.
The implication for production deployment is direct: If you deploy Phi-4-reasoning-plus F16 via a HuggingFace Transformers pipeline without configuring stop tokens and response parsing to handle <think> blocks, you will reproduce these failures. The correct deployment approach is llama.cpp inference (as used in the Q4_K_M evaluation) or a Transformers pipeline configured with the appropriate <think> token stripping.
Signal-Level Diagnostics (F16)
json_multistep
| Signal | Rate | Notes |
|---|---|---|
| schema_ok | 0.000 | All 5 cases: think-block output, no valid schema |
| checks_consistent_ok | 0.000 | All 5 cases |
| stop_semantics_ok | 0.000 | All 5 cases |
| oracle_equiv_ok | 0.000 | All 5 cases |
mixed_brief_json (partially illuminating)
| Signal | Rate | Notes |
|---|---|---|
| answer_line_ok | 0.000 | ANSWER line absent or inside think block |
| json_parse_ok | 1.000 | JSON block present and parseable |
| schema_ok | 1.000 | JSON block valid against schema |
The mixed_brief_json JSON signals are the clearest evidence of underlying capability. json_parse_ok=1.000 and schema_ok=1.000 mean the model produced valid, schema-correct JSON on both cases β but the ANSWER: <integer> line that precedes it was lost to the think-block extraction issue. The model can produce correct structured output; the pipeline cannot reliably extract it.
toolcall
| Signal | Rate | Notes |
|---|---|---|
| stage1_tool_parse_ok | 0.500 | tool_01 parsed; tool_02 did not |
| stage1_tool_schema_ok | 0.000 | Both fail schema validation |
tool_01 at 160.11s achieves stage1_tool_parse_ok=1 β the tool call JSON was parseable. It fails schema validation, suggesting the think block altered the output format. tool_02 fails parse entirely. Both run to exactly 160 seconds, suggesting the generation hits a consistent wall at that point.
Recommended Deployment Approach for F16
Given the pipeline compatibility findings, the correct deployment path for Phi-4-reasoning-plus F16 is llama.cpp inference, not HuggingFace Transformers. This is the same inference backend used for the Q4_K_M evaluation and is the one that correctly handles the model's stop-token semantics.
When using llama.cpp with the F16 GGUF, configure the chat template for Phi-4 and ensure the <|im_end|> stop token is respected. The Q4_K_M evaluation (with its own separate EOS contamination findings) gives a better proxy for expected llama.cpp behavior than the F16 evaluation does.
Hardware Requirements
| Configuration | VRAM Required | Recommended GPU |
|---|---|---|
| F16 (this repo) Β· full GPU offload | ~32 GB | A100 40 GB Β· 2Γ A10G |
| F16 Β· mixed CPU/GPU offload | 20β24 GB VRAM + 16 GB RAM | RTX 4090 with n_gpu_layers tuning |
| Q4_K_M (companion repo) | ~12 GB | T4 16 GB Β· RTX 3080/4080 Β· A10G |
Usage
Installation
pip install llama-cpp-python huggingface_hub
For GPU acceleration (CUDA):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Python β llama-cpp-python (recommended for F16)
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
# Note: 29.3 GB download β ensure sufficient disk space and ~32 GB VRAM
model_path = hf_hub_download(
repo_id="pbhappliedsystems/phi-4-reasoning-plus-gguf-F16",
filename="phi-4-reasoning-plus-gguf-F16.gguf"
)
llm = Llama(
model_path=model_path,
n_ctx=8192,
n_gpu_layers=-1, # -1 offloads all layers; reduce if VRAM < 32 GB
verbose=False,
# Phi-4 uses phi3 chat template in llama.cpp
)
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a precise reasoning assistant. Think through problems carefully and respond with structured outputs when requested."
},
{
"role": "user",
"content": "Analyze the following and return a JSON object with keys: findings, risk_level, recommendation."
}
],
temperature=0.8,
max_tokens=4096, # Reasoning traces are long β allocate generously
stop=["<|im_end|>"], # Ensure EOS token is a stop signal
)
import re
raw = response["choices"][0]["message"]["content"]
# Strip any residual EOS tokens from output (see Q4_K_M card for context)
clean = re.sub(r'<\|im_end\|>', '', raw).strip()
print(clean)
For partial GPU offload when VRAM is between 20β24 GB:
llm = Llama(
model_path=model_path,
n_ctx=4096,
n_gpu_layers=25, # Tune based on available VRAM
verbose=True, # Enable to monitor layer offload
)
CLI β llama-cli
# Expect 3β5 minute response times at full F16 precision
llama-cli \
--model phi-4-reasoning-plus-gguf-F16.gguf \
--chat-template phi3 \
--system-prompt "You are a precise reasoning assistant." \
--prompt "Analyze the following problem carefully and return structured JSON output." \
--n-predict 4096 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--temp 0.8
For server deployment:
llama-server \
--model phi-4-reasoning-plus-gguf-F16.gguf \
--chat-template phi3 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--port 8080 \
--host 0.0.0.0
Query via the OpenAI-compatible API:
from openai import OpenAI
import re
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")
response = client.chat.completions.create(
model="phi-4-reasoning-plus-gguf-F16",
messages=[{"role": "user", "content": "Your prompt here"}],
temperature=0.8,
timeout=600, # F16 reasoning can take 3β5+ minutes per response
)
clean = re.sub(r'<\|im_end\|>', '', response.choices[0].message.content).strip()
print(clean)
Artifact Provenance
| Artifact | Format | Size | SHA256 |
|---|---|---|---|
phi-4-reasoning-plus-gguf-F16.gguf |
GGUF F16 | 29.3 GB | 6491352a2d3d756fdd4b1538f188bafafc8e940658f1771308ffdaeddd86a385 |
| Q4_K_M (companion repo) | GGUF Q4_K_M | 9.05 GB | 2fe74424b03433d11ccf3f2ce8da404810fa7eb9a269135b1f14bf0d88566e4d |
The F16 GGUF was converted from the microsoft/Phi-4-reasoning-plus HuggingFace snapshot using a custom-built llama.cpp conversion pipeline developed by PBH Applied Systems, without modification to model weights.
Evaluation architecture note: This F16 run (20260222_023834) operated in cache-generation mode (skip_quant=true). The full_weight_cache.json was produced but, due to the pipeline compatibility issue documented above, the cached F16 responses reflect think-block-contaminated outputs rather than clean baseline outputs. The Q4_K_M evaluation (20260222_170914) ran independently using the phi4_reasoning_plus_quant (llama.cpp) runner rather than using the F16 cache, making it the more operationally informative evaluation of the two runs.
Evaluation Methodology
quant_eval v7.21 is a proprietary behavioral evaluation harness developed by PBH Applied Systems.
Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)
| Family | Description | Pass Signals |
|---|---|---|
fuzz |
Property-based regression; structured placement correctness | schema_ok, constraints_ok |
json |
Single-step structured JSON with constraint rules | schema_ok, constraints_ok |
json_multistep |
Multi-step planning with self-check and oracle verification | schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok |
mcq |
Multiple-choice extraction | choice_ok |
stateful_followup |
Two-turn state tracking; turn-2 correct given turn-1 | turn1/2_parse_ok, turn1/2_exact_match |
mixed_brief_json |
Hybrid: natural language answer + valid JSON block | answer_line_ok, json_parse_ok, schema_ok |
toolcall |
Tool call embedded in response; parse + schema validation | stage1_tool_parse_ok, stage1_tool_schema_ok |
toolcall_only |
Bare schema-only tool call; strict tool name + args check | tool_name_ok, args_ok |
Evaluation hardware: NVIDIA RTX 4090 (24 GB VRAM) F16 evaluation date: February 22, 2026 quant_eval seed: 42
Why This Card Exists β The Evaluation Report Pitch
Every result on this card is a 0.000. It would be easy to read this as "the model doesn't work." That interpretation is wrong β and the evidence that it's wrong is embedded in this very card:
- Three fuzz cases pass at
bucket_score=10. A broken model doesn't produce perfect scores on 3 cases. mixed_brief_jsonhasjson_parse_ok=1.000andschema_ok=1.000. A broken model doesn't produce valid, schema-correct JSON.- The raw outputs show 300+ lines of coherent, problem-relevant reasoning. A broken model doesn't reason correctly for 5 minutes about shelf-placement logic.
What this card actually documents is a pipeline compatibility failure. The HuggingFace Transformers runner used for F16 evaluation does not correctly handle Phi-4-reasoning-plus's <think> block output format. The model is reasoning correctly. The pipeline cannot extract its answers.
That distinction matters enormously in production:
- A team that runs informal testing with the right inference stack (llama.cpp) would see a capable model
- A team that deploys via a Transformers pipeline without configuring stop token handling would silently reproduce every 0.000 result on this card
- Without systematic evaluation, the second team would not know they are in the second scenario
The purpose of pre-deployment evaluation is to surface exactly this kind of finding β not just whether a model is "smart," but whether the full stack from model to runner to extraction works as expected for your specific deployment environment.
π¬ About quant_eval & This Evaluation Series
quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning β not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.
See it in action: Live AI Agent Demo β The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.
Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? β pbhappliedsystems.com
Evaluated and published by PBH Applied Systems, LLC Β· patrick@pbhappliedsystems.com
About PBH Applied Systems
PBH Applied Systems, LLC is an Oklahoma Cityβbased applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development. The organization operates with a strong emphasis on engineering rigor, reproducibility, and real-world deployment constraints β particularly in environments where performance, cost efficiency, and reliability must be balanced against available hardware and budget.
Founder β Patrick Hill, M.S.
PBH Applied Systems was founded by Patrick Hill, a Data Scientist and AI/ML Engineer with 10+ years of experience delivering advanced analytics, predictive modeling, and decision-support solutions across high-stakes operational environments. Patrick holds a Master of Science in Software Engineering with concentrations in Artificial Intelligence and Machine Learning (GPA: 4.0) and a B.S. in Business Finance.
Technical expertise spans:
- Languages & Data: Python, SQL, Linux, Pandas, NumPy, scikit-learn
- ML & Modeling: Supervised and unsupervised learning, neural networks, NLP, transformers, regression, classification, forecasting, and feature engineering
- AI/ML Frameworks: PyTorch, TensorFlow/Keras, HuggingFace Transformers, GGUF, llama.cpp, BitsAndBytes, PEFT, QLoRA
- Deployment & MLOps: Flask APIs, Docker, CI/CD pipelines, REST endpoints, streaming inference, version control
- Data Platforms: Jupyter, Databricks, Power BI, Matplotlib
- Quantization: GGUF conversion, Q4_K_M / Q5_K_M / Q8_0 strategies, adapter-per-model evaluation architecture
Published Author
Patrick is the author of Applied Machine Learning: Concepts, Tools, and Case Studies β a 1,200+ page practitioner-oriented textbook adopted as required reading for CSC 373 β Machine Learning at the University of Advancing Technology.
Core Service Areas
1. LLM Optimization & Deployment β End-to-end GGUF conversion and quantization with custom llama.cpp pipelines and adapter-per-model architecture.
2. AI Evaluation Frameworks β Proprietary behavioral evaluation via quant_eval: per-family pass rates, failure cluster diagnostics, raw output evidence, pipeline compatibility analysis, and deployment recommendations.
3. Agentic AI Infrastructure β LlamaIndex ReAct agents, Flask orchestration, serverless GPU inference, full pipeline from model selection to production serving.
4. Scalable AI Application Development β Multimodal applications (quantized LLMs + Whisper + BLIP), Dockerized Flask APIs, advanced time-series forecasting with custom attention mechanisms, Bayesian hyperparameter optimization, and FinBERT sentiment fusion.
5. ML Pipeline Design & Analytics β Feature engineering, forward-chaining cross-validation, KPI dashboards, analytical governance at scale.
6. Model & Agent Cataloging β Structured catalog publishing with reproducible artifacts and clear performance tradeoff documentation.
π Work With PBH Applied Systems
The findings on this card are the most complex in the evaluated series β not because the model is deficient, but because the interaction between a reasoning model's output format and the inference pipeline is a deployment risk that most teams do not test for. The three passing fuzz cases, the valid JSON blocks in mixed_brief_json, and the correct reasoning traces visible in the raw output all point to a capable model whose production viability depends entirely on getting the pipeline configuration right.
The Q4_K_M companion card documents its own separate EOS token contamination findings β a different manifestation of the same underlying challenge. Together, these two cards represent what a full evaluation report looks like: not a binary pass/fail verdict, but a complete picture of how a model's output interacts with deployment infrastructure.
π Book a Scoping Call β Discuss your reasoning model deployment strategy, inference stack selection, or evaluation needs directly with Patrick.
π Request an Evaluation Report β A full quant_eval behavioral audit: per-family pass rates, raw output evidence, pipeline compatibility analysis, and a deployment recommendation. Engagements from $2,500.
Connect
| π Website | pbhappliedsystems.com |
| π§ Email | patrick@pbhappliedsystems.com |
| πΌ LinkedIn | PBH Applied Systems, LLC |
| βΆοΈ YouTube | @pbhappliedsystems |
| πΈ Instagram | @pbhappliedsystems |
| π Facebook | pbhappliedsystems |
License
This GGUF repository inherits the license of the base model:
MIT β microsoft/Phi-4-reasoning-plus
The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.
GGUF conversion and behavioral evaluation performed by PBH Applied Systems, LLC Β· quant_eval v7.21 Β· F16 Run ID: 20260222_023834
- Downloads last month
- 251
16-bit