Instructions to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pbhappliedsystems/phi-4-reasoning-plus-gguf-F16",
	filename="phi-4-reasoning-plus-gguf-F16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
# Run inference directly in the terminal:
llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
# Run inference directly in the terminal:
llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
# Run inference directly in the terminal:
./llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16

Use Docker

docker model run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16

LM Studio
Jan
Ollama
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with Ollama:
```
ollama run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
```

Unsloth Studio new

How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 to start chatting

Docker Model Runner
How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with Docker Model Runner:
```
docker model run hf.co/pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16
```

Lemonade

How to use pbhappliedsystems/phi-4-reasoning-plus-gguf-F16 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull pbhappliedsystems/phi-4-reasoning-plus-gguf-F16:F16

Run and chat with the model

lemonade run user.phi-4-reasoning-plus-gguf-F16-F16

List all available models

lemonade list

Phi-4-reasoning-plus · GGUF F16

Converted and evaluated by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure

🔬 This repository is part of a production-oriented evaluation series. Every model published under pbhappliedsystems has been independently evaluated using quant_eval v7.21 — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies.

📌 This is the full-precision F16 baseline repository. The Q4_K_M evaluated deployment variant is published at pbhappliedsystems/phi-4-reasoning-plus-gguf-Q4-K-M.

⚠️ Critical context before reading this card. The F16 evaluation results documented here reflect a runner/pipeline compatibility issue specific to the full_weight_transformers (HuggingFace Transformers) runner and Phi-4-reasoning-plus's <think> block output format. The results do not represent the model's underlying capabilities. They represent what happens when the evaluation pipeline does not correctly handle the model's chain-of-thought output format — and they serve as a concrete case study in why pipeline compatibility must be verified before deployment, not assumed.

Try the Live AI Agent Demo

Launch the PBH Applied Systems AI Agent Demo →

This model is part of the PBH Applied Systems evaluated model series that supports the live AI Agent Demo. The demo lets visitors interact with production-style agent workflows powered by open-weight language models evaluated through PBH Applied Systems' quant_eval framework.

The F16 model serves a different role than the Q4_K_M deployment variant. F16 is the full-precision baseline used to measure what the model can do before quantization. quant_eval then compares the quantized model against this baseline to identify which capabilities are preserved, which degrade, and which tasks require guardrails or a higher-precision deployment.

This comparison is central to the demo. It helps determine which model belongs in which agent role:

Reasoning models are selected for planning, analysis, and auditable decision workflows.
Document models are selected for long-context extraction, summarization, and structured Q&A.
Code models are selected for task completion, structured output, API scaffolding, and automation workflows.
Quantized variants are selected when they preserve enough behavior to reduce cost, latency, and GPU requirements.
F16 variants remain important when maximum fidelity, cleaner tool execution, or reduced quantization risk matters more than speed or cost.

The live demo shows the deployment side of that process. The F16 card documents the reference behavior. The Q4_K_M card shows what changes after compression. Together, they explain how PBH Applied Systems uses quant_eval to choose the correct LLM for the correct agent type instead of guessing from model size or leaderboard reputation.

Model Description

This repository contains the full-precision F16 GGUF of microsoft/Phi-4-reasoning-plus, a 14-billion parameter reasoning-tuned model from Microsoft. Phi-4-reasoning-plus generates extended chain-of-thought reasoning traces enclosed in <think>...</think> blocks before emitting its final response.

In the PBH Applied Systems evaluation pipeline, this F16 run (20260222_023834) operated in cache-generation mode (skip_quant=true), producing the full_weight_cache.json intended as the reference baseline for the Q4_K_M comparison run. However, as documented below, the F16 evaluation results were substantially affected by a pipeline compatibility issue — the HuggingFace Transformers runner did not correctly strip <think> block tokens before extraction, causing widespread evaluation failures that do not reflect the model's actual output capabilities.

The F16 GGUF itself was correctly produced. The hardware requirements and artifact provenance sections below are accurate. The evaluation results require the interpretive context provided in this card.

Key Characteristics

Parameters: 14B
Architecture: Reasoning (extended chain-of-thought, <think> block format)
Format: GGUF F16 (full precision)
File size: 29.3 GB
SHA256: 6491352a2d3d756fdd4b1538f188bafafc8e940658f1771308ffdaeddd86a385
Minimum VRAM (GPU inference): ~32 GB
Recommended GPU tier: A100 40 GB · 2× A10G · RTX 4090 (with partial offload)
Context window: 16,384 tokens
Observed inference time (eval hardware): avg 234.39 sec/case on RTX 4090
License: MIT

Why 234.39 sec/case? At full F16 precision, Phi-4-reasoning-plus generates extensive <think> reasoning traces before each response. The fuzz family averaged 309.84 sec/case with individual cases reaching 320 seconds. Stateful followup cases ran ~278 seconds each. JSON shelf placement cases averaged 319 seconds. This is the chain-of-thought cost at full precision — the model is reasoning for 4–5 minutes before outputting. For reference, the Q4_K_M variant averages 25.84 sec/case.

PBH Applied Systems Evaluation — quant_eval v7.21

Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID: 20260222_023834 · Fixtures: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c...) · Seed: 42 Hardware: NVIDIA RTX 4090 · Runner: full_weight_transformers (F16 only) · Total rows: 42

Per-Family Results — F16 (`full_weight_transformers`)

Family	N	Pass Rate	Avg Secs	Bucket Score	Notes
json_multistep	5	0.000	178.05	0.000	All 4 gating signals fail on all 5 cases
stateful_followup	2	0.000	278.11	0.000	Both turns fail to parse
toolcall_only	2	0.000	136.65	0.000	No JSON object produced
mixed_brief_json	2	0.000	140.38	1.000	JSON valid; ANSWER line missing
toolcall	2	0.000	160.32	0.000	Stage-1 schema fails on both
json	4	n/a	319.12	0.000	tool_parse_fail on all 4
fuzz	20	n/a	309.84	1.500	3/20 pass (fuzz_0007, _0016, _0017)
mcq	5	n/a	10.00	0.000	Empty raw output on all 5

What These Results Actually Mean — Pipeline Compatibility Finding

The `<think>` Block Problem

Phi-4-reasoning-plus wraps its chain-of-thought reasoning in <think>...</think> tags before emitting its final answer. Every single case in this F16 evaluation run begins with assistant<think> in the raw output — meaning the HuggingFace Transformers runner included the think-block content in the response text that was passed to the evaluator, rather than stripping it at the pipeline level.

The evaluator receives output like:

assistant<think>We are Phi. The question: "Return JSON only. Task: For each arriving
item, choose a shelf A, B, C, or STOP..." [200+ lines of deliberation] ...So the answer
should be A. But wait, let me reconsider...
</think>
{"tool_name": "place_item", "args": {"choice": "A"}}

The extraction logic — looking for a JSON object matching the task schema — either fails to locate the valid output buried after the <think> block, or the model exhausts its generation budget during the think phase and never emits the final answer.

This is confirmed by the per-family evidence:

json family: All 4 cases show detail=tool_parse_fail step=1 no_json_object with raw output showing 315–322 seconds of assistant<think> content. The model reasoned for over 5 minutes per case and produced no extractable JSON.

json_multistep: All 5 cases show the same pattern — assistant<think> followed by extensive deliberation, no valid schema output, all four gating signals fail simultaneously on every case.

MCQ: All 5 cases produce invalid_choice raw='' — empty raw output after 9–10 seconds. The model generates a brief think trace and then terminates with no visible text. The think content is consumed but nothing is emitted after it.

stateful_followup: Both cases run 278 seconds, produce mismatch failures. The model reasons extensively about the task but does not emit the expected JSON state updates in a form the evaluator can parse.

mixed_brief_json: Both cases show json_parse_ok=1, schema_ok=1 — the JSON block is valid and schema-correct — but answer_line_ok=0 because the required ANSWER: <integer> line either doesn't appear or appears inside the think block rather than in the response body.

The Three Passing Fuzz Cases

Three fuzz cases pass cleanly: fuzz_0007 (320.15s), fuzz_0016 (160.69s), fuzz_0017 (319.11s). All three achieve bucket_score=10 and detail=ok. This is direct evidence that the model is not broken — when extraction coincidentally succeeds, the output is correct. The model is reasoning correctly. The pipeline is mishandling output.

fuzz_0016 at 160.69s is particularly informative — it runs in approximately half the time of the other passing cases, suggesting the reasoning chain terminated earlier and the valid JSON was emitted within the generation window before the think block consumed all available tokens.

Why Q4_K_M Outperforms F16

This is the most unusual finding in the evaluated series: the quantized variant substantially outperforms the full-precision model across every measured family.

Family	Q4_K_M Pass Rate	Q4_K_M Avg Secs
stateful_followup	1.000	22.89
mixed_brief_json	1.000	17.46
toolcall (stage-1)	1.000	13.98
json_multistep	0.200	14.52

The reason is the runner difference, not the quantization. The Q4_K_M evaluation used the phi4_reasoning_plus_quant runner (llama.cpp), which handles Phi-4's special tokens — including <|im_end|> and the think-block delimiters — differently from the HuggingFace Transformers pipeline. The llama.cpp runner correctly stops generation at the EOS token and returns the content before it, avoiding the think-block contamination problem.

This does not mean the Q4_K_M variant is unconditionally superior. The Q4_K_M card documents its own EOS token contamination pattern (<|im_end|> appearing as literal text), which is a different manifestation of the same underlying stop-token handling complexity in this model family. The Q4_K_M EOS contamination causes its own failures on json_multistep (4/5 cases) and MCQ (all 5 cases).

What This Means for F16 Deployment

The F16 GGUF is correctly produced. The model weights are not degraded. The conversion pipeline is sound.

The failure mode is pipeline-specific. When used with llama.cpp (which correctly handles <think> block token stripping), the F16 GGUF should perform substantially better than the HuggingFace Transformers runner evaluation suggests. The three passing fuzz cases at bucket=10 demonstrate that the underlying output quality is intact when extraction works.

The implication for production deployment is direct: If you deploy Phi-4-reasoning-plus F16 via a HuggingFace Transformers pipeline without configuring stop tokens and response parsing to handle <think> blocks, you will reproduce these failures. The correct deployment approach is llama.cpp inference (as used in the Q4_K_M evaluation) or a Transformers pipeline configured with the appropriate <think> token stripping.

Signal-Level Diagnostics (F16)

json_multistep

Signal	Rate	Notes
schema_ok	0.000	All 5 cases: think-block output, no valid schema
checks_consistent_ok	0.000	All 5 cases
stop_semantics_ok	0.000	All 5 cases
oracle_equiv_ok	0.000	All 5 cases

mixed_brief_json (partially illuminating)

Signal	Rate	Notes
answer_line_ok	0.000	ANSWER line absent or inside think block
json_parse_ok	1.000	JSON block present and parseable
schema_ok	1.000	JSON block valid against schema

The mixed_brief_json JSON signals are the clearest evidence of underlying capability. json_parse_ok=1.000 and schema_ok=1.000 mean the model produced valid, schema-correct JSON on both cases — but the ANSWER: <integer> line that precedes it was lost to the think-block extraction issue. The model can produce correct structured output; the pipeline cannot reliably extract it.

toolcall

Signal	Rate	Notes
stage1_tool_parse_ok	0.500	tool_01 parsed; tool_02 did not
stage1_tool_schema_ok	0.000	Both fail schema validation

tool_01 at 160.11s achieves stage1_tool_parse_ok=1 — the tool call JSON was parseable. It fails schema validation, suggesting the think block altered the output format. tool_02 fails parse entirely. Both run to exactly 160 seconds, suggesting the generation hits a consistent wall at that point.

Recommended Deployment Approach for F16

Given the pipeline compatibility findings, the correct deployment path for Phi-4-reasoning-plus F16 is llama.cpp inference, not HuggingFace Transformers. This is the same inference backend used for the Q4_K_M evaluation and is the one that correctly handles the model's stop-token semantics.

When using llama.cpp with the F16 GGUF, configure the chat template for Phi-4 and ensure the <|im_end|> stop token is respected. The Q4_K_M evaluation (with its own separate EOS contamination findings) gives a better proxy for expected llama.cpp behavior than the F16 evaluation does.

Hardware Requirements

Configuration	VRAM Required	Recommended GPU
F16 (this repo) · full GPU offload	~32 GB	A100 40 GB · 2× A10G
F16 · mixed CPU/GPU offload	20–24 GB VRAM + 16 GB RAM	RTX 4090 with `n_gpu_layers` tuning
Q4_K_M (companion repo)	~12 GB	T4 16 GB · RTX 3080/4080 · A10G

Usage

Installation

pip install llama-cpp-python huggingface_hub

For GPU acceleration (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Python — llama-cpp-python (recommended for F16)

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

# Note: 29.3 GB download — ensure sufficient disk space and ~32 GB VRAM
model_path = hf_hub_download(
    repo_id="pbhappliedsystems/phi-4-reasoning-plus-gguf-F16",
    filename="phi-4-reasoning-plus-gguf-F16.gguf"
)

llm = Llama(
    model_path=model_path,
    n_ctx=8192,
    n_gpu_layers=-1,   # -1 offloads all layers; reduce if VRAM < 32 GB
    verbose=False,
    # Phi-4 uses phi3 chat template in llama.cpp
)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a precise reasoning assistant. Think through problems carefully and respond with structured outputs when requested."
        },
        {
            "role": "user",
            "content": "Analyze the following and return a JSON object with keys: findings, risk_level, recommendation."
        }
    ],
    temperature=0.8,
    max_tokens=4096,  # Reasoning traces are long — allocate generously
    stop=["<|im_end|>"],  # Ensure EOS token is a stop signal
)

import re
raw = response["choices"][0]["message"]["content"]
# Strip any residual EOS tokens from output (see Q4_K_M card for context)
clean = re.sub(r'<\|im_end\|>', '', raw).strip()
print(clean)

For partial GPU offload when VRAM is between 20–24 GB:

llm = Llama(
    model_path=model_path,
    n_ctx=4096,
    n_gpu_layers=25,   # Tune based on available VRAM
    verbose=True,      # Enable to monitor layer offload
)

CLI — llama-cli

# Expect 3–5 minute response times at full F16 precision
llama-cli \
  --model phi-4-reasoning-plus-gguf-F16.gguf \
  --chat-template phi3 \
  --system-prompt "You are a precise reasoning assistant." \
  --prompt "Analyze the following problem carefully and return structured JSON output." \
  --n-predict 4096 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --temp 0.8

For server deployment:

llama-server \
  --model phi-4-reasoning-plus-gguf-F16.gguf \
  --chat-template phi3 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --port 8080 \
  --host 0.0.0.0

Query via the OpenAI-compatible API:

from openai import OpenAI
import re

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")

response = client.chat.completions.create(
    model="phi-4-reasoning-plus-gguf-F16",
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0.8,
    timeout=600,  # F16 reasoning can take 3–5+ minutes per response
)
clean = re.sub(r'<\|im_end\|>', '', response.choices[0].message.content).strip()
print(clean)

Artifact Provenance

Artifact	Format	Size	SHA256
`phi-4-reasoning-plus-gguf-F16.gguf`	GGUF F16	29.3 GB	`6491352a2d3d756fdd4b1538f188bafafc8e940658f1771308ffdaeddd86a385`
Q4_K_M (companion repo)	GGUF Q4_K_M	9.05 GB	`2fe74424b03433d11ccf3f2ce8da404810fa7eb9a269135b1f14bf0d88566e4d`

The F16 GGUF was converted from the microsoft/Phi-4-reasoning-plus HuggingFace snapshot using a custom-built llama.cpp conversion pipeline developed by PBH Applied Systems, without modification to model weights.

Evaluation architecture note: This F16 run (20260222_023834) operated in cache-generation mode (skip_quant=true). The full_weight_cache.json was produced but, due to the pipeline compatibility issue documented above, the cached F16 responses reflect think-block-contaminated outputs rather than clean baseline outputs. The Q4_K_M evaluation (20260222_170914) ran independently using the phi4_reasoning_plus_quant (llama.cpp) runner rather than using the F16 cache, making it the more operationally informative evaluation of the two runs.

Evaluation Methodology

quant_eval v7.21 is a proprietary behavioral evaluation harness developed by PBH Applied Systems.

Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)

Family	Description	Pass Signals
`fuzz`	Property-based regression; structured placement correctness	schema_ok, constraints_ok
`json`	Single-step structured JSON with constraint rules	schema_ok, constraints_ok
`json_multistep`	Multi-step planning with self-check and oracle verification	schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok
`mcq`	Multiple-choice extraction	choice_ok
`stateful_followup`	Two-turn state tracking; turn-2 correct given turn-1	turn1/2_parse_ok, turn1/2_exact_match
`mixed_brief_json`	Hybrid: natural language answer + valid JSON block	answer_line_ok, json_parse_ok, schema_ok
`toolcall`	Tool call embedded in response; parse + schema validation	stage1_tool_parse_ok, stage1_tool_schema_ok
`toolcall_only`	Bare schema-only tool call; strict tool name + args check	tool_name_ok, args_ok

Evaluation hardware: NVIDIA RTX 4090 (24 GB VRAM) F16 evaluation date: February 22, 2026 quant_eval seed: 42

Why This Card Exists — The Evaluation Report Pitch

Every result on this card is a 0.000. It would be easy to read this as "the model doesn't work." That interpretation is wrong — and the evidence that it's wrong is embedded in this very card:

Three fuzz cases pass at bucket_score=10. A broken model doesn't produce perfect scores on 3 cases.
mixed_brief_json has json_parse_ok=1.000 and schema_ok=1.000. A broken model doesn't produce valid, schema-correct JSON.
The raw outputs show 300+ lines of coherent, problem-relevant reasoning. A broken model doesn't reason correctly for 5 minutes about shelf-placement logic.

What this card actually documents is a pipeline compatibility failure. The HuggingFace Transformers runner used for F16 evaluation does not correctly handle Phi-4-reasoning-plus's <think> block output format. The model is reasoning correctly. The pipeline cannot extract its answers.

That distinction matters enormously in production:

A team that runs informal testing with the right inference stack (llama.cpp) would see a capable model
A team that deploys via a Transformers pipeline without configuring stop token handling would silently reproduce every 0.000 result on this card
Without systematic evaluation, the second team would not know they are in the second scenario

The purpose of pre-deployment evaluation is to surface exactly this kind of finding — not just whether a model is "smart," but whether the full stack from model to runner to extraction works as expected for your specific deployment environment.

🔬 About quant_eval & This Evaluation Series

quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning — not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.

See it in action: Live AI Agent Demo → The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.

Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? → pbhappliedsystems.com

Evaluated and published by PBH Applied Systems, LLC · patrick@pbhappliedsystems.com

About PBH Applied Systems

PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development. The organization operates with a strong emphasis on engineering rigor, reproducibility, and real-world deployment constraints — particularly in environments where performance, cost efficiency, and reliability must be balanced against available hardware and budget.

Founder — Patrick Hill, M.S.

PBH Applied Systems was founded by Patrick Hill, a Data Scientist and AI/ML Engineer with 10+ years of experience delivering advanced analytics, predictive modeling, and decision-support solutions across high-stakes operational environments. Patrick holds a Master of Science in Software Engineering with concentrations in Artificial Intelligence and Machine Learning (GPA: 4.0) and a B.S. in Business Finance.

Technical expertise spans:

Languages & Data: Python, SQL, Linux, Pandas, NumPy, scikit-learn
ML & Modeling: Supervised and unsupervised learning, neural networks, NLP, transformers, regression, classification, forecasting, and feature engineering
AI/ML Frameworks: PyTorch, TensorFlow/Keras, HuggingFace Transformers, GGUF, llama.cpp, BitsAndBytes, PEFT, QLoRA
Deployment & MLOps: Flask APIs, Docker, CI/CD pipelines, REST endpoints, streaming inference, version control
Data Platforms: Jupyter, Databricks, Power BI, Matplotlib
Quantization: GGUF conversion, Q4_K_M / Q5_K_M / Q8_0 strategies, adapter-per-model evaluation architecture

Published Author

Patrick is the author of Applied Machine Learning: Concepts, Tools, and Case Studies — a 1,200+ page practitioner-oriented textbook adopted as required reading for CSC 373 – Machine Learning at the University of Advancing Technology.

Core Service Areas

1. LLM Optimization & Deployment — End-to-end GGUF conversion and quantization with custom llama.cpp pipelines and adapter-per-model architecture.

2. AI Evaluation Frameworks — Proprietary behavioral evaluation via quant_eval: per-family pass rates, failure cluster diagnostics, raw output evidence, pipeline compatibility analysis, and deployment recommendations.

3. Agentic AI Infrastructure — LlamaIndex ReAct agents, Flask orchestration, serverless GPU inference, full pipeline from model selection to production serving.

4. Scalable AI Application Development — Multimodal applications (quantized LLMs + Whisper + BLIP), Dockerized Flask APIs, advanced time-series forecasting with custom attention mechanisms, Bayesian hyperparameter optimization, and FinBERT sentiment fusion.

5. ML Pipeline Design & Analytics — Feature engineering, forward-chaining cross-validation, KPI dashboards, analytical governance at scale.

6. Model & Agent Cataloging — Structured catalog publishing with reproducible artifacts and clear performance tradeoff documentation.

📞 Work With PBH Applied Systems

The findings on this card are the most complex in the evaluated series — not because the model is deficient, but because the interaction between a reasoning model's output format and the inference pipeline is a deployment risk that most teams do not test for. The three passing fuzz cases, the valid JSON blocks in mixed_brief_json, and the correct reasoning traces visible in the raw output all point to a capable model whose production viability depends entirely on getting the pipeline configuration right.

The Q4_K_M companion card documents its own separate EOS token contamination findings — a different manifestation of the same underlying challenge. Together, these two cards represent what a full evaluation report looks like: not a binary pass/fail verdict, but a complete picture of how a model's output interacts with deployment infrastructure.

👉 Book a Scoping Call — Discuss your reasoning model deployment strategy, inference stack selection, or evaluation needs directly with Patrick.

👉 Request an Evaluation Report — A full quant_eval behavioral audit: per-family pass rates, raw output evidence, pipeline compatibility analysis, and a deployment recommendation. Engagements from $2,500.

Connect


🌐 Website	pbhappliedsystems.com
📧 Email	patrick@pbhappliedsystems.com
💼 LinkedIn	PBH Applied Systems, LLC
▶️ YouTube	@pbhappliedsystems
📸 Instagram	@pbhappliedsystems
👍 Facebook	pbhappliedsystems

License

This GGUF repository inherits the license of the base model: MIT — microsoft/Phi-4-reasoning-plus

The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.

GGUF conversion and behavioral evaluation performed by PBH Applied Systems, LLC · quant_eval v7.21 · F16 Run ID: 20260222_023834

Downloads last month: 251

GGUF

Model size

15B params

Architecture

phi3

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pbhappliedsystems/phi-4-reasoning-plus-gguf-F16

Base model

microsoft/phi-4

Finetuned

microsoft/Phi-4-reasoning-plus

Quantized

(44)

this model