Instructions to use Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF",
	filename="ministral3_8b_pmra_knapsack_3p2.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF
# Run inference directly in the terminal:
llama-cli -hf Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF
# Run inference directly in the terminal:
llama-cli -hf Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF
# Run inference directly in the terminal:
./llama-cli -hf Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF

Use Docker

docker model run hf.co/Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF

LM Studio
Jan

vLLM

How to use Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF

Ollama
How to use Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF with Ollama:
```
ollama run hf.co/Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF
```

Unsloth Studio

How to use Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF to start chatting

How to use Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF

Run Hermes

hermes

Docker Model Runner
How to use Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF with Docker Model Runner:
```
docker model run hf.co/Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF
```

Lemonade

How to use Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Asystemoffields/Ministral-3-8B-Instruct-PMRA-GGUF

Run and chat with the model

lemonade run user.Ministral-3-8B-Instruct-PMRA-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Ministral 3 8B Instruct · PMRA mixed-precision GGUF

Two mixed-precision GGUFs of Mistral AI's Ministral 3 8B Instruct: a primary build at the IQ3_XS size budget and a leaner 3.2-bpw build for tight-RAM machines. Both beat the plain quant at their size on a held-out test split — the primary by ~0.18 NLL, the compact one by ~0.12 NLL while being ~311 MB smaller. Standard GGUFs for llama.cpp / Ollama, text generation.

The model

Ministral 3 8B Instruct is the instruction-tuned member of Mistral AI's Ministral 3 family — designed for edge and on-device deployment, fitting in 24 GB of VRAM at BF16 and under ~12 GB once quantized. It's natively multimodal (an 8.4B language model paired with a 0.4B vision encoder) and multilingual across dozens of languages (English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic, …), with strong instruction-following and system-prompt adherence.

Scope of this artifact: these GGUFs target the text stack for text generation in llama.cpp; image input is not exercised here. The build was calibrated and measured on English.

Why this build (PMRA)

A normal GGUF quant uses one format for nearly every tensor, paying the same bit-rate everywhere regardless of importance. Production Mixed-Rate Allocation (PMRA) measures each tensor's contribution to quality and spends bits where they help most: starting from a low-bit IQ2_M floor, it promotes the groups that matter to stronger formats under a fixed byte budget. The selection is frozen on calibration data, then re-scored on a held-out test split so the gain reflects generalization, not overfit.

Headline (held-out Wikitext-2 test, lower NLL is better):

Build	NLL	size	vs `IQ3_XS`
PMRA primary (IQ3_XS budget)	4.537	3.706 GB	−0.185 NLL, same size
PMRA compact (3.2 bpw)	4.601	3.396 GB	−0.122 NLL, −311 MB
plain `IQ3_XS`	4.722	3.706 GB	—

Both decisions: GO.

Which file?

ministral3_8b_pmra_knapsack_iq3xs_budget.gguf — primary quality build; pick this if you have the RAM.
ministral3_8b_pmra_knapsack_3p2.gguf — the 3.2-bpw build; for ~8 GB machines, start here, close memory-heavy apps, and keep the context small.

Quick start

llama-cli -m ministral3_8b_pmra_knapsack_3p2.gguf \
  -p "Write a short hello from PMRA." -n 80 --ctx-size 2048

Needs a recent llama.cpp build (or Ollama) with Ministral 3 support.

Footprint

File	Selector	Size	Payload bpw	SHA-256
`ministral3_8b_pmra_knapsack_iq3xs_budget.gguf`	`c2_calib_knapsack_mixed`	`3,713,801,312`	`3.492210`	`7f88294593cf419a5b39b4da2c7df356fee9528de947d6547b9d11d60a84ac5d`
`ministral3_8b_pmra_knapsack_3p2.gguf`	`c2_calib_knapsack_bpw_3p200_mixed`	`3,403,422,816`	`3.199730`	`ff95384e68f211b238767e1783d20ce0b4a8be8a56ac8b906756c481831421a3`

Both materialized and reloaded by the artifact builder with 0 tensor mismatches.

Benchmarks

Calibration: Wikitext-2-raw train (12 prompts). Selector eval: Wikitext-2-raw validation (128 prompts). Held-out eval: Wikitext-2-raw test (512 prompts); calibration/eval prompt overlap audited to 0. Lower NLL is better.

Held-out Wikitext-2 test:

Variant	NLL	Payload bpw	Payload bytes
fp16 reference	`2.393904`	`16.000000`	`16,979,107,840`
`IQ2_M`	`4.963936`	`2.920126`	`3,098,820,608`
`IQ3_XS` (target / control)	`4.722369`	`3.492735`	`3,706,470,400`
`Q3_K_S`	`4.757542`	`3.636073`	`3,858,579,456`
PMRA knapsack	`4.537475`	`3.492210`	`3,705,913,344`
PMRA knapsack 3.2 bpw	`4.600533`	`3.199730`	`3,395,534,848`
same-budget random	`4.912780`	`3.492210`	`3,705,913,344`

Selector validation split (Wikitext-2 validation): PMRA knapsack 4.456880 vs IQ3_XS 4.649152 — consistent.

primary vs IQ3_XS: −0.184894 NLL, −557,056 bytes · vs Q3_K_S: −0.220067 NLL, −152,666,112 bytes · vs random: −0.375305 NLL · decision GO
compact vs IQ3_XS: −0.121836 NLL, −310,935,552 bytes · vs Q3_K_S: −0.157010 NLL

How it was built

base: mistralai/Ministral-3-8B-Instruct-2512-BF16
GGUF sources: bartowski/mistralai_Ministral-3-8B-Instruct-2512-GGUF
tensor profile mistral3 · group mode tensor · selector c2_calib_knapsack_mixed
low source IQ2_M → target/control IQ3_XS; promotion menu Q2_K, Q2_K_L, Q3_K_S, Q3_K_M, IQ4_XS

Files

ministral3_8b_pmra_knapsack_iq3xs_budget.gguf, ministral3_8b_pmra_knapsack_3p2.gguf — the models
artifact_report*.json / .md, selector_result.json / .md
public_eval_wikitext_test_result.json / .md — the held-out evaluation
MINISTRAL3_8B_INSTRUCT_PMRA.md — release card