Instructions to use LocusForge/VariantAssist-Gemma4-31B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="LocusForge/VariantAssist-Gemma4-31B-GGUF",
	filename="VA-Gemma4-31B-BF16-00002-of-00002.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M

Use Docker

docker model run hf.co/LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M

LM Studio
Jan

vLLM

How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LocusForge/VariantAssist-Gemma4-31B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LocusForge/VariantAssist-Gemma4-31B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M

Ollama
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Ollama:
```
ollama run hf.co/LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
```

Unsloth Studio new

How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for LocusForge/VariantAssist-Gemma4-31B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for LocusForge/VariantAssist-Gemma4-31B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for LocusForge/VariantAssist-Gemma4-31B-GGUF to start chatting

Pi new

How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Docker Model Runner:
```
docker model run hf.co/LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
```

Lemonade

How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M

Run and chat with the model

lemonade run user.VariantAssist-Gemma4-31B-GGUF-UD-Q4_K_M

List all available models

lemonade list

variantassist.com · GitHub · License

Compatibility note: these VariantAssist-tuned GGUF models are currently intended only for Level-1 Annotation. For other VariantAssist workflow stages, use the original Q8 model rather than these tuned quantizations.

VariantAssist Gemma 4 31B GGUF

VariantAssist Gemma 4 31B GGUF is the local-inference release of the VariantAssist Gemma 4 31B LoRA model. The files in this repository are produced by merging the VariantAssist LoRA adapter with Gemma 4 31B IT and converting/quantizing the merged model for llama.cpp-compatible runtimes.

VariantAssist is designed to support structured clinical genetic variant review. It is not a diagnostic device and must not replace a clinician, medical geneticist, laboratory director, or ACMG/AMP-trained reviewer.

Evaluation Protocol

All model scores below are evaluated after the VariantAssist 3-to-5 consensus procedure. For each variant, the model is first run three times. If all three runs return the same pathogenicity level, that level is accepted. If any run differs, two additional runs are performed; a result is accepted only if one pathogenicity level appears at least three times across the five runs. If no level reaches that threshold, the result is marked as no consensus and may be rerun.

No dissensus/no-consensus cases occurred in this benchmark. In practical use, no-consensus cases have been observed at roughly 1 in 5000 variants.

Available GGUF Files

File	Size	Match	Quant	Role
`VA-Gemma4-31B-UD-Q8_0.gguf`	31 GB	86	UDQ	Best current benchmark result
`VA-Gemma4-31B-Q4_K_M.gguf`	18 GB	85	LQ	Practical default
`VA-Gemma4-31B-Q8_0.gguf`	31 GB	83	LQ	Classic Q8 variant
`VA-Gemma4-31B-UD-Q4_K_M.gguf`	18 GB	82	UDQ	Smaller UDQ variant
`VA-Gemma4-31B-F16.gguf`	58 GB	81	F16	Reference GGUF
`VA-Gemma4-31B-BF16-00002-of-00002.gguf`	11 GB	-	BF16	BF16 export shard
`VA-Gemma4-31B-BF16-mmproj.gguf`	1.2 GB	-	MMProj	Not needed for text-only runs

UDQ = Unsloth dynamic quantization. LQ = classic llama.cpp quantization. The Unsloth quantized variants were selected/validated on examples with the correct VariantAssist Level-1 input/output structure.

Benchmark Results

The ATP7B benchmark contains 100 Wilson disease variants with consensus labels from five independent expert annotations. The primary ground truth is strict majority consensus.

ATP7B benchmark accuracy versus reasoning-token budget

Reasoning budget is usually an important quality driver for classic quantized models. In this benchmark, the VariantAssist-tuned quantized runs improve accuracy while also reducing the reasoning-token budget compared with the original quantized baseline.

Current highlighted result:

VariantAssist UD-Q8: 86/100 exact matches on the ATP7B benchmark.
No strong errors in the selected released-model comparison.
Expert-consensus reference: 15 average expert disagreements, equivalent to 85/100 agreement.

VariantAssist UD-Q8 ATP7B confusion matrix

Prompts, Schema, And Reproducibility

Use the public prompt archive for reproducible evaluation:

That archive contains the system prompt, schema, annotation rules, and per-variant prompts used for benchmark-style evaluation.

Runtime

Recommended runtime is llama-server from a recent llama.cpp build with Gemma 4 reasoning support.

Recommended server command:

llama-server \
  -m /path/to/VA-Gemma4-31B-Q4_K_M.gguf \
  --no-mmproj \
  --jinja \
  -ngl auto \
  -c 32768 \
  -fa on \
  --swa-full \
  -np 1 \
  --cache-prompt \
  --cache-reuse 256 \
  --slot-prompt-similarity 0.10 \
  --ctx-checkpoints 1 \
  --checkpoint-every-n-tokens 4096 \
  --cache-ram 2048 \
  --kv-unified \
  --cache-type-k f16 \
  --cache-type-v f16 \
  -b 2048 \
  -ub 512 \
  --no-cont-batching \
  --perf \
  --metrics \
  --host 127.0.0.1 \
  --port 8091 \
  --reasoning on \
  --reasoning-budget 8192 \
  -t 24 \
  -tb 24

Small-machine optimization:

-c 8192 --reasoning-budget 4096

What to change:

-m: select the GGUF file.
--host / --port: set your serving endpoint.
-t / -tb: match your CPU thread budget.
-c and --reasoning-budget: reduce on smaller machines if needed.

What to keep for VariantAssist Level-1 runs:

--reasoning on: benchmarked runs use reasoning mode.
--jinja: uses the Gemma chat template.
--no-mmproj: this release is text-only.
--cache-type-k f16 --cache-type-v f16: keeps KV cache quality stable.
--no-cont-batching: keeps single-review behavior predictable.

Reasoning should remain enabled for VariantAssist-style review. In our workflow, no-reasoning runs could generate shorter single responses, but were less reliable in the completed 3-to-5 consensus process and could require reruns.

Intended Use

Use this release for:

local-first VariantAssist review workflows;
structured evidence synthesis for expert review;
JSON-oriented draft outputs;
reproducible local benchmarking with the public ATP7B prompt archive.

Out Of Scope

Do not use this model for:

autonomous diagnosis;
direct patient-facing medical advice;
final ACMG/AMP classification without expert review;
clinical interpretation outside the supplied evidence context;
high-stakes clinical workflows without local validation.

Training Data

The full fine-tuning corpus is not distributed with this release because it may include clinical-context and literature-derived materials requiring separate privacy and licensing review. Public benchmark data, prompt templates, response schema, and de-identified examples are provided separately to support reproducible evaluation.

Model tree for LocusForge/VariantAssist-Gemma4-31B-GGUF

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Adapter

LocusForge/VariantAssist-Gemma4-31B-LoRA

Quantized

(1)

this model

LocusForge
/

VariantAssist-Gemma4-31B-GGUF