VariantAssist

variantassist.com · GitHub · License

Compatibility note: these VariantAssist-tuned GGUF models are currently intended only for Level-1 Annotation. For other VariantAssist workflow stages, use the original Q8 model rather than these tuned quantizations.

VariantAssist Gemma 4 31B GGUF

VariantAssist Gemma 4 31B GGUF is the local-inference release of the VariantAssist Gemma 4 31B LoRA model. The files in this repository are produced by merging the VariantAssist LoRA adapter with Gemma 4 31B IT and converting/quantizing the merged model for llama.cpp-compatible runtimes.

VariantAssist is designed to support structured clinical genetic variant review. It is not a diagnostic device and must not replace a clinician, medical geneticist, laboratory director, or ACMG/AMP-trained reviewer.

Evaluation Protocol

All model scores below are evaluated after the VariantAssist 3-to-5 consensus procedure. For each variant, the model is first run three times. If all three runs return the same pathogenicity level, that level is accepted. If any run differs, two additional runs are performed; a result is accepted only if one pathogenicity level appears at least three times across the five runs. If no level reaches that threshold, the result is marked as no consensus and may be rerun.

No dissensus/no-consensus cases occurred in this benchmark. In practical use, no-consensus cases have been observed at roughly 1 in 5000 variants.

Available GGUF Files

File Size Match Quant Role
VA-Gemma4-31B-UD-Q8_0.gguf 31 GB 86 UDQ Best current benchmark result
VA-Gemma4-31B-Q4_K_M.gguf 18 GB 85 LQ Practical default
VA-Gemma4-31B-Q8_0.gguf 31 GB 83 LQ Classic Q8 variant
VA-Gemma4-31B-UD-Q4_K_M.gguf 18 GB 82 UDQ Smaller UDQ variant
VA-Gemma4-31B-F16.gguf 58 GB 81 F16 Reference GGUF
VA-Gemma4-31B-BF16-00002-of-00002.gguf 11 GB - BF16 BF16 export shard
VA-Gemma4-31B-BF16-mmproj.gguf 1.2 GB - MMProj Not needed for text-only runs

UDQ = Unsloth dynamic quantization. LQ = classic llama.cpp quantization. The Unsloth quantized variants were selected/validated on examples with the correct VariantAssist Level-1 input/output structure.

Benchmark Results

The ATP7B benchmark contains 100 Wilson disease variants with consensus labels from five independent expert annotations. The primary ground truth is strict majority consensus.

ATP7B benchmark accuracy versus reasoning-token budget

Reasoning budget is usually an important quality driver for classic quantized models. In this benchmark, the VariantAssist-tuned quantized runs improve accuracy while also reducing the reasoning-token budget compared with the original quantized baseline.

Current highlighted result:

  • VariantAssist UD-Q8: 86/100 exact matches on the ATP7B benchmark.
  • No strong errors in the selected released-model comparison.
  • Expert-consensus reference: 15 average expert disagreements, equivalent to 85/100 agreement.

VariantAssist UD-Q8 ATP7B confusion matrix

Prompts, Schema, And Reproducibility

Use the public prompt archive for reproducible evaluation:

That archive contains the system prompt, schema, annotation rules, and per-variant prompts used for benchmark-style evaluation.

Runtime

Recommended runtime is llama-server from a recent llama.cpp build with Gemma 4 reasoning support.

Recommended server command:

llama-server \
  -m /path/to/VA-Gemma4-31B-Q4_K_M.gguf \
  --no-mmproj \
  --jinja \
  -ngl auto \
  -c 32768 \
  -fa on \
  --swa-full \
  -np 1 \
  --cache-prompt \
  --cache-reuse 256 \
  --slot-prompt-similarity 0.10 \
  --ctx-checkpoints 1 \
  --checkpoint-every-n-tokens 4096 \
  --cache-ram 2048 \
  --kv-unified \
  --cache-type-k f16 \
  --cache-type-v f16 \
  -b 2048 \
  -ub 512 \
  --no-cont-batching \
  --perf \
  --metrics \
  --host 127.0.0.1 \
  --port 8091 \
  --reasoning on \
  --reasoning-budget 8192 \
  -t 24 \
  -tb 24

Small-machine optimization:

-c 8192 --reasoning-budget 4096

What to change:

  • -m: select the GGUF file.
  • --host / --port: set your serving endpoint.
  • -t / -tb: match your CPU thread budget.
  • -c and --reasoning-budget: reduce on smaller machines if needed.

What to keep for VariantAssist Level-1 runs:

  • --reasoning on: benchmarked runs use reasoning mode.
  • --jinja: uses the Gemma chat template.
  • --no-mmproj: this release is text-only.
  • --cache-type-k f16 --cache-type-v f16: keeps KV cache quality stable.
  • --no-cont-batching: keeps single-review behavior predictable.

Reasoning should remain enabled for VariantAssist-style review. In our workflow, no-reasoning runs could generate shorter single responses, but were less reliable in the completed 3-to-5 consensus process and could require reruns.

Intended Use

Use this release for:

  • local-first VariantAssist review workflows;
  • structured evidence synthesis for expert review;
  • JSON-oriented draft outputs;
  • reproducible local benchmarking with the public ATP7B prompt archive.

Out Of Scope

Do not use this model for:

  • autonomous diagnosis;
  • direct patient-facing medical advice;
  • final ACMG/AMP classification without expert review;
  • clinical interpretation outside the supplied evidence context;
  • high-stakes clinical workflows without local validation.

Training Data

The full fine-tuning corpus is not distributed with this release because it may include clinical-context and literature-derived materials requiring separate privacy and licensing review. Public benchmark data, prompt templates, response schema, and de-identified examples are provided separately to support reproducible evaluation.

Links

Downloads last month
358
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LocusForge/VariantAssist-Gemma4-31B-GGUF

Quantized
(1)
this model