Instructions to use Davis426/Healthcare-LLM-Assistant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries
PEFT
How to use Davis426/Healthcare-LLM-Assistant with PEFT:
```
Task type is invalid.
```

How to use Davis426/Healthcare-LLM-Assistant with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Davis426/Healthcare-LLM-Assistant",
	filename="llama32/llama32-medqa-gguf/model.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Davis426/Healthcare-LLM-Assistant with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Davis426/Healthcare-LLM-Assistant:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Davis426/Healthcare-LLM-Assistant:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Davis426/Healthcare-LLM-Assistant:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Davis426/Healthcare-LLM-Assistant:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Davis426/Healthcare-LLM-Assistant:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Davis426/Healthcare-LLM-Assistant:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Davis426/Healthcare-LLM-Assistant:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Davis426/Healthcare-LLM-Assistant:Q4_K_M

Use Docker

docker model run hf.co/Davis426/Healthcare-LLM-Assistant:Q4_K_M

LM Studio
Jan

vLLM

How to use Davis426/Healthcare-LLM-Assistant with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Davis426/Healthcare-LLM-Assistant"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Davis426/Healthcare-LLM-Assistant",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Davis426/Healthcare-LLM-Assistant:Q4_K_M

Ollama
How to use Davis426/Healthcare-LLM-Assistant with Ollama:
```
ollama run hf.co/Davis426/Healthcare-LLM-Assistant:Q4_K_M
```

Unsloth Studio new

How to use Davis426/Healthcare-LLM-Assistant with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Davis426/Healthcare-LLM-Assistant to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Davis426/Healthcare-LLM-Assistant to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Davis426/Healthcare-LLM-Assistant to start chatting

Pi new

How to use Davis426/Healthcare-LLM-Assistant with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Davis426/Healthcare-LLM-Assistant:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Davis426/Healthcare-LLM-Assistant:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Davis426/Healthcare-LLM-Assistant with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Davis426/Healthcare-LLM-Assistant:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Davis426/Healthcare-LLM-Assistant:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Davis426/Healthcare-LLM-Assistant with Docker Model Runner:
```
docker model run hf.co/Davis426/Healthcare-LLM-Assistant:Q4_K_M
```

Lemonade

How to use Davis426/Healthcare-LLM-Assistant with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Davis426/Healthcare-LLM-Assistant:Q4_K_M

Run and chat with the model

lemonade run user.Healthcare-LLM-Assistant-Q4_K_M

List all available models

lemonade list

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Healthcare LLM Assistant - QLoRA fine-tunes

Two parallel QLoRA fine-tunes of small instruct models on the same 9,000-pair mix of public biomedical Q&A, served side-by-side in the parent project's Streamlit UI for a 3-way bake-off against GPT-5.5.

Variant	Subfolder	Base	Adapter	GGUF (Q4_K_M)
Qwen	`qwen/`	`Qwen/Qwen2.5-1.5B-Instruct`	`qwen/qwen-medqa-adapter/` (~82 MB)	`qwen/qwen-medqa-gguf/model.Q4_K_M.gguf` (~941 MB)
Llama-3.2	`llama32/`	`meta-llama/Llama-3.2-1B-Instruct`	`llama32/llama32-medqa-adapter/` (~50 MB)	`llama32/llama32-medqa-gguf/model.Q4_K_M.gguf` (~770 MB)

Both variants were trained with the same dataset, the same LoRA shape (r=16, α=32, all 7 projection layers) and the same SFT recipe, so any quality gap isolates the base-model effect.

Built as part of the COMP8420 (Macquarie University) main project on a healthcare NLP assistant. Companion code: https://github.com/NhatNguyen3001/Healthcare-LLM-Assistant (see the GitHub README for the full system: voice input, PII railguard, multi-agent RAG, evaluation notebooks.)

What is in this repo

.
├── qwen/
│   ├── qwen-medqa-adapter/                  # PEFT LoRA adapter
│   └── qwen-medqa-gguf/
│       ├── model.Q4_K_M.gguf                # Ollama-ready GGUF
│       └── Modelfile                        # Ollama registration recipe
└── llama32/
    ├── llama32-medqa-adapter/               # PEFT LoRA adapter
    └── llama32-medqa-gguf/
        ├── model.Q4_K_M.gguf
        └── Modelfile

The merged-but-unquantized safetensors is intentionally not uploaded for either variant; it is redundant for end users (use the GGUF for Ollama OR the adapter for transformers+peft).

Training data

9,000 question-answer pairs (train 8,100 / val 450 / test 450) drawn from six public sources, capped at 1,500 pairs per source for balance:

Source	Pairs	Notes
BioASQ (subset of training14b)	~1,500	factoid / list / summary biomedical Q&A
MedQuAD	~1,500	consumer-facing medical questions
DrugBank `description`	~1,500	"What is X?" templates
DrugBank `indication`	~1,500	indication / contraindication
DrugBank `side_effects`	~1,500	side-effect summaries
DrugBank `mechanism_of_action`	~1,500	MoA explanations

90 / 5 / 5 random split with seed=42. The OpenAI messages format is used at JSONL level; each variant's chat template (Qwen2.5 or Llama-3.1) is applied at training time, not stored in the JSONL.

Training setup

Same hyperparameters across both variants:

Hyperparameter	Value
LoRA rank `r`	16
LoRA alpha	32
LoRA target modules	all 7 projection layers (q, k, v, o, gate, up, down)
Max sequence length	1024
Per-device batch size	2
Gradient accumulation	4 (effective batch = 8)
Epochs	3
Learning rate	2e-4, cosine schedule
Optimizer	`adamw_8bit`
Seed	42
Hardware	RTX 4060 (8 GB, bf16)

Per-variant differences:

	Qwen	Llama-3.2
Base id	`Qwen/Qwen2.5-1.5B-Instruct` (4-bit NF4)	`meta-llama/Llama-3.2-1B-Instruct` (4-bit NF4)
Chat template	`qwen-2.5`	`llama-3.2`
Wall time (3 epochs)	~95 min	~58 min (smaller base)
Final train loss	1.3646	1.4843
Best val loss	1.5536 (~epoch 1.97)	1.6955 (~epoch 1.97)

Deployed checkpoints are end-of-epoch-3 for both (the "what a full QLoRA run gives you" baseline, not early-stopped).

Evaluation

Evaluated on the held-out 450-pair test set, with 100 stratified pairs (~17 per source) used as the common comparison sample across all evaluation notebooks.

Two evaluation passes:

Surface metrics: ROUGE-1/2/L + BERTScore-F1 (with the PubMedBERT backbone)
LLM-as-judge: GPT-5.4 scoring blind on Accuracy / Completeness / Clarity / Safety (0-10), reference-aware

3-way results (100 stratified test pairs, seed=42):

Surface metrics (ROUGE + BERTScore with PubMedBERT backbone):

Metric	GPT-5.5	QLoRA Qwen	QLoRA Llama-3.2
ROUGE-1	0.2955	0.2997	0.3049
ROUGE-2	0.0907	0.1087	0.1105
ROUGE-L	0.1921	0.2101	0.2046
BERTScore-F1	0.8221	0.8293	0.8272

LLM-as-judge (GPT-5.4, 0-10 scale):

Dimension	GPT-5.5	QLoRA Qwen	QLoRA Llama-3.2
Accuracy	9.26	3.57	2.77
Completeness	8.24	3.08	2.70
Clarity	9.35	6.69	6.41
Safety	9.56	5.01	4.47

Latency:

Model	Mean latency
GPT-5.5 (cloud)	7.22 s
QLoRA Qwen (local, RTX 4060)	0.98 s
QLoRA Llama-3.2 (local, RTX 4060)	0.63 s

Key findings:

Both QLoRA models edge out GPT-5.5 on surface metrics via template substitution on DrugBank-style entries (71+ sibling templates in train share the same skeleton). The fine-tunes learn the template and slot-fill entities at inference. Verified 0/450 literal Q+A pair overlap between train and test, so this is template generalization, not memorization.
GPT-5.5 dominates on all judge dimensions. The Accuracy gap is the headline finding: the 1B-scale fine-tunes hallucinate plausible-sounding but factually wrong medical content that ROUGE and BERTScore (even with PubMedBERT) cannot detect.
Between the two locals, Qwen edges Llama-3.2 on every judge dimension. Llama-3.2 is faster (0.63 s vs 0.98 s) due to its smaller parameter count.
Both local models are 7-11x faster than the cloud path.

Detailed numbers and charts live in the parent repo:

results/llm_generation_evaluation.csv + llm_generation_eval_chart.png + llm_generation_bertscore_chart.png
results/llm_judge_evaluation.csv + llm_judge_eval_chart.png
results/model_comparison.csv + model_comparison_chart.png
results/qlora_loss_curve.png + results/qlora_source_mix.png

How to use

Replace <variant> with qwen or llama32 in the examples below.

Option 1: Ollama (recommended for local serving)

# Fetch one variant's GGUF + Modelfile
huggingface-cli download Davis426/Healthcare-LLM-Assistant \
  --include "qwen/qwen-medqa-gguf/*" \
  --local-dir ./models

# Register with Ollama
cd ./models/qwen/qwen-medqa-gguf
ollama create medqa-qwen -f Modelfile

# Try it
ollama run medqa-qwen "What is amoxicillin used for?"

For the Llama variant, swap every qwen for llama32 (paths) and the Ollama tag to medqa-llama32.

You can register both side-by-side; one ollama serve daemon handles both tags concurrently (OLLAMA_MAX_LOADED_MODELS defaults to 3).

Option 2: transformers + peft (Python)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# pick a variant
base_id    = "Qwen/Qwen2.5-1.5B-Instruct"
subfolder  = "qwen/qwen-medqa-adapter"
# or:
# base_id   = "meta-llama/Llama-3.2-1B-Instruct"
# subfolder = "llama32/llama32-medqa-adapter"
adapter_id = "Davis426/Healthcare-LLM-Assistant"

tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id, subfolder=subfolder)

messages = [{"role": "user", "content": "What is amoxicillin used for?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

Option 3: llama.cpp directly

huggingface-cli download Davis426/Healthcare-LLM-Assistant \
  --include "qwen/qwen-medqa-gguf/model.Q4_K_M.gguf" --local-dir .

./llama-cli -m qwen/qwen-medqa-gguf/model.Q4_K_M.gguf \
  -p "What is amoxicillin used for?" -n 256

Limitations

Both models are teaching / research artifacts. Do not use for real clinical decisions. Specifically:

Catastrophic forgetting on out-of-distribution prompts. Fine-tuning on a narrow Q&A distribution at the 1-1.5B parameter scale shifts each base model hard. Casual / non-medical questions get answered in MedQA-style; the base model's general conversational ability is degraded.
Weakened in-context grounding. Every training pair has shape user_question -> answer, with no retrieved-context block. As a result both fine-tuned models partly lose the ability to read RAG passages in the prompt and tend to answer from parametric memory even when correct evidence is supplied. The parent repo's MASS-RAG pipeline retains GPT-5.5 for cases where grounded answers matter; the local models are sidebar-selectable for the comparison experience.
No factual safety net. Both training data and evaluation rely on existing biomedical corpora; the models have no live knowledge cutoff or up-to-date drug-interaction database. The parent repo applies a regex-based PII railguard on user input, but model output itself is not safety-filtered beyond what each base model already does.
English only.
Llama-3.2 base licence: Llama-3.2 community licence applies to the Llama variant (acceptance via the gated HF repo); see the Meta licence for permitted uses.

License

The fine-tuned adapters and GGUFs in this repo are released under cc-by-nc-4.0 (research and non-commercial use). Base model licences override where stricter: Qwen2.5 is Apache-2.0; Llama-3.2 is under the Meta Llama 3.2 Community Licence. Downstream dataset licences may impose additional restrictions; please consult each source (BioASQ, MedQuAD, DrugBank, MedRAG textbooks) before redistribution.

Citation

If you use or build on this work, please reference:

@misc{comp8420-2026-medqa,
  title  = {Healthcare NLP Assistant: parallel QLoRA fine-tunes of Qwen2.5-1.5B and Llama-3.2-1B for medical Q&A},
  author = {Davis426},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Davis426/Healthcare-LLM-Assistant}}
}

Built on top of: