TheJackBright's picture
Deploy PolyGuard OpenEnv Space
877add7 verified

Mathematics

POMDP Framing

PolyGuard can be viewed as a partially observable Markov decision process:

M = (S, A, O, T, R, gamma)
  • S: latent patient/regimen state, including risks and unresolved conflicts.
  • A: constrained medication actions emitted through PolyGuardAction.
  • O: observable patient summary, medications, labs, warnings, candidate set, and uncertainty indicators.
  • T: transition dynamics that apply medication changes, evidence updates, dosing holds, taper actions, and review escalation.
  • R: decomposed reward over safety, clinical improvement, dosing quality, process integrity, and anti-cheat penalties.
  • gamma: implicit finite-horizon discount through step budgets and efficiency reward.

Action Selection

The policy chooses a candidate action from the legal candidate set:

a_t = pi(o_t, C_t)

where C_t is generated from rule-based clinical candidates and filtered by legality checks. The contextual bandit can rerank candidates before planner selection.

Reward Aggregation

Reward components are scaled, clamped, and aggregated:

r_t = clamp(sum_i w_i c_i, 0.001, 0.999)

Primary channels are averages over semantically related component scores. This keeps reward debugging possible when total reward rises for the wrong reason.

Anti-Cheat Penalty

If the anti-cheat detector flags an exploit, anti_cheat_score becomes near zero and the episode can terminate with exploit_detection.

anti_cheat_score = 0.001 if exploit else 0.999

Uncertainty And Abstention

Uncertainty is estimated from missing data, conflicts, candidate risk, and environment state. Review escalation is rewarded when uncertainty is high and penalized when used as a repeated escape hatch.

calibration = clamp(1 - |confidence - (1 - uncertainty)|)

Curriculum

Difficulty progresses from short-horizon easy cases to medium and hard cases with more medications, conflicts, missing data, and specialized sub-environments. This keeps the probability of non-zero reward above zero during early training.