Kidney-Exchange PPO Policy (Nadhi long-horizon agent)

A PPO policy and its Gymnasium environment for the fair and robust kidney exchange problem, produced autonomously by Nadhi, a long-horizon research agent. The agent ran a multi-iteration reinforcement learning campaign and learned a cycle-selection policy that targets three things at once, optimal expected transplants, max-min fairness across sensitization classes, and robustness to edge failure, what we call the (O,F,R) objective.

This artifact is the empirical achievability result. It is not a proof of anything, it is a learned policy that beats a correctly implemented greedy heuristic on small bounded pools. Every number here came from a real run.

What is in the box

File What it is
rl_env.py the KidneyExchangeEnv Gymnasium environment
policy.zip the trained Stable-Baselines3 PPO model
run.py single entry point, train and infer subcommands
train.py the original training script
eval.py the original evaluation script
learning_curve.png mean return vs training timesteps, from the real run
train_metrics.json the logged learning-curve points
rl_achievability.json the recorded evaluation summary

The environment

An instance is a directed compatibility graph on n = 20 pairs. Each vertex has a sensitization class in {0, 1, 2, 3}, with one class kept small and highly sensitized so fairness actually matters. Each edge realizes independently with probability p = 0.7, and cycles are bounded to length two and three.

  • Action space: Discrete(K + 1) with K = 10. Actions 0 to K-1 select one of the top candidate cycles, action K is STOP.
  • Observation: a flat vector of per-class unmatched tallies, per-cycle class histograms, a K x K conflict matrix, and availability flags.
  • Reward: len(cycle) * p ** len(cycle) per committed cycle, which is the expected realized transplants and carries optimality and robustness at once, plus an episode-end fairness bonus beta * min over classes of matched fraction.

Results

The published policy.zip is the pipeline validation run, PPO with an MLP policy trained for 200k timesteps on CPU, evaluated over five seeds.

Policy Mean return
PPO agent 3.69 +/- 0.62
Random -1.39
Greedy (as evaluated then) -12.98

Honest note. In that original evaluation the greedy baseline was mis implemented, it did not respect the action mask and kept taking overlapping invalid cycles, which is why it scored so badly. After the environment was fixed so the greedy baseline is flawless by construction (always commit the highest expected value valid cycle, fall back to STOP), the corrected comparison from the strongest iteration of the long run was

Policy Mean return
PPO agent 7.59 +/- 0.46
Greedy (flawless) 6.59
Random 1.18

run.py infer uses the corrected greedy, so you can reproduce a fair comparison directly. Treat the headline as empirical achievability on small pools, the tractability question for the problem in general is still open.

Usage

pip install -r requirements.txt

# inference against random and corrected greedy baselines
python run.py infer --model policy.zip --seeds 42 100 2023 --episodes 10

# train a fresh policy
python run.py train --timesteps 200000

How it was made

Nadhi is a long-horizon research agent. It fetched and indexed a corpus of related papers, ran a roundtable of sub-agents that argue with each other and converge to a consensus, then drove a staged pipeline that designs the setup, builds and validates a real Gymnasium environment, trains with a real library, and evaluates against baselines. A static preflight rejects toy or fabricated code, and a persistence guard keeps the campaign anchored to the original question. The reasoning roles are driven by Gemini 3.1 Pro with custom tool calling. The full run took about six hours and twenty one iterations.

Citation

@misc{nandakishor2026kidneyexchange,
  title  = {A Long-Horizon AI Agent Tackles a 19-Year-Old Kidney Exchange Problem},
  author = {Nandakishor M},
  year   = {2026},
  note   = {Convai Innovations}
}

Author: Nandakishor M, Convai Innovations, nandakishor@convaiinnovations.com

Downloads last month
26
Video Preview
loading