Kidney-Exchange PPO Policy (Nadhi long-horizon agent)

A PPO policy and its Gymnasium environment for the fair and robust kidney exchange problem, produced autonomously by Nadhi, a long-horizon research agent. The agent ran a multi-iteration reinforcement learning campaign and learned a cycle-selection policy that targets three things at once, optimal expected transplants, max-min fairness across sensitization classes, and robustness to edge failure, what we call the (O,F,R) objective.

This artifact is the empirical achievability result. It is not a proof of anything, it is a learned policy that beats a correctly implemented greedy heuristic on small bounded pools. Every number here came from a real run.

What is in the box

File	What it is
`rl_env.py`	the `KidneyExchangeEnv` Gymnasium environment
`policy.zip`	the trained Stable-Baselines3 PPO model
`run.py`	single entry point, `train` and `infer` subcommands
`train.py`	the original training script
`eval.py`	the original evaluation script
`learning_curve.png`	mean return vs training timesteps, from the real run
`train_metrics.json`	the logged learning-curve points
`rl_achievability.json`	the recorded evaluation summary

The environment

An instance is a directed compatibility graph on n = 20 pairs. Each vertex has a sensitization class in {0, 1, 2, 3}, with one class kept small and highly sensitized so fairness actually matters. Each edge realizes independently with probability p = 0.7, and cycles are bounded to length two and three.

Action space: Discrete(K + 1) with K = 10. Actions 0 to K-1 select one of the top candidate cycles, action K is STOP.
Observation: a flat vector of per-class unmatched tallies, per-cycle class histograms, a K x K conflict matrix, and availability flags.
Reward: len(cycle) * p ** len(cycle) per committed cycle, which is the expected realized transplants and carries optimality and robustness at once, plus an episode-end fairness bonus beta * min over classes of matched fraction.

Results

The published policy.zip is the pipeline validation run, PPO with an MLP policy trained for 200k timesteps on CPU, evaluated over five seeds.

Policy	Mean return
PPO agent	3.69 +/- 0.62
Random	-1.39
Greedy (as evaluated then)	-12.98

Honest note. In that original evaluation the greedy baseline was mis implemented, it did not respect the action mask and kept taking overlapping invalid cycles, which is why it scored so badly. After the environment was fixed so the greedy baseline is flawless by construction (always commit the highest expected value valid cycle, fall back to STOP), the corrected comparison from the strongest iteration of the long run was

Policy	Mean return
PPO agent	7.59 +/- 0.46
Greedy (flawless)	6.59
Random	1.18

run.py infer uses the corrected greedy, so you can reproduce a fair comparison directly. Treat the headline as empirical achievability on small pools, the tractability question for the problem in general is still open.

Usage

pip install -r requirements.txt

# inference against random and corrected greedy baselines
python run.py infer --model policy.zip --seeds 42 100 2023 --episodes 10

# train a fresh policy
python run.py train --timesteps 200000

How it was made

Nadhi is a long-horizon research agent. It fetched and indexed a corpus of related papers, ran a roundtable of sub-agents that argue with each other and converge to a consensus, then drove a staged pipeline that designs the setup, builds and validates a real Gymnasium environment, trains with a real library, and evaluates against baselines. A static preflight rejects toy or fabricated code, and a persistence guard keeps the campaign anchored to the original question. The reasoning roles are driven by Gemini 3.1 Pro with custom tool calling. The full run took about six hours and twenty one iterations.

Citation

@misc{nandakishor2026kidneyexchange,
  title  = {A Long-Horizon AI Agent Tackles a 19-Year-Old Kidney Exchange Problem},
  author = {Nandakishor M},
  year   = {2026},
  note   = {Convai Innovations}
}

Author: Nandakishor M, Convai Innovations, nandakishor@convaiinnovations.com

Downloads last month: 26

Video Preview

Reinforcement Learning