Instructions to use convaiinnovations/kidney-exchange-ppo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- stable-baselines3
How to use convaiinnovations/kidney-exchange-ppo with stable-baselines3:
from huggingface_sb3 import load_from_hub checkpoint = load_from_hub( repo_id="convaiinnovations/kidney-exchange-ppo", filename="{MODEL FILENAME}.zip", ) - Notebooks
- Google Colab
- Kaggle
Kidney-Exchange PPO Policy (Nadhi long-horizon agent)
A PPO policy and its Gymnasium environment for the fair and robust kidney exchange problem, produced autonomously by Nadhi, a long-horizon research agent. The agent ran a multi-iteration reinforcement learning campaign and learned a cycle-selection policy that targets three things at once, optimal expected transplants, max-min fairness across sensitization classes, and robustness to edge failure, what we call the (O,F,R) objective.
This artifact is the empirical achievability result. It is not a proof of anything, it is a learned policy that beats a correctly implemented greedy heuristic on small bounded pools. Every number here came from a real run.
What is in the box
| File | What it is |
|---|---|
rl_env.py |
the KidneyExchangeEnv Gymnasium environment |
policy.zip |
the trained Stable-Baselines3 PPO model |
run.py |
single entry point, train and infer subcommands |
train.py |
the original training script |
eval.py |
the original evaluation script |
learning_curve.png |
mean return vs training timesteps, from the real run |
train_metrics.json |
the logged learning-curve points |
rl_achievability.json |
the recorded evaluation summary |
The environment
An instance is a directed compatibility graph on n = 20 pairs. Each vertex
has a sensitization class in {0, 1, 2, 3}, with one class kept small and
highly sensitized so fairness actually matters. Each edge realizes
independently with probability p = 0.7, and cycles are bounded to length two
and three.
- Action space:
Discrete(K + 1)withK = 10. Actions0toK-1select one of the top candidate cycles, actionKis STOP. - Observation: a flat vector of per-class unmatched tallies, per-cycle class
histograms, a
K x Kconflict matrix, and availability flags. - Reward:
len(cycle) * p ** len(cycle)per committed cycle, which is the expected realized transplants and carries optimality and robustness at once, plus an episode-end fairness bonusbeta * min over classes of matched fraction.
Results
The published policy.zip is the pipeline validation run, PPO with an MLP
policy trained for 200k timesteps on CPU, evaluated over five seeds.
| Policy | Mean return |
|---|---|
| PPO agent | 3.69 +/- 0.62 |
| Random | -1.39 |
| Greedy (as evaluated then) | -12.98 |
Honest note. In that original evaluation the greedy baseline was mis implemented, it did not respect the action mask and kept taking overlapping invalid cycles, which is why it scored so badly. After the environment was fixed so the greedy baseline is flawless by construction (always commit the highest expected value valid cycle, fall back to STOP), the corrected comparison from the strongest iteration of the long run was
| Policy | Mean return |
|---|---|
| PPO agent | 7.59 +/- 0.46 |
| Greedy (flawless) | 6.59 |
| Random | 1.18 |
run.py infer uses the corrected greedy, so you can reproduce a fair
comparison directly. Treat the headline as empirical achievability on small
pools, the tractability question for the problem in general is still open.
Usage
pip install -r requirements.txt
# inference against random and corrected greedy baselines
python run.py infer --model policy.zip --seeds 42 100 2023 --episodes 10
# train a fresh policy
python run.py train --timesteps 200000
How it was made
Nadhi is a long-horizon research agent. It fetched and indexed a corpus of related papers, ran a roundtable of sub-agents that argue with each other and converge to a consensus, then drove a staged pipeline that designs the setup, builds and validates a real Gymnasium environment, trains with a real library, and evaluates against baselines. A static preflight rejects toy or fabricated code, and a persistence guard keeps the campaign anchored to the original question. The reasoning roles are driven by Gemini 3.1 Pro with custom tool calling. The full run took about six hours and twenty one iterations.
Citation
@misc{nandakishor2026kidneyexchange,
title = {A Long-Horizon AI Agent Tackles a 19-Year-Old Kidney Exchange Problem},
author = {Nandakishor M},
year = {2026},
note = {Convai Innovations}
}
Author: Nandakishor M, Convai Innovations, nandakishor@convaiinnovations.com
- Downloads last month
- 26