Idea document and participant guide — implementation map
This ties your polypharmacy / OpenEnv design notes and typical hackathon submission requirements to files in this repository.
Submission narrative (required bullets)
| Requirement | Status | Where |
|---|---|---|
| Problem statement | Documented + implemented | Root README.md, polyguard-rl/README.md, docs/safety.md |
| Environment (agent operates here) | Implemented | PolyGuardEnv, app/env/env_core.py, app/env/fastapi_app.py, openenv.yaml, server/app.py |
| Agent capabilities | Implemented | app/agents/, docs/agents.md |
| Tasks | Implemented | Scenario JSONL under data/scenarios/, presets in app/env/catalog.py |
| Reward / evaluation logic | Implemented | app/env/reward_router.py, app/env/verifier.py, configs/rewards.yaml, docs/reward_design.md, docs/evaluation.md |
| Post-training / self-improvement | Implemented | scripts/train_sft_trl.py, scripts/train_grpo_trl.py, app/training/grpo_trl.py, docs/training.md |
Your “Plan” sections vs codebase
| Plan item | Status | Notes |
|---|---|---|
OpenEnv reset / step / state, timeouts, safety |
Done | env_core.py, fastapi_app.py, max steps per sub-env, anti_cheat.py |
| Local + remote execution | Done | Local FastAPI + docker-compose.yml, HF Space via scripts/deploy_space_api.py, Dockerfile.space, docker/space/ |
| Specific envs: DDI, bandit mining, regimen risk | Done | SubEnvironment enum, transitions in app/env/transition.py |
| Precision dosing, deprescribing, web search, alternatives, new drug (hard) | Done | Matching enum values + scenario tracks; “new drug” is NEW_DRUG_DECOMPOSITION |
| Multiple reward functions + anti-hacking | Done | 13 components → 4 channels; anti-cheat and tests in tests/ |
| TRL + Unsloth, metrics, generations | Done | TRL scripts + reports; Unsloth optional (--use-unsloth); app/training/metrics.py |
| Post-training + inference | Done | merge + test_inference_postsave.py, active manifest / API path |
| Product / Space demo, UI | Done | FastAPI app/api/, React app/ui/frontend/, Space deployment scripts |
| Benchmarks + plots + sample generations | Done | scripts/evaluate_*.py, docs/results/, scripts/generate_submission_evidence.py |
| Deploy: OpenEnv, container, HF Space | Done | See docs/deployment.md |
| Easy / medium / hard | Done | scenarios_easy.jsonl, scenarios_medium.jsonl, scenarios_hard.jsonl |
Themes (world modeling, multi-agent, self-improvement)
| Theme | Status | Notes |
|---|---|---|
| World modeling / professional tasks | Primary fit | Stateful regimen, verifiers, tool-like actions |
| Multi-agent | Partial | Supervisor/orchestrator and policy stack (app/agents/orchestrator.py, supervisor_agent.py); not a separate multi-player env |
| Self-improving systems | Via GRPO | Environment-backed RLVR-style training, not online self-play |
“What to submit” checklist
| Deliverable | Status |
|---|---|
| GitHub repo + URLs in README | Root + polyguard-rl/README.md |
| HF Space URL | In README |
| Points from doc | docs/participant_guide_traceability.md, this file |
| Colab | PolyGuard_SFT_GRPO_One_Run_Runner.ipynb, notebooks/09_training_loop.ipynb |
| Video or blog | README links blog; publish draft in docs/hf_blog_draft.md or swap URL |
Future ideas from your notes (not claimed as done)
- Medicine images / barcodes: listed under Future Work in README.
- Web search agents: sub-env
WEB_SEARCH_MISSING_DATAexists; “full web agent product” is beyond current scope.
Fresh clone reminder
Generated data and many outputs/ reports are produced by scripts (see scripts/bootstrap_data.py, scripts/acceptance_gate.py REQUIRED_ARTIFACTS). Run the bootstrap/build pipeline before expecting strict POLYGUARD_ENFORCE_SUBMISSION_LINKS=true acceptance to pass on an empty workspace.