Entropy-TRPO Model Weights
PyTorch checkpoints for the TRPO ablation study in Disentangling Entropy Regularization and Experience Replay in Trust Region Policy Optimization.
Repository layout
Each checkpoint directory contains:
| File | Description |
|---|---|
policy.pt |
Policy network state dict |
value.pt |
Value network state dict |
config.json |
Training hyperparameters |
metadata.json |
Paper source, variant flags, final metrics |
Hub path: {env_id}/{variant}/latest/ (e.g. CartPole-v1/entrpo/latest/).
The repo README is updated automatically during training with a Training progress table
(epoch n/N, eval return, best return, KL) from results/summary.md, plus a JSON index of
available checkpoints.
Variant definitions
All trust-region variants share the TRPO surrogate $\max_\theta \mathbb{E}t[\rho_t(\theta) A_t]$ with $\rho_t = \pi_\theta(a_t|s_t)/\pi{\theta_{\text{old}}}(a_t|s_t)$ and GAE advantages $A_t$, unless noted below.
| Key | Paper name | Objective |
|---|---|---|
trpo |
TRPO | KL trust region $\bar D_{\mathrm{KL}} \le \delta$ (Schulman et al., 2015) |
entrpo_entropy |
EnTRPO-Entropy | $A_t \leftarrow A_t + \beta,\mathcal{H}(\pi_\theta(\cdot|s_t))$ (Roostaie ablation) |
ero_trpo |
ERO-TRPO | Surrogate includes $\beta,\mathcal{H}(\pi_\theta)$ (Xu et al., 2024) |
erc_trpo |
ERC-TRPO | Relaxed KL: $\bar D_{\mathrm{KL}} \le \delta + \alpha,\Delta\mathcal{H}$ (Xu et al., 2024) |
entrpo_buffer |
EnTRPO-Buffer | Roostaie on-policy replay buffer only |
pitis_trpo |
Pitis-TRPO | Importance-weighted replay (Pitis et al., 2019) |
entrpo |
EnTRPO | Entropy in advantage + Roostaie buffer (full method) |
ppo |
PPO | Clipped surrogate + entropy (Schulman et al., 2017) |
Older Hub folders (trpo_entropy, trpo_buffer, β¦) remain valid; training resumes from them automatically.
Variants and paper sources
| Variant | Paper |
|---|---|
trpo |
Schulman et al. (2015), Trust Region Policy Optimization, ICML |
entrpo_entropy |
Roostaie & Ebadzadeh (2021), EnTRPO β entropy-in-advantage ablation |
entrpo_buffer |
Roostaie & Ebadzadeh (2021), EnTRPO β replay-buffer ablation |
entrpo |
Roostaie & Ebadzadeh (2021), EnTRPO β full method |
ero_trpo |
Xu et al. (2024), ERO-TRPO |
erc_trpo |
Xu et al. (2024), ERC-TRPO |
pitis_trpo |
Pitis et al. (2019), replay-buffer TRPO |
ppo |
Schulman et al. (2017), Proximal Policy Optimization |
See metadata.json in each folder for full author names and URLs.
Usage
Training and evaluation code: GitHub β entropy-trpo (update URL when published).
git clone https://github.com/your-username/entropy-trpo.git
cd entropy-trpo
make setup # install deps + create .env
# edit .env with HF_TOKEN and HF_REPO_ID
make download-weights
make eval-checkpoints
Citation
@article{entropytrpo2025,
title = {Disentangling Entropy Regularization and Experience Replay in TRPO},
author = {Anonymous},
journal = {IEEE Transactions},
year = {2025}
}
@article{roostaie2021entrpo,
title = {EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization},
author = {Roostaie, Sahar and Ebadzadeh, Mohammad Mehdi},
journal = {arXiv:2110.13373},
year = {2021}
}
Training progress
Last updated: 2026-06-22 05:06:48 UTC
- Device:
cpu - Config:
configs/cpu.yaml - Jobs complete: 13/40
- Running: 1
CartPole-v1 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO | done | 10/10 | 50,000 | 252.6 Β± 72.4 | 293.4 | 0.0049 |
| EnTRPO-Entropy | done | 97/10 | 49,664 | 199.4 Β± 11.8 | 251.7 | 0.0080 |
| ERO-TRPO | done | 97/10 | 49,664 | 228.4 Β± 169.1 | 464.7 | -0.0025 |
| ERC-TRPO | done | 97/10 | 49,664 | 19.8 Β± 4.0 | 45.2 | -0.0000 |
| EnTRPO-Buffer | done | 97/10 | 49,664 | 43.5 Β± 35.2 | 141.4 | 0.0095 |
| Pitis-TRPO | done | 97/50 | 49,664 | 179.0 Β± 12.6 | 322.2 | 0.0099 |
| EnTRPO | done | 97/10 | 49,664 | 54.2 Β± 22.7 | 137.4 | 0.0050 |
| PPO | done | 97/10 | 49,664 | 204.4 Β± 65.0 | 442.8 | 0.0051 |
Humanoid-v5 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO | done | 488/488 | 999,424 | 264.6 Β± 47.9 | 325.0 | 0.0000 |
| EnTRPO-Entropy | running | 181/488 | 370,688 | 160.2 Β± 53.9 | 202.3 | 0.0042 |
| ERO-TRPO | done | 488/488 | 999,424 | 303.4 Β± 71.0 | 356.9 | 0.0060 |
| ERC-TRPO | done | 488/488 | 999,424 | 263.5 Β± 48.7 | 302.2 | -0.0000 |
| EnTRPO-Buffer | pending | 0/488 | 0 | β | β | β |
| Pitis-TRPO | pending | 0/488 | 0 | β | β | β |
| EnTRPO | done | 488/488 | 999,424 | 288.8 Β± 72.0 | 348.8 | 0.0030 |
| PPO | done | 488/488 | 999,424 | 328.5 Β± 61.7 | 439.1 | 0.1304 |
HumanoidStandup-v5 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO | pending | 236/488 | 483,328 | 45401.5 Β± 2320.2 | 48008.0 | 0.0020 |
| EnTRPO-Entropy | pending | 0/488 | 0 | β | β | β |
| ERO-TRPO | pending | 0/488 | 0 | β | β | β |
| ERC-TRPO | pending | 0/488 | 0 | β | β | β |
| EnTRPO-Buffer | pending | 0/488 | 0 | β | β | β |
| Pitis-TRPO | pending | 0/488 | 0 | β | β | β |
| EnTRPO | pending | 0/488 | 0 | β | β | β |
| PPO | pending | 0/488 | 0 | β | β | β |
Humanoid-v5 (10M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO | pending | 0/4882 | 0 | β | β | β |
| EnTRPO-Entropy | pending | 0/4882 | 0 | β | β | β |
| ERO-TRPO | pending | 0/4882 | 0 | β | β | β |
| ERC-TRPO | pending | 0/4882 | 0 | β | β | β |
| EnTRPO-Buffer | pending | 0/4882 | 0 | β | β | β |
| Pitis-TRPO | pending | 0/4882 | 0 | β | β | β |
| EnTRPO | pending | 0/4882 | 0 | β | β | β |
| PPO | pending | 0/4882 | 0 | β | β | β |
HumanoidStandup-v5 (10M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO | pending | 0/4882 | 0 | β | β | β |
| EnTRPO-Entropy | pending | 0/4882 | 0 | β | β | β |
| ERO-TRPO | pending | 0/4882 | 0 | β | β | β |
| ERC-TRPO | pending | 0/4882 | 0 | β | β | β |
| EnTRPO-Buffer | pending | 0/4882 | 0 | β | β | β |
| Pitis-TRPO | pending | 0/4882 | 0 | β | β | β |
| EnTRPO | pending | 0/4882 | 0 | β | β | β |
| PPO | pending | 0/4882 | 0 | β | β | β |
Available checkpoints
{
"CartPole-v1": [
"entrpo",
"entrpo_buffer",
"entrpo_entropy",
"erc_trpo",
"ero_trpo",
"pitis_trpo",
"ppo",
"trpo"
],
"Humanoid-v5": [
"entrpo",
"entrpo_entropy",
"erc_trpo",
"ero_trpo",
"ppo",
"trpo",
"trpo_buffer",
"trpo_replay_pitis"
],
"HumanoidStandup-v5": [
"trpo"
]
}