Entropy-TRPO Model Weights

PyTorch checkpoints for the TRPO ablation study in Disentangling Entropy Regularization and Experience Replay in Trust Region Policy Optimization.

Repository layout

Each checkpoint directory contains:

File Description
policy.pt Policy network state dict
value.pt Value network state dict
config.json Training hyperparameters
metadata.json Paper source, variant flags, final metrics

Hub path: {env_id}/{variant}/latest/ (e.g. CartPole-v1/entrpo/latest/).

The repo README is updated automatically during training with a Training progress table (epoch n/N, eval return, best return, KL) from results/summary.md, plus a JSON index of available checkpoints.

Variant definitions

All trust-region variants share the TRPO surrogate $\max_\theta \mathbb{E}t[\rho_t(\theta) A_t]$ with $\rho_t = \pi_\theta(a_t|s_t)/\pi{\theta_{\text{old}}}(a_t|s_t)$ and GAE advantages $A_t$, unless noted below.

Key Paper name Objective
trpo TRPO KL trust region $\bar D_{\mathrm{KL}} \le \delta$ (Schulman et al., 2015)
entrpo_entropy EnTRPO-Entropy $A_t \leftarrow A_t + \beta,\mathcal{H}(\pi_\theta(\cdot|s_t))$ (Roostaie ablation)
ero_trpo ERO-TRPO Surrogate includes $\beta,\mathcal{H}(\pi_\theta)$ (Xu et al., 2024)
erc_trpo ERC-TRPO Relaxed KL: $\bar D_{\mathrm{KL}} \le \delta + \alpha,\Delta\mathcal{H}$ (Xu et al., 2024)
entrpo_buffer EnTRPO-Buffer Roostaie on-policy replay buffer only
pitis_trpo Pitis-TRPO Importance-weighted replay (Pitis et al., 2019)
entrpo EnTRPO Entropy in advantage + Roostaie buffer (full method)
ppo PPO Clipped surrogate + entropy (Schulman et al., 2017)

Older Hub folders (trpo_entropy, trpo_buffer, …) remain valid; training resumes from them automatically.

Variants and paper sources

Variant Paper
trpo Schulman et al. (2015), Trust Region Policy Optimization, ICML
entrpo_entropy Roostaie & Ebadzadeh (2021), EnTRPO β€” entropy-in-advantage ablation
entrpo_buffer Roostaie & Ebadzadeh (2021), EnTRPO β€” replay-buffer ablation
entrpo Roostaie & Ebadzadeh (2021), EnTRPO β€” full method
ero_trpo Xu et al. (2024), ERO-TRPO
erc_trpo Xu et al. (2024), ERC-TRPO
pitis_trpo Pitis et al. (2019), replay-buffer TRPO
ppo Schulman et al. (2017), Proximal Policy Optimization

See metadata.json in each folder for full author names and URLs.

Usage

Training and evaluation code: GitHub β€” entropy-trpo (update URL when published).

git clone https://github.com/your-username/entropy-trpo.git
cd entropy-trpo
make setup          # install deps + create .env
# edit .env with HF_TOKEN and HF_REPO_ID
make download-weights
make eval-checkpoints

Citation

@article{entropytrpo2025,
  title   = {Disentangling Entropy Regularization and Experience Replay in TRPO},
  author  = {Anonymous},
  journal = {IEEE Transactions},
  year    = {2025}
}
@article{roostaie2021entrpo,
  title   = {EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization},
  author  = {Roostaie, Sahar and Ebadzadeh, Mohammad Mehdi},
  journal = {arXiv:2110.13373},
  year    = {2021}
}

Training progress

Last updated: 2026-06-22 05:06:48 UTC

  • Device: cpu
  • Config: configs/cpu.yaml
  • Jobs complete: 13/40
  • Running: 1

CartPole-v1 (1M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO done 10/10 50,000 252.6 Β± 72.4 293.4 0.0049
EnTRPO-Entropy done 97/10 49,664 199.4 Β± 11.8 251.7 0.0080
ERO-TRPO done 97/10 49,664 228.4 Β± 169.1 464.7 -0.0025
ERC-TRPO done 97/10 49,664 19.8 Β± 4.0 45.2 -0.0000
EnTRPO-Buffer done 97/10 49,664 43.5 Β± 35.2 141.4 0.0095
Pitis-TRPO done 97/50 49,664 179.0 Β± 12.6 322.2 0.0099
EnTRPO done 97/10 49,664 54.2 Β± 22.7 137.4 0.0050
PPO done 97/10 49,664 204.4 Β± 65.0 442.8 0.0051

Humanoid-v5 (1M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO done 488/488 999,424 264.6 Β± 47.9 325.0 0.0000
EnTRPO-Entropy running 181/488 370,688 160.2 Β± 53.9 202.3 0.0042
ERO-TRPO done 488/488 999,424 303.4 Β± 71.0 356.9 0.0060
ERC-TRPO done 488/488 999,424 263.5 Β± 48.7 302.2 -0.0000
EnTRPO-Buffer pending 0/488 0 β€” β€” β€”
Pitis-TRPO pending 0/488 0 β€” β€” β€”
EnTRPO done 488/488 999,424 288.8 Β± 72.0 348.8 0.0030
PPO done 488/488 999,424 328.5 Β± 61.7 439.1 0.1304

HumanoidStandup-v5 (1M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO pending 236/488 483,328 45401.5 Β± 2320.2 48008.0 0.0020
EnTRPO-Entropy pending 0/488 0 β€” β€” β€”
ERO-TRPO pending 0/488 0 β€” β€” β€”
ERC-TRPO pending 0/488 0 β€” β€” β€”
EnTRPO-Buffer pending 0/488 0 β€” β€” β€”
Pitis-TRPO pending 0/488 0 β€” β€” β€”
EnTRPO pending 0/488 0 β€” β€” β€”
PPO pending 0/488 0 β€” β€” β€”

Humanoid-v5 (10M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO pending 0/4882 0 β€” β€” β€”
EnTRPO-Entropy pending 0/4882 0 β€” β€” β€”
ERO-TRPO pending 0/4882 0 β€” β€” β€”
ERC-TRPO pending 0/4882 0 β€” β€” β€”
EnTRPO-Buffer pending 0/4882 0 β€” β€” β€”
Pitis-TRPO pending 0/4882 0 β€” β€” β€”
EnTRPO pending 0/4882 0 β€” β€” β€”
PPO pending 0/4882 0 β€” β€” β€”

HumanoidStandup-v5 (10M benchmark)

Variant Status Epoch Timesteps Eval return Best KL
TRPO pending 0/4882 0 β€” β€” β€”
EnTRPO-Entropy pending 0/4882 0 β€” β€” β€”
ERO-TRPO pending 0/4882 0 β€” β€” β€”
ERC-TRPO pending 0/4882 0 β€” β€” β€”
EnTRPO-Buffer pending 0/4882 0 β€” β€” β€”
Pitis-TRPO pending 0/4882 0 β€” β€” β€”
EnTRPO pending 0/4882 0 β€” β€” β€”
PPO pending 0/4882 0 β€” β€” β€”

Available checkpoints

{
  "CartPole-v1": [
    "entrpo",
    "entrpo_buffer",
    "entrpo_entropy",
    "erc_trpo",
    "ero_trpo",
    "pitis_trpo",
    "ppo",
    "trpo"
  ],
  "Humanoid-v5": [
    "entrpo",
    "entrpo_entropy",
    "erc_trpo",
    "ero_trpo",
    "ppo",
    "trpo",
    "trpo_buffer",
    "trpo_replay_pitis"
  ],
  "HumanoidStandup-v5": [
    "trpo"
  ]
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for pre63/entropy-trpo-weights