Entropy-TRPO Model Weights

PyTorch checkpoints for the TRPO ablation study in Disentangling Entropy Regularization and Experience Replay in Trust Region Policy Optimization.

Repository layout

Each checkpoint directory contains:

File	Description
`policy.pt`	Policy network state dict
`value.pt`	Value network state dict
`config.json`	Training hyperparameters
`metadata.json`	Paper source, variant flags, final metrics

Hub path: {env_id}/{variant}/latest/ (e.g. CartPole-v1/entrpo/latest/).

The repo README is updated automatically during training with a Training progress table (epoch n/N, eval return, best return, KL) from results/summary.md, plus a JSON index of available checkpoints.

Variant definitions

All trust-region variants share the TRPO surrogate $\max_\theta \mathbb{E}t[\rho_t(\theta) A_t]$ with $\rho_t = \pi_\theta(a_t|s_t)/\pi{\theta_{\text{old}}}(a_t|s_t)$ and GAE advantages $A_t$, unless noted below.

Key	Paper name	Objective
`trpo`	TRPO	KL trust region $\bar D_{\mathrm{KL}} \le \delta$ (Schulman et al., 2015)
`entrpo_entropy`	EnTRPO-Entropy	$A_t \leftarrow A_t + \beta,\mathcal{H}(\pi_\theta(\cdot\|s_t))$ (Roostaie ablation)
`ero_trpo`	ERO-TRPO	Surrogate includes $\beta,\mathcal{H}(\pi_\theta)$ (Xu et al., 2024)
`erc_trpo`	ERC-TRPO	Relaxed KL: $\bar D_{\mathrm{KL}} \le \delta + \alpha,\Delta\mathcal{H}$ (Xu et al., 2024)
`entrpo_buffer`	EnTRPO-Buffer	Roostaie on-policy replay buffer only
`pitis_trpo`	Pitis-TRPO	Importance-weighted replay (Pitis et al., 2019)
`entrpo`	EnTRPO	Entropy in advantage + Roostaie buffer (full method)
`ppo`	PPO	Clipped surrogate + entropy (Schulman et al., 2017)

Older Hub folders (trpo_entropy, trpo_buffer, …) remain valid; training resumes from them automatically.

Variants and paper sources

Variant	Paper
`trpo`	Schulman et al. (2015), Trust Region Policy Optimization, ICML
`entrpo_entropy`	Roostaie & Ebadzadeh (2021), EnTRPO — entropy-in-advantage ablation
`entrpo_buffer`	Roostaie & Ebadzadeh (2021), EnTRPO — replay-buffer ablation
`entrpo`	Roostaie & Ebadzadeh (2021), EnTRPO — full method
`ero_trpo`	Xu et al. (2024), ERO-TRPO
`erc_trpo`	Xu et al. (2024), ERC-TRPO
`pitis_trpo`	Pitis et al. (2019), replay-buffer TRPO
`ppo`	Schulman et al. (2017), Proximal Policy Optimization

See metadata.json in each folder for full author names and URLs.

Usage

Training and evaluation code: GitHub — entropy-trpo (update URL when published).

git clone https://github.com/your-username/entropy-trpo.git
cd entropy-trpo
make setup          # install deps + create .env
# edit .env with HF_TOKEN and HF_REPO_ID
make download-weights
make eval-checkpoints

Citation

@article{entropytrpo2025,
  title   = {Disentangling Entropy Regularization and Experience Replay in TRPO},
  author  = {Anonymous},
  journal = {IEEE Transactions},
  year    = {2025}
}

@article{roostaie2021entrpo,
  title   = {EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization},
  author  = {Roostaie, Sahar and Ebadzadeh, Mohammad Mehdi},
  journal = {arXiv:2110.13373},
  year    = {2021}
}

Training progress

Last updated: 2026-06-22 05:06:48 UTC

Device: cpu
Config: configs/cpu.yaml
Jobs complete: 13/40
Running: 1

CartPole-v1 (1M benchmark)

Variant	Status	Epoch	Timesteps	Eval return	Best	KL
TRPO	done	10/10	50,000	252.6 ± 72.4	293.4	0.0049
EnTRPO-Entropy	done	97/10	49,664	199.4 ± 11.8	251.7	0.0080
ERO-TRPO	done	97/10	49,664	228.4 ± 169.1	464.7	-0.0025
ERC-TRPO	done	97/10	49,664	19.8 ± 4.0	45.2	-0.0000
EnTRPO-Buffer	done	97/10	49,664	43.5 ± 35.2	141.4	0.0095
Pitis-TRPO	done	97/50	49,664	179.0 ± 12.6	322.2	0.0099
EnTRPO	done	97/10	49,664	54.2 ± 22.7	137.4	0.0050
PPO	done	97/10	49,664	204.4 ± 65.0	442.8	0.0051

Humanoid-v5 (1M benchmark)

Variant	Status	Epoch	Timesteps	Eval return	Best	KL
TRPO	done	488/488	999,424	264.6 ± 47.9	325.0	0.0000
EnTRPO-Entropy	running	181/488	370,688	160.2 ± 53.9	202.3	0.0042
ERO-TRPO	done	488/488	999,424	303.4 ± 71.0	356.9	0.0060
ERC-TRPO	done	488/488	999,424	263.5 ± 48.7	302.2	-0.0000
EnTRPO-Buffer	pending	0/488	0	—	—	—
Pitis-TRPO	pending	0/488	0	—	—	—
EnTRPO	done	488/488	999,424	288.8 ± 72.0	348.8	0.0030
PPO	done	488/488	999,424	328.5 ± 61.7	439.1	0.1304

HumanoidStandup-v5 (1M benchmark)

Variant	Status	Epoch	Timesteps	Eval return	Best	KL
TRPO	pending	236/488	483,328	45401.5 ± 2320.2	48008.0	0.0020
EnTRPO-Entropy	pending	0/488	0	—	—	—
ERO-TRPO	pending	0/488	0	—	—	—
ERC-TRPO	pending	0/488	0	—	—	—
EnTRPO-Buffer	pending	0/488	0	—	—	—
Pitis-TRPO	pending	0/488	0	—	—	—
EnTRPO	pending	0/488	0	—	—	—
PPO	pending	0/488	0	—	—	—

Humanoid-v5 (10M benchmark)

Variant	Status	Epoch	Eval return	Best	KL
TRPO	pending	0/4882	—	—	—
EnTRPO-Entropy	pending	0/4882	—	—	—
ERO-TRPO	pending	0/4882	—	—	—
ERC-TRPO	pending	0/4882	—	—	—
EnTRPO-Buffer	pending	0/4882	—	—	—
Pitis-TRPO	pending	0/4882	—	—	—
EnTRPO	pending	0/4882	—	—	—
PPO	pending	0/4882	—	—	—

HumanoidStandup-v5 (10M benchmark)

Variant	Status	Epoch	Eval return	Best	KL
TRPO	pending	0/4882	—	—	—
EnTRPO-Entropy	pending	0/4882	—	—	—
ERO-TRPO	pending	0/4882	—	—	—
ERC-TRPO	pending	0/4882	—	—	—
EnTRPO-Buffer	pending	0/4882	—	—	—
Pitis-TRPO	pending	0/4882	—	—	—
EnTRPO	pending	0/4882	—	—	—
PPO	pending	0/4882	—	—	—

Available checkpoints

{
  "CartPole-v1": [
    "entrpo",
    "entrpo_buffer",
    "entrpo_entropy",
    "erc_trpo",
    "ero_trpo",
    "pitis_trpo",
    "ppo",
    "trpo"
  ],
  "Humanoid-v5": [
    "entrpo",
    "entrpo_entropy",
    "erc_trpo",
    "ero_trpo",
    "ppo",
    "trpo",
    "trpo_buffer",
    "trpo_replay_pitis"
  ],
  "HumanoidStandup-v5": [
    "trpo"
  ]
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Paper for pre63/entropy-trpo-weights

EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization

Paper • 2110.13373 • Published Oct 26, 2021