PPO Agent playing LunarLander-v2

This is a trained model of a PPO agent playing LunarLander-v2 using the stable-baselines3 library.

Training

When I first started training, I experimented with different parameter values to see if I could find something that gave better results than others. I ended up just using the defaults provided by Hugging Face (HF), but the differences in results between those defaults and the defaults from Stable Baselines3 (SB3) where not that large in my findings.

Defaults name	n_steps	batch_size	n_epochs	gamma	gae_lambda	ent_coef
Hugging Face Defaults (hf_defaults)	1,024	64	8	0.999	0.98	0.01
SB3 Defaults (sb3_defaults)	2,048	64	10	0.99	0.95	0.0

Models

I decided to train and upload four models. I wanted to test the following. I thought 1,000,000 (1M) timesteps was insufficient and 123,456,789 (123M) timesteps was excessively time-consuming without significant improvement in results. I believed 10,000,000 (10M) timesteps would offer a reasonable balance between training duration and outcomes. I used defaults from both Hugging Face and Stable Baseline3 when training with 10M timesteps.

Number	Model name	timesteps	Defaults
1	ppo-LunarLander-v2_001_000_000_hf_defaults	1,000,000	hf_defaults
2	ppo-LunarLander-v2_010_000_000_hf_defaults	10,000,000	hf_defaults
3	ppo-LunarLander-v2_010_000_000_sb3_defaults	10,000,000	sb3_defaults
4	ppo-LunarLander-v2_123_456_789_hf_defaults	123,456,789	hf_defaults

Evaluation

I evaluated the four models using two approaches:

Search: Search through a lot different random environments for a good seed
Average: Averaging over a lot different random environments

The code in evaluate.py shows the method of evaluating and storing the results. All the results are included in the evaluation_results.csv file. The result is mean_reward - std_reward, but I also store mean_reward, std_reward, and seed and n_envs as well.

Results

Model name	Number of results	Min	Max	Average
ppo-LunarLander-v2_001_000_000_hf_defaults	4136	144.712	269.721	240.895
ppo-LunarLander-v2_010_000_000_hf_defaults	4136	130.43	305.384	270.451
ppo-LunarLander-v2_010_000_000_sb3_defaults	4136	87.9966	298.898	269.568
ppo-LunarLander-v2_123_456_789_hf_defaults	4136	141.814	302.567	268.735

Conclusion

As suspected, the 1M model performed the worst. I really don't think there are significant differences between the two 10M and the 123M models.

Disclaimer regarding the evaluation result

I kind of don't like the randomness we get by the current method for evaluating the model. As you see, I tested with different seeds and number of parallel environments for the same model, and I got quite varying results. I have not manually updated the score to the better, neither used a lower number for n_eval_episodes. The latter would give a better result, as there would be less to average over. But, as can be seen in evaluation_results.csv, I do have "mined" for a good seed for when to share.

A better way to evaluate the models?

Perhaps we should average over more environments? Wouldn't this give a result less prone to the randomness of the environments? When averaging over the environments, we get a much more stable result. So I think this perhaps could be a better way of evaluating the results for use in a leader board. In short: n_eval_episodes=10 and average over at least 10 different random environments.

Usage (with Stable-baselines3)

import gymnasium as gym
from huggingface_sb3 import load_from_hub
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

env_id = "LunarLander-v2"

model_fp = load_from_hub(
    "jostyposty/drl-course-unit-01-lunar-lander-v2",
    "ppo-LunarLander-v2_010_000_000_hf_defaults.zip",
)

model = PPO.load(model_fp, print_system_info=True)
eval_env = Monitor(gym.make(env_id))
mean_reward, std_reward = evaluate_policy(
    model, eval_env, n_eval_episodes=10, deterministic=True
)
print(f"results: {mean_reward - std_reward:.2f}")
print(f"mean_reward: {mean_reward:.2f} +/- {std_reward}")

jostyposty
/

drl-course-unit-01-lunar-lander-v2