DQN Agent playing LunarLander-v2

This is a trained model of a DQN agent playing LunarLander-v2 using the stable-baselines3 library.

I optimized model's parameters using optuna taking mean_rew - std_rew as objective. However, I made a mistake by using the same amount of envs as for training, because it showed me better results during evaluation than when I used just a single env as below. I still used those optimal parameters to make final training of the model (the code below), but this time I also incorporated eval_callback to save the best model. I had to run model twice, because the first time it trapped into local minimal, which was quite bad.

Three most important parameters were:

Learning rate (perhaps, because I specified to wide range - 1e-5 -- 1e-2, I didn't see much difference between 1e-4 and 1e-3 though)
Max grad norm - this one is indeed crucial, it stabilizes training and lower values doesn't allow the network to diverge after 1 unlucky step.
Target update interval - another important parameter, IMO. It specifies the trade-off between how frequently you would like to update target network (aka the network you use for estimating TD target, so the closer its parameters to optimal network, the closer the estimate is) and how stable you'd like to have your training. Setting this value too low will lead to unstable training, because the moment your policy network gets close to estimating Q-values, you update its parameters, and it becomes far from the target again.

Usage (with Stable-baselines3)

import gym

from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.

from stable_baselines3 import PPO, DQN
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env

params = {
    'learning_rate': 0.0001599504838637104, 'buffer_size': 683593, 'batch_size': 128, 'train_freq': 14,
    'exploration_final_eps': 0.07019679001836276, 'target_update_interval': 183, 'max_grad_norm': 0.314826407057672,
    'learning_starts': 0,
    'gradient_steps': -1,
    'exploration_fraction': 0.2,
    'gamma': 0.99,
    'policy_kwargs': {
        'net_arch': [256] * 2,
        'activation_fn': torch.nn.ReLU
    }
}
env = make_vec_env('LunarLander-v2', n_envs=16)
model = DQN('MlpPolicy', env, **params, verbose=1)

eval_env = gym.make('LunarLander-v2')
eval_callback = EvalCallback(eval_env, n_eval_episodes=10, deterministic=True, best_model_save_path="./logs/best_model", eval_freq=1250)
model.learn(total_timesteps=300_000, callback=eval_callback, progress_bar=True)
model_name = "dqn-LunarLander-v2"
model.save(model_name)
model = model.load('logs/best_model/best_model')
...

uladkaz
/

dqn-LunarLander-v2

DQN Agent playing LunarLander-v2

Usage (with Stable-baselines3)

Evaluation results