|
--- |
|
library_name: stable-baselines3 |
|
tags: |
|
- LunarLander-v2 |
|
- deep-reinforcement-learning |
|
- reinforcement-learning |
|
- stable-baselines3 |
|
model-index: |
|
- name: DQN |
|
results: |
|
- task: |
|
type: reinforcement-learning |
|
name: reinforcement-learning |
|
dataset: |
|
name: LunarLander-v2 |
|
type: LunarLander-v2 |
|
metrics: |
|
- type: mean_reward |
|
value: 261.75 +/- 24.62 |
|
name: mean_reward |
|
verified: false |
|
--- |
|
|
|
# **DQN** Agent playing **LunarLander-v2** |
|
This is a trained model of a **DQN** agent playing **LunarLander-v2** |
|
using the [stable-baselines3 library](https://github.com/DLR-RM/stable-baselines3). |
|
|
|
Some of the hyperparameters used are listed below: |
|
|
|
| Hyperparameters | Value | |
|
| --- | --- | |
|
| `Learning rate` | 0.0002 | |
|
| `Batch size` | 128 | |
|
| `Buffer size` | 100000 | |
|
|
|
## Usage (with Stable-baselines3) |
|
|
|
|
|
```python |
|
# ---------------------- Libraries ------------------------------ |
|
from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub |
|
from huggingface_hub import ( |
|
notebook_login, |
|
) # To log to our Hugging Face account to be able to upload models to the Hub. |
|
|
|
from stable_baselines3 import DQN |
|
from stable_baselines3.common.evaluation import evaluate_policy |
|
from stable_baselines3.common.env_util import make_vec_env |
|
|
|
# ---------------------------- Main -------------------------------------- |
|
|
|
# PONERLE NOMBRE CADA VEZ QUE ENTRENE UN NUEVO MODELO |
|
model_name = "dqn-LunarLander-v2-seed42-12" |
|
|
|
# Defino la seed para los envs y los números aleatorios |
|
Seed=42 |
|
utils.set_seed(Seed) |
|
|
|
|
|
# Se crea el entorno |
|
|
|
env = make_vec_env("LunarLander-v2", n_envs=16) # Creo un env vectorizado, 16 envs simultaneos |
|
env.seed(Seed) # Seed del entorno de entrenamiento |
|
|
|
# Se establece el tipo de agente, con sus hiperparametros |
|
|
|
# Para toquetear centrarse en: learning rate,buffer size, batch size |
|
model = DQN(policy = 'MlpPolicy', |
|
env = env, |
|
learning_rate= 0.0002, |
|
learning_starts= 0, # Cuando empieza el proceso de aprendizaje |
|
batch_size= 128, # Cada cuanto se da el paso en el gradiente |
|
buffer_size=100000, # (size of the replay buffer) |
|
gamma = 0.99 , # Factor de descuento, 0.99 por defecto |
|
train_freq= 4, # Cada cuanto se actualiza el modelo |
|
target_update_interval=15, # Actualizar la red cada '' pasos en el entorno |
|
gradient_steps=4, # Cuantos pasos dar de gradiente antes de cada actualizacion del modelo |
|
exploration_fraction=0.08, # que fraccion del entrenamiento tiene la exploracion reducida |
|
exploration_final_eps=0.05, # Valor final de la probabilidad de realizar una accion aleatoria |
|
verbose= 1, # Nivel de informacion que da sobre el proceso (0,1 o 2) |
|
optimize_memory_usage=False, |
|
seed= Seed |
|
) |
|
|
|
# Entrenamiento |
|
|
|
model.learn(total_timesteps=5000000) |
|
|
|
# Se guarda el modelo |
|
model.save(path = "Historial/" + model_name) |
|
|
|
# Se crea el entorno de evaluacion |
|
eval_env = gym.make("LunarLander-v2") |
|
eval_env.seed(2*Seed) # Seed del entorno de evaluacion, distinta del de entrenamiento |
|
|
|
# Se evalua |
|
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True) |
|
|
|
# Guardo los resultados de la evaluacion del modelo |
|
with open('Historial/' + model_name + '.txt', 'w') as f: |
|
f.write(f"mean_reward={mean_reward:.2f} +/- {std_reward}") |
|
``` |
|
|