|
--- |
|
language: en |
|
library_name: pytorch |
|
license: apache-2.0 |
|
pipeline_tag: reinforcement-learning |
|
tags: |
|
- reinforcement-learning |
|
- Generative Model |
|
- GenerativeRL |
|
- LunarLanderContinuous-v2 |
|
benchmark_name: Box2d |
|
task_name: LunarLanderContinuous-v2 |
|
model-index: |
|
- name: QGPO |
|
results: |
|
- task: |
|
type: reinforcement-learning |
|
name: reinforcement-learning |
|
dataset: |
|
name: LunarLanderContinuous-v2 |
|
type: LunarLanderContinuous-v2 |
|
metrics: |
|
- type: mean_reward |
|
value: '200.0' |
|
name: mean_reward |
|
verified: false |
|
--- |
|
|
|
# Play **LunarLanderContinuous-v2** with **QGPO** Policy |
|
|
|
## Model Description |
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This implementation applies **QGPO** to the Box2d **LunarLanderContinuous-v2** environment using [GenerativeRL](https://github.com/opendilab/di-engine). |
|
|
|
|
|
|
|
## Model Usage |
|
### Install the Dependencies |
|
<details close> |
|
<summary>(Click for Details)</summary> |
|
|
|
```shell |
|
# install GenerativeRL with huggingface support |
|
pip3 install GenerativeRL[huggingface] |
|
# install environment dependencies if needed |
|
pip3 install gym[box2d]==0.23.1 |
|
``` |
|
</details> |
|
|
|
### Download Model from Huggingface and Run the Model |
|
|
|
<details close> |
|
<summary>(Click for Details)</summary> |
|
|
|
```shell |
|
# running with trained model |
|
python3 -u run.py |
|
``` |
|
**run.py** |
|
```python |
|
import gym |
|
|
|
from grl.algorithms.qgpo import QGPOAlgorithm |
|
from grl.datasets import QGPOCustomizedTensorDictDataset |
|
|
|
from grl.utils.huggingface import pull_model_from_hub |
|
|
|
|
|
def qgpo_pipeline(): |
|
|
|
policy_state_dict, config = pull_model_from_hub( |
|
repo_id="zjowowen/LunarLanderContinuous-v2-QGPO", |
|
) |
|
|
|
qgpo = QGPOAlgorithm( |
|
config, |
|
dataset=QGPOCustomizedTensorDictDataset( |
|
numpy_data_path="./data.npz", |
|
action_augment_num=config.train.parameter.action_augment_num, |
|
), |
|
) |
|
|
|
qgpo.model.load_state_dict(policy_state_dict) |
|
|
|
# --------------------------------------- |
|
# Customized train code ↓ |
|
# --------------------------------------- |
|
# qgpo.train() |
|
# --------------------------------------- |
|
# Customized train code ↑ |
|
# --------------------------------------- |
|
|
|
# --------------------------------------- |
|
# Customized deploy code ↓ |
|
# --------------------------------------- |
|
agent = qgpo.deploy() |
|
env = gym.make(config.deploy.env.env_id) |
|
observation = env.reset() |
|
images = [env.render(mode="rgb_array")] |
|
for _ in range(config.deploy.num_deploy_steps): |
|
observation, reward, done, _ = env.step(agent.act(observation)) |
|
image = env.render(mode="rgb_array") |
|
images.append(image) |
|
# save images into mp4 files |
|
import imageio.v3 as imageio |
|
import numpy as np |
|
|
|
images = np.array(images) |
|
imageio.imwrite("replay.mp4", images, fps=30, quality=8) |
|
# --------------------------------------- |
|
# Customized deploy code ↑ |
|
# --------------------------------------- |
|
|
|
|
|
if __name__ == "__main__": |
|
|
|
qgpo_pipeline() |
|
|
|
``` |
|
</details> |
|
|
|
## Model Training |
|
|
|
### Train the Model and Push to Huggingface_hub |
|
|
|
<details close> |
|
<summary>(Click for Details)</summary> |
|
|
|
```shell |
|
#Training Your Own Agent |
|
python3 -u train.py |
|
``` |
|
**train.py** |
|
```python |
|
import gym |
|
|
|
from grl.algorithms.qgpo import QGPOAlgorithm |
|
from grl.datasets import QGPOCustomizedTensorDictDataset |
|
from grl.utils.log import log |
|
from grl_pipelines.diffusion_model.configurations.lunarlander_continuous_qgpo import ( |
|
config, |
|
) |
|
|
|
|
|
def qgpo_pipeline(config): |
|
|
|
qgpo = QGPOAlgorithm( |
|
config, |
|
dataset=QGPOCustomizedTensorDictDataset( |
|
numpy_data_path="./data.npz", |
|
action_augment_num=config.train.parameter.action_augment_num, |
|
), |
|
) |
|
|
|
# --------------------------------------- |
|
# Customized train code ↓ |
|
# --------------------------------------- |
|
qgpo.train() |
|
# --------------------------------------- |
|
# Customized train code ↑ |
|
# --------------------------------------- |
|
|
|
# --------------------------------------- |
|
# Customized deploy code ↓ |
|
# --------------------------------------- |
|
agent = qgpo.deploy() |
|
env = gym.make(config.deploy.env.env_id) |
|
observation = env.reset() |
|
for _ in range(config.deploy.num_deploy_steps): |
|
env.render() |
|
observation, reward, done, _ = env.step(agent.act(observation)) |
|
# --------------------------------------- |
|
# Customized deploy code ↑ |
|
# --------------------------------------- |
|
|
|
|
|
if __name__ == "__main__": |
|
log.info("config: \n{}".format(config)) |
|
qgpo_pipeline(config) |
|
|
|
``` |
|
</details> |
|
|
|
**Configuration** |
|
<details close> |
|
<summary>(Click for Details)</summary> |
|
|
|
|
|
```python |
|
{'train': {'project': 'LunarLanderContinuous-v2-QGPO-VPSDE', 'device': 'cuda', 'wandb': {'project': 'IQL-LunarLanderContinuous-v2-QGPO-VPSDE'}, 'simulator': {'type': 'GymEnvSimulator', 'args': {'env_id': 'LunarLanderContinuous-v2'}}, 'model': {'QGPOPolicy': {'device': 'cuda', 'critic': {'device': 'cuda', 'q_alpha': 1.0, 'DoubleQNetwork': {'backbone': {'type': 'ConcatenateMLP', 'args': {'hidden_sizes': [10, 256, 256], 'output_size': 1, 'activation': 'relu'}}}}, 'diffusion_model': {'device': 'cuda', 'x_size': 2, 'alpha': 1.0, 'solver': {'type': 'DPMSolver', 'args': {'order': 2, 'device': 'cuda', 'steps': 17}}, 'path': {'type': 'linear_vp_sde', 'beta_0': 0.1, 'beta_1': 20.0}, 'reverse_path': {'type': 'linear_vp_sde', 'beta_0': 0.1, 'beta_1': 20.0}, 'model': {'type': 'noise_function', 'args': {'t_encoder': {'type': 'GaussianFourierProjectionTimeEncoder', 'args': {'embed_dim': 32, 'scale': 30.0}}, 'backbone': {'type': 'TemporalSpatialResidualNet', 'args': {'hidden_sizes': [512, 256, 128], 'output_dim': 2, 't_dim': 32, 'condition_dim': 8, 'condition_hidden_dim': 32, 't_condition_hidden_dim': 128}}}}, 'energy_guidance': {'t_encoder': {'type': 'GaussianFourierProjectionTimeEncoder', 'args': {'embed_dim': 32, 'scale': 30.0}}, 'backbone': {'type': 'ConcatenateMLP', 'args': {'hidden_sizes': [42, 256, 256], 'output_size': 1, 'activation': 'silu'}}}}}}, 'parameter': {'behaviour_policy': {'batch_size': 1024, 'learning_rate': 0.0001, 'epochs': 500}, 'action_augment_num': 16, 'fake_data_t_span': None, 'energy_guided_policy': {'batch_size': 256}, 'critic': {'stop_training_epochs': 500, 'learning_rate': 0.0001, 'discount_factor': 0.99, 'update_momentum': 0.005}, 'energy_guidance': {'epochs': 1000, 'learning_rate': 0.0001}, 'evaluation': {'evaluation_interval': 50, 'guidance_scale': [0.0, 1.0, 2.0]}, 'checkpoint_path': './LunarLanderContinuous-v2-QGPO'}}, 'deploy': {'device': 'cuda', 'env': {'env_id': 'LunarLanderContinuous-v2', 'seed': 0}, 'num_deploy_steps': 1000, 't_span': None}} |
|
``` |
|
|
|
```json |
|
{ |
|
"train": { |
|
"project": "LunarLanderContinuous-v2-QGPO-VPSDE", |
|
"device": "cuda", |
|
"wandb": { |
|
"project": "IQL-LunarLanderContinuous-v2-QGPO-VPSDE" |
|
}, |
|
"simulator": { |
|
"type": "GymEnvSimulator", |
|
"args": { |
|
"env_id": "LunarLanderContinuous-v2" |
|
} |
|
}, |
|
"model": { |
|
"QGPOPolicy": { |
|
"device": "cuda", |
|
"critic": { |
|
"device": "cuda", |
|
"q_alpha": 1.0, |
|
"DoubleQNetwork": { |
|
"backbone": { |
|
"type": "ConcatenateMLP", |
|
"args": { |
|
"hidden_sizes": [ |
|
10, |
|
256, |
|
256 |
|
], |
|
"output_size": 1, |
|
"activation": "relu" |
|
} |
|
} |
|
} |
|
}, |
|
"diffusion_model": { |
|
"device": "cuda", |
|
"x_size": 2, |
|
"alpha": 1.0, |
|
"solver": { |
|
"type": "DPMSolver", |
|
"args": { |
|
"order": 2, |
|
"device": "cuda", |
|
"steps": 17 |
|
} |
|
}, |
|
"path": { |
|
"type": "linear_vp_sde", |
|
"beta_0": 0.1, |
|
"beta_1": 20.0 |
|
}, |
|
"reverse_path": { |
|
"type": "linear_vp_sde", |
|
"beta_0": 0.1, |
|
"beta_1": 20.0 |
|
}, |
|
"model": { |
|
"type": "noise_function", |
|
"args": { |
|
"t_encoder": { |
|
"type": "GaussianFourierProjectionTimeEncoder", |
|
"args": { |
|
"embed_dim": 32, |
|
"scale": 30.0 |
|
} |
|
}, |
|
"backbone": { |
|
"type": "TemporalSpatialResidualNet", |
|
"args": { |
|
"hidden_sizes": [ |
|
512, |
|
256, |
|
128 |
|
], |
|
"output_dim": 2, |
|
"t_dim": 32, |
|
"condition_dim": 8, |
|
"condition_hidden_dim": 32, |
|
"t_condition_hidden_dim": 128 |
|
} |
|
} |
|
} |
|
}, |
|
"energy_guidance": { |
|
"t_encoder": { |
|
"type": "GaussianFourierProjectionTimeEncoder", |
|
"args": { |
|
"embed_dim": 32, |
|
"scale": 30.0 |
|
} |
|
}, |
|
"backbone": { |
|
"type": "ConcatenateMLP", |
|
"args": { |
|
"hidden_sizes": [ |
|
42, |
|
256, |
|
256 |
|
], |
|
"output_size": 1, |
|
"activation": "silu" |
|
} |
|
} |
|
} |
|
} |
|
} |
|
}, |
|
"parameter": { |
|
"behaviour_policy": { |
|
"batch_size": 1024, |
|
"learning_rate": 0.0001, |
|
"epochs": 500 |
|
}, |
|
"action_augment_num": 16, |
|
"fake_data_t_span": null, |
|
"energy_guided_policy": { |
|
"batch_size": 256 |
|
}, |
|
"critic": { |
|
"stop_training_epochs": 500, |
|
"learning_rate": 0.0001, |
|
"discount_factor": 0.99, |
|
"update_momentum": 0.005 |
|
}, |
|
"energy_guidance": { |
|
"epochs": 1000, |
|
"learning_rate": 0.0001 |
|
}, |
|
"evaluation": { |
|
"evaluation_interval": 50, |
|
"guidance_scale": [ |
|
0.0, |
|
1.0, |
|
2.0 |
|
] |
|
}, |
|
"checkpoint_path": "./LunarLanderContinuous-v2-QGPO" |
|
} |
|
}, |
|
"deploy": { |
|
"device": "cuda", |
|
"env": { |
|
"env_id": "LunarLanderContinuous-v2", |
|
"seed": 0 |
|
}, |
|
"num_deploy_steps": 1000, |
|
"t_span": null |
|
} |
|
} |
|
``` |
|
|
|
</details> |
|
|
|
**Training Procedure** |
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
- **Weights & Biases (wandb):** [monitor link](https://wandb.ai/zjowowen/IQL-LunarLanderContinuous-v2-QGPO-VPSDE) |
|
|
|
## Model Information |
|
<!-- Provide the basic links for the model. --> |
|
- **Github Repository:** [repo link](https://github.com/opendilab/GenerativeRL/) |
|
- **Doc**: [Algorithm link](https://opendilab.github.io/GenerativeRL/) |
|
- **Configuration:** [config link](https://huggingface.co/OpenDILabCommunity/LunarLanderContinuous-v2-QGPO/blob/main/policy_config.json) |
|
- **Demo:** [video](https://huggingface.co/OpenDILabCommunity/LunarLanderContinuous-v2-QGPO/blob/main/replay.mp4) |
|
<!-- Provide the size information for the model. --> |
|
- **Parameters total size:** 8799.79 KB |
|
- **Last Update Date:** 2024-12-04 |
|
|
|
## Environments |
|
<!-- Address questions around what environment the model is intended to be trained and deployed at, including the necessary information needed to be provided for future users. --> |
|
- **Benchmark:** Box2d |
|
- **Task:** LunarLanderContinuous-v2 |
|
- **Gym version:** 0.23.1 |
|
- **GenerativeRL version:** v0.0.1 |
|
- **PyTorch version:** 2.4.1+cu121 |
|
- **Doc**: [Environments link](https://www.gymlibrary.dev/environments/box2d/lunar_lander/) |
|
|