sgoodfriend commited on Feb 9, 2023

Commit

85e4a43

•

1 Parent(s): 0ca6846

PPO playing CarRacing-v0 from https://github.com/sgoodfriend/rl-algo-impls/tree/fbc943f151b95afc4905a67a3835fb6b18c6a5e4

Browse files

Files changed (25) hide show

README.md +16 -14
benchmark_publish.py +4 -5
benchmarks/benchmark_test.sh +32 -0
benchmarks/colab_pybullet.sh +1 -1
benchmarks/train_loop.sh +1 -3
colab_requirements.txt +2 -1
compare_runs.py +179 -0
dqn/policy.py +1 -1
dqn/q_net.py +1 -1
huggingface_publish.py +5 -0
hyperparams/dqn.yml +5 -17
hyperparams/ppo.yml +5 -12
lambda_labs/benchmark.sh +1 -2
lambda_labs/lambda_requirements.txt +2 -1
poetry.lock +83 -28
ppo/policy.py +11 -16
pyproject.toml +3 -0
replay.meta.json +1 -1
replay.mp4 +0 -0
runner/running_utils.py +2 -0
saved_models/ppo-CarRacing-v0-S1-best/model.pth +3 -0
shared/policy/on_policy.py +16 -1
shared/policy/policy.py +0 -1
train.py +8 -9
vpg/policy.py +8 -2

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ model-index:
   results:
   - metrics:
     - type: mean_reward
-      value: 621.48 +/- 140.74
       name: mean_reward
     task:
       type: reinforcement-learning
@@ -23,17 +23,17 @@ model-index:
 This is a trained model of a **PPO** agent playing **CarRacing-v0** using the [/sgoodfriend/rl-algo-impls](https://github.com/sgoodfriend/rl-algo-impls) repo.
-All models trained at this commit can be found at https://api.wandb.ai/links/sgoodfriend/6p2sjqtn.
 ## Training Results
-This model was trained from 3 trainings of **PPO** agents using different initial seeds. These agents were trained by checking out [5598ebc](https://github.com/sgoodfriend/rl-algo-impls/tree/5598ebc4b03054f16eebe76792486ba7bcacfc5c). The best and last models were kept from each training. This submission has loaded the best models from each training, reevaluates them, and selects the best model from these latest evaluations (mean - std).
 | algo   | env          |   seed |   reward_mean |   reward_std |   eval_episodes | best   | wandb_url                                                                    |
 |:-------|:-------------|-------:|--------------:|-------------:|----------------:|:-------|:-----------------------------------------------------------------------------|
-| ppo    | CarRacing-v0 |      4 |       635.901 |      267.357 |              16 |        | [wandb](https://wandb.ai/sgoodfriend/rl-algo-impls-benchmarks/runs/1af869b6) |
-| ppo    | CarRacing-v0 |      5 |       621.48  |      140.74  |              16 | *      | [wandb](https://wandb.ai/sgoodfriend/rl-algo-impls-benchmarks/runs/2isthqpm) |
-| ppo    | CarRacing-v0 |      6 |       663.161 |      184.276 |              16 |        | [wandb](https://wandb.ai/sgoodfriend/rl-algo-impls-benchmarks/runs/rhymy2k8) |
 ### Prerequisites: Weights & Biases (WandB)
@@ -53,10 +53,10 @@ login`.
 Note: While the model state dictionary and hyperaparameters are saved, the latest
 implementation could be sufficiently different to not be able to reproduce similar
 results. You might need to checkout the commit the agent was trained on:
-[5598ebc](https://github.com/sgoodfriend/rl-algo-impls/tree/5598ebc4b03054f16eebe76792486ba7bcacfc5c).
 ```
 # Downloads the model, sets hyperparameters, and runs agent for 3 episodes
-python enjoy.py --wandb-run-path=sgoodfriend/rl-algo-impls-benchmarks/2isthqpm
 ```
 Setup hasn't been completely worked out yet, so you might be best served by using Google
@@ -68,11 +68,11 @@ notebook.
 ## Training
 If you want the highest chance to reproduce these results, you'll want to checkout the
-commit the agent was trained on: [5598ebc](https://github.com/sgoodfriend/rl-algo-impls/tree/5598ebc4b03054f16eebe76792486ba7bcacfc5c). While
 training is deterministic, different hardware will give different results.
 ```
-python train.py --algo ppo --env CarRacing-v0 --seed 5
 ```
 Setup hasn't been completely worked out yet, so you might be best served by using Google
@@ -83,7 +83,7 @@ notebook.
 ## Benchmarking (with Lambda Labs instance)
-This and other models from https://api.wandb.ai/links/sgoodfriend/6p2sjqtn were generated by running a script on a Lambda
 Labs instance. In a Lambda Labs instance terminal:
 ```
 git clone git@github.com:sgoodfriend/rl-algo-impls.git
@@ -127,16 +127,18 @@ n_timesteps: 4000000
 policy_hyperparams:
   activation_fn: relu
   cnn_feature_dim: 256
   init_layers_orthogonal: false
   log_std_init: -2
   share_features_extractor: false
   use_sde: true
-seed: 5
 use_deterministic_algorithms: true
 wandb_entity: null
 wandb_project_name: rl-algo-impls-benchmarks
 wandb_tags:
-- benchmark_5598ebc
-- host_192-9-145-26
 ```

   results:
   - metrics:
     - type: mean_reward
+      value: 865.72 +/- 58.15
       name: mean_reward
     task:
       type: reinforcement-learning
 This is a trained model of a **PPO** agent playing **CarRacing-v0** using the [/sgoodfriend/rl-algo-impls](https://github.com/sgoodfriend/rl-algo-impls) repo.
+All models trained at this commit can be found at https://api.wandb.ai/links/sgoodfriend/448odm37.
 ## Training Results
+This model was trained from 3 trainings of **PPO** agents using different initial seeds. These agents were trained by checking out [fbc943f](https://github.com/sgoodfriend/rl-algo-impls/tree/fbc943f151b95afc4905a67a3835fb6b18c6a5e4). The best and last models were kept from each training. This submission has loaded the best models from each training, reevaluates them, and selects the best model from these latest evaluations (mean - std).
 | algo   | env          |   seed |   reward_mean |   reward_std |   eval_episodes | best   | wandb_url                                                                    |
 |:-------|:-------------|-------:|--------------:|-------------:|----------------:|:-------|:-----------------------------------------------------------------------------|
+| ppo    | CarRacing-v0 |      1 |       865.725 |      58.1454 |              16 | *      | [wandb](https://wandb.ai/sgoodfriend/rl-algo-impls-benchmarks/runs/8vyb0q44) |
+| ppo    | CarRacing-v0 |      2 |       693.464 |     236.712  |              16 |        | [wandb](https://wandb.ai/sgoodfriend/rl-algo-impls-benchmarks/runs/a3ld38qf) |
+| ppo    | CarRacing-v0 |      3 |       815.26  |     141.502  |              16 |        | [wandb](https://wandb.ai/sgoodfriend/rl-algo-impls-benchmarks/runs/zah43or2) |
 ### Prerequisites: Weights & Biases (WandB)
 Note: While the model state dictionary and hyperaparameters are saved, the latest
 implementation could be sufficiently different to not be able to reproduce similar
 results. You might need to checkout the commit the agent was trained on:
+[fbc943f](https://github.com/sgoodfriend/rl-algo-impls/tree/fbc943f151b95afc4905a67a3835fb6b18c6a5e4).
 ```
 # Downloads the model, sets hyperparameters, and runs agent for 3 episodes
+python enjoy.py --wandb-run-path=sgoodfriend/rl-algo-impls-benchmarks/8vyb0q44
 ```
 Setup hasn't been completely worked out yet, so you might be best served by using Google
 ## Training
 If you want the highest chance to reproduce these results, you'll want to checkout the
+commit the agent was trained on: [fbc943f](https://github.com/sgoodfriend/rl-algo-impls/tree/fbc943f151b95afc4905a67a3835fb6b18c6a5e4). While
 training is deterministic, different hardware will give different results.
 ```
+python train.py --algo ppo --env CarRacing-v0 --seed 1
 ```
 Setup hasn't been completely worked out yet, so you might be best served by using Google
 ## Benchmarking (with Lambda Labs instance)
+This and other models from https://api.wandb.ai/links/sgoodfriend/448odm37 were generated by running a script on a Lambda
 Labs instance. In a Lambda Labs instance terminal:
 ```
 git clone git@github.com:sgoodfriend/rl-algo-impls.git
 policy_hyperparams:
   activation_fn: relu
   cnn_feature_dim: 256
+  hidden_sizes:
+  - 256
   init_layers_orthogonal: false
   log_std_init: -2
   share_features_extractor: false
   use_sde: true
+seed: 1
 use_deterministic_algorithms: true
 wandb_entity: null
 wandb_project_name: rl-algo-impls-benchmarks
 wandb_tags:
+- benchmark_fbc943f
+- host_150-230-44-105
 ```

benchmark_publish.py CHANGED Viewed

@@ -44,11 +44,10 @@ if __name__ == "__main__":
         default=3,
         help="How many publish jobs can run in parallel",
     )
-    parser.set_defaults(
-        wandb_tags=["benchmark_5598ebc", "host_192-9-145-26"],
-        wandb_report_url="https://api.wandb.ai/links/sgoodfriend/6p2sjqtn",
-        envs=["CartPole-v1", "Acrobot-v1"],
-    )
     args = parser.parse_args()
     print(args)

         default=3,
         help="How many publish jobs can run in parallel",
     )
+    # parser.set_defaults(
+    #     wandb_tags=["benchmark_5598ebc", "host_192-9-145-26"],
+    #     wandb_report_url="https://api.wandb.ai/links/sgoodfriend/6p2sjqtn",
+    # )
     args = parser.parse_args()
     print(args)

benchmarks/benchmark_test.sh ADDED Viewed

	@@ -0,0 +1,32 @@

+source benchmarks/train_loop.sh
+export WANDB_PROJECT_NAME="rl-algo-impls"
+BENCHMARK_MAX_PROCS="${BENCHMARK_MAX_PROCS:-3}"
+ALGOS=(
+    # "vpg"
+    "dqn"
+    # "ppo"
+)
+ENVS=(
+    # Basic
+    "CartPole-v1"
+    "MountainCar-v0"
+    # "MountainCarContinuous-v0"
+    "Acrobot-v1"
+    "LunarLander-v2"
+    # # PyBullet
+    # "HalfCheetahBulletEnv-v0"
+    # "AntBulletEnv-v0"
+    # "HopperBulletEnv-v0"
+    # "Walker2DBulletEnv-v0"
+    # # CarRacing
+    # "CarRacing-v0"
+    # Atari
+    "PongNoFrameskip-v4"
+    "BreakoutNoFrameskip-v4"
+    "SpaceInvadersNoFrameskip-v4"
+    "QbertNoFrameskip-v4"
+)
+train_loop "${ALGOS[*]}" "${ENVS[*]}" | xargs -I CMD -P $BENCHMARK_MAX_PROCS bash -c CMD

benchmarks/colab_pybullet.sh CHANGED Viewed

@@ -1,5 +1,5 @@
 source benchmarks/train_loop.sh
 ALGOS="ppo"
-ENVS="HalfCheetahBulletEnv-v0 AntBulletEnv-v0 Walker2DBulletEnv-v0 HopperBulletEnv-v0"
 BENCHMARK_MAX_PROCS="${BENCHMARK_MAX_PROCS:-3}"
 train_loop $ALGOS "$ENVS" | xargs -I CMD -P $BENCHMARK_MAX_PROCS bash -c CMD

 source benchmarks/train_loop.sh
 ALGOS="ppo"
+ENVS="HalfCheetahBulletEnv-v0 AntBulletEnv-v0 HopperBulletEnv-v0 Walker2DBulletEnv-v0"
 BENCHMARK_MAX_PROCS="${BENCHMARK_MAX_PROCS:-3}"
 train_loop $ALGOS "$ENVS" | xargs -I CMD -P $BENCHMARK_MAX_PROCS bash -c CMD

benchmarks/train_loop.sh CHANGED Viewed

@@ -4,13 +4,11 @@ train_loop () {
     local env
     local seed
     local WANDB_PROJECT_NAME="${WANDB_PROJECT_NAME:-rl-algo-impls-benchmarks}"
-    local args=()
-    (( VIRTUAL_DISPLAY == 1)) && args+=("--virtual-display")
     local SEEDS="${SEEDS:-1 2 3}"
     for algo in $(echo $1); do
         for env in $(echo $2); do
             for seed in $SEEDS; do
-                echo python train.py --algo $algo --env $env --seed $seed --pool-size 1 --wandb-tags $WANDB_TAGS --wandb-project-name $WANDB_PROJECT_NAME ${args[@]}
             done
         done
     done

     local env
     local seed
     local WANDB_PROJECT_NAME="${WANDB_PROJECT_NAME:-rl-algo-impls-benchmarks}"
     local SEEDS="${SEEDS:-1 2 3}"
     for algo in $(echo $1); do
         for env in $(echo $2); do
             for seed in $SEEDS; do
+                echo python train.py --algo $algo --env $env --seed $seed --pool-size 1 --wandb-tags $WANDB_TAGS --wandb-project-name $WANDB_PROJECT_NAME
             done
         done
     done

colab_requirements.txt CHANGED Viewed

@@ -6,4 +6,5 @@ wandb >= 0.13.9, < 0.14
 pyvirtualdisplay == 3.0
 pybullet >= 3.2.5, < 3.3
 tabulate >= 0.9.0, < 0.10
-huggingface-hub >= 0.12.0, < 0.13

 pyvirtualdisplay == 3.0
 pybullet >= 3.2.5, < 3.3
 tabulate >= 0.9.0, < 0.10
+huggingface-hub >= 0.12.0, < 0.13
+numexpr >= 2.8.4, < 2.9

compare_runs.py ADDED Viewed

	@@ -0,0 +1,179 @@

+import argparse
+import itertools
+import numpy as np
+import pandas as pd
+import wandb
+import wandb.apis.public
+from collections import defaultdict
+from dataclasses import dataclass
+from typing import Dict, Iterable, List, TypeVar
+from benchmark_publish import RunGroup
+@dataclass
+class Comparison:
+    control_values: List[float]
+    experiment_values: List[float]
+    def mean_diff_percentage(self) -> float:
+        return self._diff_percentage(
+            np.mean(self.control_values).item(), np.mean(self.experiment_values).item()
+        )
+    def median_diff_percentage(self) -> float:
+        return self._diff_percentage(
+            np.median(self.control_values).item(),
+            np.median(self.experiment_values).item(),
+        )
+    def _diff_percentage(self, c: float, e: float) -> float:
+        if c == e:
+            return 0
+        elif c == 0:
+            return float("inf") if e > 0 else float("-inf")
+        return 100 * (e - c) / c
+    def score(self) -> float:
+        return (
+            np.sum(
+                np.sign((self.mean_diff_percentage(), self.median_diff_percentage()))
+            ).item()
+            / 2
+        )
+RunGroupRunsSelf = TypeVar("RunGroupRunsSelf", bound="RunGroupRuns")
+class RunGroupRuns:
+    def __init__(
+        self,
+        run_group: RunGroup,
+        control: List[str],
+        experiment: List[str],
+        summary_stats: List[str] = ["best_eval", "eval", "train_rolling"],
+        summary_metrics: List[str] = ["mean", "result"],
+    ) -> None:
+        self.algo = run_group.algo
+        self.env = run_group.env_id
+        self.control = set(control)
+        self.experiment = set(experiment)
+        self.summary_stats = summary_stats
+        self.summary_metrics = summary_metrics
+        self.control_runs = []
+        self.experiment_runs = []
+    def add_run(self, run: wandb.apis.public.Run) -> None:
+        wandb_tags = set(run.config.get("wandb_tags", []))
+        if self.control & wandb_tags:
+            self.control_runs.append(run)
+        elif self.experiment & wandb_tags:
+            self.experiment_runs.append(run)
+    def comparisons_by_metric(self) -> Dict[str, Comparison]:
+        c_by_m = {}
+        for metric in (
+            f"{s}_{m}"
+            for s, m in itertools.product(self.summary_stats, self.summary_metrics)
+        ):
+            c_by_m[metric] = Comparison(
+                [c.summary[metric] for c in self.control_runs],
+                [e.summary[metric] for e in self.experiment_runs],
+            )
+        return c_by_m
+    @staticmethod
+    def data_frame(rows: Iterable[RunGroupRunsSelf]) -> pd.DataFrame:
+        results = defaultdict(list)
+        for r in rows:
+            results["algo"].append(r.algo)
+            results["env"].append(r.env)
+            results["control"].append(r.control)
+            results["expierment"].append(r.experiment)
+            c_by_m = r.comparisons_by_metric()
+            results["score"].append(
+                sum(m.score() for m in c_by_m.values()) / len(c_by_m)
+            )
+            for m, c in c_by_m.items():
+                results[f"{m}_mean"].append(c.mean_diff_percentage())
+                results[f"{m}_median"].append(c.median_diff_percentage())
+        return pd.DataFrame(results)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-p",
+        "--wandb-project-name",
+        type=str,
+        default="rl-algo-impls-benchmarks",
+        help="WandB project name to load runs from",
+    )
+    parser.add_argument(
+        "--wandb-entity",
+        type=str,
+        default=None,
+        help="WandB team. None uses default entity",
+    )
+    parser.add_argument(
+        "-n",
+        "--wandb-hostname-tag",
+        type=str,
+        help="WandB tag for hostname (i.e. host_192-9-145-26)",
+    )
+    parser.add_argument(
+        "-c",
+        "--wandb-control-tag",
+        type=str,
+        nargs="+",
+        help="WandB tag for control commit (i.e. benchmark_5598ebc)",
+    )
+    parser.add_argument(
+        "-e",
+        "--wandb-experiment-tag",
+        type=str,
+        nargs="+",
+        help="WandB tag for experiment commit (i.e. benchmark_5540e1f)",
+    )
+    parser.add_argument(
+        "--exclude_envs",
+        type=str,
+        nargs="*",
+        help="Environments to exclude from comparison",
+    )
+    parser.set_defaults(
+        wandb_hostname_tag="host_192-9-145-26",
+        wandb_control_tag=["benchmark_e4d1ed6", "benchmark_5598ebc"],
+        wandb_experiment_tag=["benchmark_680043d", "benchmark_5540e1f"],
+        exclude_envs=["CarRacing-v0"]
+    )
+    args = parser.parse_args()
+    print(args)
+    api = wandb.Api()
+    all_runs = api.runs(
+        path=f"{args.wandb_entity or api.default_entity}/{args.wandb_project_name}",
+        order="+created_at",
+    )
+    runs_by_run_group: Dict[RunGroup, RunGroupRuns] = {}
+    for r in all_runs:
+        wandb_tags = r.config.get("wandb_tags", [])
+        if not wandb_tags or not args.wandb_hostname_tag in wandb_tags:
+            continue
+        rg = RunGroup(r.config["algo"], r.config["env"])
+        if args.exclude_envs and rg.env_id in args.exclude_envs:
+            continue
+        if rg not in runs_by_run_group:
+            runs_by_run_group[rg] = RunGroupRuns(
+                rg, args.wandb_control_tag, args.wandb_experiment_tag
+            )
+        runs_by_run_group[rg].add_run(r)
+    df = RunGroupRuns.data_frame(runs_by_run_group.values()).round(decimals=2)
+    print(f"**Total Score: {sum(df.score)}**")
+    df.loc["mean"] = df.mean(numeric_only=True)
+    print(df.to_markdown())

dqn/policy.py CHANGED Viewed

@@ -15,7 +15,7 @@ class DQNPolicy(Policy):
     def __init__(
         self,
         env: VecEnv,
-        hidden_sizes: Sequence[int],
         **kwargs,
     ) -> None:
         super().__init__(env, **kwargs)

     def __init__(
         self,
         env: VecEnv,
+        hidden_sizes: Sequence[int] = [],
         **kwargs,
     ) -> None:
         super().__init__(env, **kwargs)

dqn/q_net.py CHANGED Viewed

@@ -13,7 +13,7 @@ class QNetwork(nn.Module):
         self,
         observation_space: gym.Space,
         action_space: gym.Space,
-        hidden_sizes: Sequence[int],
         activation: Type[nn.Module] = nn.ReLU,  # Used by stable-baselines3
     ) -> None:
         super().__init__()

         self,
         observation_space: gym.Space,
         action_space: gym.Space,
+        hidden_sizes: Sequence[int] = [],
         activation: Type[nn.Module] = nn.ReLU,  # Used by stable-baselines3
     ) -> None:
         super().__init__()

huggingface_publish.py CHANGED Viewed

@@ -14,6 +14,8 @@ from typing import List, Optional
 from huggingface_hub.hf_api import HfApi, upload_folder
 from huggingface_hub.repocard import metadata_save
 from publish.markdown_format import EvalTableData, model_card_text
 from runner.evaluate import EvalArgs, evaluate_model
 from runner.env import make_eval_env
@@ -27,6 +29,9 @@ def publish(
     huggingface_user: Optional[str] = None,
     huggingface_token: Optional[str] = None,
 ) -> None:
     api = wandb.Api()
     runs = [api.run(rp) for rp in wandb_run_paths]
     algo = runs[0].config["algo"]

 from huggingface_hub.hf_api import HfApi, upload_folder
 from huggingface_hub.repocard import metadata_save
+from pyvirtualdisplay.display import Display
 from publish.markdown_format import EvalTableData, model_card_text
 from runner.evaluate import EvalArgs, evaluate_model
 from runner.env import make_eval_env
     huggingface_user: Optional[str] = None,
     huggingface_token: Optional[str] = None,
 ) -> None:
+    virtual_display = Display(visible=False, size=(1400, 900))
+    virtual_display.start()
     api = wandb.Api()
     runs = [api.run(rp) for rp in wandb_run_paths]
     algo = runs[0].config["algo"]

hyperparams/dqn.yml CHANGED Viewed

@@ -1,7 +1,6 @@
 CartPole-v1: &cartpole-defaults
   n_timesteps: !!float 5e4
   env_hyperparams:
-    n_envs: 1
     rolling_length: 50
   policy_hyperparams:
     hidden_sizes: [256, 256]
@@ -18,8 +17,6 @@ CartPole-v1: &cartpole-defaults
     exploration_final_eps: 0.04
   eval_params:
     step_freq: !!float 1e4
-    n_episodes: 10
-    save_best: true
 CartPole-v0:
   <<: *cartpole-defaults
@@ -46,7 +43,7 @@ MountainCar-v0:
 Acrobot-v1:
   n_timesteps: !!float 1e5
   env_hyperparams:
-    rolling_length: 10
   policy_hyperparams:
     hidden_sizes: [256, 256]
   algo_hyperparams:
@@ -64,7 +61,7 @@ Acrobot-v1:
 LunarLander-v2:
   n_timesteps: !!float 5e5
   env_hyperparams:
-    rolling_length: 10
   policy_hyperparams:
     hidden_sizes: [256, 256]
   algo_hyperparams:
@@ -81,19 +78,15 @@ LunarLander-v2:
     max_grad_norm: 0.5
   eval_params:
     step_freq: 25_000
-    n_episodes: 10
-    save_best: true
-SpaceInvadersNoFrameskip-v4: &atari-defaults
   n_timesteps: !!float 1e7
   env_hyperparams:
     frame_stack: 4
     no_reward_timeout_steps: 1_000
     n_envs: 8
     vec_env_class: "subproc"
-    rolling_length: 20
-  policy_hyperparams:
-    hidden_sizes: [512]
   algo_hyperparams:
     buffer_size: 100000
     learning_rate: !!float 1e-4
@@ -105,12 +98,7 @@ SpaceInvadersNoFrameskip-v4: &atari-defaults
     exploration_fraction: 0.1
     exploration_final_eps: 0.01
   eval_params:
-    step_freq: 100_000
-    n_episodes: 10
-    save_best: true
-BreakoutNoFrameskip-v4:
-  <<: *atari-defaults
 PongNoFrameskip-v4:
   <<: *atari-defaults

 CartPole-v1: &cartpole-defaults
   n_timesteps: !!float 5e4
   env_hyperparams:
     rolling_length: 50
   policy_hyperparams:
     hidden_sizes: [256, 256]
     exploration_final_eps: 0.04
   eval_params:
     step_freq: !!float 1e4
 CartPole-v0:
   <<: *cartpole-defaults
 Acrobot-v1:
   n_timesteps: !!float 1e5
   env_hyperparams:
+    rolling_length: 50
   policy_hyperparams:
     hidden_sizes: [256, 256]
   algo_hyperparams:
 LunarLander-v2:
   n_timesteps: !!float 5e5
   env_hyperparams:
+    rolling_length: 50
   policy_hyperparams:
     hidden_sizes: [256, 256]
   algo_hyperparams:
     max_grad_norm: 0.5
   eval_params:
     step_freq: 25_000
+atari: &atari-defaults
   n_timesteps: !!float 1e7
   env_hyperparams:
     frame_stack: 4
     no_reward_timeout_steps: 1_000
+    no_reward_fire_steps: 500
     n_envs: 8
     vec_env_class: "subproc"
   algo_hyperparams:
     buffer_size: 100000
     learning_rate: !!float 1e-4
     exploration_fraction: 0.1
     exploration_final_eps: 0.01
   eval_params:
+    deterministic: false
 PongNoFrameskip-v4:
   <<: *atari-defaults

hyperparams/ppo.yml CHANGED Viewed

@@ -15,8 +15,6 @@ CartPole-v1: &cartpole-defaults
     clip_range_decay: linear
   eval_params:
     step_freq: !!float 2.5e4
-    n_episodes: 10
-    save_best: true
 CartPole-v0:
   <<: *cartpole-defaults
@@ -39,9 +37,10 @@ MountainCarContinuous-v0:
   env_hyperparams:
     normalize: true
     n_envs: 4
-  policy_hyperparams:
-    init_layers_orthogonal: false
-    # log_std_init: -3.29
   algo_hyperparams:
     n_steps: 512
     batch_size: 256
@@ -53,11 +52,8 @@ MountainCarContinuous-v0:
     gae_lambda: 0.9
     max_grad_norm: 5
     vf_coef: 0.19
-    # use_sde: true
   eval_params:
     step_freq: 5000
-    n_episodes: 10
-    save_best: true
 Acrobot-v1:
   n_timesteps: !!float 1e6
@@ -84,10 +80,6 @@ LunarLander-v2:
     ent_coef: 0.01
     ent_coef_decay: linear
     normalize_advantage: false
-  eval_params:
-    step_freq: !!float 5e4
-    n_episodes: 10
-    save_best: true
 CarRacing-v0:
   n_timesteps: !!float 4e6
@@ -101,6 +93,7 @@ CarRacing-v0:
     activation_fn: relu
     share_features_extractor: false
     cnn_feature_dim: 256
   algo_hyperparams:
     n_steps: 512
     batch_size: 128

     clip_range_decay: linear
   eval_params:
     step_freq: !!float 2.5e4
 CartPole-v0:
   <<: *cartpole-defaults
   env_hyperparams:
     normalize: true
     n_envs: 4
+  # policy_hyperparams:
+  #   init_layers_orthogonal: false
+  #   log_std_init: -3.29
+  #   use_sde: true
   algo_hyperparams:
     n_steps: 512
     batch_size: 256
     gae_lambda: 0.9
     max_grad_norm: 5
     vf_coef: 0.19
   eval_params:
     step_freq: 5000
 Acrobot-v1:
   n_timesteps: !!float 1e6
     ent_coef: 0.01
     ent_coef_decay: linear
     normalize_advantage: false
 CarRacing-v0:
   n_timesteps: !!float 4e6
     activation_fn: relu
     share_features_extractor: false
     cnn_feature_dim: 256
+    hidden_sizes: [256]
   algo_hyperparams:
     n_steps: 512
     batch_size: 128

lambda_labs/benchmark.sh CHANGED Viewed

@@ -1,7 +1,6 @@
 source benchmarks/train_loop.sh
 # export WANDB_PROJECT_NAME="rl-algo-impls"
-export VIRTUAL_DISPLAY=1
 BENCHMARK_MAX_PROCS="${BENCHMARK_MAX_PROCS:-6}"
@@ -20,8 +19,8 @@ ENVS=(
     # PyBullet
     "HalfCheetahBulletEnv-v0"
     "AntBulletEnv-v0"
-    "Walker2DBulletEnv-v0"
     "HopperBulletEnv-v0"
     # CarRacing
     "CarRacing-v0"
     # Atari

 source benchmarks/train_loop.sh
 # export WANDB_PROJECT_NAME="rl-algo-impls"
 BENCHMARK_MAX_PROCS="${BENCHMARK_MAX_PROCS:-6}"
     # PyBullet
     "HalfCheetahBulletEnv-v0"
     "AntBulletEnv-v0"
     "HopperBulletEnv-v0"
+    "Walker2DBulletEnv-v0"
     # CarRacing
     "CarRacing-v0"
     # Atari

lambda_labs/lambda_requirements.txt CHANGED Viewed

@@ -8,4 +8,5 @@ wandb >= 0.13.9, < 0.14
 pyvirtualdisplay == 3.0
 pybullet >= 3.2.5, < 3.3
 tabulate >= 0.9.0, < 0.10
-huggingface-hub >= 0.12.0, < 0.13

 pyvirtualdisplay == 3.0
 pybullet >= 3.2.5, < 3.3
 tabulate >= 0.9.0, < 0.10
+huggingface-hub >= 0.12.0, < 0.13
+numexpr >= 2.8.4, < 2.9

poetry.lock CHANGED Viewed

@@ -787,47 +787,47 @@ files = [
 [[package]]
 name = "cryptography"
-version = "39.0.0"
 description = "cryptography is a package which provides cryptographic recipes and primitives to Python developers."
 category = "main"
 optional = false
 python-versions = ">=3.6"
 files = [
-    {file = "cryptography-39.0.0-cp36-abi3-macosx_10_12_universal2.whl", hash = "sha256:c52a1a6f81e738d07f43dab57831c29e57d21c81a942f4602fac7ee21b27f288"},
-    {file = "cryptography-39.0.0-cp36-abi3-macosx_10_12_x86_64.whl", hash = "sha256:80ee674c08aaef194bc4627b7f2956e5ba7ef29c3cc3ca488cf15854838a8f72"},
-    {file = "cryptography-39.0.0-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_24_aarch64.whl", hash = "sha256:887cbc1ea60786e534b00ba8b04d1095f4272d380ebd5f7a7eb4cc274710fad9"},
-    {file = "cryptography-39.0.0-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:6f97109336df5c178ee7c9c711b264c502b905c2d2a29ace99ed761533a3460f"},
-    {file = "cryptography-39.0.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1a6915075c6d3a5e1215eab5d99bcec0da26036ff2102a1038401d6ef5bef25b"},
-    {file = "cryptography-39.0.0-cp36-abi3-manylinux_2_24_x86_64.whl", hash = "sha256:76c24dd4fd196a80f9f2f5405a778a8ca132f16b10af113474005635fe7e066c"},
-    {file = "cryptography-39.0.0-cp36-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:bae6c7f4a36a25291b619ad064a30a07110a805d08dc89984f4f441f6c1f3f96"},
-    {file = "cryptography-39.0.0-cp36-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:875aea1039d78557c7c6b4db2fe0e9d2413439f4676310a5f269dd342ca7a717"},
-    {file = "cryptography-39.0.0-cp36-abi3-musllinux_1_1_aarch64.whl", hash = "sha256:f6c0db08d81ead9576c4d94bbb27aed8d7a430fa27890f39084c2d0e2ec6b0df"},
-    {file = "cryptography-39.0.0-cp36-abi3-musllinux_1_1_x86_64.whl", hash = "sha256:f3ed2d864a2fa1666e749fe52fb8e23d8e06b8012e8bd8147c73797c506e86f1"},
-    {file = "cryptography-39.0.0-cp36-abi3-win32.whl", hash = "sha256:f671c1bb0d6088e94d61d80c606d65baacc0d374e67bf895148883461cd848de"},
-    {file = "cryptography-39.0.0-cp36-abi3-win_amd64.whl", hash = "sha256:e324de6972b151f99dc078defe8fb1b0a82c6498e37bff335f5bc6b1e3ab5a1e"},
-    {file = "cryptography-39.0.0-pp38-pypy38_pp73-macosx_10_12_x86_64.whl", hash = "sha256:754978da4d0457e7ca176f58c57b1f9de6556591c19b25b8bcce3c77d314f5eb"},
-    {file = "cryptography-39.0.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1ee1fd0de9851ff32dbbb9362a4d833b579b4a6cc96883e8e6d2ff2a6bc7104f"},
-    {file = "cryptography-39.0.0-pp38-pypy38_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:fec8b932f51ae245121c4671b4bbc030880f363354b2f0e0bd1366017d891458"},
-    {file = "cryptography-39.0.0-pp38-pypy38_pp73-manylinux_2_28_x86_64.whl", hash = "sha256:407cec680e811b4fc829de966f88a7c62a596faa250fc1a4b520a0355b9bc190"},
-    {file = "cryptography-39.0.0-pp38-pypy38_pp73-win_amd64.whl", hash = "sha256:7dacfdeee048814563eaaec7c4743c8aea529fe3dd53127313a792f0dadc1773"},
-    {file = "cryptography-39.0.0-pp39-pypy39_pp73-macosx_10_12_x86_64.whl", hash = "sha256:ad04f413436b0781f20c52a661660f1e23bcd89a0e9bb1d6d20822d048cf2856"},
-    {file = "cryptography-39.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:50386acb40fbabbceeb2986332f0287f50f29ccf1497bae31cf5c3e7b4f4b34f"},
-    {file = "cryptography-39.0.0-pp39-pypy39_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:e5d71c5d5bd5b5c3eebcf7c5c2bb332d62ec68921a8c593bea8c394911a005ce"},
-    {file = "cryptography-39.0.0-pp39-pypy39_pp73-manylinux_2_28_x86_64.whl", hash = "sha256:844ad4d7c3850081dffba91cdd91950038ee4ac525c575509a42d3fc806b83c8"},
-    {file = "cryptography-39.0.0-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:e0a05aee6a82d944f9b4edd6a001178787d1546ec7c6223ee9a848a7ade92e39"},
-    {file = "cryptography-39.0.0.tar.gz", hash = "sha256:f964c7dcf7802d133e8dbd1565914fa0194f9d683d82411989889ecd701e8adf"},
 ]
 [package.dependencies]
 cffi = ">=1.12"
 [package.extras]
-docs = ["sphinx (>=1.6.5,!=1.8.0,!=3.1.0,!=3.1.1,!=5.2.0,!=5.2.0.post0)", "sphinx-rtd-theme"]
 docstest = ["pyenchant (>=1.6.11)", "sphinxcontrib-spelling (>=4.0.1)", "twine (>=1.12.0)"]
-pep8test = ["black", "ruff"]
 sdist = ["setuptools-rust (>=0.11.4)"]
 ssh = ["bcrypt (>=3.1.5)"]
-test = ["hypothesis (>=1.11.4,!=3.79.2)", "iso8601", "pretend", "pytest (>=6.2.0)", "pytest-benchmark", "pytest-cov", "pytest-subtests", "pytest-xdist", "pytz"]
 [[package]]
 name = "cycler"
@@ -2250,6 +2250,49 @@ jupyter-server = ">=1.8,<3"
 [package.extras]
 test = ["pytest", "pytest-console-scripts", "pytest-tornasync"]
 [[package]]
 name = "numpy"
 version = "1.24.1"
@@ -3002,6 +3045,18 @@ files = [
     {file = "pytz-2022.7.tar.gz", hash = "sha256:7ccfae7b4b2c067464a6733c6261673fdb8fd1be905460396b97a073e9fa683a"},
 ]
 [[package]]
 name = "pywin32"
 version = "305"
@@ -4198,4 +4253,4 @@ testing = ["flake8 (<5)", "func-timeout", "jaraco.functools", "jaraco.itertools"
 [metadata]
 lock-version = "2.0"
 python-versions = "~3.10"
-content-hash = "89d4861857be881d3c6cb591d17fb98396b8c117b24a8d4ce4b6593ac8048670"

 [[package]]
 name = "cryptography"
+version = "39.0.1"
 description = "cryptography is a package which provides cryptographic recipes and primitives to Python developers."
 category = "main"
 optional = false
 python-versions = ">=3.6"
 files = [
+    {file = "cryptography-39.0.1-cp36-abi3-macosx_10_12_universal2.whl", hash = "sha256:6687ef6d0a6497e2b58e7c5b852b53f62142cfa7cd1555795758934da363a965"},
+    {file = "cryptography-39.0.1-cp36-abi3-macosx_10_12_x86_64.whl", hash = "sha256:706843b48f9a3f9b9911979761c91541e3d90db1ca905fd63fee540a217698bc"},
+    {file = "cryptography-39.0.1-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_24_aarch64.whl", hash = "sha256:5d2d8b87a490bfcd407ed9d49093793d0f75198a35e6eb1a923ce1ee86c62b41"},
+    {file = "cryptography-39.0.1-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:83e17b26de248c33f3acffb922748151d71827d6021d98c70e6c1a25ddd78505"},
+    {file = "cryptography-39.0.1-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e124352fd3db36a9d4a21c1aa27fd5d051e621845cb87fb851c08f4f75ce8be6"},
+    {file = "cryptography-39.0.1-cp36-abi3-manylinux_2_24_x86_64.whl", hash = "sha256:5aa67414fcdfa22cf052e640cb5ddc461924a045cacf325cd164e65312d99502"},
+    {file = "cryptography-39.0.1-cp36-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:35f7c7d015d474f4011e859e93e789c87d21f6f4880ebdc29896a60403328f1f"},
+    {file = "cryptography-39.0.1-cp36-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:f24077a3b5298a5a06a8e0536e3ea9ec60e4c7ac486755e5fb6e6ea9b3500106"},
+    {file = "cryptography-39.0.1-cp36-abi3-musllinux_1_1_aarch64.whl", hash = "sha256:f0c64d1bd842ca2633e74a1a28033d139368ad959872533b1bab8c80e8240a0c"},
+    {file = "cryptography-39.0.1-cp36-abi3-musllinux_1_1_x86_64.whl", hash = "sha256:0f8da300b5c8af9f98111ffd512910bc792b4c77392a9523624680f7956a99d4"},
+    {file = "cryptography-39.0.1-cp36-abi3-win32.whl", hash = "sha256:fe913f20024eb2cb2f323e42a64bdf2911bb9738a15dba7d3cce48151034e3a8"},
+    {file = "cryptography-39.0.1-cp36-abi3-win_amd64.whl", hash = "sha256:ced4e447ae29ca194449a3f1ce132ded8fcab06971ef5f618605aacaa612beac"},
+    {file = "cryptography-39.0.1-pp38-pypy38_pp73-macosx_10_12_x86_64.whl", hash = "sha256:807ce09d4434881ca3a7594733669bd834f5b2c6d5c7e36f8c00f691887042ad"},
+    {file = "cryptography-39.0.1-pp38-pypy38_pp73-win_amd64.whl", hash = "sha256:96f1157a7c08b5b189b16b47bc9db2332269d6680a196341bf30046330d15388"},
+    {file = "cryptography-39.0.1-pp39-pypy39_pp73-macosx_10_12_x86_64.whl", hash = "sha256:e422abdec8b5fa8462aa016786680720d78bdce7a30c652b7fadf83a4ba35336"},
+    {file = "cryptography-39.0.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_24_aarch64.whl", hash = "sha256:b0afd054cd42f3d213bf82c629efb1ee5f22eba35bf0eec88ea9ea7304f511a2"},
+    {file = "cryptography-39.0.1-pp39-pypy39_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:6f8ba7f0328b79f08bdacc3e4e66fb4d7aab0c3584e0bd41328dce5262e26b2e"},
+    {file = "cryptography-39.0.1-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl", hash = "sha256:ef8b72fa70b348724ff1218267e7f7375b8de4e8194d1636ee60510aae104cd0"},
+    {file = "cryptography-39.0.1-pp39-pypy39_pp73-manylinux_2_28_x86_64.whl", hash = "sha256:aec5a6c9864be7df2240c382740fcf3b96928c46604eaa7f3091f58b878c0bb6"},
+    {file = "cryptography-39.0.1-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:fdd188c8a6ef8769f148f88f859884507b954cc64db6b52f66ef199bb9ad660a"},
+    {file = "cryptography-39.0.1.tar.gz", hash = "sha256:d1f6198ee6d9148405e49887803907fe8962a23e6c6f83ea7d98f1c0de375695"},
 ]
 [package.dependencies]
 cffi = ">=1.12"
 [package.extras]
+docs = ["sphinx (>=5.3.0)", "sphinx-rtd-theme (>=1.1.1)"]
 docstest = ["pyenchant (>=1.6.11)", "sphinxcontrib-spelling (>=4.0.1)", "twine (>=1.12.0)"]
+pep8test = ["black", "check-manifest", "mypy", "ruff", "types-pytz", "types-requests"]
 sdist = ["setuptools-rust (>=0.11.4)"]
 ssh = ["bcrypt (>=3.1.5)"]
+test = ["hypothesis (>=1.11.4,!=3.79.2)", "iso8601", "pretend", "pytest (>=6.2.0)", "pytest-benchmark", "pytest-cov", "pytest-shard (>=0.1.2)", "pytest-subtests", "pytest-xdist", "pytz"]
+test-randomorder = ["pytest-randomly"]
+tox = ["tox"]
 [[package]]
 name = "cycler"
 [package.extras]
 test = ["pytest", "pytest-console-scripts", "pytest-tornasync"]
+[[package]]
+name = "numexpr"
+version = "2.8.4"
+description = "Fast numerical expression evaluator for NumPy"
+category = "main"
+optional = false
+python-versions = ">=3.7"
+files = [
+    {file = "numexpr-2.8.4-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:a75967d46b6bd56455dd32da6285e5ffabe155d0ee61eef685bbfb8dafb2e484"},
+    {file = "numexpr-2.8.4-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:db93cf1842f068247de631bfc8af20118bf1f9447cd929b531595a5e0efc9346"},
+    {file = "numexpr-2.8.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7bca95f4473b444428061d4cda8e59ac564dc7dc6a1dea3015af9805c6bc2946"},
+    {file = "numexpr-2.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9e34931089a6bafc77aaae21f37ad6594b98aa1085bb8b45d5b3cd038c3c17d9"},
+    {file = "numexpr-2.8.4-cp310-cp310-win32.whl", hash = "sha256:f3a920bfac2645017110b87ddbe364c9c7a742870a4d2f6120b8786c25dc6db3"},
+    {file = "numexpr-2.8.4-cp310-cp310-win_amd64.whl", hash = "sha256:6931b1e9d4f629f43c14b21d44f3f77997298bea43790cfcdb4dd98804f90783"},
+    {file = "numexpr-2.8.4-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:9400781553541f414f82eac056f2b4c965373650df9694286b9bd7e8d413f8d8"},
+    {file = "numexpr-2.8.4-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:6ee9db7598dd4001138b482342b96d78110dd77cefc051ec75af3295604dde6a"},
+    {file = "numexpr-2.8.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ff5835e8af9a212e8480003d731aad1727aaea909926fd009e8ae6a1cba7f141"},
+    {file = "numexpr-2.8.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:655d84eb09adfee3c09ecf4a89a512225da153fdb7de13c447404b7d0523a9a7"},
+    {file = "numexpr-2.8.4-cp311-cp311-win32.whl", hash = "sha256:5538b30199bfc68886d2be18fcef3abd11d9271767a7a69ff3688defe782800a"},
+    {file = "numexpr-2.8.4-cp311-cp311-win_amd64.whl", hash = "sha256:3f039321d1c17962c33079987b675fb251b273dbec0f51aac0934e932446ccc3"},
+    {file = "numexpr-2.8.4-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:c867cc36cf815a3ec9122029874e00d8fbcef65035c4a5901e9b120dd5d626a2"},
+    {file = "numexpr-2.8.4-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:059546e8f6283ccdb47c683101a890844f667fa6d56258d48ae2ecf1b3875957"},
+    {file = "numexpr-2.8.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:845a6aa0ed3e2a53239b89c1ebfa8cf052d3cc6e053c72805e8153300078c0b1"},
+    {file = "numexpr-2.8.4-cp37-cp37m-win32.whl", hash = "sha256:a38664e699526cb1687aefd9069e2b5b9387da7feac4545de446141f1ef86f46"},
+    {file = "numexpr-2.8.4-cp37-cp37m-win_amd64.whl", hash = "sha256:eaec59e9bf70ff05615c34a8b8d6c7bd042bd9f55465d7b495ea5436f45319d0"},
+    {file = "numexpr-2.8.4-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:b318541bf3d8326682ebada087ba0050549a16d8b3fa260dd2585d73a83d20a7"},
+    {file = "numexpr-2.8.4-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:b076db98ca65eeaf9bd224576e3ac84c05e451c0bd85b13664b7e5f7b62e2c70"},
+    {file = "numexpr-2.8.4-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:90f12cc851240f7911a47c91aaf223dba753e98e46dff3017282e633602e76a7"},
+    {file = "numexpr-2.8.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c368aa35ae9b18840e78b05f929d3a7b3abccdba9630a878c7db74ca2368339"},
+    {file = "numexpr-2.8.4-cp38-cp38-win32.whl", hash = "sha256:b96334fc1748e9ec4f93d5fadb1044089d73fb08208fdb8382ed77c893f0be01"},
+    {file = "numexpr-2.8.4-cp38-cp38-win_amd64.whl", hash = "sha256:a6d2d7740ae83ba5f3531e83afc4b626daa71df1ef903970947903345c37bd03"},
+    {file = "numexpr-2.8.4-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:77898fdf3da6bb96aa8a4759a8231d763a75d848b2f2e5c5279dad0b243c8dfe"},
+    {file = "numexpr-2.8.4-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:df35324666b693f13a016bc7957de7cc4d8801b746b81060b671bf78a52b9037"},
+    {file = "numexpr-2.8.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:17ac9cfe6d0078c5fc06ba1c1bbd20b8783f28c6f475bbabd3cad53683075cab"},
+    {file = "numexpr-2.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:df3a1f6b24214a1ab826e9c1c99edf1686c8e307547a9aef33910d586f626d01"},
+    {file = "numexpr-2.8.4-cp39-cp39-win32.whl", hash = "sha256:7d71add384adc9119568d7e9ffa8a35b195decae81e0abf54a2b7779852f0637"},
+    {file = "numexpr-2.8.4-cp39-cp39-win_amd64.whl", hash = "sha256:9f096d707290a6a00b6ffdaf581ee37331109fb7b6c8744e9ded7c779a48e517"},
+    {file = "numexpr-2.8.4.tar.gz", hash = "sha256:d5432537418d18691b9115d615d6daa17ee8275baef3edf1afbbf8bc69806147"},
+]
+[package.dependencies]
+numpy = ">=1.13.3"
 [[package]]
 name = "numpy"
 version = "1.24.1"
     {file = "pytz-2022.7.tar.gz", hash = "sha256:7ccfae7b4b2c067464a6733c6261673fdb8fd1be905460396b97a073e9fa683a"},
 ]
+[[package]]
+name = "pyvirtualdisplay"
+version = "3.0"
+description = "python wrapper for Xvfb, Xephyr and Xvnc"
+category = "main"
+optional = false
+python-versions = "*"
+files = [
+    {file = "PyVirtualDisplay-3.0-py3-none-any.whl", hash = "sha256:40d4b8dfe4b8de8552e28eb367647f311f88a130bf837fe910e7f180d5477f0e"},
+    {file = "PyVirtualDisplay-3.0.tar.gz", hash = "sha256:09755bc3ceb6eb725fb07eca5425f43f2358d3bf08e00d2a9b792a1aedd16159"},
+]
 [[package]]
 name = "pywin32"
 version = "305"
 [metadata]
 lock-version = "2.0"
 python-versions = "~3.10"
+content-hash = "8301ee1f2321a6c23370a61466fd3b45096291c2cc63326bbe4701774edf1d94"

ppo/policy.py CHANGED Viewed

@@ -2,7 +2,7 @@ from stable_baselines3.common.vec_env.base_vec_env import VecEnv
 from typing import Optional, Sequence
 from gym.spaces import Box, Discrete
-from shared.policy.on_policy import ActorCritic
 class PPOActorCritic(ActorCritic):
@@ -13,21 +13,16 @@ class PPOActorCritic(ActorCritic):
         v_hidden_sizes: Optional[Sequence[int]] = None,
         **kwargs,
     ) -> None:
-        obs_space = env.observation_space
-        if isinstance(obs_space, Box):
-            if len(obs_space.shape) == 3:
-                pi_hidden_sizes = pi_hidden_sizes or []
-                v_hidden_sizes = v_hidden_sizes or []
-            elif len(obs_space.shape) == 1:
-                pi_hidden_sizes = pi_hidden_sizes or [64, 64]
-                v_hidden_sizes = v_hidden_sizes or [64, 64]
-            else:
-                raise ValueError(f"Unsupported observation space: {obs_space}")
-        elif isinstance(obs_space, Discrete):
-            pi_hidden_sizes = pi_hidden_sizes or [64]
-            v_hidden_sizes = v_hidden_sizes or [64]
-        else:
-            raise ValueError(f"Unsupported observation space: {obs_space}")
         super().__init__(
             env,
             pi_hidden_sizes,

 from typing import Optional, Sequence
 from gym.spaces import Box, Discrete
+from shared.policy.on_policy import ActorCritic, default_hidden_sizes
 class PPOActorCritic(ActorCritic):
         v_hidden_sizes: Optional[Sequence[int]] = None,
         **kwargs,
     ) -> None:
+        pi_hidden_sizes = (
+            pi_hidden_sizes
+            if pi_hidden_sizes is not None
+            else default_hidden_sizes(env.observation_space)
+        )
+        v_hidden_sizes = (
+            v_hidden_sizes
+            if v_hidden_sizes is not None
+            else default_hidden_sizes(env.observation_space)
+        )
         super().__init__(
             env,
             pi_hidden_sizes,

pyproject.toml CHANGED Viewed

@@ -23,6 +23,9 @@ torch-tb-profiler = "^0.4.1"
 jupyter = "^1.0.0"
 tabulate = "^0.9.0"
 huggingface-hub = "^0.12.0"
 [build-system]
 requires = ["poetry-core"]

 jupyter = "^1.0.0"
 tabulate = "^0.9.0"
 huggingface-hub = "^0.12.0"
+cryptography = "39.0.1"
+pyvirtualdisplay = "^3.0"
+numexpr = "^2.8.4"
 [build-system]
 requires = ["poetry-core"]

replay.meta.json CHANGED Viewed

@@ -1 +1 @@

- {"content_type": "video/mp4", "encoder_version": {"backend": "ffmpeg", "version": "b'ffmpeg version 5.1.2 Copyright (c) 2000-2022 the FFmpeg developers\\nbuilt with clang version 14.0.6\\nconfiguration: --prefix=/Users/runner/miniforge3/conda-bld/ffmpeg_1671040513231/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl --cc=arm64-apple-darwin20.0.0-clang --cxx=arm64-apple-darwin20.0.0-clang++ --nm=arm64-apple-darwin20.0.0-nm --ar=arm64-apple-darwin20.0.0-ar --disable-doc --disable-openssl --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libfontconfig --enable-libopenh264 --enable-cross-compile --arch=arm64 --target-os=darwin --cross-prefix=arm64-apple-darwin20.0.0- --host-cc=/Users/runner/miniforge3/conda-bld/ffmpeg_1671040513231/_build_env/bin/x86_64-apple-darwin13.4.0-clang --enable-neon --enable-gnutls --enable-libmp3lame --enable-libvpx --enable-pthreads --enable-gpl --enable-libx264 --enable-libx265 --enable-libaom --enable-libsvtav1 --enable-libxml2 --enable-pic --enable-shared --disable-static --enable-version3 --enable-zlib --pkg-config=/Users/runner/miniforge3/conda-bld/ffmpeg_1671040513231/_build_env/bin/pkg-config\\nlibavutil 57. 28.100 / 57. 28.100\\nlibavcodec 59. 37.100 / 59. 37.100\\nlibavformat 59. 27.100 / 59. 27.100\\nlibavdevice 59. 7.100 / 59. 7.100\\nlibavfilter 8. 44.100 / 8. 44.100\\nlibswscale 6. 7.100 / 6. 7.100\\nlibswresample 4. 7.100 / 4. 7.100\\nlibpostproc 56. 6.100 / 56. 6.100\\n'", "cmdline": ["ffmpeg", "-nostats", "-loglevel", "error", "-y", "-f", "rawvideo", "-s:v", "600x400", "-pix_fmt", "rgb24", "-framerate", "50", "-i", "-", "-vf", "scale=trunc(iw/2)*2:trunc(ih/2)*2", "-vcodec", "libx264", "-pix_fmt", "yuv420p", "-r", "50", "/var/folders/9g/my5557_91xddp6lx00nkzly80000gn/T/~~tmpdv8g18sc~~/ppo-CarRacing-v0/replay.mp4"]}, "episode": {"r": ~~442~~.~~7605895996094~~, "l": ~~1000~~, "t": 12.~~301982~~}}

+ {"content_type": "video/mp4", "encoder_version": {"backend": "ffmpeg", "version": "b'ffmpeg version 5.1.2 Copyright (c) 2000-2022 the FFmpeg developers\\nbuilt with clang version 14.0.6\\nconfiguration: --prefix=/Users/runner/miniforge3/conda-bld/ffmpeg_1671040513231/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl --cc=arm64-apple-darwin20.0.0-clang --cxx=arm64-apple-darwin20.0.0-clang++ --nm=arm64-apple-darwin20.0.0-nm --ar=arm64-apple-darwin20.0.0-ar --disable-doc --disable-openssl --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libfontconfig --enable-libopenh264 --enable-cross-compile --arch=arm64 --target-os=darwin --cross-prefix=arm64-apple-darwin20.0.0- --host-cc=/Users/runner/miniforge3/conda-bld/ffmpeg_1671040513231/_build_env/bin/x86_64-apple-darwin13.4.0-clang --enable-neon --enable-gnutls --enable-libmp3lame --enable-libvpx --enable-pthreads --enable-gpl --enable-libx264 --enable-libx265 --enable-libaom --enable-libsvtav1 --enable-libxml2 --enable-pic --enable-shared --disable-static --enable-version3 --enable-zlib --pkg-config=/Users/runner/miniforge3/conda-bld/ffmpeg_1671040513231/_build_env/bin/pkg-config\\nlibavutil 57. 28.100 / 57. 28.100\\nlibavcodec 59. 37.100 / 59. 37.100\\nlibavformat 59. 27.100 / 59. 27.100\\nlibavdevice 59. 7.100 / 59. 7.100\\nlibavfilter 8. 44.100 / 8. 44.100\\nlibswscale 6. 7.100 / 6. 7.100\\nlibswresample 4. 7.100 / 4. 7.100\\nlibpostproc 56. 6.100 / 56. 6.100\\n'", "cmdline": ["ffmpeg", "-nostats", "-loglevel", "error", "-y", "-f", "rawvideo", "-s:v", "600x400", "-pix_fmt", "rgb24", "-framerate", "50", "-i", "-", "-vf", "scale=trunc(iw/2)*2:trunc(ih/2)*2", "-vcodec", "libx264", "-pix_fmt", "yuv420p", "-r", "50", "/var/folders/9g/my5557_91xddp6lx00nkzly80000gn/T/tmpcfp_i4ww/ppo-CarRacing-v0/replay.mp4"]}, "episode": {"r": 910.102783203125, "l": 899, "t": 11.729042}}

replay.mp4 CHANGED Viewed

Binary files a/replay.mp4 and b/replay.mp4 differ

runner/running_utils.py CHANGED Viewed

@@ -119,6 +119,8 @@ def set_seeds(seed: Optional[int], use_deterministic_algorithms: bool) -> None:
     torch.backends.cudnn.benchmark = False
     torch.use_deterministic_algorithms(use_deterministic_algorithms)
     os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
 def make_policy(

     torch.backends.cudnn.benchmark = False
     torch.use_deterministic_algorithms(use_deterministic_algorithms)
     os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
+    # Stop warning and it would introduce stochasticity if I was using TF
+    os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"
 def make_policy(

saved_models/ppo-CarRacing-v0-S1-best/model.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e252b9b261d9afa502ed7c5079655ec0e34990e64aaa8ec133e91ce72aa5fb34
+size 2737400

shared/policy/on_policy.py CHANGED Viewed

@@ -2,7 +2,7 @@ import gym
 import numpy as np
 import torch
-from gym.spaces import Box
 from pathlib import Path
 from stable_baselines3.common.vec_env.base_vec_env import VecEnv, VecEnvObs
 from typing import NamedTuple, Optional, Sequence, Tuple, TypeVar
@@ -47,6 +47,21 @@ def clamp_actions(
     return actions
 class ActorCritic(Policy):
     def __init__(
         self,

 import numpy as np
 import torch
+from gym.spaces import Box, Discrete, Space
 from pathlib import Path
 from stable_baselines3.common.vec_env.base_vec_env import VecEnv, VecEnvObs
 from typing import NamedTuple, Optional, Sequence, Tuple, TypeVar
     return actions
+def default_hidden_sizes(obs_space: Space) -> Sequence[int]:
+    if isinstance(obs_space, Box):
+        if len(obs_space.shape) == 3:
+            # By default feature extractor to output has no hidden layers
+            return []
+        elif len(obs_space.shape) == 1:
+            return [64, 64]
+        else:
+            raise ValueError(f"Unsupported observation space: {obs_space}")
+    elif isinstance(obs_space, Discrete):
+        return [64]
+    else:
+        raise ValueError(f"Unsupported observation space: {obs_space}")
 class ActorCritic(Policy):
     def __init__(
         self,

shared/policy/policy.py CHANGED Viewed

@@ -51,7 +51,6 @@ class Policy(nn.Module, ABC):
             os.path.join(path, MODEL_FILENAME),
         )
-    @abstractmethod
     def load(self, path: str) -> None:
         # VecNormalize load occurs in env.py
         self.load_state_dict(

             os.path.join(path, MODEL_FILENAME),
         )
     def load(self, path: str) -> None:
         # VecNormalize load occurs in env.py
         self.load_state_dict(

train.py CHANGED Viewed

@@ -45,21 +45,20 @@ if __name__ == "__main__":
     parser.add_argument(
         "--pool-size", type=int, default=1, help="Simultaneous training jobs to run"
     )
-    parser.add_argument(
-        "--virtual-display",
-        action="store_true",
-        help="Whether to create a virtual display for video rendering",
     )
-    parser.set_defaults(algo="ppo", env="CartPole-v1", seed=1)
     args = parser.parse_args()
     print(args)
-    if args.virtual_display:
-        from pyvirtualdisplay import Display
-        virtual_display = Display(visible=0, size=(1400, 900))
         virtual_display.start()
-    delattr(args, "virtual_display")
     # pool_size isn't a TrainArg so must be removed from args
     pool_size = args.pool_size

     parser.add_argument(
         "--pool-size", type=int, default=1, help="Simultaneous training jobs to run"
     )
+    parser.set_defaults(
+        algo="ppo",
+        env="MountainCarContinuous-v0",
+        seed=[1, 2, 3],
+        pool_size=3,
     )
     args = parser.parse_args()
     print(args)
+    if args.pool_size == 1:
+        from pyvirtualdisplay.display import Display
+        virtual_display = Display(visible=False, size=(1400, 900))
         virtual_display.start()
     # pool_size isn't a TrainArg so must be removed from args
     pool_size = args.pool_size

vpg/policy.py CHANGED Viewed

@@ -15,7 +15,7 @@ from shared.policy.actor import (
     actor_head,
 )
 from shared.policy.critic import CriticHead
-from shared.policy.on_policy import Step, clamp_actions
 from shared.policy.policy import ACTIVATION, Policy
 PI_FILE_NAME = "pi.pt"
@@ -37,7 +37,7 @@ class VPGActorCritic(Policy):
     def __init__(
         self,
         env: VecEnv,
-        hidden_sizes: Sequence[int],
         init_layers_orthogonal: bool = True,
         activation_fn: str = "tanh",
         log_std_init: float = -0.5,
@@ -53,6 +53,12 @@ class VPGActorCritic(Policy):
         self.use_sde = use_sde
         self.squash_output = squash_output
         pi_feature_extractor = FeatureExtractor(
             obs_space, activation, init_layers_orthogonal=init_layers_orthogonal
         )

     actor_head,
 )
 from shared.policy.critic import CriticHead
+from shared.policy.on_policy import Step, clamp_actions, default_hidden_sizes
 from shared.policy.policy import ACTIVATION, Policy
 PI_FILE_NAME = "pi.pt"
     def __init__(
         self,
         env: VecEnv,
+        hidden_sizes: Optional[Sequence[int]] = None,
         init_layers_orthogonal: bool = True,
         activation_fn: str = "tanh",
         log_std_init: float = -0.5,
         self.use_sde = use_sde
         self.squash_output = squash_output
+        hidden_sizes = (
+            hidden_sizes
+            if hidden_sizes is not None
+            else default_hidden_sizes(obs_space)
+        )
         pi_feature_extractor = FeatureExtractor(
             obs_space, activation, init_layers_orthogonal=init_layers_orthogonal
         )