00BER
/

ml-reinforcement-learning

Model card Files Files and versions Community

00BER commited on Feb 22, 2023

Commit

e085e3b

•

1 Parent(s): cc9a97a

Upload 36 files

Browse files

Files changed (36) hide show

Makefile +30 -0
README.md +358 -0
environment.atari.yml +153 -0
environment.procgen-v2.yml +135 -0
environment.procgen.yml +135 -0
requirements-v1.txt +76 -0
requirements.txt +42 -0
src/airstriker-genesis/__init__.py +0 -0
src/airstriker-genesis/agent.py +400 -0
src/airstriker-genesis/cartpole.py +353 -0
src/airstriker-genesis/procgen_agent.py +400 -0
src/airstriker-genesis/replay.py +66 -0
src/airstriker-genesis/run-airstriker-ddqn.py +120 -0
src/airstriker-genesis/run-airstriker-dqn.py +115 -0
src/airstriker-genesis/run-cartpole.py +120 -0
src/airstriker-genesis/test.py +405 -0
src/airstriker-genesis/utils.py +22 -0
src/airstriker-genesis/wrappers.py +213 -0
src/lunar-lander/agent.py +1104 -0
src/lunar-lander/params.py +12 -0
src/lunar-lander/replay.py +67 -0
src/lunar-lander/run-lunar-ddqn.py +45 -0
src/lunar-lander/run-lunar-dqn.py +46 -0
src/lunar-lander/run-lunar-dueling-ddqn.py +47 -0
src/lunar-lander/run-lunar-dueling-dqn.py +46 -0
src/lunar-lander/train.py +84 -0
src/lunar-lander/wrappers.py +193 -0
src/procgen/agent.py +664 -0
src/procgen/run-starpilot-ddqn.py +45 -0
src/procgen/run-starpilot-dqn.py +45 -0
src/procgen/run-starpilot-dueling-ddqn.py +45 -0
src/procgen/run-starpilot-dueling-dqn.py +45 -0
src/procgen/test-procgen.py +12 -0
src/procgen/train.py +48 -0
src/procgen/wrappers.py +187 -0
troubleshooting.md +37 -0

Makefile ADDED Viewed

	@@ -0,0 +1,30 @@

+.PHONY: create-atari-env
+create-atari-env: ## Creates conda environment
+	conda env create -f environment.atari-yml --force
+.PHONY: create-procgen-env
+create-procgen-env: ## Creates conda environment
+	conda env create -f environment.procgen.yml --force
+.PHONY: setup-env
+setup-env: ## Sets up conda environment
+	conda install pytorch torchvision numpy -c pytorch -y
+	pip install gym-retro
+	pip install "gym[atari]==0.21.0"
+	pip install importlib-metadata==4.13.0
+.PHONY: run-air-dqn
+run-air-dqn: ## Runs
+	python ./src/airstriker-genesis/run-airstriker-dqn.py
+.PHONY: run-air-ddqn
+run-air-ddqn: ## Runs
+	python ./src/airstriker-genesis/run-airstriker-ddqn.py
+.PHONY: run-starpilot-dqn
+run-starpilot-dqn: ## Runs
+	python ./src/procgen/run-starpilot-dqn.py
+.PHONY: run-starpilot-ddqn
+run-starpilot-ddqn: ## Runs
+	python ./src/procgen/run-starpilot-ddqn.py

README.md CHANGED Viewed

	@@ -0,0 +1,358 @@

+# **Abstract**
+On January 1, 2013, DeepMind published a paper called "Playing Atari
+with Deep Reinforcement Learning" introducing their algorithm called
+Deep Q-Network (DQN) which revolutionized the field of reinforcement
+learning. For the first time they had brought together Deep Learning and
+Q-learning and showed impressive results applying deep reinforcement
+learning to Atari games with their agents performing at or over human
+level expertise in almost all the games trained on.
+A Deep Q-Network utilizes a deep neural network to estimate the q-values
+for each action, allowing the policy to select the action with the
+maximum q-values. This use of deep neural network to get q-values was
+immensely superior to implementing q-table look-ups and widened the
+applicability of q-learning to more complex reinforcement learning
+environments.
+While revolutionary, the original version of DQN had a few problems,
+especially its slow/inefficient learning process. Over these past 9
+years, a few improved versions of DQNs have become popular. This project
+is an attempt to study the effectiveness of a few of these DQN flavors,
+what problems they solve and compare their performance in the same
+reinforcement learning environment.
+# Deep Q-Networks and its flavors
+  - **Vanilla DQN**
+    The vanilla (original) DQN uses 2 neural networks: the **online**
+    network and the **target** network. The online network is the main
+    neural network that the agent uses to select the best action for a
+    given state. The target neural network is usually a copy of the
+    online network. It is used to get the "target" q-values for each
+    action for a particular state. i.e. During the learning phase, since
+    we don’t have actual ground truths for future q-values, these
+    q-values from the target network will be used as labels optimize the
+    network.
+    The target network calculates the target q-values by using the
+    following Bellman equation: \[\begin{aligned}
+        Q(s_t, a_t) =
+                r_{t+1} + \gamma \max _{a_{t+1} \in A} Q(s_{t+1}, a_{t+1})
+        \end{aligned}\] where,
+    \(Q(s_t, a_t)\) = The target q-value (ground truth) for a past
+    experience in the replay memory
+    \(r_{t+1}\)= The reward that was obtained for taking the chosen
+    action in that particular experience
+    \(\gamma\)= The discount factor for future rewards
+    \(Q(s_{t+1}, a_{t+1})\) = The q-value for best action (based on the
+    policy) for the next state for that particular experience
+  - **Double DQN**
+    One of the problems with vanilla DQN is the way it calculates its
+    target values (ground-truth). We can see from the bellman equation
+    above that the target network uses the **max** q-value directly in
+    the equation. This is found to almost always overestimate the
+    q-value because using the **max** function introduces the
+    maximization-bias to our estimates. Using max will give the largest
+    value even if that specific max value was an outlier, thus skewing
+    our estimates.
+    The Double DQN solves this problem by changing the original
+    algorithm to the following:
+    1.  Instead of using the **max** function, first use the online
+        network to estimate the best action for the next state
+    2.  Calculate target q-values for the next state for each possible
+        action using the target network
+    3.  From the q-values calculated by the target network, use the
+        q-value of the action chosen in step 1.
+    This can be represented by the following equation: \[\begin{aligned}
+        Q(s_t, a_t) =
+                r_{t+1} + \gamma Q_{target}(s_{t+1}, a'_{t+1})
+        \end{aligned}\] where, \[\begin{aligned}
+        a'_{t+1} = argmax({Q_{online}(s_{t+1})})
+        \end{aligned}\]
+  - **Dueling DQN**
+    The Dueling DQN algorithm was an attempt to improve upon the
+    original DQN algorithm by changing the architecture of the neural
+    network used in Deep Q-learning. The Duelling DQN algorithm splits
+    the last layer of the DQN into to parts, a **value stream** and an
+    **advantage stream**, the outputs of which are aggregated in an
+    aggregating layer that gives the final q-value. One of the main
+    problems with the original DQN algorithm was that the difference in
+    Q-values for the actions were often very close. Thus, selecting the
+    action with the max q-value might always not be the best action to
+    take. The Dueling DQN attempts to mitigate this by using advantage,
+    which is a measure of how better an action is compared to other
+    actions for a given state. The value stream, on the other hand,
+    learns how good/bad it is to be in a specific state. eg. Moving
+    straight towards an obstacle in a racing game, being in the path of
+    a projectile in Space Invaders, etc. Instead of learning to predict
+    a single q-value, by separating into value and advantage streams
+    helps the network generalize better.
+    ![image](./docs/dueling.png)
+    Fig: The Dueling DQN architecture (Image taken from the original
+    paper by Wang et al.)
+    The q-value in a Dueling DQN architecture is given by
+    \[\begin{aligned}
+        Q(s_t, a_t) = V(s_t) + A(a)
+        \end{aligned}\] where,
+    V(s\_t) = The value of the current state (how advantageous it is to
+    be in that state)
+    A(a) =The advantage of taking action an a at that state
+    # About the project
+    My original goal for the project was to train an agent using DQN to
+    play **Airstriker Genesis**, a space shooting game and evaluate the
+    same agent’s performance on another similar game called
+    **Starpilot**. Unfortunately, I was unable to train a decent enough
+    agent in the first game, which made it meaningless to evaluate it’s
+    performance on yet another game.
+    Because I still want to do the original project some time in the
+    future, to prepare myself for that I thought it would be better to
+    first learn in-depth about how Deep Q-Networks work, what their
+    shortcomings are and how they can be improved. This, and for
+    time-constraint reasons, I have changed my project for this class to
+    a comparison of various DQN versions.
+    # Dataset
+    I used the excellent [Gym](https://github.com/openai/gym) library to
+    run my environment. A total of 9 agents, 1 in Airstriker Genesis, 4
+    in Starpilot and 4 in Lunar Lander were trained.
+    | **Game**           | **Observation Space**                                                                                                                                                                                     | **Action Space**                                                                                                                                                                                                  |
+    | :----------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+    | Airstriker Genesis | RGB values of each pixel of the game screen (255, 255, 3)                                                                                                                                                 | Discrete(12) representing each of the buttons on the old Atari controllers. But since only three of those buttons were used in the game  the action space was reduced to 3 during training. ( Left, Right, Fire ) |
+    | Starpilot          | RGB values of each pixel of the game screen (64, 64, 3)                                                                                                                                                   | Discrete(15) representing each of the button combos ( Left, Right, Up, Down, Up + Right, Up + Left, Down + Right, Down + Left, W, A, S, D, Q, E, Do nothing )                                                     |
+    | Lunar Lander       | 8-dimensional vector: ( X-coordinate, Y-coordinate, Linear velocity in X, Linear Velocity in Y, Angle, Angular Velocity, Boolean (Leg 1 in contact with ground), Boolean (Leg 2 in contact with ground) ) | Discrete(4)( Do nothing, Fire left engine, Fire main engine, Fire right engine )                                                                                                                                  |
+    **Environment/Libraries**:
+    Miniconda, Python 3.9, Gym, Pyorch, Numpy, Tensorboard on my
+    personal Macbook Pro (M1)
+    # ML Methodology
+    Each agent was trained using DQN or one of its flavors. Each agent
+    for a particular game was trained with the same hyperparameters with
+    just the underlying algorithm different. The following metrics for
+    each agent were used for evaluation:
+      - **Epsilon value over each episode** Shows what the exploration
+        rate was at the end of each episode.
+      - **Average Q-value for the last 100 episodes** A measure of the
+        average q-value (for the action chosen) for the last 100
+        episodes.
+      - **Average length for the last 100 episodes** A measure of the
+        average number of steps taken in each episode
+      - **Average loss for the last 100 episodes** A measure of loss
+        during learning in the last 100 episodes (A Huber Loss was used)
+      - **Average reward for the last 100 episodes** A measure of the
+        average reward the agent accumulated over the last 100 episodes
+    ## Preprocessing
+    For the Airstriker and the Starpilot games:
+    1.  Changed each frame to grayscale
+        Since the color shouldn’t matter to the agent, I decided to
+        change the RGB image to grayscale
+    2.  Changed observation space shape from (height, width, channels)
+        to (channels, height, width) to make it compatible with
+        Pytorch
+        Apparently Pytorch uses a different format than the direct
+        output of the gym environment. For this reason, I had to reshape
+        each observation to match Pytorch’s scheme (this took me a very
+        long time to figure out, but had an "Aha\!" moment when I
+        remember you saying something similar in class).
+    3.  Framestacking
+        Instead of processing 1 frame at a time, process 4 frames at a
+        time. This is because just 1 frame is not enough information for
+        the agent to decide what action to take.
+    For Lunar Lander, since the reward changes are very drastic (sudden
+    +100, -100, +200) rewards, I experimented with Reward Clipping
+    (clipping the rewards to \[-1, 1\] range) but this didn’t seem to
+    make much difference in my agent’s performance.
+# Results
+  - **Airstriker Genesis**
+    The loss went down until about 5200 episodes but after that it
+    stopped going down any further. Consequently the average reward the
+    agent accumulated over the last 100 episodes pretty much plateaued
+    after about 5000 episodes. On analysis, I noticed that my
+    exploration rate at the end of the 7000th episode was still about
+    0.65, which means that the agent was taking random actions more than
+    half of the time. On hindsight, I feel like I should have trained
+    more, at least until the epsilon value (exploration rate) completely
+    decayed to 5%.
+    ![image](./docs/air1.png) ![image](./docs/air2.png) ![image](./docs/air3.png)
+  - **Starpilot**
+    I trained DQN, Double DQN, Dueling DQN and Dueling Double DQN
+    versions for this game to compare the different algorithms.
+    From the graph of mean q-values, we can tell that the Vanilla DQN
+    versions indeed give high q-values, and their Double-DQN couterparts
+    give lower values, which makes me think that my implementation of
+    the Double DQN algorithm was OK. I had expected the agent to
+    accumulate higher rewards starting much earlier for the Double and
+    Dueling versions, but since the average rewards was almost similar
+    for all the agents, I could not notice any stark differences between
+    the performance of each agent.
+    ![image](./docs/star1.png)
+    ![image](./docs/star2.png)
+    |                     |                     |
+    | :------------------ | :------------------ |
+    | ![image](./docs/star3.png) | ![image](./docs/star4.png) |
+  - **Lunar Lander**
+    Since I did gain much insight from the agent in the Starpilot game,
+    I thought I was not training long enough. So I tried training the
+    same agents on Lunar Lander, which is a comparatively simpler game
+    with a smaller observation space and one that a DQN algorithm should
+    be able converge pretty quickly to (based on comments by other
+    people in the RL community).
+    ![image](./docs/lunar1.png)
+    ![image](./docs/lunar2.png)
+    |                      |                      |
+    | :------------------- | :------------------- |
+    | ![image](./docs/lunar3.png) | ![image](./docs/lunar4.png) |
+    The results for this were interesting. Although I did not find any
+    vast difference between the different variations of the DQN
+    algorithm, I found that the performance of my agent suddenly got
+    worse at around 300 episodes. Upon researching on why this may have
+    happened, I learned that DQN agents suffer from **catastrophic
+    forgetting** i.e. after training extensively, the network suddenly
+    forgets what it has learned in the past and the starts performing
+    worse. Initially, I thought this might have been the case, but since
+    I haven’t trained long enough, and because all models started
+    performing worse at almost exactly the same episode number, I think
+    this might be a problem with my code or some hyperparameter that I
+    used.
+    Upon checking what the agent was doing in the actual game, I found
+    that it was playing it very safe and just constantly hovering in the
+    air, not attempting to land the spaceship (the goal of the agent is
+    to land within the yellow flags). I thought maybe penalizing the
+    rewards for taking too many steps in the episode would work, but
+    that didn’t help either.
+    ![image](./docs/check.png)
+# Problems Faced
+Here are a few of the problems that I faced while training my agents:
+  - Understanding the various hyperparameters in the algorithm. DQN uses
+    a lot of moving parts and thus, tuning each parameter was a
+    difficult task. There were about 8 different hyperparameters (some
+    correlated) that impacted the agent’s training performance. I
+    struggled with understanding how each parameter impacted the agent
+    and also with figuring out how to find optimal values for those. I
+    ended up tuning them by trial and error.
+  - I got stuck for a long time figuring out why my convolutional layer
+    was not working. I didn’t realize that Pytorch has the channels in
+    the first dimension, and because of that, I was passing huge numbers
+    like 255 (the height of the image) into the input dimension for a
+    Conv2D layer.
+  - I struggled with knowing how long is long enough to realize that a
+    model is not working. I trained a model on Airstriker Genesis for 14
+    hours just to realize later that I had set a parameter incorrectly
+    and had to retrain all over again.
+# What Next?
+Although I didn’t get a final working agent for any of the games I
+tried, I feel like I have learned a lot about reinforcement learning,
+especially about Deep Q-learning. I plan to improve upon this further,
+and hopefully get an agent to go far into at least one of the games.
+Next time, I will start with first debugging my current code and see if
+I have any implementation mistakes. Then I will train them a lot longer
+than I did this time and see if it works. While learning about the
+different flavors of DQN, I also learned a little about NoisyNet DQN,
+Rainbow-DQN and Prioritized Experience Replay. I couln’t implement these
+for this project, but I would like to try them out some time soon.
+# Lessons Learned
+  - Reinforcement learning is a very challenging problem. It takes a
+    substantially large amount of time to train, it is hard to debug and
+    it is very difficult to tune its hyperparameters just right. It is a
+    lot different from supervised learning in that there are no actual
+    labels and thus, this makes optimization very difficult.
+  - I tried training an agent on the Atari Airstriker Genesis and the
+    procgen Starpilot game using just the CPU, but this took a very long
+    time. This is understandable because the inputs are images and using
+    a GPU would have been obviously better. Next time, I will definitely
+    try using a GPU to make training faster.
+  - Upon being faced with the problem of my agent not learning, I went
+    into research mode and got to learn a lot about DQN and its improved
+    versions. I am not a master of the algorithms yet (I have yet to get
+    an agent to perform well in the game), but I feel like I understand
+    how each version works.
+  - Rather than just following someone’s tutorial, also reading the
+    actual papers for that particular algorithm helped me understand the
+    algorithm better and code it.
+  - Doing this project reinforced into me that I love the concept of
+    reinforcement learning. It has made me even more interested into
+    exploring the field further and learn more.
+# References / Resources
+  - [Reinforcement Learning (DQN) Tutorial, Adam
+    Paszke](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)
+  - [Train a mario-playing RL agent, Yuansong Feng, Suraj Subramanian,
+    Howard Wang, Steven
+    Guo](https://pytorch.org/tutorials/intermediate/mario_rl_tutorial.html)
+  - [About Double DQN, Dueling
+    DQN](https://horomary.hatenablog.com/entry/2021/02/06/013412)
+  - [Dueling Network Architecture for Deep Reinforcement Learning (Wang
+    et al., 2015))](https://arxiv.org/abs/1511.06581)
+    *(Final source code for the project can be found*
+    [*here*](https://github.com/00ber/ml-reinforcement-learning)*)*.

environment.atari.yml ADDED Viewed

	@@ -0,0 +1,153 @@

+name: mlrl
+channels:
+  - pytorch
+  - defaults
+dependencies:
+  - absl-py=1.3.0=py37hecd8cb5_0
+  - aiohttp=3.8.3=py37h6c40b1e_0
+  - aiosignal=1.2.0=pyhd3eb1b0_0
+  - appnope=0.1.2=py37hecd8cb5_1001
+  - async-timeout=4.0.2=py37hecd8cb5_0
+  - asynctest=0.13.0=py_0
+  - attrs=22.1.0=py37hecd8cb5_0
+  - backcall=0.2.0=pyhd3eb1b0_0
+  - blas=1.0=mkl
+  - blinker=1.4=py37hecd8cb5_0
+  - brotli=1.0.9=hca72f7f_7
+  - brotli-bin=1.0.9=hca72f7f_7
+  - brotlipy=0.7.0=py37h9ed2024_1003
+  - bzip2=1.0.8=h1de35cc_0
+  - c-ares=1.18.1=hca72f7f_0
+  - ca-certificates=2022.10.11=hecd8cb5_0
+  - cachetools=4.2.2=pyhd3eb1b0_0
+  - cairo=1.14.12=hc4e6be7_4
+  - certifi=2022.9.24=py37hecd8cb5_0
+  - cffi=1.15.0=py37hca72f7f_0
+  - charset-normalizer=2.0.4=pyhd3eb1b0_0
+  - click=8.0.4=py37hecd8cb5_0
+  - cryptography=38.0.1=py37hf6deb26_0
+  - cycler=0.11.0=pyhd3eb1b0_0
+  - dataclasses=0.8=pyh6d0b6a4_7
+  - decorator=5.1.1=pyhd3eb1b0_0
+  - expat=2.4.9=he9d5cce_0
+  - ffmpeg=4.0=h01ea3c9_0
+  - flit-core=3.6.0=pyhd3eb1b0_0
+  - fontconfig=2.14.1=hedf32ac_1
+  - fonttools=4.25.0=pyhd3eb1b0_0
+  - freetype=2.12.1=hd8bbffd_0
+  - frozenlist=1.3.3=py37h6c40b1e_0
+  - gettext=0.21.0=h7535e17_0
+  - giflib=5.2.1=haf1e3a3_0
+  - glib=2.63.1=hd977a24_0
+  - google-auth=2.6.0=pyhd3eb1b0_0
+  - google-auth-oauthlib=0.4.4=pyhd3eb1b0_0
+  - graphite2=1.3.14=he9d5cce_1
+  - grpcio=1.42.0=py37ha29bfda_0
+  - harfbuzz=1.8.8=hb8d4a28_0
+  - hdf5=1.10.2=hfa1e0ec_1
+  - icu=58.2=h0a44026_3
+  - idna=3.4=py37hecd8cb5_0
+  - intel-openmp=2021.4.0=hecd8cb5_3538
+  - ipython=7.31.1=py37hecd8cb5_1
+  - jasper=2.0.14=h0129ec2_2
+  - jedi=0.18.1=py37hecd8cb5_1
+  - jpeg=9e=hca72f7f_0
+  - kiwisolver=1.4.2=py37he9d5cce_0
+  - lcms2=2.12=hf1fd2bf_0
+  - lerc=3.0=he9d5cce_0
+  - libbrotlicommon=1.0.9=hca72f7f_7
+  - libbrotlidec=1.0.9=hca72f7f_7
+  - libbrotlienc=1.0.9=hca72f7f_7
+  - libcxx=14.0.6=h9765a3e_0
+  - libdeflate=1.8=h9ed2024_5
+  - libedit=3.1.20221030=h6c40b1e_0
+  - libffi=3.2.1=h0a44026_1007
+  - libgfortran=3.0.1=h93005f0_2
+  - libiconv=1.16=hca72f7f_2
+  - libopencv=3.4.2=h7c891bd_1
+  - libopus=1.3.1=h1de35cc_0
+  - libpng=1.6.37=ha441bb4_0
+  - libprotobuf=3.20.1=h8346a28_0
+  - libtiff=4.4.0=h2cd0358_2
+  - libvpx=1.7.0=h378b8a2_0
+  - libwebp=1.2.4=h56c3ce4_0
+  - libwebp-base=1.2.4=hca72f7f_0
+  - libxml2=2.9.14=hbf8cd5e_0
+  - llvm-openmp=14.0.6=h0dcd299_0
+  - lz4-c=1.9.4=hcec6c5f_0
+  - markdown=3.3.4=py37hecd8cb5_0
+  - matplotlib=3.1.2=py37h9aa3819_0
+  - matplotlib-inline=0.1.6=py37hecd8cb5_0
+  - mkl=2021.4.0=hecd8cb5_637
+  - mkl-service=2.4.0=py37h9ed2024_0
+  - mkl_fft=1.3.1=py37h4ab4a9b_0
+  - mkl_random=1.2.2=py37hb2f4e1b_0
+  - multidict=6.0.2=py37hca72f7f_0
+  - munkres=1.1.4=py_0
+  - ncurses=6.3=hca72f7f_3
+  - numpy=1.21.5=py37h2e5f0a9_3
+  - numpy-base=1.21.5=py37h3b1a694_3
+  - oauthlib=3.2.1=py37hecd8cb5_0
+  - olefile=0.46=py37_0
+  - opencv=3.4.2=py37h6fd60c2_1
+  - openssl=1.1.1s=hca72f7f_0
+  - packaging=21.3=pyhd3eb1b0_0
+  - parso=0.8.3=pyhd3eb1b0_0
+  - pcre=8.45=h23ab428_0
+  - pexpect=4.8.0=pyhd3eb1b0_3
+  - pickleshare=0.7.5=pyhd3eb1b0_1003
+  - pillow=6.1.0=py37hb68e598_0
+  - pip=22.3.1=py37hecd8cb5_0
+  - pixman=0.40.0=h9ed2024_1
+  - prompt-toolkit=3.0.20=pyhd3eb1b0_0
+  - protobuf=3.20.1=py37he9d5cce_0
+  - ptyprocess=0.7.0=pyhd3eb1b0_2
+  - py-opencv=3.4.2=py37h7c891bd_1
+  - pyasn1=0.4.8=pyhd3eb1b0_0
+  - pyasn1-modules=0.2.8=py_0
+  - pycparser=2.21=pyhd3eb1b0_0
+  - pygments=2.11.2=pyhd3eb1b0_0
+  - pyjwt=2.4.0=py37hecd8cb5_0
+  - pyopenssl=22.0.0=pyhd3eb1b0_0
+  - pyparsing=3.0.9=py37hecd8cb5_0
+  - pysocks=1.7.1=py37hecd8cb5_0
+  - python=3.7.3=h359304d_0
+  - python-dateutil=2.8.2=pyhd3eb1b0_0
+  - pytorch=1.13.1=py3.7_0
+  - readline=7.0=h1de35cc_5
+  - requests=2.28.1=py37hecd8cb5_0
+  - requests-oauthlib=1.3.0=py_0
+  - rsa=4.7.2=pyhd3eb1b0_1
+  - setuptools=65.5.0=py37hecd8cb5_0
+  - six=1.16.0=pyhd3eb1b0_1
+  - sqlite=3.33.0=hffcf06c_0
+  - tensorboard=2.9.0=py37hecd8cb5_0
+  - tensorboard-data-server=0.6.1=py37h7242b5c_0
+  - tensorboard-plugin-wit=1.6.0=py_0
+  - tk=8.6.12=h5d9f67b_0
+  - torchvision=0.2.2=py_3
+  - tornado=6.2=py37hca72f7f_0
+  - tqdm=4.64.1=py37hecd8cb5_0
+  - traitlets=5.7.1=py37hecd8cb5_0
+  - typing-extensions=4.4.0=py37hecd8cb5_0
+  - typing_extensions=4.4.0=py37hecd8cb5_0
+  - urllib3=1.26.13=py37hecd8cb5_0
+  - wcwidth=0.2.5=pyhd3eb1b0_0
+  - werkzeug=2.0.3=pyhd3eb1b0_0
+  - wheel=0.37.1=pyhd3eb1b0_0
+  - xz=5.2.8=h6c40b1e_0
+  - yarl=1.8.1=py37hca72f7f_0
+  - zlib=1.2.13=h4dc903c_0
+  - zstd=1.5.2=hcb37349_0
+  - pip:
+      - ale-py==0.7.5
+      - cloudpickle==2.2.0
+      - gym==0.21.0
+      - gym-notices==0.0.8
+      - gym-retro==0.8.0
+      - importlib-metadata==4.13.0
+      - importlib-resources==5.10.1
+      - pygame==2.1.0
+      - pyglet==1.5.27
+      - zipp==3.11.0
+prefix: /Users/karkisushant/miniconda3/envs/mlrl

environment.procgen-v2.yml ADDED Viewed

	@@ -0,0 +1,135 @@

+name: procgen
+channels:
+  - pytorch
+  - defaults
+dependencies:
+  - absl-py=1.3.0=py39hecd8cb5_0
+  - aiohttp=3.8.3=py39h6c40b1e_0
+  - aiosignal=1.2.0=pyhd3eb1b0_0
+  - async-timeout=4.0.2=py39hecd8cb5_0
+  - attrs=22.1.0=py39hecd8cb5_0
+  - blas=1.0=mkl
+  - blinker=1.4=py39hecd8cb5_0
+  - brotli=1.0.9=hca72f7f_7
+  - brotli-bin=1.0.9=hca72f7f_7
+  - brotlipy=0.7.0=py39h9ed2024_1003
+  - bzip2=1.0.8=h1de35cc_0
+  - c-ares=1.18.1=hca72f7f_0
+  - ca-certificates=2022.10.11=hecd8cb5_0
+  - cachetools=4.2.2=pyhd3eb1b0_0
+  - certifi=2022.9.24=py39hecd8cb5_0
+  - cffi=1.15.1=py39h6c40b1e_3
+  - charset-normalizer=2.0.4=pyhd3eb1b0_0
+  - click=8.0.4=py39hecd8cb5_0
+  - contourpy=1.0.5=py39haf03e11_0
+  - cryptography=38.0.1=py39hf6deb26_0
+  - cycler=0.11.0=pyhd3eb1b0_0
+  - ffmpeg=4.3=h0a44026_0
+  - flit-core=3.6.0=pyhd3eb1b0_0
+  - fonttools=4.25.0=pyhd3eb1b0_0
+  - freetype=2.12.1=hd8bbffd_0
+  - frozenlist=1.3.3=py39h6c40b1e_0
+  - gettext=0.21.0=h7535e17_0
+  - giflib=5.2.1=haf1e3a3_0
+  - gmp=6.2.1=he9d5cce_3
+  - gnutls=3.6.15=hed9c0bf_0
+  - google-auth=2.6.0=pyhd3eb1b0_0
+  - google-auth-oauthlib=0.4.4=pyhd3eb1b0_0
+  - grpcio=1.42.0=py39ha29bfda_0
+  - icu=58.2=h0a44026_3
+  - idna=3.4=py39hecd8cb5_0
+  - importlib-metadata=4.11.3=py39hecd8cb5_0
+  - intel-openmp=2021.4.0=hecd8cb5_3538
+  - jpeg=9e=hca72f7f_0
+  - kiwisolver=1.4.2=py39he9d5cce_0
+  - lame=3.100=h1de35cc_0
+  - lcms2=2.12=hf1fd2bf_0
+  - lerc=3.0=he9d5cce_0
+  - libbrotlicommon=1.0.9=hca72f7f_7
+  - libbrotlidec=1.0.9=hca72f7f_7
+  - libbrotlienc=1.0.9=hca72f7f_7
+  - libcxx=14.0.6=h9765a3e_0
+  - libdeflate=1.8=h9ed2024_5
+  - libffi=3.4.2=hecd8cb5_6
+  - libiconv=1.16=hca72f7f_2
+  - libidn2=2.3.2=h9ed2024_0
+  - libpng=1.6.37=ha441bb4_0
+  - libprotobuf=3.20.1=h8346a28_0
+  - libtasn1=4.16.0=h9ed2024_0
+  - libtiff=4.4.0=h2cd0358_2
+  - libunistring=0.9.10=h9ed2024_0
+  - libwebp=1.2.4=h56c3ce4_0
+  - libwebp-base=1.2.4=hca72f7f_0
+  - libxml2=2.9.14=hbf8cd5e_0
+  - llvm-openmp=14.0.6=h0dcd299_0
+  - lz4-c=1.9.4=hcec6c5f_0
+  - markdown=3.3.4=py39hecd8cb5_0
+  - markupsafe=2.1.1=py39hca72f7f_0
+  - matplotlib=3.6.2=py39hecd8cb5_0
+  - matplotlib-base=3.6.2=py39h220de94_0
+  - mkl=2021.4.0=hecd8cb5_637
+  - mkl-service=2.4.0=py39h9ed2024_0
+  - mkl_fft=1.3.1=py39h4ab4a9b_0
+  - mkl_random=1.2.2=py39hb2f4e1b_0
+  - multidict=6.0.2=py39hca72f7f_0
+  - munkres=1.1.4=py_0
+  - ncurses=6.3=hca72f7f_3
+  - nettle=3.7.3=h230ac6f_1
+  - numpy=1.23.4=py39he696674_0
+  - numpy-base=1.23.4=py39h9cd3388_0
+  - oauthlib=3.2.1=py39hecd8cb5_0
+  - openh264=2.1.1=h8346a28_0
+  - openssl=1.1.1s=hca72f7f_0
+  - packaging=21.3=pyhd3eb1b0_0
+  - pillow=9.2.0=py39hde71d04_1
+  - pip=22.3.1=py39hecd8cb5_0
+  - protobuf=3.20.1=py39he9d5cce_0
+  - pyasn1=0.4.8=pyhd3eb1b0_0
+  - pyasn1-modules=0.2.8=py_0
+  - pycparser=2.21=pyhd3eb1b0_0
+  - pyjwt=2.4.0=py39hecd8cb5_0
+  - pyopenssl=22.0.0=pyhd3eb1b0_0
+  - pyparsing=3.0.9=py39hecd8cb5_0
+  - pysocks=1.7.1=py39hecd8cb5_0
+  - python=3.9.15=h218abb5_2
+  - python-dateutil=2.8.2=pyhd3eb1b0_0
+  - pytorch=1.13.1=py3.9_0
+  - readline=8.2=hca72f7f_0
+  - requests=2.28.1=py39hecd8cb5_0
+  - requests-oauthlib=1.3.0=py_0
+  - rsa=4.7.2=pyhd3eb1b0_1
+  - setuptools=65.5.0=py39hecd8cb5_0
+  - six=1.16.0=pyhd3eb1b0_1
+  - sqlite=3.40.0=h880c91c_0
+  - tensorboard=2.9.0=py39hecd8cb5_0
+  - tensorboard-data-server=0.6.1=py39h7242b5c_0
+  - tensorboard-plugin-wit=1.6.0=py_0
+  - tk=8.6.12=h5d9f67b_0
+  - torchvision=0.14.1=py39_cpu
+  - tornado=6.2=py39hca72f7f_0
+  - tqdm=4.64.1=py39hecd8cb5_0
+  - typing_extensions=4.4.0=py39hecd8cb5_0
+  - tzdata=2022g=h04d1e81_0
+  - urllib3=1.26.13=py39hecd8cb5_0
+  - werkzeug=2.2.2=py39hecd8cb5_0
+  - wheel=0.37.1=pyhd3eb1b0_0
+  - xz=5.2.8=h6c40b1e_0
+  - yarl=1.8.1=py39hca72f7f_0
+  - zipp=3.8.0=py39hecd8cb5_0
+  - zlib=1.2.13=h4dc903c_0
+  - zstd=1.5.2=hcb37349_0
+  - pip:
+    - cloudpickle==2.2.0
+    - filelock==3.8.2
+    - glcontext==2.3.7
+    - glfw==1.12.0
+    - gym==0.21.0
+    - gym-notices==0.0.8
+    - gym3==0.3.3
+    - imageio==2.22.4
+    - imageio-ffmpeg==0.3.0
+    - moderngl==5.7.4
+    - opencv-python==4.6.0.66
+    - procgen==0.10.7
+    - pyglet==1.5.27
+prefix: /Users/karkisushant/miniconda3/envs/v2

environment.procgen.yml ADDED Viewed

	@@ -0,0 +1,135 @@

+name: procgen
+channels:
+  - pytorch
+  - defaults
+dependencies:
+  - absl-py=1.3.0=py39hecd8cb5_0
+  - aiohttp=3.8.3=py39h6c40b1e_0
+  - aiosignal=1.2.0=pyhd3eb1b0_0
+  - async-timeout=4.0.2=py39hecd8cb5_0
+  - attrs=22.1.0=py39hecd8cb5_0
+  - blas=1.0=mkl
+  - blinker=1.4=py39hecd8cb5_0
+  - brotli=1.0.9=hca72f7f_7
+  - brotli-bin=1.0.9=hca72f7f_7
+  - brotlipy=0.7.0=py39h9ed2024_1003
+  - bzip2=1.0.8=h1de35cc_0
+  - c-ares=1.18.1=hca72f7f_0
+  - ca-certificates=2022.10.11=hecd8cb5_0
+  - cachetools=4.2.2=pyhd3eb1b0_0
+  - certifi=2022.9.24=py39hecd8cb5_0
+  - cffi=1.15.1=py39h6c40b1e_3
+  - charset-normalizer=2.0.4=pyhd3eb1b0_0
+  - click=8.0.4=py39hecd8cb5_0
+  - contourpy=1.0.5=py39haf03e11_0
+  - cryptography=38.0.1=py39hf6deb26_0
+  - cycler=0.11.0=pyhd3eb1b0_0
+  - ffmpeg=4.3=h0a44026_0
+  - flit-core=3.6.0=pyhd3eb1b0_0
+  - fonttools=4.25.0=pyhd3eb1b0_0
+  - freetype=2.12.1=hd8bbffd_0
+  - frozenlist=1.3.3=py39h6c40b1e_0
+  - gettext=0.21.0=h7535e17_0
+  - giflib=5.2.1=haf1e3a3_0
+  - gmp=6.2.1=he9d5cce_3
+  - gnutls=3.6.15=hed9c0bf_0
+  - google-auth=2.6.0=pyhd3eb1b0_0
+  - google-auth-oauthlib=0.4.4=pyhd3eb1b0_0
+  - grpcio=1.42.0=py39ha29bfda_0
+  - icu=58.2=h0a44026_3
+  - idna=3.4=py39hecd8cb5_0
+  - importlib-metadata=4.11.3=py39hecd8cb5_0
+  - intel-openmp=2021.4.0=hecd8cb5_3538
+  - jpeg=9e=hca72f7f_0
+  - kiwisolver=1.4.2=py39he9d5cce_0
+  - lame=3.100=h1de35cc_0
+  - lcms2=2.12=hf1fd2bf_0
+  - lerc=3.0=he9d5cce_0
+  - libbrotlicommon=1.0.9=hca72f7f_7
+  - libbrotlidec=1.0.9=hca72f7f_7
+  - libbrotlienc=1.0.9=hca72f7f_7
+  - libcxx=14.0.6=h9765a3e_0
+  - libdeflate=1.8=h9ed2024_5
+  - libffi=3.4.2=hecd8cb5_6
+  - libiconv=1.16=hca72f7f_2
+  - libidn2=2.3.2=h9ed2024_0
+  - libpng=1.6.37=ha441bb4_0
+  - libprotobuf=3.20.1=h8346a28_0
+  - libtasn1=4.16.0=h9ed2024_0
+  - libtiff=4.4.0=h2cd0358_2
+  - libunistring=0.9.10=h9ed2024_0
+  - libwebp=1.2.4=h56c3ce4_0
+  - libwebp-base=1.2.4=hca72f7f_0
+  - libxml2=2.9.14=hbf8cd5e_0
+  - llvm-openmp=14.0.6=h0dcd299_0
+  - lz4-c=1.9.4=hcec6c5f_0
+  - markdown=3.3.4=py39hecd8cb5_0
+  - markupsafe=2.1.1=py39hca72f7f_0
+  - matplotlib=3.6.2=py39hecd8cb5_0
+  - matplotlib-base=3.6.2=py39h220de94_0
+  - mkl=2021.4.0=hecd8cb5_637
+  - mkl-service=2.4.0=py39h9ed2024_0
+  - mkl_fft=1.3.1=py39h4ab4a9b_0
+  - mkl_random=1.2.2=py39hb2f4e1b_0
+  - multidict=6.0.2=py39hca72f7f_0
+  - munkres=1.1.4=py_0
+  - ncurses=6.3=hca72f7f_3
+  - nettle=3.7.3=h230ac6f_1
+  - numpy=1.23.4=py39he696674_0
+  - numpy-base=1.23.4=py39h9cd3388_0
+  - oauthlib=3.2.1=py39hecd8cb5_0
+  - openh264=2.1.1=h8346a28_0
+  - openssl=1.1.1s=hca72f7f_0
+  - packaging=21.3=pyhd3eb1b0_0
+  - pillow=9.2.0=py39hde71d04_1
+  - pip=22.3.1=py39hecd8cb5_0
+  - protobuf=3.20.1=py39he9d5cce_0
+  - pyasn1=0.4.8=pyhd3eb1b0_0
+  - pyasn1-modules=0.2.8=py_0
+  - pycparser=2.21=pyhd3eb1b0_0
+  - pyjwt=2.4.0=py39hecd8cb5_0
+  - pyopenssl=22.0.0=pyhd3eb1b0_0
+  - pyparsing=3.0.9=py39hecd8cb5_0
+  - pysocks=1.7.1=py39hecd8cb5_0
+  - python=3.9.15=h218abb5_2
+  - python-dateutil=2.8.2=pyhd3eb1b0_0
+  - pytorch=1.13.1=py3.9_0
+  - readline=8.2=hca72f7f_0
+  - requests=2.28.1=py39hecd8cb5_0
+  - requests-oauthlib=1.3.0=py_0
+  - rsa=4.7.2=pyhd3eb1b0_1
+  - setuptools=65.5.0=py39hecd8cb5_0
+  - six=1.16.0=pyhd3eb1b0_1
+  - sqlite=3.40.0=h880c91c_0
+  - tensorboard=2.9.0=py39hecd8cb5_0
+  - tensorboard-data-server=0.6.1=py39h7242b5c_0
+  - tensorboard-plugin-wit=1.6.0=py_0
+  - tk=8.6.12=h5d9f67b_0
+  - torchvision=0.14.1=py39_cpu
+  - tornado=6.2=py39hca72f7f_0
+  - tqdm=4.64.1=py39hecd8cb5_0
+  - typing_extensions=4.4.0=py39hecd8cb5_0
+  - tzdata=2022g=h04d1e81_0
+  - urllib3=1.26.13=py39hecd8cb5_0
+  - werkzeug=2.2.2=py39hecd8cb5_0
+  - wheel=0.37.1=pyhd3eb1b0_0
+  - xz=5.2.8=h6c40b1e_0
+  - yarl=1.8.1=py39hca72f7f_0
+  - zipp=3.8.0=py39hecd8cb5_0
+  - zlib=1.2.13=h4dc903c_0
+  - zstd=1.5.2=hcb37349_0
+  - pip:
+    - cloudpickle==2.2.0
+    - filelock==3.8.2
+    - glcontext==2.3.7
+    - glfw==1.12.0
+    - gym==0.21.0
+    - gym-notices==0.0.8
+    - gym3==0.3.3
+    - imageio==2.22.4
+    - imageio-ffmpeg==0.3.0
+    - moderngl==5.7.4
+    - opencv-python==4.6.0.66
+    - procgen==0.10.7
+    - pyglet==1.5.27
+prefix: /Users/karkisushant/miniconda3/envs/procgen

requirements-v1.txt ADDED Viewed

	@@ -0,0 +1,76 @@

+absl-py==1.3.0
+ale-py==0.7.5
+astunparse==1.6.3
+attrs==22.1.0
+box2d-py==2.3.5
+cachetools==5.2.0
+certifi==2022.12.7
+cffi==1.15.1
+charset-normalizer==2.1.1
+cloudpickle==2.2.0
+cycler==0.11.0
+Cython==0.29.32
+fasteners==0.18
+flatbuffers==22.12.6
+fonttools==4.38.0
+future==0.18.2
+gast==0.4.0
+glfw==2.5.5
+google-auth==2.15.0
+google-auth-oauthlib==0.4.6
+google-pasta==0.2.0
+grpcio==1.51.1
+gym==0.21.0
+gym-notices==0.0.8
+gym-retro==0.8.0
+h5py==3.7.0
+idna==3.4
+imageio==2.22.4
+importlib-metadata==4.13.0
+importlib-resources==5.10.1
+iniconfig==1.1.1
+keras==2.11.0
+kiwisolver==1.4.4
+libclang==14.0.6
+lz4==4.0.2
+Markdown==3.4.1
+MarkupSafe==2.1.1
+matplotlib==3.5.3
+mujoco==2.2.0
+mujoco-py==2.1.2.14
+numpy==1.21.6
+oauthlib==3.2.2
+opencv-python==4.6.0.66
+opt-einsum==3.3.0
+packaging==22.0
+Pillow==9.3.0
+pluggy==1.0.0
+protobuf==3.19.6
+py==1.11.0
+pyasn1==0.4.8
+pyasn1-modules==0.2.8
+pycparser==2.21
+pygame==2.1.0
+pyglet==1.5.11
+PyOpenGL==3.1.6
+pyparsing==3.0.9
+pytest==7.0.1
+python-dateutil==2.8.2
+requests==2.28.1
+requests-oauthlib==1.3.1
+rsa==4.9
+six==1.16.0
+swig==4.1.1
+tensorboard==2.11.0
+tensorboard-data-server==0.6.1
+tensorboard-plugin-wit==1.8.1
+tensorflow==2.11.0
+tensorflow-estimator==2.11.0
+tensorflow-io-gcs-filesystem==0.28.0
+termcolor==2.1.1
+tomli==2.0.1
+typing_extensions==4.4.0
+urllib3==1.26.13
+Werkzeug==2.2.2
+wrapt==1.14.1
+zipp==3.11.0

requirements.txt ADDED Viewed

	@@ -0,0 +1,42 @@

+absl-py==1.3.0
+ale-py==0.7.5
+attrs==22.1.0
+box2d-py==2.3.5
+cffi==1.15.1
+cloudpickle==2.2.0
+cycler==0.11.0
+Cython==0.29.32
+fasteners==0.18
+fonttools==4.38.0
+future==0.18.2
+glfw==2.5.5
+gym==0.21.0
+gym-notices==0.0.8
+gym-retro==0.8.0
+imageio==2.22.4
+importlib-metadata==4.13.0
+importlib-resources==5.10.1
+iniconfig==1.1.1
+kiwisolver==1.4.4
+lz4==4.0.2
+matplotlib==3.5.3
+mujoco==2.2.0
+mujoco-py==2.1.2.14
+numpy==1.18.0
+opencv-python==4.6.0.66
+packaging==22.0
+Pillow==9.3.0
+pluggy==1.0.0
+py==1.11.0
+pycparser==2.21
+pygame==2.1.0
+pyglet==1.5.11
+PyOpenGL==3.1.6
+pyparsing==3.0.9
+pytest==7.0.1
+python-dateutil==2.8.2
+six==1.16.0
+swig==4.1.1
+tomli==2.0.1
+typing_extensions==4.4.0
+zipp==3.11.0

src/airstriker-genesis/__init__.py ADDED Viewed

File without changes

src/airstriker-genesis/agent.py ADDED Viewed

	@@ -0,0 +1,400 @@

+import torch
+import numpy as np
+import random
+import torch.nn as nn
+import copy
+import time, datetime
+import matplotlib.pyplot as plt
+from collections import deque
+from torch.utils.tensorboard import SummaryWriter
+import pickle
+class DQNet(nn.Module):
+    """mini cnn structure
+  input -> (conv2d + relu) x 3 -> flatten -> (dense + relu) x 2 -> output
+  """
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        print("#################################")
+        print("#################################")
+        print(input_dim)
+        print(output_dim)
+        print("#################################")
+        print("#################################")
+        c, h, w = input_dim
+        # if h != 84:
+        #     raise ValueError(f"Expecting input height: 84, got: {h}")
+        # if w != 84:
+        #     raise ValueError(f"Expecting input width: 84, got: {w}")
+        self.online = nn.Sequential(
+            nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),
+            nn.ReLU(),
+            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
+            nn.ReLU(),
+            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
+            nn.ReLU(),
+            nn.Flatten(),
+            nn.Linear(17024, 512),
+            nn.ReLU(),
+            nn.Linear(512, output_dim),
+        )
+        self.target = copy.deepcopy(self.online)
+        # Q_target parameters are frozen.
+        for p in self.target.parameters():
+            p.requires_grad = False
+    def forward(self, input, model):
+        if model == "online":
+            return self.online(input)
+        elif model == "target":
+            return self.target(input)
+class MetricLogger:
+    def __init__(self, save_dir):
+        self.writer = SummaryWriter(log_dir=save_dir)
+        self.save_log = save_dir / "log"
+        with open(self.save_log, "w") as f:
+            f.write(
+                f"{'Episode':>8}{'Step':>8}{'Epsilon':>10}{'MeanReward':>15}"
+                f"{'MeanLength':>15}{'MeanLoss':>15}{'MeanQValue':>15}"
+                f"{'TimeDelta':>15}{'Time':>20}\n"
+            )
+        self.ep_rewards_plot = save_dir / "reward_plot.jpg"
+        self.ep_lengths_plot = save_dir / "length_plot.jpg"
+        self.ep_avg_losses_plot = save_dir / "loss_plot.jpg"
+        self.ep_avg_qs_plot = save_dir / "q_plot.jpg"
+        # History metrics
+        self.ep_rewards = []
+        self.ep_lengths = []
+        self.ep_avg_losses = []
+        self.ep_avg_qs = []
+        # Moving averages, added for every call to record()
+        self.moving_avg_ep_rewards = []
+        self.moving_avg_ep_lengths = []
+        self.moving_avg_ep_avg_losses = []
+        self.moving_avg_ep_avg_qs = []
+        # Current episode metric
+        self.init_episode()
+        # Timing
+        self.record_time = time.time()
+    def log_step(self, reward, loss, q):
+        self.curr_ep_reward += reward
+        self.curr_ep_length += 1
+        if loss:
+            self.curr_ep_loss += loss
+            self.curr_ep_q += q
+            self.curr_ep_loss_length += 1
+    def log_episode(self, episode_number):
+        "Mark end of episode"
+        self.ep_rewards.append(self.curr_ep_reward)
+        self.ep_lengths.append(self.curr_ep_length)
+        if self.curr_ep_loss_length == 0:
+            ep_avg_loss = 0
+            ep_avg_q = 0
+        else:
+            ep_avg_loss = np.round(self.curr_ep_loss / self.curr_ep_loss_length, 5)
+            ep_avg_q = np.round(self.curr_ep_q / self.curr_ep_loss_length, 5)
+        self.ep_avg_losses.append(ep_avg_loss)
+        self.ep_avg_qs.append(ep_avg_q)
+        self.writer.add_scalar("Avg Loss for episode", ep_avg_loss, episode_number)
+        self.writer.add_scalar("Avg Q value for episode", ep_avg_q, episode_number)
+        self.writer.flush()
+        self.init_episode()
+    def init_episode(self):
+        self.curr_ep_reward = 0.0
+        self.curr_ep_length = 0
+        self.curr_ep_loss = 0.0
+        self.curr_ep_q = 0.0
+        self.curr_ep_loss_length = 0
+    def record(self, episode, epsilon, step):
+        mean_ep_reward = np.round(np.mean(self.ep_rewards[-100:]), 3)
+        mean_ep_length = np.round(np.mean(self.ep_lengths[-100:]), 3)
+        mean_ep_loss = np.round(np.mean(self.ep_avg_losses[-100:]), 3)
+        mean_ep_q = np.round(np.mean(self.ep_avg_qs[-100:]), 3)
+        self.moving_avg_ep_rewards.append(mean_ep_reward)
+        self.moving_avg_ep_lengths.append(mean_ep_length)
+        self.moving_avg_ep_avg_losses.append(mean_ep_loss)
+        self.moving_avg_ep_avg_qs.append(mean_ep_q)
+        last_record_time = self.record_time
+        self.record_time = time.time()
+        time_since_last_record = np.round(self.record_time - last_record_time, 3)
+        print(
+            f"Episode {episode} - "
+            f"Step {step} - "
+            f"Epsilon {epsilon} - "
+            f"Mean Reward {mean_ep_reward} - "
+            f"Mean Length {mean_ep_length} - "
+            f"Mean Loss {mean_ep_loss} - "
+            f"Mean Q Value {mean_ep_q} - "
+            f"Time Delta {time_since_last_record} - "
+            f"Time {datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S')}"
+        )
+        self.writer.add_scalar("Mean reward last 100 episodes", mean_ep_reward, episode)
+        self.writer.add_scalar("Mean length last 100 episodes", mean_ep_length, episode)
+        self.writer.add_scalar("Mean loss last 100 episodes", mean_ep_loss, episode)
+        self.writer.add_scalar("Mean reward last 100 episodes", mean_ep_reward, episode)
+        self.writer.add_scalar("Epsilon value", epsilon, episode)
+        self.writer.add_scalar("Mean Q Value last 100 episodes", mean_ep_q, episode)
+        self.writer.flush()
+        with open(self.save_log, "a") as f:
+            f.write(
+                f"{episode:8d}{step:8d}{epsilon:10.3f}"
+                f"{mean_ep_reward:15.3f}{mean_ep_length:15.3f}{mean_ep_loss:15.3f}{mean_ep_q:15.3f}"
+                f"{time_since_last_record:15.3f}"
+                f"{datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S'):>20}\n"
+            )
+        for metric in ["ep_rewards", "ep_lengths", "ep_avg_losses", "ep_avg_qs"]:
+            plt.plot(getattr(self, f"moving_avg_{metric}"))
+            plt.savefig(getattr(self, f"{metric}_plot"))
+            plt.clf()
+class DQNAgent:
+    def __init__(self,
+                 state_dim,
+                 action_dim,
+                 save_dir,
+                 checkpoint=None,
+                 learning_rate=0.00025,
+                 max_memory_size=100000,
+                 batch_size=32,
+                 exploration_rate=1,
+                 exploration_rate_decay=0.9999999,
+                 exploration_rate_min=0.1,
+                 training_frequency=1,
+                 learning_starts=1000,
+                 target_network_sync_frequency=500,
+                 reset_exploration_rate=False,
+                 save_frequency=100000,
+                 gamma=0.9,
+                 load_replay_buffer=True):
+        self.state_dim = state_dim
+        self.action_dim = action_dim
+        self.max_memory_size = max_memory_size
+        self.memory = deque(maxlen=max_memory_size)
+        self.batch_size = batch_size
+        self.exploration_rate = exploration_rate
+        self.exploration_rate_decay = exploration_rate_decay
+        self.exploration_rate_min = exploration_rate_min
+        self.gamma = gamma
+        self.curr_step = 0
+        self.learning_starts = learning_starts  # min. experiences before training
+        self.training_frequency = training_frequency   # no. of experiences between updates to Q_online
+        self.target_network_sync_frequency = target_network_sync_frequency  # no. of experiences between Q_target & Q_online sync
+        self.save_every = save_frequency   # no. of experiences between saving Mario Net
+        self.save_dir = save_dir
+        self.use_cuda = torch.cuda.is_available()
+        # Mario's DNN to predict the most optimal action - we implement this in the Learn section
+        self.net = DQNet(self.state_dim, self.action_dim).float()
+        if self.use_cuda:
+            self.net = self.net.to(device='cuda')
+        if checkpoint:
+            self.load(checkpoint, reset_exploration_rate, load_replay_buffer)
+        self.optimizer = torch.optim.AdamW(self.net.parameters(), lr=learning_rate, amsgrad=True)
+        self.loss_fn = torch.nn.SmoothL1Loss()
+    def act(self, state):
+        """
+        Given a state, choose an epsilon-greedy action and update value of step.
+        Inputs:
+        state(LazyFrame): A single observation of the current state, dimension is (state_dim)
+        Outputs:
+        action_idx (int): An integer representing which action Mario will perform
+        """
+        # EXPLORE
+        if np.random.rand() < self.exploration_rate:
+            action_idx = np.random.randint(self.action_dim)
+        # EXPLOIT
+        else:
+            state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+            state = state.unsqueeze(0)
+            action_values = self.net(state, model='online')
+            action_idx = torch.argmax(action_values, axis=1).item()
+        # decrease exploration_rate
+        self.exploration_rate *= self.exploration_rate_decay
+        self.exploration_rate = max(self.exploration_rate_min, self.exploration_rate)
+        # increment step
+        self.curr_step += 1
+        return action_idx
+    def cache(self, state, next_state, action, reward, done):
+        """
+        Store the experience to self.memory (replay buffer)
+        Inputs:
+        state (LazyFrame),
+        next_state (LazyFrame),
+        action (int),
+        reward (float),
+        done(bool))
+        """
+        state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+        next_state = torch.FloatTensor(next_state).cuda() if self.use_cuda else torch.FloatTensor(next_state)
+        action = torch.LongTensor([action]).cuda() if self.use_cuda else torch.LongTensor([action])
+        reward = torch.DoubleTensor([reward]).cuda() if self.use_cuda else torch.DoubleTensor([reward])
+        done = torch.BoolTensor([done]).cuda() if self.use_cuda else torch.BoolTensor([done])
+        self.memory.append( (state, next_state, action, reward, done,) )
+    def recall(self):
+        """
+        Retrieve a batch of experiences from memory
+        """
+        batch = random.sample(self.memory, self.batch_size)
+        state, next_state, action, reward, done = map(torch.stack, zip(*batch))
+        return state, next_state, action.squeeze(), reward.squeeze(), done.squeeze()
+    # def td_estimate(self, state, action):
+    #     current_Q = self.net(state, model='online')[np.arange(0, self.batch_size), action] # Q_online(s,a)
+    #     return current_Q
+    # @torch.no_grad()
+    # def td_target(self, reward, next_state, done):
+    #     next_state_Q = self.net(next_state, model='online')
+    #     best_action = torch.argmax(next_state_Q, axis=1)
+    #     next_Q = self.net(next_state, model='target')[np.arange(0, self.batch_size), best_action]
+    #     return (reward + (1 - done.float()) * self.gamma * next_Q).float()
+    def td_estimate(self, states, actions):
+        actions = actions.reshape(-1, 1)
+        predicted_qs = self.net(states, model='online')# Q_online(s,a)
+        predicted_qs = predicted_qs.gather(1, actions)
+        return predicted_qs
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target')
+        target_qs = torch.max(target_qs, dim=1).values
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))
+    def update_Q_online(self, td_estimate, td_target) :
+        loss = self.loss_fn(td_estimate, td_target)
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+        return loss.item()
+    def sync_Q_target(self):
+        self.net.target.load_state_dict(self.net.online.state_dict())
+    def learn(self):
+        if self.curr_step % self.target_network_sync_frequency == 0:
+            self.sync_Q_target()
+        if self.curr_step % self.save_every == 0:
+            self.save()
+        if self.curr_step < self.learning_starts:
+            return None, None
+        if self.curr_step % self.training_frequency != 0:
+            return None, None
+        # Sample from memory
+        state, next_state, action, reward, done = self.recall()
+        # Get TD Estimate
+        td_est = self.td_estimate(state, action)
+        # Get TD Target
+        td_tgt = self.td_target(reward, next_state, done)
+        # Backpropagate loss through Q_online
+        loss = self.update_Q_online(td_est, td_tgt)
+        return (td_est.mean().item(), loss)
+    def save(self):
+        save_path = self.save_dir / f"airstriker_net_{int(self.curr_step // self.save_every)}.chkpt"
+        torch.save(
+            dict(
+                model=self.net.state_dict(),
+                exploration_rate=self.exploration_rate,
+                replay_memory=self.memory
+            ),
+            save_path
+        )
+        print(f"Airstriker model saved to {save_path} at step {self.curr_step}")
+    def load(self, load_path, reset_exploration_rate, load_replay_buffer):
+        if not load_path.exists():
+            raise ValueError(f"{load_path} does not exist")
+        ckp = torch.load(load_path, map_location=('cuda' if self.use_cuda else 'cpu'))
+        exploration_rate = ckp.get('exploration_rate')
+        state_dict = ckp.get('model')
+        print(f"Loading model at {load_path} with exploration rate {exploration_rate}")
+        self.net.load_state_dict(state_dict)
+        if load_replay_buffer:
+            replay_memory = ckp.get('replay_memory')
+            print(f"Loading replay memory. Len {len(replay_memory)}" if replay_memory else "Saved replay memory not found. Not restoring replay memory.")
+            self.memory = replay_memory if replay_memory else self.memory
+        if reset_exploration_rate:
+            print(f"Reset exploration rate option specified. Not restoring saved exploration rate {exploration_rate}. The current exploration rate is {self.exploration_rate}")
+        else:
+            print(f"Setting exploration rate to {exploration_rate} not loaded.")
+            self.exploration_rate = exploration_rate
+class DDQNAgent(DQNAgent):
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        print("Double dqn -----------------------")
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        q_vals = self.net(next_states, model='online')
+        target_actions = torch.argmax(q_vals, axis=1)
+        target_actions = target_actions.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target').gather(target_actions, 1)
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))

src/airstriker-genesis/cartpole.py ADDED Viewed

	@@ -0,0 +1,353 @@

+import torch
+import numpy as np
+import random
+import torch.nn as nn
+import copy
+import time, datetime
+import matplotlib.pyplot as plt
+from collections import deque
+from torch.utils.tensorboard import SummaryWriter
+import pickle
+class MyDQN(nn.Module):
+    """mini cnn structure
+  input -> (conv2d + relu) x 3 -> flatten -> (dense + relu) x 2 -> output
+  """
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        self.online = nn.Sequential(
+            nn.Linear(input_dim, 128),
+            nn.ReLU(),
+            nn.Linear(128, 128),
+            nn.ReLU(),
+           nn.Linear(128, output_dim)
+        )
+        self.target = copy.deepcopy(self.online)
+        # Q_target parameters are frozen.
+        for p in self.target.parameters():
+            p.requires_grad = False
+    def forward(self, input, model):
+        if model == "online":
+            return self.online(input)
+        elif model == "target":
+            return self.target(input)
+class MetricLogger:
+    def __init__(self, save_dir):
+        self.writer = SummaryWriter(log_dir=save_dir)
+        self.save_log = save_dir / "log"
+        with open(self.save_log, "w") as f:
+            f.write(
+                f"{'Episode':>8}{'Step':>8}{'Epsilon':>10}{'MeanReward':>15}"
+                f"{'MeanLength':>15}{'MeanLoss':>15}{'MeanQValue':>15}"
+                f"{'TimeDelta':>15}{'Time':>20}\n"
+            )
+        self.ep_rewards_plot = save_dir / "reward_plot.jpg"
+        self.ep_lengths_plot = save_dir / "length_plot.jpg"
+        self.ep_avg_losses_plot = save_dir / "loss_plot.jpg"
+        self.ep_avg_qs_plot = save_dir / "q_plot.jpg"
+        # History metrics
+        self.ep_rewards = []
+        self.ep_lengths = []
+        self.ep_avg_losses = []
+        self.ep_avg_qs = []
+        # Moving averages, added for every call to record()
+        self.moving_avg_ep_rewards = []
+        self.moving_avg_ep_lengths = []
+        self.moving_avg_ep_avg_losses = []
+        self.moving_avg_ep_avg_qs = []
+        # Current episode metric
+        self.init_episode()
+        # Timing
+        self.record_time = time.time()
+    def log_step(self, reward, loss, q):
+        self.curr_ep_reward += reward
+        self.curr_ep_length += 1
+        if loss:
+            self.curr_ep_loss += loss
+            self.curr_ep_q += q
+            self.curr_ep_loss_length += 1
+    def log_episode(self, episode_number):
+        "Mark end of episode"
+        self.ep_rewards.append(self.curr_ep_reward)
+        self.ep_lengths.append(self.curr_ep_length)
+        if self.curr_ep_loss_length == 0:
+            ep_avg_loss = 0
+            ep_avg_q = 0
+        else:
+            ep_avg_loss = np.round(self.curr_ep_loss / self.curr_ep_loss_length, 5)
+            ep_avg_q = np.round(self.curr_ep_q / self.curr_ep_loss_length, 5)
+        self.ep_avg_losses.append(ep_avg_loss)
+        self.ep_avg_qs.append(ep_avg_q)
+        self.writer.add_scalar("Avg Loss for episode", ep_avg_loss, episode_number)
+        self.writer.add_scalar("Avg Q value for episode", ep_avg_q, episode_number)
+        self.writer.flush()
+        self.init_episode()
+    def init_episode(self):
+        self.curr_ep_reward = 0.0
+        self.curr_ep_length = 0
+        self.curr_ep_loss = 0.0
+        self.curr_ep_q = 0.0
+        self.curr_ep_loss_length = 0
+    def record(self, episode, epsilon, step):
+        mean_ep_reward = np.round(np.mean(self.ep_rewards[-100:]), 3)
+        mean_ep_length = np.round(np.mean(self.ep_lengths[-100:]), 3)
+        mean_ep_loss = np.round(np.mean(self.ep_avg_losses[-100:]), 3)
+        mean_ep_q = np.round(np.mean(self.ep_avg_qs[-100:]), 3)
+        self.moving_avg_ep_rewards.append(mean_ep_reward)
+        self.moving_avg_ep_lengths.append(mean_ep_length)
+        self.moving_avg_ep_avg_losses.append(mean_ep_loss)
+        self.moving_avg_ep_avg_qs.append(mean_ep_q)
+        last_record_time = self.record_time
+        self.record_time = time.time()
+        time_since_last_record = np.round(self.record_time - last_record_time, 3)
+        print(
+            f"Episode {episode} - "
+            f"Step {step} - "
+            f"Epsilon {epsilon} - "
+            f"Mean Reward {mean_ep_reward} - "
+            f"Mean Length {mean_ep_length} - "
+            f"Mean Loss {mean_ep_loss} - "
+            f"Mean Q Value {mean_ep_q} - "
+            f"Time Delta {time_since_last_record} - "
+            f"Time {datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S')}"
+        )
+        self.writer.add_scalar("Mean reward last 100 episodes", mean_ep_reward, episode)
+        self.writer.add_scalar("Mean length last 100 episodes", mean_ep_length, episode)
+        self.writer.add_scalar("Mean loss last 100 episodes", mean_ep_loss, episode)
+        self.writer.add_scalar("Mean reward last 100 episodes", mean_ep_reward, episode)
+        self.writer.add_scalar("Epsilon value", epsilon, episode)
+        self.writer.add_scalar("Mean Q Value last 100 episodes", mean_ep_q, episode)
+        self.writer.flush()
+        with open(self.save_log, "a") as f:
+            f.write(
+                f"{episode:8d}{step:8d}{epsilon:10.3f}"
+                f"{mean_ep_reward:15.3f}{mean_ep_length:15.3f}{mean_ep_loss:15.3f}{mean_ep_q:15.3f}"
+                f"{time_since_last_record:15.3f}"
+                f"{datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S'):>20}\n"
+            )
+        for metric in ["ep_rewards", "ep_lengths", "ep_avg_losses", "ep_avg_qs"]:
+            plt.plot(getattr(self, f"moving_avg_{metric}"))
+            plt.savefig(getattr(self, f"{metric}_plot"))
+            plt.clf()
+class MyAgent:
+    def __init__(self, state_dim, action_dim, save_dir, checkpoint=None, reset_exploration_rate=False, max_memory_size=100000):
+        self.state_dim = state_dim
+        self.action_dim = action_dim
+        self.max_memory_size = max_memory_size
+        self.memory = deque(maxlen=max_memory_size)
+        # self.batch_size = 32
+        self.batch_size = 512
+        self.exploration_rate = 1
+        # self.exploration_rate_decay = 0.99999975
+        self.exploration_rate_decay = 0.9999999
+        self.exploration_rate_min = 0.1
+        self.gamma = 0.9
+        self.curr_step = 0
+        self.learning_start_threshold = 10000  # min. experiences before training
+        self.learn_every = 5   # no. of experiences between updates to Q_online
+        self.sync_every = 200  # no. of experiences between Q_target & Q_online sync
+        self.save_every = 200000   # no. of experiences between saving Mario Net
+        self.save_dir = save_dir
+        self.use_cuda = torch.cuda.is_available()
+        # Mario's DNN to predict the most optimal action - we implement this in the Learn section
+        self.net = MyDQN(self.state_dim, self.action_dim).float()
+        if self.use_cuda:
+            self.net = self.net.to(device='cuda')
+        if checkpoint:
+            self.load(checkpoint, reset_exploration_rate)
+        # self.optimizer = torch.optim.Adam(self.net.parameters(), lr=0.00025)
+        self.optimizer = torch.optim.AdamW(self.net.parameters(), lr=0.00025, amsgrad=True)
+        self.loss_fn = torch.nn.SmoothL1Loss()
+    def act(self, state):
+        """
+        Given a state, choose an epsilon-greedy action and update value of step.
+        Inputs:
+        state(LazyFrame): A single observation of the current state, dimension is (state_dim)
+        Outputs:
+        action_idx (int): An integer representing which action Mario will perform
+        """
+        # EXPLORE
+        if np.random.rand() < self.exploration_rate:
+            action_idx = np.random.randint(self.action_dim)
+        # EXPLOIT
+        else:
+            state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+            state = state.unsqueeze(0)
+            action_values = self.net(state, model='online')
+            action_idx = torch.argmax(action_values, axis=1).item()
+        # decrease exploration_rate
+        self.exploration_rate *= self.exploration_rate_decay
+        self.exploration_rate = max(self.exploration_rate_min, self.exploration_rate)
+        # increment step
+        self.curr_step += 1
+        return action_idx
+    def cache(self, state, next_state, action, reward, done):
+        """
+        Store the experience to self.memory (replay buffer)
+        Inputs:
+        state (LazyFrame),
+        next_state (LazyFrame),
+        action (int),
+        reward (float),
+        done(bool))
+        """
+        state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+        next_state = torch.FloatTensor(next_state).cuda() if self.use_cuda else torch.FloatTensor(next_state)
+        action = torch.LongTensor([action]).cuda() if self.use_cuda else torch.LongTensor([action])
+        reward = torch.DoubleTensor([reward]).cuda() if self.use_cuda else torch.DoubleTensor([reward])
+        done = torch.BoolTensor([done]).cuda() if self.use_cuda else torch.BoolTensor([done])
+        self.memory.append( (state, next_state, action, reward, done,) )
+    def recall(self):
+        """
+        Retrieve a batch of experiences from memory
+        """
+        batch = random.sample(self.memory, self.batch_size)
+        state, next_state, action, reward, done = map(torch.stack, zip(*batch))
+        return state, next_state, action.squeeze(), reward.squeeze(), done.squeeze()
+    # def td_estimate(self, state, action):
+    #     current_Q = self.net(state, model='online')[np.arange(0, self.batch_size), action] # Q_online(s,a)
+    #     return current_Q
+    # @torch.no_grad()
+    # def td_target(self, reward, next_state, done):
+    #     next_state_Q = self.net(next_state, model='online')
+    #     best_action = torch.argmax(next_state_Q, axis=1)
+    #     next_Q = self.net(next_state, model='target')[np.arange(0, self.batch_size), best_action]
+    #     return (reward + (1 - done.float()) * self.gamma * next_Q).float()
+    def td_estimate(self, states, actions):
+        actions = actions.reshape(-1, 1)
+        predicted_qs = self.net(states, model='online')# Q_online(s,a)
+        predicted_qs = predicted_qs.gather(1, actions)
+        return predicted_qs
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target')
+        target_qs = torch.max(target_qs, dim=1).values
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))
+    def update_Q_online(self, td_estimate, td_target) :
+        loss = self.loss_fn(td_estimate, td_target)
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+        return loss.item()
+    def sync_Q_target(self):
+        self.net.target.load_state_dict(self.net.online.state_dict())
+    def learn(self):
+        if self.curr_step % self.sync_every == 0:
+            self.sync_Q_target()
+        if self.curr_step % self.save_every == 0:
+            self.save()
+        if self.curr_step < self.learning_start_threshold:
+            return None, None
+        if self.curr_step % self.learn_every != 0:
+            return None, None
+        # Sample from memory
+        state, next_state, action, reward, done = self.recall()
+        # Get TD Estimate
+        td_est = self.td_estimate(state, action)
+        # Get TD Target
+        td_tgt = self.td_target(reward, next_state, done)
+        # Backpropagate loss through Q_online
+        loss = self.update_Q_online(td_est, td_tgt)
+        return (td_est.mean().item(), loss)
+    def save(self):
+        save_path = self.save_dir / f"cartpole_net_{int(self.curr_step // self.save_every)}.chkpt"
+        torch.save(
+            dict(
+                model=self.net.state_dict(),
+                exploration_rate=self.exploration_rate,
+                replay_memory=self.memory
+            ),
+            save_path
+        )
+        print(f"Cartpole Net saved to {save_path} at step {self.curr_step}")
+    def load(self, load_path, reset_exploration_rate=False):
+        if not load_path.exists():
+            raise ValueError(f"{load_path} does not exist")
+        ckp = torch.load(load_path, map_location=('cuda' if self.use_cuda else 'cpu'))
+        exploration_rate = ckp.get('exploration_rate')
+        state_dict = ckp.get('model')
+        replay_memory = ckp.get('replay_memory')
+        print(f"Loading model at {load_path} with exploration rate {exploration_rate}")
+        self.net.load_state_dict(state_dict)
+        print(f"Loading replay memory. Len {len(replay_memory)}" if replay_memory else "Saved replay memory not found. Not restoring replay memory.")
+        self.memory = replay_memory if replay_memory else self.memory
+        if reset_exploration_rate:
+            print(f"Reset exploration rate option specified. Not restoring saved exploration rate {exploration_rate}. The current exploration rate is {self.exploration_rate}")
+        else:
+            print(f"Setting exploration rate to {exploration_rate} not loaded.")
+            self.exploration_rate = exploration_rate

src/airstriker-genesis/procgen_agent.py ADDED Viewed

	@@ -0,0 +1,400 @@

+import torch
+import numpy as np
+import random
+import torch.nn as nn
+import copy
+import time, datetime
+import matplotlib.pyplot as plt
+from collections import deque
+from torch.utils.tensorboard import SummaryWriter
+import pickle
+class DQNet(nn.Module):
+    """mini cnn structure
+  input -> (conv2d + relu) x 3 -> flatten -> (dense + relu) x 2 -> output
+  """
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        print("#################################")
+        print("#################################")
+        print(input_dim)
+        print(output_dim)
+        print("#################################")
+        print("#################################")
+        c, h, w = input_dim
+        # if h != 84:
+        #     raise ValueError(f"Expecting input height: 84, got: {h}")
+        # if w != 84:
+        #     raise ValueError(f"Expecting input width: 84, got: {w}")
+        self.online = nn.Sequential(
+            nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),
+            nn.ReLU(),
+            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
+            nn.ReLU(),
+            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
+            nn.ReLU(),
+            nn.Flatten(),
+            nn.Linear(7168, 512),
+            nn.ReLU(),
+            nn.Linear(512, output_dim),
+        )
+        self.target = copy.deepcopy(self.online)
+        # Q_target parameters are frozen.
+        for p in self.target.parameters():
+            p.requires_grad = False
+    def forward(self, input, model):
+        if model == "online":
+            return self.online(input)
+        elif model == "target":
+            return self.target(input)
+class MetricLogger:
+    def __init__(self, save_dir):
+        self.writer = SummaryWriter(log_dir=save_dir)
+        self.save_log = save_dir / "log"
+        with open(self.save_log, "w") as f:
+            f.write(
+                f"{'Episode':>8}{'Step':>8}{'Epsilon':>10}{'MeanReward':>15}"
+                f"{'MeanLength':>15}{'MeanLoss':>15}{'MeanQValue':>15}"
+                f"{'TimeDelta':>15}{'Time':>20}\n"
+            )
+        self.ep_rewards_plot = save_dir / "reward_plot.jpg"
+        self.ep_lengths_plot = save_dir / "length_plot.jpg"
+        self.ep_avg_losses_plot = save_dir / "loss_plot.jpg"
+        self.ep_avg_qs_plot = save_dir / "q_plot.jpg"
+        # History metrics
+        self.ep_rewards = []
+        self.ep_lengths = []
+        self.ep_avg_losses = []
+        self.ep_avg_qs = []
+        # Moving averages, added for every call to record()
+        self.moving_avg_ep_rewards = []
+        self.moving_avg_ep_lengths = []
+        self.moving_avg_ep_avg_losses = []
+        self.moving_avg_ep_avg_qs = []
+        # Current episode metric
+        self.init_episode()
+        # Timing
+        self.record_time = time.time()
+    def log_step(self, reward, loss, q):
+        self.curr_ep_reward += reward
+        self.curr_ep_length += 1
+        if loss:
+            self.curr_ep_loss += loss
+            self.curr_ep_q += q
+            self.curr_ep_loss_length += 1
+    def log_episode(self, episode_number):
+        "Mark end of episode"
+        self.ep_rewards.append(self.curr_ep_reward)
+        self.ep_lengths.append(self.curr_ep_length)
+        if self.curr_ep_loss_length == 0:
+            ep_avg_loss = 0
+            ep_avg_q = 0
+        else:
+            ep_avg_loss = np.round(self.curr_ep_loss / self.curr_ep_loss_length, 5)
+            ep_avg_q = np.round(self.curr_ep_q / self.curr_ep_loss_length, 5)
+        self.ep_avg_losses.append(ep_avg_loss)
+        self.ep_avg_qs.append(ep_avg_q)
+        self.writer.add_scalar("Avg Loss for episode", ep_avg_loss, episode_number)
+        self.writer.add_scalar("Avg Q value for episode", ep_avg_q, episode_number)
+        self.writer.flush()
+        self.init_episode()
+    def init_episode(self):
+        self.curr_ep_reward = 0.0
+        self.curr_ep_length = 0
+        self.curr_ep_loss = 0.0
+        self.curr_ep_q = 0.0
+        self.curr_ep_loss_length = 0
+    def record(self, episode, epsilon, step):
+        mean_ep_reward = np.round(np.mean(self.ep_rewards[-100:]), 3)
+        mean_ep_length = np.round(np.mean(self.ep_lengths[-100:]), 3)
+        mean_ep_loss = np.round(np.mean(self.ep_avg_losses[-100:]), 3)
+        mean_ep_q = np.round(np.mean(self.ep_avg_qs[-100:]), 3)
+        self.moving_avg_ep_rewards.append(mean_ep_reward)
+        self.moving_avg_ep_lengths.append(mean_ep_length)
+        self.moving_avg_ep_avg_losses.append(mean_ep_loss)
+        self.moving_avg_ep_avg_qs.append(mean_ep_q)
+        last_record_time = self.record_time
+        self.record_time = time.time()
+        time_since_last_record = np.round(self.record_time - last_record_time, 3)
+        print(
+            f"Episode {episode} - "
+            f"Step {step} - "
+            f"Epsilon {epsilon} - "
+            f"Mean Reward {mean_ep_reward} - "
+            f"Mean Length {mean_ep_length} - "
+            f"Mean Loss {mean_ep_loss} - "
+            f"Mean Q Value {mean_ep_q} - "
+            f"Time Delta {time_since_last_record} - "
+            f"Time {datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S')}"
+        )
+        self.writer.add_scalar("Mean reward last 100 episodes", mean_ep_reward, episode)
+        self.writer.add_scalar("Mean length last 100 episodes", mean_ep_length, episode)
+        self.writer.add_scalar("Mean loss last 100 episodes", mean_ep_loss, episode)
+        self.writer.add_scalar("Mean reward last 100 episodes", mean_ep_reward, episode)
+        self.writer.add_scalar("Epsilon value", epsilon, episode)
+        self.writer.add_scalar("Mean Q Value last 100 episodes", mean_ep_q, episode)
+        self.writer.flush()
+        with open(self.save_log, "a") as f:
+            f.write(
+                f"{episode:8d}{step:8d}{epsilon:10.3f}"
+                f"{mean_ep_reward:15.3f}{mean_ep_length:15.3f}{mean_ep_loss:15.3f}{mean_ep_q:15.3f}"
+                f"{time_since_last_record:15.3f}"
+                f"{datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S'):>20}\n"
+            )
+        for metric in ["ep_rewards", "ep_lengths", "ep_avg_losses", "ep_avg_qs"]:
+            plt.plot(getattr(self, f"moving_avg_{metric}"))
+            plt.savefig(getattr(self, f"{metric}_plot"))
+            plt.clf()
+class DQNAgent:
+    def __init__(self,
+                 state_dim,
+                 action_dim,
+                 save_dir,
+                 checkpoint=None,
+                 learning_rate=0.00025,
+                 max_memory_size=100000,
+                 batch_size=32,
+                 exploration_rate=1,
+                 exploration_rate_decay=0.9999999,
+                 exploration_rate_min=0.1,
+                 training_frequency=1,
+                 learning_starts=1000,
+                 target_network_sync_frequency=500,
+                 reset_exploration_rate=False,
+                 save_frequency=100000,
+                 gamma=0.9,
+                 load_replay_buffer=True):
+        self.state_dim = state_dim
+        self.action_dim = action_dim
+        self.max_memory_size = max_memory_size
+        self.memory = deque(maxlen=max_memory_size)
+        self.batch_size = batch_size
+        self.exploration_rate = exploration_rate
+        self.exploration_rate_decay = exploration_rate_decay
+        self.exploration_rate_min = exploration_rate_min
+        self.gamma = gamma
+        self.curr_step = 0
+        self.learning_starts = learning_starts  # min. experiences before training
+        self.training_frequency = training_frequency   # no. of experiences between updates to Q_online
+        self.target_network_sync_frequency = target_network_sync_frequency  # no. of experiences between Q_target & Q_online sync
+        self.save_every = save_frequency   # no. of experiences between saving Mario Net
+        self.save_dir = save_dir
+        self.use_cuda = torch.cuda.is_available()
+        # Mario's DNN to predict the most optimal action - we implement this in the Learn section
+        self.net = DQNet(self.state_dim, self.action_dim).float()
+        if self.use_cuda:
+            self.net = self.net.to(device='cuda')
+        if checkpoint:
+            self.load(checkpoint, reset_exploration_rate, load_replay_buffer)
+        self.optimizer = torch.optim.AdamW(self.net.parameters(), lr=learning_rate, amsgrad=True)
+        self.loss_fn = torch.nn.SmoothL1Loss()
+    def act(self, state):
+        """
+        Given a state, choose an epsilon-greedy action and update value of step.
+        Inputs:
+        state(LazyFrame): A single observation of the current state, dimension is (state_dim)
+        Outputs:
+        action_idx (int): An integer representing which action Mario will perform
+        """
+        # EXPLORE
+        if np.random.rand() < self.exploration_rate:
+            action_idx = np.random.randint(self.action_dim)
+        # EXPLOIT
+        else:
+            state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+            state = state.unsqueeze(0)
+            action_values = self.net(state, model='online')
+            action_idx = torch.argmax(action_values, axis=1).item()
+        # decrease exploration_rate
+        self.exploration_rate *= self.exploration_rate_decay
+        self.exploration_rate = max(self.exploration_rate_min, self.exploration_rate)
+        # increment step
+        self.curr_step += 1
+        return action_idx
+    def cache(self, state, next_state, action, reward, done):
+        """
+        Store the experience to self.memory (replay buffer)
+        Inputs:
+        state (LazyFrame),
+        next_state (LazyFrame),
+        action (int),
+        reward (float),
+        done(bool))
+        """
+        state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+        next_state = torch.FloatTensor(next_state).cuda() if self.use_cuda else torch.FloatTensor(next_state)
+        action = torch.LongTensor([action]).cuda() if self.use_cuda else torch.LongTensor([action])
+        reward = torch.DoubleTensor([reward]).cuda() if self.use_cuda else torch.DoubleTensor([reward])
+        done = torch.BoolTensor([done]).cuda() if self.use_cuda else torch.BoolTensor([done])
+        self.memory.append( (state, next_state, action, reward, done,) )
+    def recall(self):
+        """
+        Retrieve a batch of experiences from memory
+        """
+        batch = random.sample(self.memory, self.batch_size)
+        state, next_state, action, reward, done = map(torch.stack, zip(*batch))
+        return state, next_state, action.squeeze(), reward.squeeze(), done.squeeze()
+    # def td_estimate(self, state, action):
+    #     current_Q = self.net(state, model='online')[np.arange(0, self.batch_size), action] # Q_online(s,a)
+    #     return current_Q
+    # @torch.no_grad()
+    # def td_target(self, reward, next_state, done):
+    #     next_state_Q = self.net(next_state, model='online')
+    #     best_action = torch.argmax(next_state_Q, axis=1)
+    #     next_Q = self.net(next_state, model='target')[np.arange(0, self.batch_size), best_action]
+    #     return (reward + (1 - done.float()) * self.gamma * next_Q).float()
+    def td_estimate(self, states, actions):
+        actions = actions.reshape(-1, 1)
+        predicted_qs = self.net(states, model='online')# Q_online(s,a)
+        predicted_qs = predicted_qs.gather(1, actions)
+        return predicted_qs
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target')
+        target_qs = torch.max(target_qs, dim=1).values
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))
+    def update_Q_online(self, td_estimate, td_target) :
+        loss = self.loss_fn(td_estimate, td_target)
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+        return loss.item()
+    def sync_Q_target(self):
+        self.net.target.load_state_dict(self.net.online.state_dict())
+    def learn(self):
+        if self.curr_step % self.target_network_sync_frequency == 0:
+            self.sync_Q_target()
+        if self.curr_step % self.save_every == 0:
+            self.save()
+        if self.curr_step < self.learning_starts:
+            return None, None
+        if self.curr_step % self.training_frequency != 0:
+            return None, None
+        # Sample from memory
+        state, next_state, action, reward, done = self.recall()
+        # Get TD Estimate
+        td_est = self.td_estimate(state, action)
+        # Get TD Target
+        td_tgt = self.td_target(reward, next_state, done)
+        # Backpropagate loss through Q_online
+        loss = self.update_Q_online(td_est, td_tgt)
+        return (td_est.mean().item(), loss)
+    def save(self):
+        save_path = self.save_dir / f"airstriker_net_{int(self.curr_step // self.save_every)}.chkpt"
+        torch.save(
+            dict(
+                model=self.net.state_dict(),
+                exploration_rate=self.exploration_rate,
+                replay_memory=self.memory
+            ),
+            save_path
+        )
+        print(f"Airstriker model saved to {save_path} at step {self.curr_step}")
+    def load(self, load_path, reset_exploration_rate, load_replay_buffer):
+        if not load_path.exists():
+            raise ValueError(f"{load_path} does not exist")
+        ckp = torch.load(load_path, map_location=('cuda' if self.use_cuda else 'cpu'))
+        exploration_rate = ckp.get('exploration_rate')
+        state_dict = ckp.get('model')
+        print(f"Loading model at {load_path} with exploration rate {exploration_rate}")
+        self.net.load_state_dict(state_dict)
+        if load_replay_buffer:
+            replay_memory = ckp.get('replay_memory')
+            print(f"Loading replay memory. Len {len(replay_memory)}" if replay_memory else "Saved replay memory not found. Not restoring replay memory.")
+            self.memory = replay_memory if replay_memory else self.memory
+        if reset_exploration_rate:
+            print(f"Reset exploration rate option specified. Not restoring saved exploration rate {exploration_rate}. The current exploration rate is {self.exploration_rate}")
+        else:
+            print(f"Setting exploration rate to {exploration_rate} not loaded.")
+            self.exploration_rate = exploration_rate
+class DDQNAgent(DQNAgent):
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        print("Double dqn -----------------------")
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        q_vals = self.net(next_states, model='online')
+        target_actions = torch.argmax(q_vals, axis=1)
+        target_actions = target_actions.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target').gather(target_actions, 1)
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))

src/airstriker-genesis/replay.py ADDED Viewed

	@@ -0,0 +1,66 @@

+import datetime
+from pathlib import Path
+from itertools import count
+from agent import DQNAgent,  MetricLogger
+from wrappers import make_env, make_starpilot
+env = make_starpilot()
+env.reset()
+save_dir = Path("checkpoints") / datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
+save_dir.mkdir(parents=True)
+checkpoint = Path('checkpoints/procgen-starpilot-dqn/airstriker_net_3.chkpt')
+agent = DQNAgent(
+    state_dim=(1, 64, 64),
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    batch_size=256,
+    checkpoint=checkpoint,
+    reset_exploration_rate=True,
+    exploration_rate_decay=0.999999,
+    training_frequency=10,
+    target_network_sync_frequency=200,
+    max_memory_size=3000,
+    learning_rate=0.001,
+    save_frequency=2000
+)
+agent.exploration_rate = agent.exploration_rate_min
+# logger = MetricLogger(save_dir)
+episodes = 100
+for e in range(episodes):
+    state = env.reset()
+    while True:
+        env.render()
+        action = agent.act(state)
+        next_state, reward, done, info = env.step(action)
+        agent.cache(state, next_state, action, reward, done)
+        # logger.log_step(reward, None, None)
+        state = next_state
+        if done:
+            break
+    # logger.log_episode()
+    # if e % 20 == 0:
+    #     logger.record(
+    #         episode=e,
+    #         epsilon=agent.exploration_rate,
+    #         step=agent.curr_step
+    #     )

src/airstriker-genesis/run-airstriker-ddqn.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import os
+import torch
+import matplotlib
+import matplotlib.pyplot as plt
+from pathlib import Path
+from tqdm import trange
+from agent import DQNAgent, DDQNAgent, MetricLogger
+from wrappers import make_env
+# set up matplotlib
+is_ipython = 'inline' in matplotlib.get_backend()
+if is_ipython:
+    from IPython import display
+plt.ion()
+env = make_env()
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}\n")
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+path = "checkpoints/airstriker-ddqn"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+# Vanilla DQN
+print("Training Vanilla DQN Agent!")
+# agent = DQNAgent(
+#     state_dim=(1, 84, 84),
+#     action_dim=env.action_space.n,
+#     save_dir=save_dir,
+#     batch_size=128,
+#     checkpoint=checkpoint,
+#     exploration_rate_decay=0.995,
+#     exploration_rate_min=0.05,
+#     training_frequency=1,
+#     target_network_sync_frequency=500,
+#     max_memory_size=50000,
+#     learning_rate=0.0005,
+# )
+# Double DQN
+print("Training DDQN Agent!")
+agent = DDQNAgent(
+    state_dim=(1, 84, 84),
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    batch_size=128,
+    checkpoint=checkpoint,
+    exploration_rate_decay=0.995,
+    exploration_rate_min=0.05,
+    training_frequency=1,
+    target_network_sync_frequency=500,
+    max_memory_size=50000,
+    learning_rate=0.0005,
+)
+logger = MetricLogger(save_dir)
+def fill_memory(agent: DQNAgent, num_episodes=1000):
+    print("Filling up memory....")
+    for _ in trange(num_episodes):
+        state = env.reset()
+        done = False
+        while not done:
+            action = agent.act(state)
+            next_state, reward, done, _ = env.step(action)
+            agent.cache(state, next_state, action, reward, done)
+            state = next_state
+def train(agent: DQNAgent):
+    episodes = 10000000
+    for e in range(episodes):
+        state = env.reset()
+        # Play the game!
+        while True:
+            # print(state.shape)
+            # Run agent on the state
+            action = agent.act(state)
+            # Agent performs action
+            next_state, reward, done, info = env.step(action)
+            # Remember
+            agent.cache(state, next_state, action, reward, done)
+            # Learn
+            q, loss = agent.learn()
+            # Logging
+            logger.log_step(reward, loss, q)
+            # Update state
+            state = next_state
+            # Check if end of game
+            if done or info["gameover"] == 1:
+                break
+        logger.log_episode(e)
+        if e % 20 == 0:
+            logger.record(episode=e, epsilon=agent.exploration_rate, step=agent.curr_step)
+fill_memory(agent)
+train(agent)

src/airstriker-genesis/run-airstriker-dqn.py ADDED Viewed

	@@ -0,0 +1,115 @@

+import os
+import torch
+import matplotlib
+import matplotlib.pyplot as plt
+from pathlib import Path
+from tqdm import trange
+from agent import DQNAgent, DDQNAgent, MetricLogger
+from wrappers import make_env
+# set up matplotlib
+is_ipython = 'inline' in matplotlib.get_backend()
+if is_ipython:
+    from IPython import display
+plt.ion()
+env = make_env()
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}\n")
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+path = "checkpoints/airstriker-dqn-new"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+# Vanilla DQN
+print("Training Vanilla DQN Agent!")
+agent = DQNAgent(
+    state_dim=(1, 84, 84),
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    batch_size=128,
+    checkpoint=checkpoint,
+    exploration_rate_decay=0.995,
+    exploration_rate_min=0.05,
+    training_frequency=1,
+    target_network_sync_frequency=500,
+    max_memory_size=50000,
+    learning_rate=0.0005,
+)
+# Double DQN
+# print("Training DDQN Agent!")
+# agent = DDQNAgent(
+#     state_dim=(1, 84, 84),
+#     action_dim=env.action_space.n,
+#     save_dir=save_dir,
+#     checkpoint=checkpoint,
+#     reset_exploration_rate=True,
+#     max_memory_size=max_memory_size
+# )
+logger = MetricLogger(save_dir)
+def fill_memory(agent: DQNAgent, num_episodes=1000):
+    print("Filling up memory....")
+    for _ in trange(num_episodes):
+        state = env.reset()
+        done = False
+        while not done:
+            action = agent.act(state)
+            next_state, reward, done, _ = env.step(action)
+            agent.cache(state, next_state, action, reward, done)
+            state = next_state
+def train(agent: DQNAgent):
+    episodes = 10000000
+    for e in range(episodes):
+        state = env.reset()
+        # Play the game!
+        while True:
+            # print(state.shape)
+            # Run agent on the state
+            action = agent.act(state)
+            # Agent performs action
+            next_state, reward, done, info = env.step(action)
+            # Remember
+            agent.cache(state, next_state, action, reward, done)
+            # Learn
+            q, loss = agent.learn()
+            # Logging
+            logger.log_step(reward, loss, q)
+            # Update state
+            state = next_state
+            # Check if end of game
+            if done or info["gameover"] == 1:
+                break
+        logger.log_episode(e)
+        if e % 20 == 0:
+            logger.record(episode=e, epsilon=agent.exploration_rate, step=agent.curr_step)
+fill_memory(agent)
+train(agent)

src/airstriker-genesis/run-cartpole.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import os
+import random, datetime
+from pathlib import Path
+import retro as gym
+from collections import namedtuple, deque
+from itertools import count
+import torch
+import matplotlib
+import matplotlib.pyplot as plt
+# from agent import MyAgent, MyDQN, MetricLogger
+from cartpole import MyAgent, MetricLogger
+from wrappers import make_env
+import pickle
+import gym
+from tqdm import trange
+# set up matplotlib
+is_ipython = 'inline' in matplotlib.get_backend()
+if is_ipython:
+    from IPython import display
+plt.ion()
+# env = make_env()
+env = gym.make('CartPole-v1')
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}")
+print()
+path = "checkpoints/cartpole/latest"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+# save_dir.mkdir(parents=True)
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+# For cartpole
+n_actions = env.action_space.n
+state = env.reset()
+n_observations = len(state)
+max_memory_size=100000
+agent = MyAgent(
+    state_dim=n_observations,
+    action_dim=n_actions,
+    save_dir=save_dir,
+    checkpoint=checkpoint,
+    reset_exploration_rate=True,
+    max_memory_size=max_memory_size
+)
+# For airstriker
+# agent = MyAgent(state_dim=(1, 84, 84), action_dim=env.action_space.n, save_dir=save_dir, checkpoint=checkpoint, reset_exploration_rate=True)
+logger = MetricLogger(save_dir)
+def fill_memory(agent: MyAgent):
+    print("Filling up memory....")
+    for _ in trange(max_memory_size):
+        state = env.reset()
+        done = False
+        while not done:
+            action = agent.act(state)
+            next_state, reward, done, info = env.step(action)
+            agent.cache(state, next_state, action, reward, done)
+            state = next_state
+def train(agent: MyAgent):
+    episodes = 10000000
+    for e in range(episodes):
+        state = env.reset()
+        # Play the game!
+        while True:
+            # print(state.shape)
+            # Run agent on the state
+            action = agent.act(state)
+            # Agent performs action
+            next_state, reward, done, info = env.step(action)
+            # Remember
+            agent.cache(state, next_state, action, reward, done)
+            # Learn
+            q, loss = agent.learn()
+            # Logging
+            logger.log_step(reward, loss, q)
+            # Update state
+            state = next_state
+            # # Check if end of game (for airstriker)
+            # if done or info["gameover"] == 1:
+            #     break
+            # Check if end of game (for cartpole)
+            if done:
+                break
+        logger.log_episode(e)
+        if e % 20 == 0:
+            logger.record(episode=e, epsilon=agent.exploration_rate, step=agent.curr_step)
+fill_memory(agent)
+train(agent)

src/airstriker-genesis/test.py ADDED Viewed

	@@ -0,0 +1,405 @@

+import retro
+import gym
+import math
+import random
+import numpy as np
+import matplotlib
+import matplotlib.pyplot as plt
+from collections import namedtuple, deque
+from itertools import count
+from gym import spaces
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import torch.nn.functional as F
+import cv2
+import torch
+from torch.utils.tensorboard import SummaryWriter
+class MaxAndSkipEnv(gym.Wrapper):
+    def __init__(self, env, skip=4):
+        """Return only every `skip`-th frame"""
+        gym.Wrapper.__init__(self, env)
+        # most recent raw observations (for max pooling across time steps)
+        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
+        self._skip       = skip
+    def step(self, action):
+        """Repeat action, sum reward, and max over last observations."""
+        total_reward = 0.0
+        done = None
+        for i in range(self._skip):
+            obs, reward, done, info = self.env.step(action)
+            if i == self._skip - 2: self._obs_buffer[0] = obs
+            if i == self._skip - 1: self._obs_buffer[1] = obs
+            total_reward += reward
+            if done:
+                break
+        # Note that the observation on the done=True frame
+        # doesn't matter
+        max_frame = self._obs_buffer.max(axis=0)
+        return max_frame, total_reward, done, info
+    def reset(self, **kwargs):
+        return self.env.reset(**kwargs)
+class LazyFrames(object):
+    def __init__(self, frames):
+        """This object ensures that common frames between the observations are only stored once.
+        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
+        buffers.
+        This object should only be converted to numpy array before being passed to the model.
+        You'd not believe how complex the previous solution was."""
+        self._frames = frames
+        self._out = None
+    def _force(self):
+        if self._out is None:
+            self._out = np.concatenate(self._frames, axis=2)
+            self._frames = None
+        return self._out
+    def __array__(self, dtype=None):
+        out = self._force()
+        if dtype is not None:
+            out = out.astype(dtype)
+        return out
+    def __len__(self):
+        return len(self._force())
+    def __getitem__(self, i):
+        return self._force()[i]
+class FrameStack(gym.Wrapper):
+    def __init__(self, env, k):
+        """Stack k last frames.
+        Returns lazy array, which is much more memory efficient.
+        See Also
+        --------
+        baselines.common.atari_wrappers.LazyFrames
+        """
+        gym.Wrapper.__init__(self, env)
+        self.k = k
+        self.frames = deque([], maxlen=k)
+        shp = env.observation_space.shape
+        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=env.observation_space.dtype)
+    def reset(self):
+        ob = self.env.reset()
+        for _ in range(self.k):
+            self.frames.append(ob)
+        return self._get_ob()
+    def step(self, action):
+        ob, reward, done, info = self.env.step(action)
+        self.frames.append(ob)
+        return self._get_ob(), reward, done, info
+    def _get_ob(self):
+        assert len(self.frames) == self.k
+        return LazyFrames(list(self.frames))
+class ClipRewardEnv(gym.RewardWrapper):
+    def __init__(self, env):
+        gym.RewardWrapper.__init__(self, env)
+    def reward(self, reward):
+        """Bin reward to {+1, 0, -1} by its sign."""
+        return np.sign(reward)
+class ImageToPyTorch(gym.ObservationWrapper):
+    def __init__(self, env):
+        super(ImageToPyTorch, self).__init__(env)
+        old_shape = self.observation_space.shape
+        self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]), dtype=np.float32)
+    def observation(self, observation):
+        return np.moveaxis(observation, 2, 0)
+class WarpFrame(gym.ObservationWrapper):
+    def __init__(self, env):
+        """Warp frames to 84x84 as done in the Nature paper and later work."""
+        gym.ObservationWrapper.__init__(self, env)
+        self.width = 84
+        self.height = 84
+        self.observation_space = spaces.Box(low=0, high=255,
+            shape=(self.height, self.width, 1), dtype=np.uint8)
+    def observation(self, frame):
+        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
+        return frame[:, :, None]
+class AirstrikerDiscretizer(gym.ActionWrapper):
+    # 初期化
+    def __init__(self, env):
+        super(AirstrikerDiscretizer, self).__init__(env)
+        buttons = ['B', 'A', 'MODE', 'START', 'UP', 'DOWN', 'LEFT', 'RIGHT', 'C', 'Y', 'X', 'Z']
+        actions = [['LEFT'], ['RIGHT'], ['B']]
+        self._actions = []
+        for action in actions:
+            arr = np.array([False] * 12)
+            for button in action:
+                arr[buttons.index(button)] = True
+            self._actions.append(arr)
+        self.action_space = gym.spaces.Discrete(len(self._actions))
+    # 行動の取得
+    def action(self, a):
+        return self._actions[a].copy()
+env = retro.make(game='Airstriker-Genesis')
+env = MaxAndSkipEnv(env) ## Return only every `skip`-th frame
+env = WarpFrame(env) ## Reshape image
+env = ImageToPyTorch(env) ## Invert shape
+env = FrameStack(env, 4) ## Stack last 4 frames
+# env = ScaledFloatFrame(env) ## Scale frames
+env = AirstrikerDiscretizer(env)
+env = ClipRewardEnv(env)
+# set up matplotlib
+is_ipython = 'inline' in matplotlib.get_backend()
+if is_ipython:
+    from IPython import display
+plt.ion()
+# if gpu is to be used
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+Transition = namedtuple('Transition',
+                        ('state', 'action', 'next_state', 'reward'))
+class ReplayMemory(object):
+    def __init__(self, capacity):
+        self.memory = deque([],maxlen=capacity)
+    def push(self, *args):
+        """Save a transition"""
+        self.memory.append(Transition(*args))
+    def sample(self, batch_size):
+        return random.sample(self.memory, batch_size)
+    def __len__(self):
+        return len(self.memory)
+class DQN(nn.Module):
+    def __init__(self, n_observations, n_actions):
+        super(DQN, self).__init__()
+        # self.layer1 = nn.Linear(n_observations, 128)
+        # self.layer2 = nn.Linear(128, 128)
+        # self.layer3 = nn.Linear(128, n_actions)
+        self.layer1 = nn.Conv2d(in_channels=n_observations, out_channels=32, kernel_size=8, stride=4)
+        self.layer2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
+        self.layer3 = nn.Sequential(nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1), nn.ReLU(), nn.Flatten())
+        self.layer4 = nn.Linear(17024, 512)
+        self.layer5 = nn.Linear(512, n_actions)
+    # Called with either one element to determine next action, or a batch
+    # during optimization. Returns tensor([[left0exp,right0exp]...]).
+    def forward(self, x):
+        x = F.relu(self.layer1(x))
+        x = F.relu(self.layer2(x))
+        x = F.relu(self.layer3(x))
+        x = F.relu(self.layer4(x))
+        return self.layer5(x)
+# BATCH_SIZE is the number of transitions sampled from the replay buffer
+# GAMMA is the discount factor as mentioned in the previous section
+# EPS_START is the starting value of epsilon
+# EPS_END is the final value of epsilon
+# EPS_DECAY controls the rate of exponential decay of epsilon, higher means a slower decay
+# TAU is the update rate of the target network
+# LR is the learning rate of the AdamW optimizer
+BATCH_SIZE = 512
+GAMMA = 0.99
+EPS_START = 1
+EPS_END = 0.01
+EPS_DECAY = 10000
+TAU = 0.005
+# LR = 1e-4
+LR = 0.00025
+# Get number of actions from gym action space
+n_actions = env.action_space.n
+state = env.reset()
+n_observations = len(state)
+policy_net = DQN(n_observations, n_actions).to(device)
+target_net = DQN(n_observations, n_actions).to(device)
+target_net.load_state_dict(policy_net.state_dict())
+optimizer = optim.AdamW(policy_net.parameters(), lr=LR, amsgrad=True)
+memory = ReplayMemory(10000)
+steps_done = 0
+def select_action(state):
+    global steps_done
+    sample = random.random()
+    eps_threshold = EPS_END + (EPS_START - EPS_END) * math.exp(-1. * steps_done / EPS_DECAY)
+    steps_done += 1
+    if sample > eps_threshold:
+        with torch.no_grad():
+            # t.max(1) will return largest column value of each row.
+            # second column on max result is index of where max element was
+            # found, so we pick action with the larger expected reward.
+            return policy_net(state).max(1)[1].view(1, 1), eps_threshold
+    else:
+        return torch.tensor([[env.action_space.sample()]], device=device, dtype=torch.long), eps_threshold
+episode_durations = []
+def plot_durations(show_result=False):
+    plt.figure(1)
+    durations_t = torch.tensor(episode_durations, dtype=torch.float)
+    if show_result:
+        plt.title('Result')
+    else:
+        plt.clf()
+        plt.title('Training...')
+    plt.xlabel('Episode')
+    plt.ylabel('Duration')
+    plt.plot(durations_t.numpy())
+    # Take 100 episode averages and plot them too
+    if len(durations_t) >= 100:
+        means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
+        means = torch.cat((torch.zeros(99), means))
+        plt.plot(means.numpy())
+    plt.pause(0.001)  # pause a bit so that plots are updated
+    if is_ipython:
+        if not show_result:
+            display.display(plt.gcf())
+            display.clear_output(wait=True)
+        else:
+            display.display(plt.gcf())
+def optimize_model():
+    if len(memory) < BATCH_SIZE:
+        return
+    transitions = memory.sample(BATCH_SIZE)
+    # Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
+    # detailed explanation). This converts batch-array of Transitions
+    # to Transition of batch-arrays.
+    batch = Transition(*zip(*transitions))
+    # Compute a mask of non-final states and concatenate the batch elements
+    # (a final state would've been the one after which simulation ended)
+    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
+                                          batch.next_state)), device=device, dtype=torch.bool)
+    non_final_next_states = torch.cat([s for s in batch.next_state
+                                                if s is not None])
+    state_batch = torch.cat(batch.state)
+    action_batch = torch.cat(batch.action)
+    reward_batch = torch.cat(batch.reward)
+    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
+    # columns of actions taken. These are the actions which would've been taken
+    # for each batch state according to policy_net
+    state_action_values = policy_net(state_batch).gather(1, action_batch)
+    # Compute V(s_{t+1}) for all next states.
+    # Expected values of actions for non_final_next_states are computed based
+    # on the "older" target_net; selecting their best reward with max(1)[0].
+    # This is merged based on the mask, such that we'll have either the expected
+    # state value or 0 in case the state was final.
+    next_state_values = torch.zeros(BATCH_SIZE, device=device)
+    with torch.no_grad():
+        next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0]
+    # Compute the expected Q values
+    expected_state_action_values = (next_state_values * GAMMA) + reward_batch
+    # Compute Huber loss
+    criterion = nn.SmoothL1Loss()
+    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))
+    # Optimize the model
+    optimizer.zero_grad()
+    loss.backward()
+    # In-place gradient clipping
+    torch.nn.utils.clip_grad_value_(policy_net.parameters(), 100)
+    optimizer.step()
+with SummaryWriter() as writer:
+    if torch.cuda.is_available():
+        num_episodes = 600
+    else:
+        num_episodes = 50
+    epsilon = 1
+    episode_rewards = []
+    for i_episode in range(num_episodes):
+        # Initialize the environment and get it's state
+        state = env.reset()
+        state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
+        episode_reward = 0
+        for t in count():
+            action, epsilon = select_action(state)
+            observation, reward,  done, info = env.step(action.item())
+            reward = torch.tensor([reward], device=device)
+            done = done or info["gameover"] == 1
+            if done:
+                episode_durations.append(t + 1)
+                print(f"Episode {i_episode} done")
+                # plot_durations()
+                break
+            # if done:
+            #     next_state = None
+            # else:
+            #     next_state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)
+            next_state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)
+            # Store the transition in memory
+            memory.push(state, action, next_state, reward)
+            episode_reward += reward
+            # Move to the next state
+            state = next_state
+            # Perform one step of the optimization (on the policy network)
+            optimize_model()
+            # Soft update of the target network's weights
+            # θ′ ← τ θ + (1 −τ )θ′
+            target_net_state_dict = target_net.state_dict()
+            policy_net_state_dict = policy_net.state_dict()
+            for key in policy_net_state_dict:
+                target_net_state_dict[key] = policy_net_state_dict[key]*TAU + target_net_state_dict[key]*(1-TAU)
+            target_net.load_state_dict(target_net_state_dict)
+            # if done:
+            #     episode_durations.append(t + 1)
+            #     # plot_durations()
+            #     break
+        # episode_rewards.append(episode_reward)
+        writer.add_scalar("Rewards/Episode", episode_reward, i_episode)
+        writer.add_scalar("Epsilon", epsilon, i_episode)
+        writer.flush()
+    print('Complete')
+    plot_durations(show_result=True)
+    plt.ioff()
+    plt.show()

src/airstriker-genesis/utils.py ADDED Viewed

	@@ -0,0 +1,22 @@

+import gym
+import numpy as np
+# Airstrikerラッパー
+class AirstrikerDiscretizer(gym.ActionWrapper):
+    # 初期化
+    def __init__(self, env):
+        super(AirstrikerDiscretizer, self).__init__(env)
+        buttons = ['B', 'A', 'MODE', 'START', 'UP', 'DOWN', 'LEFT', 'RIGHT', 'C', 'Y', 'X', 'Z']
+        actions = [['LEFT'], ['RIGHT'], ['B']]
+        self._actions = []
+        for action in actions:
+            arr = np.array([False] * 12)
+            for button in action:
+                arr[buttons.index(button)] = True
+            self._actions.append(arr)
+        self.action_space = gym.spaces.Discrete(len(self._actions))
+    # 行動の取得
+    def action(self, a):
+        return self._actions[a].copy()

src/airstriker-genesis/wrappers.py ADDED Viewed

	@@ -0,0 +1,213 @@

+import numpy as np
+import os
+from collections import deque
+import gym
+from gym import spaces
+import cv2
+import retro
+from utils import AirstrikerDiscretizer
+'''
+Atari Wrapper copied from https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py
+'''
+class LazyFrames(object):
+    def __init__(self, frames):
+        """This object ensures that common frames between the observations are only stored once.
+        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
+        buffers.
+        This object should only be converted to numpy array before being passed to the model.
+        You'd not believe how complex the previous solution was."""
+        self._frames = frames
+        self._out = None
+    def _force(self):
+        if self._out is None:
+            self._out = np.concatenate(self._frames, axis=2)
+            self._frames = None
+        return self._out
+    def __array__(self, dtype=None):
+        out = self._force()
+        if dtype is not None:
+            out = out.astype(dtype)
+        return out
+    def __len__(self):
+        return len(self._force())
+    def __getitem__(self, i):
+        return self._force()[i]
+class FireResetEnv(gym.Wrapper):
+    def __init__(self, env):
+        """Take action on reset for environments that are fixed until firing."""
+        gym.Wrapper.__init__(self, env)
+        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
+        assert len(env.unwrapped.get_action_meanings()) >= 3
+    def reset(self, **kwargs):
+        self.env.reset(**kwargs)
+        obs, _, done, _ = self.env.step(1)
+        if done:
+            self.env.reset(**kwargs)
+        obs, _, done, _ = self.env.step(2)
+        if done:
+            self.env.reset(**kwargs)
+        return obs
+    def step(self, ac):
+        return self.env.step(ac)
+class MaxAndSkipEnv(gym.Wrapper):
+    def __init__(self, env, skip=4):
+        """Return only every `skip`-th frame"""
+        gym.Wrapper.__init__(self, env)
+        # most recent raw observations (for max pooling across time steps)
+        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
+        self._skip       = skip
+    def step(self, action):
+        """Repeat action, sum reward, and max over last observations."""
+        total_reward = 0.0
+        done = None
+        for i in range(self._skip):
+            obs, reward, done, info = self.env.step(action)
+            if i == self._skip - 2: self._obs_buffer[0] = obs
+            if i == self._skip - 1: self._obs_buffer[1] = obs
+            total_reward += reward
+            if done:
+                break
+        # Note that the observation on the done=True frame
+        # doesn't matter
+        max_frame = self._obs_buffer.max(axis=0)
+        return max_frame, total_reward, done, info
+    def reset(self, **kwargs):
+        return self.env.reset(**kwargs)
+class WarpFrame(gym.ObservationWrapper):
+    def __init__(self, env):
+        """Warp frames to 84x84 as done in the Nature paper and later work."""
+        gym.ObservationWrapper.__init__(self, env)
+        self.width = 84
+        self.height = 84
+        self.observation_space = spaces.Box(low=0, high=255,
+            shape=(self.height, self.width, 1), dtype=np.uint8)
+    def observation(self, frame):
+        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
+        return frame[:, :, None]
+class WarpFrameNoResize(gym.ObservationWrapper):
+    def __init__(self, env):
+        """Warp frames to 84x84 as done in the Nature paper and later work."""
+        gym.ObservationWrapper.__init__(self, env)
+    def observation(self, frame):
+        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        # frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
+        return frame[:, :, None]
+class FrameStack(gym.Wrapper):
+    def __init__(self, env, k):
+        """Stack k last frames.
+        Returns lazy array, which is much more memory efficient.
+        See Also
+        --------
+        baselines.common.atari_wrappers.LazyFrames
+        """
+        gym.Wrapper.__init__(self, env)
+        self.k = k
+        self.frames = deque([], maxlen=k)
+        shp = env.observation_space.shape
+        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=env.observation_space.dtype)
+    def reset(self):
+        ob = self.env.reset()
+        for _ in range(self.k):
+            self.frames.append(ob)
+        return self._get_ob()
+    def step(self, action):
+        ob, reward, done, info = self.env.step(action)
+        self.frames.append(ob)
+        return self._get_ob(), reward, done, info
+    def _get_ob(self):
+        assert len(self.frames) == self.k
+        return LazyFrames(list(self.frames))
+class ImageToPyTorch(gym.ObservationWrapper):
+    def __init__(self, env):
+        super(ImageToPyTorch, self).__init__(env)
+        old_shape = self.observation_space.shape
+        self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]), dtype=np.float32)
+    def observation(self, observation):
+        return np.moveaxis(observation, 2, 0)
+# class ImageToPyTorch(gym.ObservationWrapper):
+#     def __init__(self, env):
+#         super(ImageToPyTorch, self).__init__(env)
+#         old_shape = self.observation_space.shape
+#         new_shape = (old_shape[-1], old_shape[0], old_shape[1])
+#         print("Old: ", old_shape)
+#         print("New: ", new_shape)
+#         self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=new_shape, dtype=np.float32)
+#     def observation(self, observation):
+#         return np.moveaxis(observation, 2, 0)
+class ScaledFloatFrame(gym.ObservationWrapper):
+    def __init__(self, env):
+        gym.ObservationWrapper.__init__(self, env)
+        self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)
+    def observation(self, observation):
+        # careful! This undoes the memory optimization, use
+        # with smaller replay buffers only.
+        return np.array(observation).astype(np.float32) / 255.0
+class ClipRewardEnv(gym.RewardWrapper):
+    def __init__(self, env):
+        gym.RewardWrapper.__init__(self, env)
+    def reward(self, reward):
+        """Bin reward to {+1, 0, -1} by its sign."""
+        return np.sign(reward)
+def make_env():
+    env = retro.make(game='Airstriker-Genesis')
+    env = MaxAndSkipEnv(env) ## Return only every `skip`-th frame
+    env = WarpFrame(env) ## Reshape image
+    env = ImageToPyTorch(env) ## Invert shape
+    env = FrameStack(env, 4) ## Stack last 4 frames
+    env = ScaledFloatFrame(env) ## Scale frames
+    env = AirstrikerDiscretizer(env)
+    env = ClipRewardEnv(env)
+    return env
+def make_starpilot(render=False):
+    if render:
+        env = gym.make("procgen:procgen-starpilot-v0", distribution_mode="easy", render_mode="human")
+    else:
+        env = gym.make("procgen:procgen-starpilot-v0", distribution_mode="easy")
+    env = WarpFrameNoResize(env) ## Reshape image
+    env = ImageToPyTorch(env) ## Invert shape
+    env = FrameStack(env, 4) ## Stack last 4 frames
+    return env

src/lunar-lander/agent.py ADDED Viewed

	@@ -0,0 +1,1104 @@

+import torch
+import numpy as np
+import random
+import torch.nn as nn
+import copy
+import time, datetime
+import matplotlib.pyplot as plt
+from collections import deque
+from torch.utils.tensorboard import SummaryWriter
+class DQNet(nn.Module):
+    """mini cnn structure"""
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        self.online = nn.Sequential(
+            nn.Linear(input_dim, 150),
+            nn.ReLU(),
+            nn.Linear(150, 120),
+            nn.ReLU(),
+            nn.Linear(120, output_dim),
+        )
+        self.target = copy.deepcopy(self.online)
+        # Q_target parameters are frozen.
+        for p in self.target.parameters():
+            p.requires_grad = False
+    def forward(self, input, model):
+        if model == "online":
+            return self.online(input)
+        elif model == "target":
+            return self.target(input)
+class MetricLogger:
+    def __init__(self, save_dir):
+        self.writer = SummaryWriter(log_dir=save_dir)
+        self.save_log = save_dir / "log"
+        with open(self.save_log, "w") as f:
+            f.write(
+                f"{'Episode':>8}{'Step':>8}{'Epsilon':>10}{'MeanReward':>15}"
+                f"{'MeanLength':>15}{'MeanLoss':>15}{'MeanQValue':>15}"
+                f"{'TimeDelta':>15}{'Time':>20}\n"
+            )
+        self.ep_rewards_plot = save_dir / "reward_plot.jpg"
+        self.ep_lengths_plot = save_dir / "length_plot.jpg"
+        self.ep_avg_losses_plot = save_dir / "loss_plot.jpg"
+        self.ep_avg_qs_plot = save_dir / "q_plot.jpg"
+        # History metrics
+        self.ep_rewards = []
+        self.ep_lengths = []
+        self.ep_avg_losses = []
+        self.ep_avg_qs = []
+        # Moving averages, added for every call to record()
+        self.moving_avg_ep_rewards = []
+        self.moving_avg_ep_lengths = []
+        self.moving_avg_ep_avg_losses = []
+        self.moving_avg_ep_avg_qs = []
+        # Current episode metric
+        self.init_episode()
+        # Timing
+        self.record_time = time.time()
+    def log_step(self, reward, loss, q):
+        self.curr_ep_reward += reward
+        self.curr_ep_length += 1
+        if loss:
+            self.curr_ep_loss += loss
+            self.curr_ep_q += q
+            self.curr_ep_loss_length += 1
+    def log_episode(self, episode_number):
+        "Mark end of episode"
+        self.ep_rewards.append(self.curr_ep_reward)
+        self.ep_lengths.append(self.curr_ep_length)
+        if self.curr_ep_loss_length == 0:
+            ep_avg_loss = 0
+            ep_avg_q = 0
+        else:
+            ep_avg_loss = np.round(self.curr_ep_loss / self.curr_ep_loss_length, 5)
+            ep_avg_q = np.round(self.curr_ep_q / self.curr_ep_loss_length, 5)
+        self.ep_avg_losses.append(ep_avg_loss)
+        self.ep_avg_qs.append(ep_avg_q)
+        self.writer.add_scalar("Avg Loss for episode", ep_avg_loss, episode_number)
+        self.writer.add_scalar("Avg Q value for episode", ep_avg_q, episode_number)
+        self.writer.flush()
+        self.init_episode()
+    def init_episode(self):
+        self.curr_ep_reward = 0.0
+        self.curr_ep_length = 0
+        self.curr_ep_loss = 0.0
+        self.curr_ep_q = 0.0
+        self.curr_ep_loss_length = 0
+    def record(self, episode, epsilon, step):
+        mean_ep_reward = np.round(np.mean(self.ep_rewards[-100:]), 3)
+        mean_ep_length = np.round(np.mean(self.ep_lengths[-100:]), 3)
+        mean_ep_loss = np.round(np.mean(self.ep_avg_losses[-100:]), 3)
+        mean_ep_q = np.round(np.mean(self.ep_avg_qs[-100:]), 3)
+        self.moving_avg_ep_rewards.append(mean_ep_reward)
+        self.moving_avg_ep_lengths.append(mean_ep_length)
+        self.moving_avg_ep_avg_losses.append(mean_ep_loss)
+        self.moving_avg_ep_avg_qs.append(mean_ep_q)
+        last_record_time = self.record_time
+        self.record_time = time.time()
+        time_since_last_record = np.round(self.record_time - last_record_time, 3)
+        print(
+            f"Episode {episode} - "
+            f"Step {step} - "
+            f"Epsilon {epsilon} - "
+            f"Mean Reward {mean_ep_reward} - "
+            f"Mean Length {mean_ep_length} - "
+            f"Mean Loss {mean_ep_loss} - "
+            f"Mean Q Value {mean_ep_q} - "
+            f"Time Delta {time_since_last_record} - "
+            f"Time {datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S')}"
+        )
+        self.writer.add_scalar("Mean reward last 100 episodes", mean_ep_reward, episode)
+        self.writer.add_scalar("Mean length last 100 episodes", mean_ep_length, episode)
+        self.writer.add_scalar("Mean loss last 100 episodes", mean_ep_loss, episode)
+        self.writer.add_scalar("Mean reward last 100 episodes", mean_ep_reward, episode)
+        self.writer.add_scalar("Epsilon value", epsilon, episode)
+        self.writer.add_scalar("Mean Q Value last 100 episodes", mean_ep_q, episode)
+        self.writer.flush()
+        with open(self.save_log, "a") as f:
+            f.write(
+                f"{episode:8d}{step:8d}{epsilon:10.3f}"
+                f"{mean_ep_reward:15.3f}{mean_ep_length:15.3f}{mean_ep_loss:15.3f}{mean_ep_q:15.3f}"
+                f"{time_since_last_record:15.3f}"
+                f"{datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S'):>20}\n"
+            )
+        for metric in ["ep_rewards", "ep_lengths", "ep_avg_losses", "ep_avg_qs"]:
+            plt.plot(getattr(self, f"moving_avg_{metric}"))
+            plt.savefig(getattr(self, f"{metric}_plot"))
+            plt.clf()
+class DQNAgent:
+    def __init__(self,
+                 state_dim,
+                 action_dim,
+                 save_dir,
+                 checkpoint=None,
+                 learning_rate=0.00025,
+                 max_memory_size=100000,
+                 batch_size=32,
+                 exploration_rate=1,
+                 exploration_rate_decay=0.9999999,
+                 exploration_rate_min=0.1,
+                 training_frequency=1,
+                 learning_starts=1000,
+                 target_network_sync_frequency=500,
+                 reset_exploration_rate=False,
+                 save_frequency=100000,
+                 gamma=0.9,
+                 load_replay_buffer=True):
+        self.state_dim = state_dim
+        self.action_dim = action_dim
+        self.max_memory_size = max_memory_size
+        self.memory = deque(maxlen=max_memory_size)
+        self.batch_size = batch_size
+        self.exploration_rate = exploration_rate
+        self.exploration_rate_decay = exploration_rate_decay
+        self.exploration_rate_min = exploration_rate_min
+        self.gamma = gamma
+        self.curr_step = 0
+        self.learning_starts = learning_starts  # min. experiences before training
+        self.training_frequency = training_frequency   # no. of experiences between updates to Q_online
+        self.target_network_sync_frequency = target_network_sync_frequency  # no. of experiences between Q_target & Q_online sync
+        self.save_every = save_frequency   # no. of experiences between saving the network
+        self.save_dir = save_dir
+        self.use_cuda = torch.cuda.is_available()
+        self.net = DQNet(self.state_dim, self.action_dim).float()
+        if self.use_cuda:
+            self.net = self.net.to(device='cuda')
+        if checkpoint:
+            self.load(checkpoint, reset_exploration_rate, load_replay_buffer)
+        self.optimizer = torch.optim.AdamW(self.net.parameters(), lr=learning_rate, amsgrad=True)
+        self.loss_fn = torch.nn.SmoothL1Loss()
+        # self.optimizer = torch.optim.Adam(self.net.parameters(), lr=learning_rate)
+        # self.loss_fn = torch.nn.MSELoss()
+    def act(self, state):
+        """
+        Given a state, choose an epsilon-greedy action and update value of step.
+        Inputs:
+        state(LazyFrame): A single observation of the current state, dimension is (state_dim)
+        Outputs:
+        action_idx (int): An integer representing which action the agent will perform
+        """
+        # EXPLORE
+        if np.random.rand() < self.exploration_rate:
+            action_idx = np.random.randint(self.action_dim)
+        # EXPLOIT
+        else:
+            state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+            state = state.unsqueeze(0)
+            action_values = self.net(state, model='online')
+            action_idx = torch.argmax(action_values, axis=1).item()
+        # decrease exploration_rate
+        self.exploration_rate *= self.exploration_rate_decay
+        self.exploration_rate = max(self.exploration_rate_min, self.exploration_rate)
+        # increment step
+        self.curr_step += 1
+        return action_idx
+    def cache(self, state, next_state, action, reward, done):
+        """
+        Store the experience to self.memory (replay buffer)
+        Inputs:
+        state (LazyFrame),
+        next_state (LazyFrame),
+        action (int),
+        reward (float),
+        done(bool))
+        """
+        state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+        next_state = torch.FloatTensor(next_state).cuda() if self.use_cuda else torch.FloatTensor(next_state)
+        action = torch.LongTensor([action]).cuda() if self.use_cuda else torch.LongTensor([action])
+        reward = torch.DoubleTensor([reward]).cuda() if self.use_cuda else torch.DoubleTensor([reward])
+        done = torch.BoolTensor([done]).cuda() if self.use_cuda else torch.BoolTensor([done])
+        self.memory.append( (state, next_state, action, reward, done,) )
+    def recall(self):
+        """
+        Retrieve a batch of experiences from memory
+        """
+        batch = random.sample(self.memory, self.batch_size)
+        state, next_state, action, reward, done = map(torch.stack, zip(*batch))
+        return state, next_state, action.squeeze(), reward.squeeze(), done.squeeze()
+    def td_estimate(self, states, actions):
+        actions = actions.reshape(-1, 1)
+        predicted_qs = self.net(states, model='online')# Q_online(s,a)
+        predicted_qs = predicted_qs.gather(1, actions)
+        return predicted_qs
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target')
+        target_qs = torch.max(target_qs, dim=1).values
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))
+    def update_Q_online(self, td_estimate, td_target) :
+        loss = self.loss_fn(td_estimate.float(), td_target.float())
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+        return loss.item()
+    def sync_Q_target(self):
+        self.net.target.load_state_dict(self.net.online.state_dict())
+    def learn(self):
+        if self.curr_step % self.target_network_sync_frequency == 0:
+            self.sync_Q_target()
+        if self.curr_step % self.save_every == 0:
+            self.save()
+        if self.curr_step < self.learning_starts:
+            return None, None
+        if self.curr_step % self.training_frequency != 0:
+            return None, None
+        # Sample from memory
+        state, next_state, action, reward, done = self.recall()
+        # Get TD Estimate
+        td_est = self.td_estimate(state, action)
+        # Get TD Target
+        td_tgt = self.td_target(reward, next_state, done)
+        # Backpropagate loss through Q_online
+        loss = self.update_Q_online(td_est, td_tgt)
+        return (td_est.mean().item(), loss)
+    def save(self):
+        save_path = self.save_dir / f"airstriker_net_{int(self.curr_step // self.save_every)}.chkpt"
+        torch.save(
+            dict(
+                model=self.net.state_dict(),
+                exploration_rate=self.exploration_rate,
+                replay_memory=self.memory
+            ),
+            save_path
+        )
+        print(f"Airstriker model saved to {save_path} at step {self.curr_step}")
+    def load(self, load_path, reset_exploration_rate, load_replay_buffer):
+        if not load_path.exists():
+            raise ValueError(f"{load_path} does not exist")
+        ckp = torch.load(load_path, map_location=('cuda' if self.use_cuda else 'cpu'))
+        exploration_rate = ckp.get('exploration_rate')
+        state_dict = ckp.get('model')
+        print(f"Loading model at {load_path} with exploration rate {exploration_rate}")
+        self.net.load_state_dict(state_dict)
+        if load_replay_buffer:
+            replay_memory = ckp.get('replay_memory')
+            print(f"Loading replay memory. Len {len(replay_memory)}" if replay_memory else "Saved replay memory not found. Not restoring replay memory.")
+            self.memory = replay_memory if replay_memory else self.memory
+        if reset_exploration_rate:
+            print(f"Reset exploration rate option specified. Not restoring saved exploration rate {exploration_rate}. The current exploration rate is {self.exploration_rate}")
+        else:
+            print(f"Setting exploration rate to {exploration_rate} not loaded.")
+            self.exploration_rate = exploration_rate
+class DDQNAgent(DQNAgent):
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        q_vals = self.net(next_states, model='online')
+        target_actions = torch.argmax(q_vals, axis=1)
+        target_actions = target_actions.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target')
+        target_qs = target_qs.gather(1, target_actions)
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))
+class DuelingDQNet(nn.Module):
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        self.feature_layer = nn.Sequential(
+            nn.Linear(input_dim, 150),
+            nn.ReLU(),
+            nn.Linear(150, 120),
+            nn.ReLU()
+        )
+        self.value_layer = nn.Sequential(
+            nn.Linear(120, 120),
+            nn.ReLU(),
+            nn.Linear(120, 1)
+        )
+        self.advantage_layer = nn.Sequential(
+            nn.Linear(120, 120),
+            nn.ReLU(),
+            nn.Linear(120, output_dim)
+        )
+    def forward(self, state):
+        feature_output = self.feature_layer(state)
+        # feature_output = feature_output.view(feature_output.size(0), -1)
+        value = self.value_layer(feature_output)
+        advantage = self.advantage_layer(feature_output)
+        q_value = value + (advantage - advantage.mean())
+        return q_value
+class DuelingDQNAgent:
+    def __init__(self,
+                 state_dim,
+                 action_dim,
+                 save_dir,
+                 checkpoint=None,
+                 learning_rate=0.00025,
+                 max_memory_size=100000,
+                 batch_size=32,
+                 exploration_rate=1,
+                 exploration_rate_decay=0.9999999,
+                 exploration_rate_min=0.1,
+                 training_frequency=1,
+                 learning_starts=1000,
+                 target_network_sync_frequency=500,
+                 reset_exploration_rate=False,
+                 save_frequency=100000,
+                 gamma=0.9,
+                 load_replay_buffer=True):
+        self.state_dim = state_dim
+        self.action_dim = action_dim
+        self.max_memory_size = max_memory_size
+        self.memory = deque(maxlen=max_memory_size)
+        self.batch_size = batch_size
+        self.exploration_rate = exploration_rate
+        self.exploration_rate_decay = exploration_rate_decay
+        self.exploration_rate_min = exploration_rate_min
+        self.gamma = gamma
+        self.curr_step = 0
+        self.learning_starts = learning_starts  # min. experiences before training
+        self.training_frequency = training_frequency   # no. of experiences between updates to Q_online
+        self.target_network_sync_frequency = target_network_sync_frequency  # no. of experiences between Q_target & Q_online sync
+        self.save_every = save_frequency   # no. of experiences between saving the network
+        self.save_dir = save_dir
+        self.use_cuda = torch.cuda.is_available()
+        self.online_net = DuelingDQNet(self.state_dim, self.action_dim).float()
+        self.target_net = copy.deepcopy(self.online_net)
+        # Q_target parameters are frozen.
+        for p in self.target_net.parameters():
+            p.requires_grad = False
+        if self.use_cuda:
+            self.online_net = self.online_net(device='cuda')
+            self.target_net = self.target_net(device='cuda')
+        if checkpoint:
+            self.load(checkpoint, reset_exploration_rate, load_replay_buffer)
+        self.optimizer = torch.optim.AdamW(self.online_net.parameters(), lr=learning_rate, amsgrad=True)
+        self.loss_fn = torch.nn.SmoothL1Loss()
+        # self.optimizer = torch.optim.Adam(self.online_net.parameters(), lr=learning_rate)
+        # self.loss_fn = torch.nn.MSELoss()
+    def act(self, state):
+        """
+        Given a state, choose an epsilon-greedy action and update value of step.
+        Inputs:
+        state(LazyFrame): A single observation of the current state, dimension is (state_dim)
+        Outputs:
+        action_idx (int): An integer representing which action the agent will perform
+        """
+        # EXPLORE
+        if np.random.rand() < self.exploration_rate:
+            action_idx = np.random.randint(self.action_dim)
+        # EXPLOIT
+        else:
+            state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+            state = state.unsqueeze(0)
+            action_values = self.online_net(state)
+            action_idx = torch.argmax(action_values, axis=1).item()
+        # decrease exploration_rate
+        self.exploration_rate *= self.exploration_rate_decay
+        self.exploration_rate = max(self.exploration_rate_min, self.exploration_rate)
+        # increment step
+        self.curr_step += 1
+        return action_idx
+    def cache(self, state, next_state, action, reward, done):
+        """
+        Store the experience to self.memory (replay buffer)
+        Inputs:
+        state (LazyFrame),
+        next_state (LazyFrame),
+        action (int),
+        reward (float),
+        done(bool))
+        """
+        state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+        next_state = torch.FloatTensor(next_state).cuda() if self.use_cuda else torch.FloatTensor(next_state)
+        action = torch.LongTensor([action]).cuda() if self.use_cuda else torch.LongTensor([action])
+        reward = torch.DoubleTensor([reward]).cuda() if self.use_cuda else torch.DoubleTensor([reward])
+        done = torch.BoolTensor([done]).cuda() if self.use_cuda else torch.BoolTensor([done])
+        self.memory.append( (state, next_state, action, reward, done,) )
+    def recall(self):
+        """
+        Retrieve a batch of experiences from memory
+        """
+        batch = random.sample(self.memory, self.batch_size)
+        state, next_state, action, reward, done = map(torch.stack, zip(*batch))
+        return state, next_state, action.squeeze(), reward.squeeze(), done.squeeze()
+    def td_estimate(self, states, actions):
+        actions = actions.reshape(-1, 1)
+        predicted_qs = self.online_net(states)# Q_online(s,a)
+        predicted_qs = predicted_qs.gather(1, actions)
+        return predicted_qs
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        target_qs = self.target_net.forward(next_states)
+        target_qs = torch.max(target_qs, dim=1).values
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))
+    def update_Q_online(self, td_estimate, td_target) :
+        loss = self.loss_fn(td_estimate.float(), td_target.float())
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+        return loss.item()
+    def sync_Q_target(self):
+        self.target_net.load_state_dict(self.online_net.state_dict())
+    def learn(self):
+        if self.curr_step % self.target_network_sync_frequency == 0:
+            self.sync_Q_target()
+        if self.curr_step % self.save_every == 0:
+            self.save()
+        if self.curr_step < self.learning_starts:
+            return None, None
+        if self.curr_step % self.training_frequency != 0:
+            return None, None
+        # Sample from memory
+        state, next_state, action, reward, done = self.recall()
+        # Get TD Estimate
+        td_est = self.td_estimate(state, action)
+        # Get TD Target
+        td_tgt = self.td_target(reward, next_state, done)
+        # Backpropagate loss through Q_online
+        loss = self.update_Q_online(td_est, td_tgt)
+        return (td_est.mean().item(), loss)
+    def save(self):
+        save_path = self.save_dir / f"airstriker_net_{int(self.curr_step // self.save_every)}.chkpt"
+        torch.save(
+            dict(
+                model=self.online_net.state_dict(),
+                exploration_rate=self.exploration_rate,
+                replay_memory=self.memory
+            ),
+            save_path
+        )
+        print(f"Airstriker model saved to {save_path} at step {self.curr_step}")
+    def load(self, load_path, reset_exploration_rate, load_replay_buffer):
+        if not load_path.exists():
+            raise ValueError(f"{load_path} does not exist")
+        ckp = torch.load(load_path, map_location=('cuda' if self.use_cuda else 'cpu'))
+        exploration_rate = ckp.get('exploration_rate')
+        state_dict = ckp.get('model')
+        print(f"Loading model at {load_path} with exploration rate {exploration_rate}")
+        self.online_net.load_state_dict(state_dict)
+        self.target_net = copy.deepcopy(self.online_net)
+        self.sync_Q_target()
+        if load_replay_buffer:
+            replay_memory = ckp.get('replay_memory')
+            print(f"Loading replay memory. Len {len(replay_memory)}" if replay_memory else "Saved replay memory not found. Not restoring replay memory.")
+            self.memory = replay_memory if replay_memory else self.memory
+        if reset_exploration_rate:
+            print(f"Reset exploration rate option specified. Not restoring saved exploration rate {exploration_rate}. The current exploration rate is {self.exploration_rate}")
+        else:
+            print(f"Setting exploration rate to {exploration_rate} not loaded.")
+            self.exploration_rate = exploration_rate
+class DuelingDDQNAgent(DuelingDQNAgent):
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        q_vals = self.online_net.forward(next_states)
+        target_actions = torch.argmax(q_vals, axis=1)
+        target_actions = target_actions.reshape(-1, 1)
+        target_qs = self.target_net.forward(next_states)
+        target_qs = target_qs.gather(1, target_actions)
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))
+class DQNAgentWithStepDecay:
+    def __init__(self,
+                 state_dim,
+                 action_dim,
+                 save_dir,
+                 checkpoint=None,
+                 learning_rate=0.00025,
+                 max_memory_size=100000,
+                 batch_size=32,
+                 exploration_rate=1,
+                 exploration_rate_decay=0.9999999,
+                 exploration_rate_min=0.1,
+                 training_frequency=1,
+                 learning_starts=1000,
+                 target_network_sync_frequency=500,
+                 reset_exploration_rate=False,
+                 save_frequency=100000,
+                 gamma=0.9,
+                 load_replay_buffer=True):
+        self.state_dim = state_dim
+        self.action_dim = action_dim
+        self.max_memory_size = max_memory_size
+        self.memory = deque(maxlen=max_memory_size)
+        self.batch_size = batch_size
+        self.exploration_rate = exploration_rate
+        self.exploration_rate_decay = exploration_rate_decay
+        self.exploration_rate_min = exploration_rate_min
+        self.gamma = gamma
+        self.curr_step = 0
+        self.learning_starts = learning_starts  # min. experiences before training
+        self.training_frequency = training_frequency   # no. of experiences between updates to Q_online
+        self.target_network_sync_frequency = target_network_sync_frequency  # no. of experiences between Q_target & Q_online sync
+        self.save_every = save_frequency   # no. of experiences between saving the network
+        self.save_dir = save_dir
+        self.use_cuda = torch.cuda.is_available()
+        self.net = DQNet(self.state_dim, self.action_dim).float()
+        if self.use_cuda:
+            self.net = self.net.to(device='cuda')
+        if checkpoint:
+            self.load(checkpoint, reset_exploration_rate, load_replay_buffer)
+        self.optimizer = torch.optim.AdamW(self.net.parameters(), lr=learning_rate, amsgrad=True)
+        self.loss_fn = torch.nn.SmoothL1Loss()
+        # self.optimizer = torch.optim.Adam(self.net.parameters(), lr=learning_rate)
+        # self.loss_fn = torch.nn.MSELoss()
+    def act(self, state):
+        """
+        Given a state, choose an epsilon-greedy action and update value of step.
+        Inputs:
+        state(LazyFrame): A single observation of the current state, dimension is (state_dim)
+        Outputs:
+        action_idx (int): An integer representing which action the agent will perform
+        """
+        # EXPLORE
+        if np.random.rand() < self.exploration_rate:
+            action_idx = np.random.randint(self.action_dim)
+        # EXPLOIT
+        else:
+            state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+            state = state.unsqueeze(0)
+            action_values = self.net(state, model='online')
+            action_idx = torch.argmax(action_values, axis=1).item()
+        # decrease exploration_rate
+        self.exploration_rate *= self.exploration_rate_decay
+        self.exploration_rate = max(self.exploration_rate_min, self.exploration_rate)
+        # increment step
+        self.curr_step += 1
+        return action_idx
+    def cache(self, state, next_state, action, reward, done, stepnumber):
+        """
+        Store the experience to self.memory (replay buffer)
+        Inputs:
+        state (LazyFrame),
+        next_state (LazyFrame),
+        action (int),
+        reward (float),
+        done(bool))
+        """
+        state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+        next_state = torch.FloatTensor(next_state).cuda() if self.use_cuda else torch.FloatTensor(next_state)
+        action = torch.LongTensor([action]).cuda() if self.use_cuda else torch.LongTensor([action])
+        reward = torch.DoubleTensor([reward]).cuda() if self.use_cuda else torch.DoubleTensor([reward])
+        done = torch.BoolTensor([done]).cuda() if self.use_cuda else torch.BoolTensor([done])
+        stepnumber = torch.LongTensor([stepnumber]).cuda() if self.use_cuda else torch.LongTensor([stepnumber])
+        self.memory.append( (state, next_state, action, reward, done, stepnumber) )
+    def recall(self):
+        """
+        Retrieve a batch of experiences from memory
+        """
+        batch = random.sample(self.memory, self.batch_size)
+        state, next_state, action, reward, done, stepnumber = map(torch.stack, zip(*batch))
+        return state, next_state, action.squeeze(), reward.squeeze(), done.squeeze(), stepnumber.squeeze()
+    def td_estimate(self, states, actions):
+        actions = actions.reshape(-1, 1)
+        predicted_qs = self.net(states, model='online')# Q_online(s,a)
+        predicted_qs = predicted_qs.gather(1, actions)
+        return predicted_qs
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones, stepnumbers):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        stepnumbers = stepnumbers.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target')
+        target_qs = torch.max(target_qs, dim=1).values
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        discount = ((200 - stepnumbers)/200)
+        val = np.minimum(discount, self.gamma * target_qs)
+        return (rewards + val)
+    def update_Q_online(self, td_estimate, td_target) :
+        loss = self.loss_fn(td_estimate.float(), td_target.float())
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+        return loss.item()
+    def sync_Q_target(self):
+        self.net.target.load_state_dict(self.net.online.state_dict())
+    def learn(self):
+        if self.curr_step % self.target_network_sync_frequency == 0:
+            self.sync_Q_target()
+        if self.curr_step % self.save_every == 0:
+            self.save()
+        if self.curr_step < self.learning_starts:
+            return None, None
+        if self.curr_step % self.training_frequency != 0:
+            return None, None
+        # Sample from memory
+        state, next_state, action, reward, done, stepnumber = self.recall()
+        # Get TD Estimate
+        td_est = self.td_estimate(state, action)
+        # Get TD Target
+        td_tgt = self.td_target(reward, next_state, done, stepnumber)
+        # Backpropagate loss through Q_online
+        loss = self.update_Q_online(td_est, td_tgt)
+        return (td_est.mean().item(), loss)
+    def save(self):
+        save_path = self.save_dir / f"airstriker_net_{int(self.curr_step // self.save_every)}.chkpt"
+        torch.save(
+            dict(
+                model=self.net.state_dict(),
+                exploration_rate=self.exploration_rate,
+                replay_memory=self.memory
+            ),
+            save_path
+        )
+        print(f"Airstriker model saved to {save_path} at step {self.curr_step}")
+    def load(self, load_path, reset_exploration_rate, load_replay_buffer):
+        if not load_path.exists():
+            raise ValueError(f"{load_path} does not exist")
+        ckp = torch.load(load_path, map_location=('cuda' if self.use_cuda else 'cpu'))
+        exploration_rate = ckp.get('exploration_rate')
+        state_dict = ckp.get('model')
+        print(f"Loading model at {load_path} with exploration rate {exploration_rate}")
+        self.net.load_state_dict(state_dict)
+        if load_replay_buffer:
+            replay_memory = ckp.get('replay_memory')
+            print(f"Loading replay memory. Len {len(replay_memory)}" if replay_memory else "Saved replay memory not found. Not restoring replay memory.")
+            self.memory = replay_memory if replay_memory else self.memory
+        if reset_exploration_rate:
+            print(f"Reset exploration rate option specified. Not restoring saved exploration rate {exploration_rate}. The current exploration rate is {self.exploration_rate}")
+        else:
+            print(f"Setting exploration rate to {exploration_rate} not loaded.")
+            self.exploration_rate = exploration_rate
+class DDQNAgentWithStepDecay(DQNAgentWithStepDecay):
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones, stepnumbers):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        stepnumbers = stepnumbers.reshape(-1, 1)
+        q_vals = self.net(next_states, model='online')
+        target_actions = torch.argmax(q_vals, axis=1)
+        target_actions = target_actions.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target')
+        target_qs = target_qs.gather(1, target_actions)
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        discount = ((200 - stepnumbers)/200)
+        val = np.minimum(discount, self.gamma * target_qs)
+        return (rewards + val)
+class DuelingDQNAgentWithStepDecay:
+    def __init__(self,
+                 state_dim,
+                 action_dim,
+                 save_dir,
+                 checkpoint=None,
+                 learning_rate=0.00025,
+                 max_memory_size=100000,
+                 batch_size=32,
+                 exploration_rate=1,
+                 exploration_rate_decay=0.9999999,
+                 exploration_rate_min=0.1,
+                 training_frequency=1,
+                 learning_starts=1000,
+                 target_network_sync_frequency=500,
+                 reset_exploration_rate=False,
+                 save_frequency=100000,
+                 gamma=0.9,
+                 load_replay_buffer=True):
+        self.state_dim = state_dim
+        self.action_dim = action_dim
+        self.max_memory_size = max_memory_size
+        self.memory = deque(maxlen=max_memory_size)
+        self.batch_size = batch_size
+        self.exploration_rate = exploration_rate
+        self.exploration_rate_decay = exploration_rate_decay
+        self.exploration_rate_min = exploration_rate_min
+        self.gamma = gamma
+        self.curr_step = 0
+        self.learning_starts = learning_starts  # min. experiences before training
+        self.training_frequency = training_frequency   # no. of experiences between updates to Q_online
+        self.target_network_sync_frequency = target_network_sync_frequency  # no. of experiences between Q_target & Q_online sync
+        self.save_every = save_frequency   # no. of experiences between saving the network
+        self.save_dir = save_dir
+        self.use_cuda = torch.cuda.is_available()
+        self.online_net = DuelingDQNet(self.state_dim, self.action_dim).float()
+        self.target_net = copy.deepcopy(self.online_net)
+        # Q_target parameters are frozen.
+        for p in self.target_net.parameters():
+            p.requires_grad = False
+        if self.use_cuda:
+            self.online_net = self.online_net(device='cuda')
+            self.target_net = self.target_net(device='cuda')
+        if checkpoint:
+            self.load(checkpoint, reset_exploration_rate, load_replay_buffer)
+        self.optimizer = torch.optim.AdamW(self.online_net.parameters(), lr=learning_rate, amsgrad=True)
+        self.loss_fn = torch.nn.SmoothL1Loss()
+        # self.optimizer = torch.optim.Adam(self.online_net.parameters(), lr=learning_rate)
+        # self.loss_fn = torch.nn.MSELoss()
+    def act(self, state):
+        """
+        Given a state, choose an epsilon-greedy action and update value of step.
+        Inputs:
+        state(LazyFrame): A single observation of the current state, dimension is (state_dim)
+        Outputs:
+        action_idx (int): An integer representing which action the agent will perform
+        """
+        # EXPLORE
+        if np.random.rand() < self.exploration_rate:
+            action_idx = np.random.randint(self.action_dim)
+        # EXPLOIT
+        else:
+            state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+            state = state.unsqueeze(0)
+            action_values = self.online_net(state)
+            action_idx = torch.argmax(action_values, axis=1).item()
+        # decrease exploration_rate
+        self.exploration_rate *= self.exploration_rate_decay
+        self.exploration_rate = max(self.exploration_rate_min, self.exploration_rate)
+        # increment step
+        self.curr_step += 1
+        return action_idx
+    def cache(self, state, next_state, action, reward, done, stepnumber):
+        """
+        Store the experience to self.memory (replay buffer)
+        Inputs:
+        state (LazyFrame),
+        next_state (LazyFrame),
+        action (int),
+        reward (float),
+        done(bool))
+        """
+        state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+        next_state = torch.FloatTensor(next_state).cuda() if self.use_cuda else torch.FloatTensor(next_state)
+        action = torch.LongTensor([action]).cuda() if self.use_cuda else torch.LongTensor([action])
+        reward = torch.DoubleTensor([reward]).cuda() if self.use_cuda else torch.DoubleTensor([reward])
+        done = torch.BoolTensor([done]).cuda() if self.use_cuda else torch.BoolTensor([done])
+        stepnumber = torch.LongTensor([stepnumber]).cuda() if self.use_cuda else torch.LongTensor([stepnumber])
+        self.memory.append( (state, next_state, action, reward, done, stepnumber) )
+    def recall(self):
+        """
+        Retrieve a batch of experiences from memory
+        """
+        batch = random.sample(self.memory, self.batch_size)
+        state, next_state, action, reward, done, stepnumber = map(torch.stack, zip(*batch))
+        return state, next_state, action.squeeze(), reward.squeeze(), done.squeeze(), stepnumber.squeeze()
+    def td_estimate(self, states, actions):
+        actions = actions.reshape(-1, 1)
+        predicted_qs = self.online_net(states)# Q_online(s,a)
+        predicted_qs = predicted_qs.gather(1, actions)
+        return predicted_qs
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones, stepnumbers):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        stepnumbers = stepnumbers.reshape(-1, 1)
+        target_qs = self.target_net.forward(next_states)
+        target_qs = torch.max(target_qs, dim=1).values
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        discount = ((200 - stepnumbers)/200)
+        val = np.minimum(discount, self.gamma * target_qs)
+        return (rewards + val)
+    def update_Q_online(self, td_estimate, td_target) :
+        loss = self.loss_fn(td_estimate.float(), td_target.float())
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+        return loss.item()
+    def sync_Q_target(self):
+        self.target_net.load_state_dict(self.online_net.state_dict())
+    def learn(self):
+        if self.curr_step % self.target_network_sync_frequency == 0:
+            self.sync_Q_target()
+        if self.curr_step % self.save_every == 0:
+            self.save()
+        if self.curr_step < self.learning_starts:
+            return None, None
+        if self.curr_step % self.training_frequency != 0:
+            return None, None
+        # Sample from memory
+        state, next_state, action, reward, done, stepnumbers = self.recall()
+        # Get TD Estimate
+        td_est = self.td_estimate(state, action)
+        # Get TD Target
+        td_tgt = self.td_target(reward, next_state, done, stepnumbers)
+        # Backpropagate loss through Q_online
+        loss = self.update_Q_online(td_est, td_tgt)
+        return (td_est.mean().item(), loss)
+    def save(self):
+        save_path = self.save_dir / f"airstriker_net_{int(self.curr_step // self.save_every)}.chkpt"
+        torch.save(
+            dict(
+                model=self.online_net.state_dict(),
+                exploration_rate=self.exploration_rate,
+                replay_memory=self.memory
+            ),
+            save_path
+        )
+        print(f"Airstriker model saved to {save_path} at step {self.curr_step}")
+    def load(self, load_path, reset_exploration_rate, load_replay_buffer):
+        if not load_path.exists():
+            raise ValueError(f"{load_path} does not exist")
+        ckp = torch.load(load_path, map_location=('cuda' if self.use_cuda else 'cpu'))
+        exploration_rate = ckp.get('exploration_rate')
+        state_dict = ckp.get('model')
+        print(f"Loading model at {load_path} with exploration rate {exploration_rate}")
+        self.online_net.load_state_dict(state_dict)
+        self.target_net = copy.deepcopy(self.online_net)
+        self.sync_Q_target()
+        if load_replay_buffer:
+            replay_memory = ckp.get('replay_memory')
+            print(f"Loading replay memory. Len {len(replay_memory)}" if replay_memory else "Saved replay memory not found. Not restoring replay memory.")
+            self.memory = replay_memory if replay_memory else self.memory
+        if reset_exploration_rate:
+            print(f"Reset exploration rate option specified. Not restoring saved exploration rate {exploration_rate}. The current exploration rate is {self.exploration_rate}")
+        else:
+            print(f"Setting exploration rate to {exploration_rate} not loaded.")
+            self.exploration_rate = exploration_rate
+class DuelingDDQNAgentWithStepDecay(DuelingDQNAgentWithStepDecay):
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones, stepnumbers):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        stepnumbers = stepnumbers.reshape(-1, 1)
+        q_vals = self.online_net.forward(next_states)
+        target_actions = torch.argmax(q_vals, axis=1)
+        target_actions = target_actions.reshape(-1, 1)
+        target_qs = self.target_net.forward(next_states)
+        target_qs = target_qs.gather(1, target_actions)
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        discount = ((200 - stepnumbers)/200)
+        val = np.minimum(discount, self.gamma * target_qs)
+        return (rewards + val)

src/lunar-lander/params.py ADDED Viewed

	@@ -0,0 +1,12 @@

+hyperparams = dict(
+    batch_size=128,
+    exploration_rate=1,
+    exploration_rate_decay=0.99999,
+    exploration_rate_min=0.01,
+    training_frequency=1,
+    target_network_sync_frequency=20,
+    max_memory_size=1000000,
+    learning_rate=0.001,
+    learning_starts=128,
+    save_frequency=100000
+)

src/lunar-lander/replay.py ADDED Viewed

	@@ -0,0 +1,67 @@

+import datetime
+from pathlib import Path
+from agent import DQNAgent, DDQNAgent, MetricLogger
+from wrappers import make_lunar
+env = make_lunar()
+env.reset()
+save_dir = Path("checkpoints") / datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
+save_dir.mkdir(parents=True)
+# checkpoint = Path('checkpoints/lunar-lander-dueling-ddqn/airstriker_net_2.chkpt')
+checkpoint = Path('checkpoints/lunar-lander-dqn-rc/airstriker_net_1.chkpt')
+logger = MetricLogger(save_dir)
+print("Testing Double DQN Agent!")
+agent = DDQNAgent(
+    state_dim=8,
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    batch_size=512,
+    checkpoint=checkpoint,
+    exploration_rate_decay=0.999995,
+    exploration_rate_min=0.05,
+    training_frequency=1,
+    target_network_sync_frequency=200,
+    max_memory_size=50000,
+    learning_rate=0.0005,
+    load_replay_buffer=False
+)
+agent.exploration_rate = agent.exploration_rate_min
+episodes = 100
+for e in range(episodes):
+    state = env.reset()
+    while True:
+        env.render()
+        action = agent.act(state)
+        next_state, reward, done, info = env.step(action)
+        # agent.cache(state, next_state, action, reward, done)
+        # logger.log_step(reward, None, None)
+        state = next_state
+        if done:
+            break
+    # logger.log_episode()
+    # if e % 20 == 0:
+    #     logger.record(
+    #         episode=e,
+    #         epsilon=agent.exploration_rate,
+    #         step=agent.curr_step
+    #     )

src/lunar-lander/run-lunar-ddqn.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import os
+import torch
+from pathlib import Path
+from agent import DDQNAgent, DDQNAgentWithStepDecay, MetricLogger
+from wrappers import make_lunar
+import os
+from train import train, fill_memory
+from params import hyperparams
+env = make_lunar()
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}\n")
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+path = "checkpoints/lunar-lander-ddqn-rc"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+logger = MetricLogger(save_dir)
+print("Training DDQN Agent!")
+agent = DDQNAgentWithStepDecay(
+    state_dim=8,
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    checkpoint=checkpoint,
+    **hyperparams
+)
+# agent = DDQNAgent(
+#     state_dim=8,
+#     action_dim=env.action_space.n,
+#     save_dir=save_dir,
+#     checkpoint=checkpoint,
+#     **hyperparams
+# )
+# fill_memory(agent, env, 5000)
+train(agent, env, logger)

src/lunar-lander/run-lunar-dqn.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import os
+import torch
+from pathlib import Path
+from agent import DQNAgent, DQNAgentWithStepDecay, MetricLogger
+from wrappers import make_lunar
+import os
+from train import train, fill_memory
+from params import hyperparams
+env = make_lunar()
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}\n")
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+path = "checkpoints/lunar-lander-dqn-rc"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+logger = MetricLogger(save_dir)
+print("Training Vanilla DQN Agent with decay!")
+agent = DQNAgentWithStepDecay(
+    state_dim=8,
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    checkpoint=checkpoint,
+    **hyperparams
+)
+# print("Training Vanilla DQN Agent!")
+# agent = DQNAgent(
+#     state_dim=8,
+#     action_dim=env.action_space.n,
+#     save_dir=save_dir,
+#     checkpoint=checkpoint,
+#     **hyperparams
+# )
+# fill_memory(agent, env, 5000)
+train(agent, env, logger)

src/lunar-lander/run-lunar-dueling-ddqn.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import os
+import torch
+from pathlib import Path
+from agent import DuelingDDQNAgent, DuelingDDQNAgentWithStepDecay,MetricLogger
+from wrappers import make_lunar
+import os
+from train import train, fill_memory
+from params import hyperparams
+env = make_lunar()
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}\n")
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+path = "checkpoints/lunar-lander-dueling-ddqn-rc"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+logger = MetricLogger(save_dir)
+print("Training Dueling DDQN Agent with step decay!")
+agent = DuelingDDQNAgentWithStepDecay(
+    state_dim=8,
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    checkpoint=checkpoint,
+    **hyperparams
+)
+# print("Training Dueling DDQN Agent!")
+# agent = DuelingDDQNAgent(
+#     state_dim=8,
+#     action_dim=env.action_space.n,
+#     save_dir=save_dir,
+#     checkpoint=checkpoint,
+#     **hyperparams
+# )
+# fill_memory(agent, env, 5000)
+train(agent, env, logger)

src/lunar-lander/run-lunar-dueling-dqn.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import os
+import torch
+from pathlib import Path
+from agent import DuelingDQNAgent, DuelingDQNAgentWithStepDecay, MetricLogger
+from wrappers import make_lunar
+import os
+from train import train, fill_memory
+from params import hyperparams
+env = make_lunar()
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}\n")
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+path = "checkpoints/lunar-lander-dueling-dqn-rc"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+logger = MetricLogger(save_dir)
+print("Training Dueling DQN Agent with step decay!")
+agent = DuelingDQNAgentWithStepDecay(
+    state_dim=8,
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    checkpoint=checkpoint,
+    **hyperparams
+)
+# print("Training Dueling DQN Agent!")
+# agent = DuelingDQNAgent(
+#     state_dim=8,
+#     action_dim=env.action_space.n,
+#     save_dir=save_dir,
+#     checkpoint=checkpoint,
+#     **hyperparams
+# )
+# fill_memory(agent, env, 5000)
+train(agent, env, logger)

src/lunar-lander/train.py ADDED Viewed

	@@ -0,0 +1,84 @@

+from tqdm import trange
+def fill_memory(agent, env, num_episodes=500 ):
+    print("Filling up memory....")
+    for _ in trange(500):
+        state = env.reset()
+        done = False
+        while not done:
+            action = agent.act(state)
+            next_state, reward, done, _ = env.step(action)
+            agent.cache(state, next_state, action, reward, done)
+            state = next_state
+# def train(agent, env, logger):
+#     episodes = 5000
+#     for e in range(episodes):
+#         state = env.reset()
+#         # Play the game!
+#         while True:
+#             # Run agent on the state
+#             action = agent.act(state)
+#             # Agent performs action
+#             next_state, reward, done, info = env.step(action)
+#             # Remember
+#             agent.cache(state, next_state, action, reward, done)
+#             # Learn
+#             q, loss = agent.learn()
+#             # Logging
+#             logger.log_step(reward, loss, q)
+#             # Update state
+#             state = next_state
+#             # Check if end of game
+#             if done:
+#                 break
+#         logger.log_episode(e)
+#         if e % 20 == 0:
+#             logger.record(episode=e, epsilon=agent.exploration_rate, step=agent.curr_step)
+def train(agent, env, logger):
+    episodes = 5000
+    for e in range(episodes):
+        state = env.reset()
+        # Play the game!
+        for i in range(1000):
+            # Run agent on the state
+            action = agent.act(state)
+            env.render()
+            # Agent performs action
+            next_state, reward, done, info = env.step(action)
+            # Remember
+            agent.cache(state, next_state, action, reward, done, i)
+            # Learn
+            q, loss = agent.learn()
+            # Logging
+            logger.log_step(reward, loss, q)
+            # Update state
+            state = next_state
+            # Check if end of game
+            if done:
+                break
+        logger.log_episode(e)
+        if e % 20 == 0:
+            logger.record(episode=e, epsilon=agent.exploration_rate, step=agent.curr_step)

src/lunar-lander/wrappers.py ADDED Viewed

	@@ -0,0 +1,193 @@

+import numpy as np
+import os
+from collections import deque
+import gym
+from gym import spaces
+import cv2
+import math
+'''
+Atari Wrapper copied from https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py
+'''
+class LazyFrames(object):
+    def __init__(self, frames):
+        """This object ensures that common frames between the observations are only stored once.
+        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
+        buffers.
+        This object should only be converted to numpy array before being passed to the model.
+        You'd not believe how complex the previous solution was."""
+        self._frames = frames
+        self._out = None
+    def _force(self):
+        if self._out is None:
+            self._out = np.concatenate(self._frames, axis=2)
+            self._frames = None
+        return self._out
+    def __array__(self, dtype=None):
+        out = self._force()
+        if dtype is not None:
+            out = out.astype(dtype)
+        return out
+    def __len__(self):
+        return len(self._force())
+    def __getitem__(self, i):
+        return self._force()[i]
+class FireResetEnv(gym.Wrapper):
+    def __init__(self, env):
+        """Take action on reset for environments that are fixed until firing."""
+        gym.Wrapper.__init__(self, env)
+        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
+        assert len(env.unwrapped.get_action_meanings()) >= 3
+    def reset(self, **kwargs):
+        self.env.reset(**kwargs)
+        obs, _, done, _ = self.env.step(1)
+        if done:
+            self.env.reset(**kwargs)
+        obs, _, done, _ = self.env.step(2)
+        if done:
+            self.env.reset(**kwargs)
+        return obs
+    def step(self, ac):
+        return self.env.step(ac)
+class MaxAndSkipEnv(gym.Wrapper):
+    def __init__(self, env, skip=4):
+        """Return only every `skip`-th frame"""
+        gym.Wrapper.__init__(self, env)
+        # most recent raw observations (for max pooling across time steps)
+        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
+        self._skip       = skip
+    def step(self, action):
+        """Repeat action, sum reward, and max over last observations."""
+        total_reward = 0.0
+        done = None
+        for i in range(self._skip):
+            obs, reward, done, info = self.env.step(action)
+            if i == self._skip - 2: self._obs_buffer[0] = obs
+            if i == self._skip - 1: self._obs_buffer[1] = obs
+            total_reward += reward
+            if done:
+                break
+        # Note that the observation on the done=True frame
+        # doesn't matter
+        max_frame = self._obs_buffer.max(axis=0)
+        return max_frame, total_reward, done, info
+    def reset(self, **kwargs):
+        return self.env.reset(**kwargs)
+class WarpFrame(gym.ObservationWrapper):
+    def __init__(self, env):
+        """Warp frames to 84x84 as done in the Nature paper and later work."""
+        gym.ObservationWrapper.__init__(self, env)
+        self.width = 84
+        self.height = 84
+        self.observation_space = spaces.Box(low=0, high=255,
+            shape=(self.height, self.width, 1), dtype=np.uint8)
+    def observation(self, frame):
+        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
+        return frame[:, :, None]
+class WarpFrameNoResize(gym.ObservationWrapper):
+    def __init__(self, env):
+        """Warp frames to 84x84 as done in the Nature paper and later work."""
+        gym.ObservationWrapper.__init__(self, env)
+    def observation(self, frame):
+        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        # frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
+        return frame[:, :, None]
+class FrameStack(gym.Wrapper):
+    def __init__(self, env, k):
+        """Stack k last frames.
+        Returns lazy array, which is much more memory efficient.
+        See Also
+        --------
+        baselines.common.atari_wrappers.LazyFrames
+        """
+        gym.Wrapper.__init__(self, env)
+        self.k = k
+        self.frames = deque([], maxlen=k)
+        shp = env.observation_space.shape
+        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=env.observation_space.dtype)
+    def reset(self):
+        ob = self.env.reset()
+        for _ in range(self.k):
+            self.frames.append(ob)
+        return self._get_ob()
+    def step(self, action):
+        ob, reward, done, info = self.env.step(action)
+        self.frames.append(ob)
+        return self._get_ob(), reward, done, info
+    def _get_ob(self):
+        assert len(self.frames) == self.k
+        return LazyFrames(list(self.frames))
+class ImageToPyTorch(gym.ObservationWrapper):
+    def __init__(self, env):
+        super(ImageToPyTorch, self).__init__(env)
+        old_shape = self.observation_space.shape
+        self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]), dtype=np.float32)
+    def observation(self, observation):
+        return np.moveaxis(observation, 2, 0)
+class ScaledFloatFrame(gym.ObservationWrapper):
+    def __init__(self, env):
+        gym.ObservationWrapper.__init__(self, env)
+        self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)
+    def observation(self, observation):
+        # careful! This undoes the memory optimization, use
+        # with smaller replay buffers only.
+        return np.array(observation).astype(np.float32) / 255.0
+class ClipRewardEnv(gym.RewardWrapper):
+    def __init__(self, env):
+        gym.RewardWrapper.__init__(self, env)
+    def reward(self, reward):
+        """Bin reward to {+1, 0, -1} by its sign."""
+        return np.sign(reward)
+class TanRewardClipperEnv(gym.RewardWrapper):
+    def __init__(self, env):
+        gym.RewardWrapper.__init__(self, env)
+    def reward(self, reward):
+        """Bin reward to {+1, 0, -1} by its sign."""
+        return 10 * math.tanh(float(reward)/30.)
+def make_lunar(render=False):
+    print("Environment: Lunar Lander")
+    env = gym.make("LunarLander-v2")
+    # env = TanRewardClipperEnv(env)
+    # env = WarpFrameNoResize(env) ## Reshape image
+    # env = ImageToPyTorch(env) ## Invert shape
+    # env = FrameStack(env, 4) ## Stack last 4 frames
+    return env

src/procgen/agent.py ADDED Viewed

	@@ -0,0 +1,664 @@

+import torch
+import numpy as np
+import random
+import torch.nn as nn
+import copy
+import time, datetime
+import matplotlib.pyplot as plt
+from collections import deque
+from torch.utils.tensorboard import SummaryWriter
+class DQNet(nn.Module):
+    """mini cnn structure
+  input -> (conv2d + relu) x 3 -> flatten -> (dense + relu) x 2 -> output
+  """
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        print("#################################")
+        print("#################################")
+        print(input_dim)
+        print(output_dim)
+        print("#################################")
+        print("#################################")
+        c, h, w = input_dim
+        self.online = nn.Sequential(
+            nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),
+            nn.ReLU(),
+            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
+            nn.ReLU(),
+            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
+            nn.ReLU(),
+            nn.Flatten(),
+            nn.Linear(7168, 512),
+            nn.ReLU(),
+            nn.Linear(512, output_dim),
+        )
+        self.target = copy.deepcopy(self.online)
+        # Q_target parameters are frozen.
+        for p in self.target.parameters():
+            p.requires_grad = False
+    def forward(self, input, model):
+        if model == "online":
+            return self.online(input)
+        elif model == "target":
+            return self.target(input)
+class MetricLogger:
+    def __init__(self, save_dir):
+        self.writer = SummaryWriter(log_dir=save_dir)
+        self.save_log = save_dir / "log"
+        with open(self.save_log, "w") as f:
+            f.write(
+                f"{'Episode':>8}{'Step':>8}{'Epsilon':>10}{'MeanReward':>15}"
+                f"{'MeanLength':>15}{'MeanLoss':>15}{'MeanQValue':>15}"
+                f"{'TimeDelta':>15}{'Time':>20}\n"
+            )
+        self.ep_rewards_plot = save_dir / "reward_plot.jpg"
+        self.ep_lengths_plot = save_dir / "length_plot.jpg"
+        self.ep_avg_losses_plot = save_dir / "loss_plot.jpg"
+        self.ep_avg_qs_plot = save_dir / "q_plot.jpg"
+        # History metrics
+        self.ep_rewards = []
+        self.ep_lengths = []
+        self.ep_avg_losses = []
+        self.ep_avg_qs = []
+        # Moving averages, added for every call to record()
+        self.moving_avg_ep_rewards = []
+        self.moving_avg_ep_lengths = []
+        self.moving_avg_ep_avg_losses = []
+        self.moving_avg_ep_avg_qs = []
+        # Current episode metric
+        self.init_episode()
+        # Timing
+        self.record_time = time.time()
+    def log_step(self, reward, loss, q):
+        self.curr_ep_reward += reward
+        self.curr_ep_length += 1
+        if loss:
+            self.curr_ep_loss += loss
+            self.curr_ep_q += q
+            self.curr_ep_loss_length += 1
+    def log_episode(self, episode_number):
+        "Mark end of episode"
+        self.ep_rewards.append(self.curr_ep_reward)
+        self.ep_lengths.append(self.curr_ep_length)
+        if self.curr_ep_loss_length == 0:
+            ep_avg_loss = 0
+            ep_avg_q = 0
+        else:
+            ep_avg_loss = np.round(self.curr_ep_loss / self.curr_ep_loss_length, 5)
+            ep_avg_q = np.round(self.curr_ep_q / self.curr_ep_loss_length, 5)
+        self.ep_avg_losses.append(ep_avg_loss)
+        self.ep_avg_qs.append(ep_avg_q)
+        self.writer.add_scalar("Avg Loss for episode", ep_avg_loss, episode_number)
+        self.writer.add_scalar("Avg Q value for episode", ep_avg_q, episode_number)
+        self.writer.flush()
+        self.init_episode()
+    def init_episode(self):
+        self.curr_ep_reward = 0.0
+        self.curr_ep_length = 0
+        self.curr_ep_loss = 0.0
+        self.curr_ep_q = 0.0
+        self.curr_ep_loss_length = 0
+    def record(self, episode, epsilon, step):
+        mean_ep_reward = np.round(np.mean(self.ep_rewards[-100:]), 3)
+        mean_ep_length = np.round(np.mean(self.ep_lengths[-100:]), 3)
+        mean_ep_loss = np.round(np.mean(self.ep_avg_losses[-100:]), 3)
+        mean_ep_q = np.round(np.mean(self.ep_avg_qs[-100:]), 3)
+        self.moving_avg_ep_rewards.append(mean_ep_reward)
+        self.moving_avg_ep_lengths.append(mean_ep_length)
+        self.moving_avg_ep_avg_losses.append(mean_ep_loss)
+        self.moving_avg_ep_avg_qs.append(mean_ep_q)
+        last_record_time = self.record_time
+        self.record_time = time.time()
+        time_since_last_record = np.round(self.record_time - last_record_time, 3)
+        print(
+            f"Episode {episode} - "
+            f"Step {step} - "
+            f"Epsilon {epsilon} - "
+            f"Mean Reward {mean_ep_reward} - "
+            f"Mean Length {mean_ep_length} - "
+            f"Mean Loss {mean_ep_loss} - "
+            f"Mean Q Value {mean_ep_q} - "
+            f"Time Delta {time_since_last_record} - "
+            f"Time {datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S')}"
+        )
+        self.writer.add_scalar("Mean reward last 100 episodes", mean_ep_reward, episode)
+        self.writer.add_scalar("Mean length last 100 episodes", mean_ep_length, episode)
+        self.writer.add_scalar("Mean loss last 100 episodes", mean_ep_loss, episode)
+        self.writer.add_scalar("Mean reward last 100 episodes", mean_ep_reward, episode)
+        self.writer.add_scalar("Epsilon value", epsilon, episode)
+        self.writer.add_scalar("Mean Q Value last 100 episodes", mean_ep_q, episode)
+        self.writer.flush()
+        with open(self.save_log, "a") as f:
+            f.write(
+                f"{episode:8d}{step:8d}{epsilon:10.3f}"
+                f"{mean_ep_reward:15.3f}{mean_ep_length:15.3f}{mean_ep_loss:15.3f}{mean_ep_q:15.3f}"
+                f"{time_since_last_record:15.3f}"
+                f"{datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S'):>20}\n"
+            )
+        for metric in ["ep_rewards", "ep_lengths", "ep_avg_losses", "ep_avg_qs"]:
+            plt.plot(getattr(self, f"moving_avg_{metric}"))
+            plt.savefig(getattr(self, f"{metric}_plot"))
+            plt.clf()
+class DQNAgent:
+    def __init__(self,
+                 state_dim,
+                 action_dim,
+                 save_dir,
+                 checkpoint=None,
+                 learning_rate=0.00025,
+                 max_memory_size=100000,
+                 batch_size=32,
+                 exploration_rate=1,
+                 exploration_rate_decay=0.9999999,
+                 exploration_rate_min=0.1,
+                 training_frequency=1,
+                 learning_starts=1000,
+                 target_network_sync_frequency=500,
+                 reset_exploration_rate=False,
+                 save_frequency=100000,
+                 gamma=0.9,
+                 load_replay_buffer=True):
+        self.state_dim = state_dim
+        self.action_dim = action_dim
+        self.max_memory_size = max_memory_size
+        self.memory = deque(maxlen=max_memory_size)
+        self.batch_size = batch_size
+        self.exploration_rate = exploration_rate
+        self.exploration_rate_decay = exploration_rate_decay
+        self.exploration_rate_min = exploration_rate_min
+        self.gamma = gamma
+        self.curr_step = 0
+        self.learning_starts = learning_starts  # min. experiences before training
+        self.training_frequency = training_frequency   # no. of experiences between updates to Q_online
+        self.target_network_sync_frequency = target_network_sync_frequency  # no. of experiences between Q_target & Q_online sync
+        self.save_every = save_frequency   # no. of experiences between saving the network
+        self.save_dir = save_dir
+        self.use_cuda = torch.cuda.is_available()
+        self.net = DQNet(self.state_dim, self.action_dim).float()
+        if self.use_cuda:
+            self.net = self.net.to(device='cuda')
+        if checkpoint:
+            self.load(checkpoint, reset_exploration_rate, load_replay_buffer)
+        self.optimizer = torch.optim.AdamW(self.net.parameters(), lr=learning_rate, amsgrad=True)
+        self.loss_fn = torch.nn.SmoothL1Loss()
+    def act(self, state):
+        """
+        Given a state, choose an epsilon-greedy action and update value of step.
+        Inputs:
+        state(LazyFrame): A single observation of the current state, dimension is (state_dim)
+        Outputs:
+        action_idx (int): An integer representing which action the agent will perform
+        """
+        # EXPLORE
+        if np.random.rand() < self.exploration_rate:
+            action_idx = np.random.randint(self.action_dim)
+        # EXPLOIT
+        else:
+            state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+            state = state.unsqueeze(0)
+            action_values = self.net(state, model='online')
+            action_idx = torch.argmax(action_values, axis=1).item()
+        # decrease exploration_rate
+        self.exploration_rate *= self.exploration_rate_decay
+        self.exploration_rate = max(self.exploration_rate_min, self.exploration_rate)
+        # increment step
+        self.curr_step += 1
+        return action_idx
+    def cache(self, state, next_state, action, reward, done):
+        """
+        Store the experience to self.memory (replay buffer)
+        Inputs:
+        state (LazyFrame),
+        next_state (LazyFrame),
+        action (int),
+        reward (float),
+        done(bool))
+        """
+        state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+        next_state = torch.FloatTensor(next_state).cuda() if self.use_cuda else torch.FloatTensor(next_state)
+        action = torch.LongTensor([action]).cuda() if self.use_cuda else torch.LongTensor([action])
+        reward = torch.DoubleTensor([reward]).cuda() if self.use_cuda else torch.DoubleTensor([reward])
+        done = torch.BoolTensor([done]).cuda() if self.use_cuda else torch.BoolTensor([done])
+        self.memory.append( (state, next_state, action, reward, done,) )
+    def recall(self):
+        """
+        Retrieve a batch of experiences from memory
+        """
+        batch = random.sample(self.memory, self.batch_size)
+        state, next_state, action, reward, done = map(torch.stack, zip(*batch))
+        return state, next_state, action.squeeze(), reward.squeeze(), done.squeeze()
+    def td_estimate(self, states, actions):
+        actions = actions.reshape(-1, 1)
+        predicted_qs = self.net(states, model='online')# Q_online(s,a)
+        predicted_qs = predicted_qs.gather(1, actions)
+        return predicted_qs
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target')
+        target_qs = torch.max(target_qs, dim=1).values
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))
+    def update_Q_online(self, td_estimate, td_target) :
+        loss = self.loss_fn(td_estimate, td_target)
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+        return loss.item()
+    def sync_Q_target(self):
+        self.net.target.load_state_dict(self.net.online.state_dict())
+    def learn(self):
+        if self.curr_step % self.target_network_sync_frequency == 0:
+            self.sync_Q_target()
+        if self.curr_step % self.save_every == 0:
+            self.save()
+        if self.curr_step < self.learning_starts:
+            return None, None
+        if self.curr_step % self.training_frequency != 0:
+            return None, None
+        # Sample from memory
+        state, next_state, action, reward, done = self.recall()
+        # Get TD Estimate
+        td_est = self.td_estimate(state, action)
+        # Get TD Target
+        td_tgt = self.td_target(reward, next_state, done)
+        # Backpropagate loss through Q_online
+        loss = self.update_Q_online(td_est, td_tgt)
+        return (td_est.mean().item(), loss)
+    def save(self):
+        save_path = self.save_dir / f"airstriker_net_{int(self.curr_step // self.save_every)}.chkpt"
+        torch.save(
+            dict(
+                model=self.net.state_dict(),
+                exploration_rate=self.exploration_rate,
+                replay_memory=self.memory
+            ),
+            save_path
+        )
+        print(f"Airstriker model saved to {save_path} at step {self.curr_step}")
+    def load(self, load_path, reset_exploration_rate, load_replay_buffer):
+        if not load_path.exists():
+            raise ValueError(f"{load_path} does not exist")
+        ckp = torch.load(load_path, map_location=('cuda' if self.use_cuda else 'cpu'))
+        exploration_rate = ckp.get('exploration_rate')
+        state_dict = ckp.get('model')
+        print(f"Loading model at {load_path} with exploration rate {exploration_rate}")
+        self.net.load_state_dict(state_dict)
+        if load_replay_buffer:
+            replay_memory = ckp.get('replay_memory')
+            print(f"Loading replay memory. Len {len(replay_memory)}" if replay_memory else "Saved replay memory not found. Not restoring replay memory.")
+            self.memory = replay_memory if replay_memory else self.memory
+        if reset_exploration_rate:
+            print(f"Reset exploration rate option specified. Not restoring saved exploration rate {exploration_rate}. The current exploration rate is {self.exploration_rate}")
+        else:
+            print(f"Setting exploration rate to {exploration_rate} not loaded.")
+            self.exploration_rate = exploration_rate
+class DDQNAgent(DQNAgent):
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        q_vals = self.net(next_states, model='online')
+        target_actions = torch.argmax(q_vals, axis=1)
+        target_actions = target_actions.reshape(-1, 1)
+        target_qs = self.net(next_states, model='target')
+        target_qs = target_qs.gather(1, target_actions)
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))
+class DuelingDQNet(nn.Module):
+    def __init__(self, input_dim, output_dim):
+        super().__init__()
+        print("#################################")
+        print("#################################")
+        print(input_dim)
+        print(output_dim)
+        print("#################################")
+        print("#################################")
+        c, h, w = input_dim
+        self.conv_layer = nn.Sequential(
+            nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),
+            nn.ReLU(),
+            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
+            nn.ReLU(),
+            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
+            nn.ReLU(),
+        )
+        self.value_layer = nn.Sequential(
+            nn.Linear(7168, 128),
+            nn.ReLU(),
+            nn.Linear(128, 1)
+        )
+        self.advantage_layer = nn.Sequential(
+            nn.Linear(7168, 128),
+            nn.ReLU(),
+            nn.Linear(128, output_dim)
+        )
+    def forward(self, state):
+        conv_output = self.conv_layer(state)
+        conv_output = conv_output.view(conv_output.size(0), -1)
+        value = self.value_layer(conv_output)
+        advantage = self.advantage_layer(conv_output)
+        q_value = value + (advantage - advantage.mean())
+        return q_value
+class DuelingDQNAgent:
+    def __init__(self,
+                 state_dim,
+                 action_dim,
+                 save_dir,
+                 checkpoint=None,
+                 learning_rate=0.00025,
+                 max_memory_size=100000,
+                 batch_size=32,
+                 exploration_rate=1,
+                 exploration_rate_decay=0.9999999,
+                 exploration_rate_min=0.1,
+                 training_frequency=1,
+                 learning_starts=1000,
+                 target_network_sync_frequency=500,
+                 reset_exploration_rate=False,
+                 save_frequency=100000,
+                 gamma=0.9,
+                 load_replay_buffer=True):
+        self.state_dim = state_dim
+        self.action_dim = action_dim
+        self.max_memory_size = max_memory_size
+        self.memory = deque(maxlen=max_memory_size)
+        self.batch_size = batch_size
+        self.exploration_rate = exploration_rate
+        self.exploration_rate_decay = exploration_rate_decay
+        self.exploration_rate_min = exploration_rate_min
+        self.gamma = gamma
+        self.curr_step = 0
+        self.learning_starts = learning_starts  # min. experiences before training
+        self.training_frequency = training_frequency   # no. of experiences between updates to Q_online
+        self.target_network_sync_frequency = target_network_sync_frequency  # no. of experiences between Q_target & Q_online sync
+        self.save_every = save_frequency   # no. of experiences between saving the network
+        self.save_dir = save_dir
+        self.use_cuda = torch.cuda.is_available()
+        self.online_net = DuelingDQNet(self.state_dim, self.action_dim).float()
+        self.target_net = copy.deepcopy(self.online_net)
+        # Q_target parameters are frozen.
+        for p in self.target_net.parameters():
+            p.requires_grad = False
+        if self.use_cuda:
+            self.online_net = self.online_net(device='cuda')
+            self.target_net = self.target_net(device='cuda')
+        if checkpoint:
+            self.load(checkpoint, reset_exploration_rate, load_replay_buffer)
+        self.optimizer = torch.optim.AdamW(self.online_net.parameters(), lr=learning_rate, amsgrad=True)
+        self.loss_fn = torch.nn.SmoothL1Loss()
+    def act(self, state):
+        """
+        Given a state, choose an epsilon-greedy action and update value of step.
+        Inputs:
+        state(LazyFrame): A single observation of the current state, dimension is (state_dim)
+        Outputs:
+        action_idx (int): An integer representing which action the agent will perform
+        """
+        # EXPLORE
+        if np.random.rand() < self.exploration_rate:
+            action_idx = np.random.randint(self.action_dim)
+        # EXPLOIT
+        else:
+            state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+            state = state.unsqueeze(0)
+            action_values = self.online_net(state)
+            action_idx = torch.argmax(action_values, axis=1).item()
+        # decrease exploration_rate
+        self.exploration_rate *= self.exploration_rate_decay
+        self.exploration_rate = max(self.exploration_rate_min, self.exploration_rate)
+        # increment step
+        self.curr_step += 1
+        return action_idx
+    def cache(self, state, next_state, action, reward, done):
+        """
+        Store the experience to self.memory (replay buffer)
+        Inputs:
+        state (LazyFrame),
+        next_state (LazyFrame),
+        action (int),
+        reward (float),
+        done(bool))
+        """
+        state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state)
+        next_state = torch.FloatTensor(next_state).cuda() if self.use_cuda else torch.FloatTensor(next_state)
+        action = torch.LongTensor([action]).cuda() if self.use_cuda else torch.LongTensor([action])
+        reward = torch.DoubleTensor([reward]).cuda() if self.use_cuda else torch.DoubleTensor([reward])
+        done = torch.BoolTensor([done]).cuda() if self.use_cuda else torch.BoolTensor([done])
+        self.memory.append( (state, next_state, action, reward, done,) )
+    def recall(self):
+        """
+        Retrieve a batch of experiences from memory
+        """
+        batch = random.sample(self.memory, self.batch_size)
+        state, next_state, action, reward, done = map(torch.stack, zip(*batch))
+        return state, next_state, action.squeeze(), reward.squeeze(), done.squeeze()
+    def td_estimate(self, states, actions):
+        actions = actions.reshape(-1, 1)
+        predicted_qs = self.online_net(states)# Q_online(s,a)
+        predicted_qs = predicted_qs.gather(1, actions)
+        return predicted_qs
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        target_qs = self.target_net.forward(next_states)
+        target_qs = torch.max(target_qs, dim=1).values
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))
+    def update_Q_online(self, td_estimate, td_target) :
+        loss = self.loss_fn(td_estimate, td_target)
+        self.optimizer.zero_grad()
+        loss.backward()
+        self.optimizer.step()
+        return loss.item()
+    def sync_Q_target(self):
+        self.target_net.load_state_dict(self.online_net.state_dict())
+    def learn(self):
+        if self.curr_step % self.target_network_sync_frequency == 0:
+            self.sync_Q_target()
+        if self.curr_step % self.save_every == 0:
+            self.save()
+        if self.curr_step < self.learning_starts:
+            return None, None
+        if self.curr_step % self.training_frequency != 0:
+            return None, None
+        # Sample from memory
+        state, next_state, action, reward, done = self.recall()
+        # Get TD Estimate
+        td_est = self.td_estimate(state, action)
+        # Get TD Target
+        td_tgt = self.td_target(reward, next_state, done)
+        # Backpropagate loss through Q_online
+        loss = self.update_Q_online(td_est, td_tgt)
+        return (td_est.mean().item(), loss)
+    def save(self):
+        save_path = self.save_dir / f"airstriker_net_{int(self.curr_step // self.save_every)}.chkpt"
+        torch.save(
+            dict(
+                model=self.online_net.state_dict(),
+                exploration_rate=self.exploration_rate,
+                replay_memory=self.memory
+            ),
+            save_path
+        )
+        print(f"Airstriker model saved to {save_path} at step {self.curr_step}")
+    def load(self, load_path, reset_exploration_rate, load_replay_buffer):
+        if not load_path.exists():
+            raise ValueError(f"{load_path} does not exist")
+        ckp = torch.load(load_path, map_location=('cuda' if self.use_cuda else 'cpu'))
+        exploration_rate = ckp.get('exploration_rate')
+        state_dict = ckp.get('model')
+        print(f"Loading model at {load_path} with exploration rate {exploration_rate}")
+        self.online_net.load_state_dict(state_dict)
+        self.target_net = copy.deepcopy(self.online_net)
+        self.sync_Q_target()
+        if load_replay_buffer:
+            replay_memory = ckp.get('replay_memory')
+            print(f"Loading replay memory. Len {len(replay_memory)}" if replay_memory else "Saved replay memory not found. Not restoring replay memory.")
+            self.memory = replay_memory if replay_memory else self.memory
+        if reset_exploration_rate:
+            print(f"Reset exploration rate option specified. Not restoring saved exploration rate {exploration_rate}. The current exploration rate is {self.exploration_rate}")
+        else:
+            print(f"Setting exploration rate to {exploration_rate} not loaded.")
+            self.exploration_rate = exploration_rate
+class DuelingDDQNAgent(DuelingDQNAgent):
+    @torch.no_grad()
+    def td_target(self, rewards, next_states, dones):
+        rewards = rewards.reshape(-1, 1)
+        dones = dones.reshape(-1, 1)
+        q_vals = self.online_net.forward(next_states)
+        target_actions = torch.argmax(q_vals, axis=1)
+        target_actions = target_actions.reshape(-1, 1)
+        target_qs = self.target_net.forward(next_states)
+        target_qs = target_qs.gather(1, target_actions)
+        target_qs = target_qs.reshape(-1, 1)
+        target_qs[dones] = 0.0
+        return (rewards + (self.gamma * target_qs))

src/procgen/run-starpilot-ddqn.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import os
+import torch
+from pathlib import Path
+from agent import DDQNAgent, MetricLogger
+from wrappers import make_starpilot
+import os
+from train import train, fill_memory
+env = make_starpilot()
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}\n")
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+path = "checkpoints/procgen-starpilot-ddqn"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+logger = MetricLogger(save_dir)
+print("Training DDQN Agent!")
+agent = DDQNAgent(
+    state_dim=(1, 64, 64),
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    batch_size=256,
+    checkpoint=checkpoint,
+    exploration_rate_decay=0.999995,
+    exploration_rate_min=0.05,
+    training_frequency=1,
+    target_network_sync_frequency=200,
+    max_memory_size=50000,
+    learning_rate=0.0005,
+)
+fill_memory(agent, env, 300)
+train(agent, env, logger)

src/procgen/run-starpilot-dqn.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import os
+import torch
+from pathlib import Path
+from agent import DQNAgent, MetricLogger
+from wrappers import make_starpilot
+import os
+from train import train, fill_memory
+env = make_starpilot()
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}\n")
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+path = "checkpoints/procgen-starpilot-dqn"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+logger = MetricLogger(save_dir)
+print("Training Vanilla DQN Agent!")
+agent = DQNAgent(
+    state_dim=(1, 64, 64),
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    batch_size=256,
+    checkpoint=checkpoint,
+    exploration_rate_decay=0.999995,
+    exploration_rate_min=0.05,
+    training_frequency=1,
+    target_network_sync_frequency=200,
+    max_memory_size=50000,
+    learning_rate=0.0005,
+)
+fill_memory(agent, env, 300)
+train(agent, env, logger)

src/procgen/run-starpilot-dueling-ddqn.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import os
+import torch
+from pathlib import Path
+from agent import DuelingDDQNAgent, MetricLogger
+from wrappers import make_starpilot
+import os
+from train import train, fill_memory
+env = make_starpilot()
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}\n")
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+path = "checkpoints/procgen-starpilot-dueling-ddqn"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+logger = MetricLogger(save_dir)
+print("Training Dueling Double DQN Agent!")
+agent = DuelingDDQNAgent(
+    state_dim=(1, 64, 64),
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    batch_size=256,
+    checkpoint=checkpoint,
+    exploration_rate_decay=0.999995,
+    exploration_rate_min=0.05,
+    training_frequency=1,
+    target_network_sync_frequency=200,
+    max_memory_size=50000,
+    learning_rate=0.0005,
+)
+# fill_memory(agent, env, 300)
+train(agent, env, logger)

src/procgen/run-starpilot-dueling-dqn.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import os
+import torch
+from pathlib import Path
+from agent import DuelingDQNAgent, MetricLogger
+from wrappers import make_starpilot
+import os
+from train import train, fill_memory
+env = make_starpilot()
+use_cuda = torch.cuda.is_available()
+print(f"Using CUDA: {use_cuda}\n")
+checkpoint = None
+# checkpoint = Path('checkpoints/latest/airstriker_net_3.chkpt')
+path = "checkpoints/procgen-starpilot-dueling-dqn"
+save_dir = Path(path)
+isExist = os.path.exists(path)
+if not isExist:
+   os.makedirs(path)
+logger = MetricLogger(save_dir)
+print("Training Dueling DQN Agent!")
+agent = DuelingDQNAgent(
+    state_dim=(1, 64, 64),
+    action_dim=env.action_space.n,
+    save_dir=save_dir,
+    batch_size=256,
+    checkpoint=checkpoint,
+    exploration_rate_decay=0.999995,
+    exploration_rate_min=0.05,
+    training_frequency=1,
+    target_network_sync_frequency=200,
+    max_memory_size=50000,
+    learning_rate=0.0005,
+)
+# fill_memory(agent, env, 300)
+train(agent, env, logger)

src/procgen/test-procgen.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import gym
+env = gym.make("procgen:procgen-starpilot-v0")
+obs = env.reset()
+step = 0
+while True:
+    obs, rew, done, info = env.step(env.action_space.sample())
+    print(info)
+    print(f"step {step} reward {rew} done {done}")
+    step += 1
+    if done:
+        break

src/procgen/train.py ADDED Viewed

	@@ -0,0 +1,48 @@

+from tqdm import trange
+def fill_memory(agent, env, num_episodes=500 ):
+    print("Filling up memory....")
+    for _ in trange(num_episodes):
+        state = env.reset()
+        done = False
+        while not done:
+            action = agent.act(state)
+            next_state, reward, done, _ = env.step(action)
+            agent.cache(state, next_state, action, reward, done)
+            state = next_state
+def train(agent, env, logger):
+    episodes = 5000
+    for e in range(episodes):
+        state = env.reset()
+        # Play the game!
+        while True:
+            # Run agent on the state
+            action = agent.act(state)
+            # Agent performs action
+            next_state, reward, done, info = env.step(action)
+            # Remember
+            agent.cache(state, next_state, action, reward, done)
+            # Learn
+            q, loss = agent.learn()
+            # Logging
+            logger.log_step(reward, loss, q)
+            # Update state
+            state = next_state
+            # Check if end of game
+            if done:
+                break
+        logger.log_episode(e)
+        if e % 20 == 0:
+            logger.record(episode=e, epsilon=agent.exploration_rate, step=agent.curr_step)

src/procgen/wrappers.py ADDED Viewed

	@@ -0,0 +1,187 @@

+import numpy as np
+import os
+from collections import deque
+import gym
+from gym import spaces
+import cv2
+'''
+Atari Wrapper copied from https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py
+'''
+class LazyFrames(object):
+    def __init__(self, frames):
+        """This object ensures that common frames between the observations are only stored once.
+        It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
+        buffers.
+        This object should only be converted to numpy array before being passed to the model.
+        You'd not believe how complex the previous solution was."""
+        self._frames = frames
+        self._out = None
+    def _force(self):
+        if self._out is None:
+            self._out = np.concatenate(self._frames, axis=2)
+            self._frames = None
+        return self._out
+    def __array__(self, dtype=None):
+        out = self._force()
+        if dtype is not None:
+            out = out.astype(dtype)
+        return out
+    def __len__(self):
+        return len(self._force())
+    def __getitem__(self, i):
+        return self._force()[i]
+class FireResetEnv(gym.Wrapper):
+    def __init__(self, env):
+        """Take action on reset for environments that are fixed until firing."""
+        gym.Wrapper.__init__(self, env)
+        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
+        assert len(env.unwrapped.get_action_meanings()) >= 3
+    def reset(self, **kwargs):
+        self.env.reset(**kwargs)
+        obs, _, done, _ = self.env.step(1)
+        if done:
+            self.env.reset(**kwargs)
+        obs, _, done, _ = self.env.step(2)
+        if done:
+            self.env.reset(**kwargs)
+        return obs
+    def step(self, ac):
+        return self.env.step(ac)
+class MaxAndSkipEnv(gym.Wrapper):
+    def __init__(self, env, skip=4):
+        """Return only every `skip`-th frame"""
+        gym.Wrapper.__init__(self, env)
+        # most recent raw observations (for max pooling across time steps)
+        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
+        self._skip       = skip
+    def step(self, action):
+        """Repeat action, sum reward, and max over last observations."""
+        total_reward = 0.0
+        done = None
+        for i in range(self._skip):
+            obs, reward, done, info = self.env.step(action)
+            if i == self._skip - 2: self._obs_buffer[0] = obs
+            if i == self._skip - 1: self._obs_buffer[1] = obs
+            total_reward += reward
+            if done:
+                break
+        # Note that the observation on the done=True frame
+        # doesn't matter
+        max_frame = self._obs_buffer.max(axis=0)
+        return max_frame, total_reward, done, info
+    def reset(self, **kwargs):
+        return self.env.reset(**kwargs)
+class WarpFrame(gym.ObservationWrapper):
+    def __init__(self, env):
+        """Warp frames to 84x84 as done in the Nature paper and later work."""
+        gym.ObservationWrapper.__init__(self, env)
+        self.width = 84
+        self.height = 84
+        self.observation_space = spaces.Box(low=0, high=255,
+            shape=(self.height, self.width, 1), dtype=np.uint8)
+    def observation(self, frame):
+        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
+        return frame[:, :, None]
+class WarpFrameNoResize(gym.ObservationWrapper):
+    def __init__(self, env):
+        """Warp frames to 84x84 as done in the Nature paper and later work."""
+        gym.ObservationWrapper.__init__(self, env)
+    def observation(self, frame):
+        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        # frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
+        return frame[:, :, None]
+class FrameStack(gym.Wrapper):
+    def __init__(self, env, k):
+        """Stack k last frames.
+        Returns lazy array, which is much more memory efficient.
+        See Also
+        --------
+        baselines.common.atari_wrappers.LazyFrames
+        """
+        gym.Wrapper.__init__(self, env)
+        self.k = k
+        self.frames = deque([], maxlen=k)
+        shp = env.observation_space.shape
+        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=env.observation_space.dtype)
+    def reset(self):
+        ob = self.env.reset()
+        for _ in range(self.k):
+            self.frames.append(ob)
+        return self._get_ob()
+    def step(self, action):
+        ob, reward, done, info = self.env.step(action)
+        self.frames.append(ob)
+        return self._get_ob(), reward, done, info
+    def _get_ob(self):
+        assert len(self.frames) == self.k
+        return LazyFrames(list(self.frames))
+class ImageToPyTorch(gym.ObservationWrapper):
+    def __init__(self, env):
+        super(ImageToPyTorch, self).__init__(env)
+        old_shape = self.observation_space.shape
+        self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]), dtype=np.float32)
+    def observation(self, observation):
+        return np.moveaxis(observation, 2, 0)
+class ScaledFloatFrame(gym.ObservationWrapper):
+    def __init__(self, env):
+        gym.ObservationWrapper.__init__(self, env)
+        self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)
+    def observation(self, observation):
+        # careful! This undoes the memory optimization, use
+        # with smaller replay buffers only.
+        return np.array(observation).astype(np.float32) / 255.0
+class ClipRewardEnv(gym.RewardWrapper):
+    def __init__(self, env):
+        gym.RewardWrapper.__init__(self, env)
+    def reward(self, reward):
+        """Bin reward to {+1, 0, -1} by its sign."""
+        return np.sign(reward)
+def make_starpilot(render=False):
+    print("Environment: Starpilot")
+    if render:
+        env = gym.make("procgen:procgen-starpilot-v0", distribution_mode="easy", render_mode="human")
+    else:
+        env = gym.make("procgen:procgen-starpilot-v0", distribution_mode="easy")
+    env = WarpFrameNoResize(env) ## Reshape image
+    env = ImageToPyTorch(env) ## Invert shape
+    env = FrameStack(env, 4) ## Stack last 4 frames
+    return env

troubleshooting.md ADDED Viewed

	@@ -0,0 +1,37 @@

+# ml-reinforcement-learning
+Python version: 3.7.3
+Troubleshooting
+- RuntimeError: Polyfit sanity test emitted a warning, most likely due to using a buggy Accelerate backend. If you compiled yourself, more information is available at https://numpy.org/doc/stable/user/building.html#accelerated-blas-lapack-libraries Otherwise report this to the vendor that provided NumPy.
+RankWarning: Polyfit may be poorly conditioned
+```
+$ pip uninstall numpy
+$ export OPENBLAS=$(brew --prefix openblas)
+$ pip install --no-cache-dir  numpy
+```
+During grpcio installation 👇
+distutils.errors.CompileError: command 'clang' failed with exit status 1
+```
+CFLAGS="-I/Library/Developer/CommandLineTools/usr/include/c++/v1 -I/opt/homebrew/opt/openssl/include" LDFLAGS="-L/opt/homebrew/opt/openssl/lib" pip3 install grpcio
+```
+ModuleNotFoundError: No module named 'gym.envs.classic_control.rendering'
+#Setup
+```
+conda install pytorch torchvision -c pytorch
+pip install gym-retro
+conda install numpy
+pip install "gym[atari]==0.21.0"
+pip install importlib-metadata==4.13.0
+```