metadata

tags:
  - Pixelcopter-PLE-v0
  - reinforce
  - reinforcement-learning
  - custom-implementation
  - deep-rl-class
model-index:
  - name: policy_grad_2-Pixelcopter-PLE-v0
    results:
      - task:
          type: reinforcement-learning
          name: reinforcement-learning
        dataset:
          name: Pixelcopter-PLE-v0
          type: Pixelcopter-PLE-v0
        metrics:
          - type: mean_reward
            value: 70.30 +/- 33.94
            name: mean_reward
            verified: false

Reinforce Agent playing Pixelcopter-PLE-v0

This is a trained model of a Reinforce agent playing Pixelcopter-PLE-v0 . To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction

Some math about 'Pixelcopter' training.

The game is to fly in a passage and avoid blocks. Let we have trained our agent so that the probability to crash at block is p (low enogh, I hope). The probability that the copter crashes exactly at n-th block is product of probabilities it doesn't crash at previous (n-1) blocks and probability it crashes at current block: $P = {p \cdot (1-p)^{n-1}}$ The mathematical expectation of number of the block it crashes at is: $<n> = \sum_{n=1}^\infty{n \cdot p \cdot (1-p)^{n-1}} = \frac{1}{p}$ The std is: $std(n) = \sqrt{<n^2>-<n>^2}$ $<n^2> = \sum_{n=1}^\infty{n^2 \cdot p \cdot (1-p)^{n-1}} = \frac{2-p}{p^2}$ $std(n) = \sqrt{\frac{2-p}{p^2}-\left( \frac{1}{p} \right) ^2} = \frac{\sqrt{1-p}}{p}$ So difference is: $<n> - std(n) = \frac{1 - \sqrt{1-p}}{p}$ As long as $0 \le p \le 1$ the following is true: $\sqrt{1-p} \ge 1-p$ $<n> - std(n) = \frac{1 - \sqrt{1-p}}{p} \le \frac{1 - (1-p)}{p} = 1$ The scores s in 'Pixelcopter' are the number of blocks passed decreased by 5 (for crash). So the average is lower by 5 and the std is the same. No matter how small p is, our 'least score' is: $(<n> - 5) - std(n) = <n> - std(n) - 5 \le - 4$ But as we use only 10 episodes to calculate statistics and episode duration is limited, we can still achieve the goal, better agent, more chances. But understanding this is disappointing