Lew's picture
Update Readme.md. Add math
7946c8c
metadata
tags:
  - Pixelcopter-PLE-v0
  - reinforce
  - reinforcement-learning
  - custom-implementation
  - deep-rl-class
model-index:
  - name: policy_grad_2-Pixelcopter-PLE-v0
    results:
      - task:
          type: reinforcement-learning
          name: reinforcement-learning
        dataset:
          name: Pixelcopter-PLE-v0
          type: Pixelcopter-PLE-v0
        metrics:
          - type: mean_reward
            value: 70.30 +/- 33.94
            name: mean_reward
            verified: false

Reinforce Agent playing Pixelcopter-PLE-v0

This is a trained model of a Reinforce agent playing Pixelcopter-PLE-v0 . To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction

Some math about 'Pixelcopter' training.

The game is to fly in a passage and avoid blocks. Let we have trained our agent so that the probability to crash at block is p (low enogh, I hope). The probability that the copter crashes exactly at n-th block is product of probabilities it doesn't crash at previous (n-1) blocks and probability it crashes at current block: P=pβ‹…(1βˆ’p)nβˆ’1P = {p \cdot (1-p)^{n-1}} The mathematical expectation of number of the block it crashes at is: <n>=βˆ‘n=1∞nβ‹…pβ‹…(1βˆ’p)nβˆ’1=1p<n> = \sum_{n=1}^\infty{n \cdot p \cdot (1-p)^{n-1}} = \frac{1}{p} The std is: std(n)=<n2>βˆ’<n>2std(n) = \sqrt{<n^2>-<n>^2} <n2>=βˆ‘n=1∞n2β‹…pβ‹…(1βˆ’p)nβˆ’1=2βˆ’pp2<n^2> = \sum_{n=1}^\infty{n^2 \cdot p \cdot (1-p)^{n-1}} = \frac{2-p}{p^2} std(n)=2βˆ’pp2βˆ’(1p)2=1βˆ’ppstd(n) = \sqrt{\frac{2-p}{p^2}-\left( \frac{1}{p} \right) ^2} = \frac{\sqrt{1-p}}{p} So difference is: <n>βˆ’std(n)=1βˆ’1βˆ’pp<n> - std(n) = \frac{1 - \sqrt{1-p}}{p} As long as 0≀p≀1 0 \le p \le 1 the following is true: 1βˆ’pβ‰₯1βˆ’p\sqrt{1-p} \ge 1-p <n>βˆ’std(n)=1βˆ’1βˆ’pp≀1βˆ’(1βˆ’p)p=1<n> - std(n) = \frac{1 - \sqrt{1-p}}{p} \le \frac{1 - (1-p)}{p} = 1 The scores s in 'Pixelcopter' are the number of blocks passed decreased by 5 (for crash). So the average is lower by 5 and the std is the same. No matter how small p is, our 'least score' is: (<n>βˆ’5)βˆ’std(n)=<n>βˆ’std(n)βˆ’5β‰€βˆ’4 (<n> - 5) - std(n) = <n> - std(n) - 5 \le - 4 But as we use only 10 episodes to calculate statistics and episode duration is limited, we can still achieve the goal, better agent, more chances. But understanding this is disappointing