Reinforce Agent playing Pixelcopter-PLE-v0

This is a trained model of a Reinforce agent playing Pixelcopter-PLE-v0 . To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction

Some math about 'Pixelcopter' training.

The game is to fly in a passage and avoid blocks. Let we have trained our agent so that the probability to crash at block is p (low enogh, I hope). The probability that the copter crashes exactly at n-th block is product of probabilities it doesn't crash at previous (n-1) blocks and probability it crashes at current block: $P = {p \cdot (1-p)^{n-1}}$ The mathematical expectation of number of the block it crashes at is: $<n> = \sum_{n=1}^\infty{n \cdot p \cdot (1-p)^{n-1}} = \frac{1}{p}$ The std is: $std(n) = \sqrt{<n^2>-<n>^2}$ $<n^2> = \sum_{n=1}^\infty{n^2 \cdot p \cdot (1-p)^{n-1}} = \frac{2-p}{p^2}$ $std(n) = \sqrt{\frac{2-p}{p^2}-\left( \frac{1}{p} \right) ^2} = \frac{\sqrt{1-p}}{p}$ So difference is: $<n> - std(n) = \frac{1 - \sqrt{1-p}}{p}$ As long as $0 \le p \le 1$ the following is true: $\sqrt{1-p} \ge 1-p$ $<n> - std(n) = \frac{1 - \sqrt{1-p}}{p} \le \frac{1 - (1-p)}{p} = 1$ The scores s in 'Pixelcopter' are the number of blocks passed decreased by 5 (for crash). So the average is lower by 5 and the std is the same. No matter how small p is, our 'least score' is: $(<n> - 5) - std(n) = <n> - std(n) - 5 \le - 4$ But as we use only 10 episodes to calculate statistics and episode duration is limited, we can still achieve the goal, better agent, more chances. But understanding this is disappointing

Lew
/

policy_grad_2-Pixelcopter-PLE-v0

Reinforce Agent playing Pixelcopter-PLE-v0

Some math about 'Pixelcopter' training.

Evaluation results