Lew commited on
Commit
8f77724
1 Parent(s): d5630b4

Add mean/std calculation

Browse files
Files changed (1) hide show
  1. README.md +20 -1
README.md CHANGED
@@ -25,5 +25,24 @@ model-index:
25
  This is a trained model of a **Reinforce** agent playing **Pixelcopter-PLE-v0** .
26
  To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction
27
 
28
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
 
25
  This is a trained model of a **Reinforce** agent playing **Pixelcopter-PLE-v0** .
26
  To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction
27
 
28
+ Some math about 'Pixelcopter' training.
29
+ The game is to fly in a passage and avoid blocks. Let we have trained our agent so that the probability to crash at block is _p_ (low enogh, I hope).
30
+ The probability that the copter crashes exactly at _n_-th block is product of probabilities it doesn't crash at previous _(n-1)_ blocks and probability it crashes at current block:
31
+ $$P = {p \cdot (1-p)^{n-1}}$$
32
+ The mathematical expectation of number of the block it crashes at is:
33
+ $$<n> = \sum_{n=1}^\infty{n \cdot p \cdot (1-p)^{n-1}} = \frac{1}{p}$$
34
+ The std is:
35
+ $$std(n) = \sqrt{<n^2>-<n>^2}$$
36
+ $$<n^2> = \sum_{n=1}^\infty{n^2 \cdot p \cdot (1-p)^{n-1}} = \frac{2-p}{p^2}$$
37
+ $$std(n) = \sqrt{\frac{2-p}{p^2}-\left( \frac{1}{p} \right) ^2} = \frac{\sqrt{1-p}}{p}$$
38
+ So difference is:
39
+ $$<n> - std(n) = \frac{1 - \sqrt{1-p}}{p}$$
40
+ As long as
41
+ $$ 0 \le p \le 1 $$
42
+ the following is true:
43
+ $$\sqrt{1-p} \ge 1-p$$
44
+ $$<n> - std(n) = \frac{1 - \sqrt{1-p}}{p} \le \frac{1 - (1-p)}{p} = 1$$
45
+ The scores _s_ in 'Pixelcopter' are the number of blocks passed decreased by 5 (for crash). So the average is lower by 5 and the std is the same. No matter how small _p_ is, our 'least score' is:
46
+ $$ (<n> - 5) - std(n) = <n> - std(n) - 5 \le - 4$$
47
+ But as we use only 10 episodes to calculate statistics and episode duration is limited, we can still achieve the goal, better agent, more chances. But understanding this is disappointing
48