tags:
- Pixelcopter-PLE-v0
- reinforce
- reinforcement-learning
- custom-implementation
- deep-rl-class
model-index:
- name: policy_grad_2-Pixelcopter-PLE-v0
results:
- task:
type: reinforcement-learning
name: reinforcement-learning
dataset:
name: Pixelcopter-PLE-v0
type: Pixelcopter-PLE-v0
metrics:
- type: mean_reward
value: 70.30 +/- 33.94
name: mean_reward
verified: false
Reinforce Agent playing Pixelcopter-PLE-v0
This is a trained model of a Reinforce agent playing Pixelcopter-PLE-v0 . To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction
Some math about 'Pixelcopter' training.
The game is to fly in a passage and avoid blocks. Let we have trained our agent so that the probability to crash at block is p (low enogh, I hope). The probability that the copter crashes exactly at n-th block is product of probabilities it doesn't crash at previous (n-1) blocks and probability it crashes at current block: The mathematical expectation of number of the block it crashes at is: The std is: So difference is: As long as the following is true: The scores s in 'Pixelcopter' are the number of blocks passed decreased by 5 (for crash). So the average is lower by 5 and the std is the same. No matter how small p is, our 'least score' is: But as we use only 10 episodes to calculate statistics and episode duration is limited, we can still achieve the goal, better agent, more chances. But understanding this is disappointing