Update README.md
Browse files
README.md
CHANGED
|
@@ -15,13 +15,94 @@ model-index:
|
|
| 15 |
name: Pixelcopter-PLE-v0
|
| 16 |
type: Pixelcopter-PLE-v0
|
| 17 |
metrics:
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
---
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
name: Pixelcopter-PLE-v0
|
| 16 |
type: Pixelcopter-PLE-v0
|
| 17 |
metrics:
|
| 18 |
+
- type: mean_reward
|
| 19 |
+
value: 58.13 +/- 55.17
|
| 20 |
+
name: mean_reward
|
| 21 |
+
verified: false
|
| 22 |
---
|
| 23 |
|
| 24 |
+
# ๐ Reinforce Agent โ Pixelcopter-PLE-v0
|
| 25 |
+
|
| 26 |
+
A policy gradient agent trained from scratch using the **REINFORCE** algorithm to play [Pixelcopter](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html), a challenging continuous control game built on the PyGame Learning Environment (PLE).
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## ๐ Performance
|
| 31 |
+
|
| 32 |
+
| Metric | Value |
|
| 33 |
+
|--------|-------|
|
| 34 |
+
| Mean Reward | 58.13 |
|
| 35 |
+
| Std of Reward | ยฑ55.17 |
|
| 36 |
+
| Best Average Score | 80.65 (Episode 46000) |
|
| 37 |
+
| Evaluation Episodes | 10 |
|
| 38 |
+
| Training Episodes | 50,000 |
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## ๐ง Algorithm โ REINFORCE (Monte Carlo Policy Gradient)
|
| 43 |
+
|
| 44 |
+
REINFORCE is a classic **policy gradient** method that directly optimizes the policy by:
|
| 45 |
+
1. Rolling out full episodes using the current policy
|
| 46 |
+
2. Computing discounted returns **Gโ = rโโโ + ฮณrโโโ + ฮณยฒrโโโ + ...** for each timestep
|
| 47 |
+
3. Updating the policy by maximizing **E[ log ฯ_ฮธ(a|s) ยท Gโ ]**
|
| 48 |
+
|
| 49 |
+
The policy network is a simple feedforward neural network:
|
| 50 |
+
- **Input:** State observation vector
|
| 51 |
+
- **Hidden layer:** Fully connected + ReLU activation
|
| 52 |
+
- **Output:** Action probabilities via Softmax
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## โ๏ธ Hyperparameters
|
| 57 |
+
|
| 58 |
+
| Parameter | Value |
|
| 59 |
+
|-----------|-------|
|
| 60 |
+
| Hidden layer size | 64 |
|
| 61 |
+
| Training episodes | 50,000 |
|
| 62 |
+
| Max steps per episode | 10,000 |
|
| 63 |
+
| Discount factor (ฮณ) | 0.99 |
|
| 64 |
+
| Learning rate | 1e-4 |
|
| 65 |
+
| Optimizer | Adam |
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## ๐ฎ About the Environment
|
| 70 |
+
|
| 71 |
+
**Pixelcopter-PLE-v0** is a side-scrolling game where the agent controls a helicopter and must navigate through gaps in walls without crashing.
|
| 72 |
+
|
| 73 |
+
- **Observation space:** 7 continuous values (player velocity, player y-position, wall positions, etc.)
|
| 74 |
+
- **Action space:** 2 discrete actions โ throttle up or do nothing
|
| 75 |
+
- **Reward:** +1 for each timestep survived
|
| 76 |
+
- **Episode ends:** On collision with a wall or the ground/ceiling
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
## ๐ How to Use
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
from ple.games.pixelcopter import Pixelcopter
|
| 84 |
+
from ple import PLE
|
| 85 |
+
import torch
|
| 86 |
+
|
| 87 |
+
# Load the model
|
| 88 |
+
model = torch.load("model.pt", map_location=torch.device("cpu"))
|
| 89 |
+
model.eval()
|
| 90 |
+
|
| 91 |
+
# Run inference
|
| 92 |
+
state, _ = env.reset()
|
| 93 |
+
action, _ = model.act(state)
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## ๐ Training Details
|
| 99 |
+
|
| 100 |
+
- **Framework:** PyTorch
|
| 101 |
+
- **Returns:** Standardized per episode for training stability
|
| 102 |
+
- **Environment API:** PyGame Learning Environment (PLE) via custom Gymnasium wrapper
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## ๐ค Author
|
| 107 |
+
|
| 108 |
+
Trained by **nirmanpatel** as part of the [Hugging Face Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/intro/README).
|