Saraaaaaaaaa
commited on
Commit
•
8e876fd
1
Parent(s):
e286ab8
Update README.md
Browse files
README.md
CHANGED
@@ -25,12 +25,21 @@ model-index:
|
|
25 |
This is a trained model of a **Reinforce** agent playing **CartPole-v1**.
|
26 |
To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction
|
27 |
|
|
|
|
|
|
|
|
|
28 |
**Policy-based learning** is directly approximating π without having to learn a value function- Our objective then is to maximize the performance of the parameterized policy using gradient ascent.
|
29 |
TL;DR: Having the cart learn to balance the pole via optimizing π for the best output; *the pole not falling over*.
|
30 |
This method of learning skips over using a value function like Q-learning does, allowing an immediate improvement in the next iteration instead of having to calculate and approximate tables and numbers for a new action, as Q-learning does.
|
|
|
31 |
|
32 |
This specific CartPole model only has 500 training timesteps- the average is 1000, which is the reason why the cart struggles so much with balancing the pole in the video; it has not trained enough for it.
|
33 |
A model trained with 1000 timesteps is successful in balancing the pole, and the more training steps a model has, the more accurate its result is, like when you play a really hard level in a video game over and over, it eventually gets easier.
|
34 |
-
However, the more timesteps a model has, the longer it takes to train and render- 1000 timesteps
|
|
|
|
|
|
|
|
|
35 |
|
36 |
|
|
|
25 |
This is a trained model of a **Reinforce** agent playing **CartPole-v1**.
|
26 |
To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction
|
27 |
|
28 |
+
|
29 |
+
# ***Project Information***
|
30 |
+
|
31 |
+
|
32 |
**Policy-based learning** is directly approximating π without having to learn a value function- Our objective then is to maximize the performance of the parameterized policy using gradient ascent.
|
33 |
TL;DR: Having the cart learn to balance the pole via optimizing π for the best output; *the pole not falling over*.
|
34 |
This method of learning skips over using a value function like Q-learning does, allowing an immediate improvement in the next iteration instead of having to calculate and approximate tables and numbers for a new action, as Q-learning does.
|
35 |
+
|
36 |
|
37 |
This specific CartPole model only has 500 training timesteps- the average is 1000, which is the reason why the cart struggles so much with balancing the pole in the video; it has not trained enough for it.
|
38 |
A model trained with 1000 timesteps is successful in balancing the pole, and the more training steps a model has, the more accurate its result is, like when you play a really hard level in a video game over and over, it eventually gets easier.
|
39 |
+
However, the more timesteps a model has, the longer it takes to train and render- 1000 timesteps take 10-15 minutes to load, and the time only increases the more training timesteps are inputted.
|
40 |
+
|
41 |
+
Here -https...- is a video of it working with 1000 timesteps, and here -https...- is one with 2000 *(links will be inserted soon)*
|
42 |
+
|
43 |
+
|
44 |
|
45 |
|