merve HF staff commited on
Commit
b0adc83
1 Parent(s): 382424a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -0
README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - reinforcement learning
4
+ - deep deterministic policy gradient
5
+ license:
6
+ - cc0.0
7
+ ---
8
+
9
+ ## Keras Implementation of Deep Deterministic Policy Gradient ⏱🤖
10
+ This repo contains the model and the notebook [to this Keras example on PPO for Cartpole](https://keras.io/examples/rl/ppo_cartpole/).
11
+ Full credits to: [Hemant Singh](https://github.com/amifunny)
12
+
13
+ ## Background Information
14
+ Deep Deterministic Policy Gradient (DDPG) is a model-free off-policy algorithm for learning continous actions.
15
+
16
+ It combines ideas from DPG (Deterministic Policy Gradient) and DQN (Deep Q-Network). It uses Experience Replay and slow-learning target networks from DQN, and it is based on DPG, which can operate over continuous action spaces.
17
+
18
+ This tutorial closely follow this paper - Continuous control with deep reinforcement learning
19
+
20
+ We are trying to solve the classic Inverted Pendulum control problem. In this setting, we can take only two actions: swing left or swing right.
21
+
22
+ What make this problem challenging for Q-Learning Algorithms is that actions are continuous instead of being discrete. That is, instead of using two discrete actions like -1 or +1, we have to select from infinite actions ranging from -2 to +2.
23
+
24
+ Just like the Actor-Critic method, we have two networks:
25
+
26
+ Actor - It proposes an action given a state.
27
+ Critic - It predicts if the action is good (positive value) or bad (negative value) given a state and an action.
28
+ DDPG uses two more techniques not present in the original DQN:
29
+
30
+ First, it uses two Target networks.
31
+
32
+ Why? Because it add stability to training. In short, we are learning from estimated targets and Target networks are updated slowly, hence keeping our estimated targets stable.
33
+
34
+ Conceptually, this is like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better", as opposed to saying "I'm going to re-learn how to play this entire game after every move". See this StackOverflow answer.
35
+
36
+ Second, it uses Experience Replay.
37
+
38
+ We store list of tuples (state, action, reward, next_state), and instead of learning only from recent experience, we learn from sampling all of our experience accumulated so far.
39
+
40
+ [pendulum_gif](https://imgur.com/eEH8Cz6)