Spaces:
Sleeping
Sleeping
Peter
commited on
Commit
•
a40d998
1
Parent(s):
32b040c
:truck: rename examples
Browse files
examples/HFblog-An Introduction to Q-Learning Part 1.txt
ADDED
@@ -0,0 +1,290 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
created: 2022-05-23T01:23:41 (UTC +02:00)
|
3 |
+
tags: []
|
4 |
+
source: https://huggingface.co/blog/deep-rl-q-part1
|
5 |
+
author: ThomasSimonini
|
6 |
+
Thomas Simonini
|
7 |
+
---
|
8 |
+
|
9 |
+
# An Introduction to Q-Learning Part 1
|
10 |
+
|
11 |
+
> ## Excerpt
|
12 |
+
> We’re on a journey to advance and democratize artificial intelligence through open source and open science.
|
13 |
+
|
14 |
+
---
|
15 |
+
Back to blog
|
16 |
+
|
17 |
+
## Unit 2, part 1 of the Deep Reinforcement Learning Class with Hugging Face 🤗
|
18 |
+
|
19 |
+
## _This article is part of the Deep Reinforcement Learning Class. A free course from beginner to expert. Check the syllabus here._
|
20 |
+
|
21 |
+
In the first chapter of this class, we learned about Reinforcement Learning (RL), the RL process, and the different methods to solve an RL problem. We also trained our first lander agent to **land correctly on the Moon 🌕 and uploaded it to the Hugging Face Hub.**
|
22 |
+
|
23 |
+
So today, we're going to **dive deeper into one of the Reinforcement Learning methods: value-based methods** and study our first RL algorithm: **Q-Learning.**
|
24 |
+
|
25 |
+
We'll also **implement our first RL agent from scratch**: a Q-Learning agent and will train it in two environments:
|
26 |
+
|
27 |
+
1. Frozen-Lake-v1 (non-slippery version): where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
|
28 |
+
2. An autonomous taxi will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
|
29 |
+
|
30 |
+
This unit is divided into 2 parts:
|
31 |
+
|
32 |
+
In the first part, we'll **learn about the value-based methods and the difference between Monte Carlo and Temporal Difference Learning.**
|
33 |
+
|
34 |
+
And in the second part, **we'll study our first RL algorithm: Q-Learning, and implement our first RL Agent.**
|
35 |
+
|
36 |
+
This unit is fundamental **if you want to be able to work on Deep Q-Learning** (unit 3): the first Deep RL algorithm that was able to play Atari games and **beat the human level on some of them** (breakout, space invaders…).
|
37 |
+
|
38 |
+
So let's get started!
|
39 |
+
|
40 |
+
- What is RL? A short recap
|
41 |
+
- The two types of value-based methods
|
42 |
+
- The State-Value function
|
43 |
+
- The Action-Value function
|
44 |
+
- The Bellman Equation: simplify our value estimation
|
45 |
+
- Monte Carlo vs Temporal Difference Learning
|
46 |
+
- Monte Carlo: learning at the end of the episode
|
47 |
+
- Temporal Difference Learning: learning at each step
|
48 |
+
|
49 |
+
## **What is RL? A short recap**
|
50 |
+
|
51 |
+
In RL, we build an agent that can **make smart decisions**. For instance, an agent that **learns to play a video game.** Or a trading agent that **learns to maximize its benefits** by making smart decisions on **what stocks to buy and when to sell.**
|
52 |
+
|
53 |
+
But, to make intelligent decisions, our agent will learn from the environment by **interacting with it through trial and error** and receiving rewards (positive or negative) **as unique feedback.**
|
54 |
+
|
55 |
+
Its goal **is to maximize its expected cumulative reward** (because of the reward hypothesis).
|
56 |
+
|
57 |
+
**The agent's decision-making process is called the policy π:** given a state, a policy will output an action or a probability distribution over actions. That is, given an observation of the environment, a policy will provide an action (or multiple probabilities for each action) that the agent should take.
|
58 |
+
|
59 |
+
**Our goal is to find an optimal policy π**\*, aka., a policy that leads to the best expected cumulative reward.
|
60 |
+
|
61 |
+
And to find this optimal policy (hence solving the RL problem), there **are two main types of RL methods**:
|
62 |
+
|
63 |
+
- _Policy-based methods_: **Train the policy directly** to learn which action to take given a state.
|
64 |
+
- _Value-based methods_: **Train a value function** to learn **which state is more valuable** and use this value function **to take the action that leads to it.**
|
65 |
+
|
66 |
+
And in this chapter, **we'll dive deeper into the Value-based methods.**
|
67 |
+
|
68 |
+
## **The two types of value-based methods**
|
69 |
+
|
70 |
+
In value-based methods, **we learn a value function** that **maps a state to the expected value of being at that state.**
|
71 |
+
|
72 |
+
The value of a state is the **expected discounted return** the agent can get if it **starts at that state and then acts according to our policy.**
|
73 |
+
|
74 |
+
If you forgot what discounting is, you can read this section.
|
75 |
+
|
76 |
+
> But what means acting according to our policy? We don't have a policy in value-based methods since we train a value function and not a policy?
|
77 |
+
|
78 |
+
Remember that the goal of an **RL agent is to have an optimal policy π.**
|
79 |
+
|
80 |
+
To find it, we learned that there are two different methods:
|
81 |
+
|
82 |
+
- _Policy-based methods:_ **Directly train the policy** to select what action to take given a state (or a probability distribution over actions at that state). In this case, we **don't have a value function.**
|
83 |
+
|
84 |
+
The policy takes a state as input and outputs what action to take at that state (deterministic policy).
|
85 |
+
|
86 |
+
And consequently, **we don't define by hand the behavior of our policy; it's the training that will define it.**
|
87 |
+
|
88 |
+
- _Value-based methods:_ **Indirectly, by training a value function** that outputs the value of a state or a state-action pair. Given this value function, our policy **will take action.**
|
89 |
+
|
90 |
+
But, because we didn't train our policy, **we need to specify its behavior.** For instance, if we want a policy that, given the value function, will take actions that always lead to the biggest reward, **we'll create a Greedy Policy.**
|
91 |
+
|
92 |
+
Given a state, our action-value function (that we train) outputs the value of each action at that state, then our greedy policy (that we defined) selects the action with the biggest state-action pair value.
|
93 |
+
|
94 |
+
Consequently, whatever method you use to solve your problem, **you will have a policy**, but in the case of value-based methods you don't train it, your policy **is just a simple function that you specify** (for instance greedy policy) and this policy **uses the values given by the value-function to select its actions.**
|
95 |
+
|
96 |
+
So the difference is:
|
97 |
+
|
98 |
+
- In policy-based, **the optimal policy is found by training the policy directly.**
|
99 |
+
- In value-based, **finding an optimal value function leads to having an optimal policy.**
|
100 |
+
|
101 |
+
In fact, most of the time, in value-based methods, you'll use **an Epsilon-Greedy Policy** that handles the exploration/exploitation trade-off; we'll talk about it when we talk about Q-Learning in the second part of this unit.
|
102 |
+
|
103 |
+
So, we have two types of value-based functions:
|
104 |
+
|
105 |
+
### **The State-Value function**
|
106 |
+
|
107 |
+
We write the state value function under a policy π like this:
|
108 |
+
|
109 |
+
For each state, the state-value function outputs the expected return if the agent **starts at that state,** and then follow the policy forever after (for all future timesteps if you prefer).
|
110 |
+
|
111 |
+
If we take the state with value -7: it's the expected return starting at that state and taking actions according to our policy (greedy policy), so right, right, right, down, down, right, right.
|
112 |
+
|
113 |
+
### **The Action-Value function**
|
114 |
+
|
115 |
+
In the Action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
|
116 |
+
|
117 |
+
The value of taking action an in state s under a policy π is:
|
118 |
+
|
119 |
+
We see that the difference is:
|
120 |
+
|
121 |
+
- In state-value function, we calculate **the value of a state (St).**
|
122 |
+
- In action-value function, we calculate **the value of the state-action pair (St, At) hence the value of taking that action at that state.**
|
123 |
+
|
124 |
+
Note: We didn't fill all the state-action pairs for the example of Action-value function
|
125 |
+
|
126 |
+
In either case, whatever value function we choose (state-value or action-value function), **the value is the expected return.**
|
127 |
+
|
128 |
+
However, the problem is that it implies that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
|
129 |
+
|
130 |
+
This can be a tedious process, and that's **where the Bellman equation comes to help us.**
|
131 |
+
|
132 |
+
## **The Bellman Equation: simplify our value estimation**
|
133 |
+
|
134 |
+
The Bellman equation **simplifies our state value or state-action value calculation.**
|
135 |
+
|
136 |
+
With what we learned from now, we know that if we calculate the V(St) (value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(Our policy that we defined in the following example is a Greedy Policy, and for simplification, we don't discount the reward).**
|
137 |
+
|
138 |
+
So to calculate V(St), we need to make the sum of the expected rewards. Hence:
|
139 |
+
|
140 |
+
To calculate the value of State 1: the sum of rewards \*\*if the agent started in that state\*\* and then followed the \*\*greedy policy (taking actions that leads to the best states values) for all the time steps.\*\*
|
141 |
+
|
142 |
+
Then, to calculate the V(St+1), we need to calculate the return starting at that state St+1.
|
143 |
+
|
144 |
+
To calculate the value of State 2: the sum of rewards \*\*if the agent started in that state,\*\* and then followed the \*\*policy for all the time steps.\*\*
|
145 |
+
|
146 |
+
So you see, that's a pretty tedious process if you need to do it for each state value or state-action value.
|
147 |
+
|
148 |
+
Instead of calculating the expected return for each state or each state-action pair, **we can use the Bellman equation.**
|
149 |
+
|
150 |
+
The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
|
151 |
+
|
152 |
+
**The immediate reward (Rt+1) + the discounted value of the state that follows (gamma \* V(St+1)).**
|
153 |
+
|
154 |
+
For simplification here we don’t discount so gamma = 1.
|
155 |
+
|
156 |
+
If we go back to our example, the value of State 1= expected cumulative return if we start at that state.
|
157 |
+
|
158 |
+
To calculate the value of State 1: the sum of rewards **if the agent started in that state 1** and then followed the **policy for all the time steps.**
|
159 |
+
|
160 |
+
Which is equivalent to V(St) = Immediate reward (Rt+1) + Discounted value of the next state (Gamma \* V(St+1))
|
161 |
+
|
162 |
+
For simplification, here we don't discount, so gamma = 1.
|
163 |
+
|
164 |
+
- The value of V(St+1) = Immediate reward (Rt+2) + Discounted value of the St+2 (Gamma \* V(St+2)).
|
165 |
+
- And so on.
|
166 |
+
|
167 |
+
To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process.** This is equivalent **to the sum of immediate reward + the discounted value of the state that follows.**
|
168 |
+
|
169 |
+
## **Monte Carlo vs Temporal Difference Learning**
|
170 |
+
|
171 |
+
The last thing we need to talk about before diving into Q-Learning is the two ways of learning.
|
172 |
+
|
173 |
+
Remember that an RL agent **learns by interacting with its environment.** The idea is that **using the experience taken**, given the reward it gets, will **update its value or policy.**
|
174 |
+
|
175 |
+
Monte Carlo and Temporal Difference Learning are two different **strategies on how to train our value function or our policy function.** Both of them **use experience to solve the RL problem.**
|
176 |
+
|
177 |
+
On one hand, Monte Carlo uses **an entire episode of experience before learning.** On the other hand, Temporal Difference uses **only a step (St, At, Rt+1, St+1) to learn.**
|
178 |
+
|
179 |
+
We'll explain both of them **using a value-based method example.**
|
180 |
+
|
181 |
+
### **Monte Carlo: learning at the end of the episode**
|
182 |
+
|
183 |
+
Monte Carlo waits until the end of the episode, calculates Gt (return) and uses it as **a target for updating V(St).**
|
184 |
+
|
185 |
+
So it requires a **complete entire episode of interaction before updating our value function.**
|
186 |
+
|
187 |
+
If we take an example:
|
188 |
+
|
189 |
+
- We always start the episode **at the same starting point.**
|
190 |
+
|
191 |
+
- **The agent takes actions using the policy**. For instance, using an Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation.
|
192 |
+
|
193 |
+
- We get **the reward and the next state.**
|
194 |
+
|
195 |
+
- We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
|
196 |
+
|
197 |
+
- At the end of the episode, **we have a list of State, Actions, Rewards, and Next States**
|
198 |
+
|
199 |
+
- **The agent will sum the total rewards Gt** (to see how well it did).
|
200 |
+
|
201 |
+
- It will then **update V(st) based on the formula**
|
202 |
+
|
203 |
+
|
204 |
+
- Then **start a new game with this new knowledge**
|
205 |
+
|
206 |
+
By running more and more episodes, **the agent will learn to play better and better.**
|
207 |
+
|
208 |
+
For instance, if we train a state-value function using Monte Carlo:
|
209 |
+
|
210 |
+
- We just started to train our Value function, **so it returns 0 value for each state**
|
211 |
+
- Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
|
212 |
+
- Our mouse **explores the environment and takes random actions**
|
213 |
+
|
214 |
+
- The mouse made more than 10 steps, so the episode ends .
|
215 |
+
|
216 |
+
- We have a list of state, action, rewards, next\_state, **we need to calculate the return Gt**
|
217 |
+
- $$G\_t = R\_{t+1} + R\_{t+2} + R\_{t+3} ...$$
|
218 |
+
- Gt = Rt+1 + Rt+2 + Rt+3… (for simplicity we don’t discount the rewards).
|
219 |
+
- Gt = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1+ 0 + 0
|
220 |
+
- Gt= 3
|
221 |
+
- We can now update V(S0):
|
222 |
+
|
223 |
+
- New V(S0) = V(S0) + lr \* \[Gt — V(S0)\]
|
224 |
+
- New V(S0) = 0 + 0.1 \* \[3 –0\]
|
225 |
+
- The new V(S0) = 0.3
|
226 |
+
|
227 |
+
### **Temporal Difference Learning: learning at each step**
|
228 |
+
|
229 |
+
- **Temporal difference, on the other hand, waits for only one interaction (one step) St+1**
|
230 |
+
- to form a TD target and update V(St) using Rt+1 and gamma \* V(St+1).
|
231 |
+
|
232 |
+
The idea with **TD is to update the V(St) at each step.**
|
233 |
+
|
234 |
+
But because we didn't play during an entire episode, we don't have Gt (expected return). Instead, **we estimate Gt by adding Rt+1 and the discounted value of the next state.**
|
235 |
+
|
236 |
+
We speak about **bootstrap because TD bases its update part on an existing estimate V(St+1) and not a complete sample Gt.**
|
237 |
+
|
238 |
+
This method is called TD(0) or **one-step TD (update the value function after any individual step).**
|
239 |
+
|
240 |
+
If we take the same example,
|
241 |
+
|
242 |
+
- We just started to train our Value function, so it returns 0 value for each state.
|
243 |
+
- Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
|
244 |
+
- Our mouse explore the environment and take a random action: **going to the left**
|
245 |
+
- It gets a reward Rt+1 = 1 since **it eats a piece of cheese**
|
246 |
+
|
247 |
+
We can now update V(S0):
|
248 |
+
|
249 |
+
New V(S0) = V(S0) + lr \* \[R1 + gamma \* V(S1) — V(S0)\]
|
250 |
+
|
251 |
+
New V(S0) = 0 + 0.1 \* \[1 + 0.99 \* 0–0\]
|
252 |
+
|
253 |
+
The new V(S0) = 0.1
|
254 |
+
|
255 |
+
So we just updated our value function for State 0.
|
256 |
+
|
257 |
+
Now we **continue to interact with this environment with our updated value function.**
|
258 |
+
|
259 |
+
If we summarize:
|
260 |
+
|
261 |
+
- With Monte Carlo, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
262 |
+
- With TD learning, we update the value function from a step, so we replace Gt that we don't have with **an estimated return called TD target.**
|
263 |
+
|
264 |
+
So now, before diving on Q-Learning, let's summarise what we just learned:
|
265 |
+
|
266 |
+
We have two types of value-based functions:
|
267 |
+
|
268 |
+
- State-Value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
|
269 |
+
- Action-Value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
|
270 |
+
- In value-based methods, **we define the policy by hand** because we don't train it, we train a value function. The idea is that if we have an optimal value function, we **will have an optimal policy.**
|
271 |
+
|
272 |
+
There are two types of methods to learn a policy or a value function:
|
273 |
+
|
274 |
+
- With _the Monte Carlo method_, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
275 |
+
- With _the TD Learning method,_ we update the value function from a step, so we replace Gt that we don't have with **an estimated return called TD target.**
|
276 |
+
|
277 |
+
---
|
278 |
+
|
279 |
+
So that’s all for today. Congrats on finishing this first part of the chapter! There was a lot of information.
|
280 |
+
|
281 |
+
**That’s normal if you still feel confused with all these elements**. This was the same for me and for all people who studied RL.
|
282 |
+
|
283 |
+
**Take time to really grasp the material before continuing**. In the second part (that we will publish this Friday 📆), we’ll study our first RL algorithm: Q-Learning, and implement our first RL Agent in two environments:
|
284 |
+
|
285 |
+
1. Frozen-Lake-v1 (non-slippery version): where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
|
286 |
+
2. An autonomous taxi will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
|
287 |
+
|
288 |
+
And don't forget to share with your friends who want to learn 🤗 !
|
289 |
+
|
290 |
+
### Keep learning, stay awesome,
|
examples/HFblog-Introducing Decision Transformers.txt
ADDED
@@ -0,0 +1,236 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
created: 2022-05-23T01:23:57 (UTC +02:00)
|
3 |
+
tags: []
|
4 |
+
source: https://huggingface.co/blog/decision-transformers
|
5 |
+
author:
|
6 |
+
---
|
7 |
+
|
8 |
+
# Introducing Decision Transformers on Hugging Face 🤗
|
9 |
+
|
10 |
+
> ## Excerpt
|
11 |
+
> We’re on a journey to advance and democratize artificial intelligence through open source and open science.
|
12 |
+
|
13 |
+
---
|
14 |
+
Back to blog
|
15 |
+
|
16 |
+
At Hugging Face, we are contributing to the ecosystem for Deep Reinforcement Learning researchers and enthusiasts. Recently, we have integrated Deep RL frameworks such as Stable-Baselines3.
|
17 |
+
|
18 |
+
And today we are happy to announce that we integrated the Decision Transformer, an Offline Reinforcement Learning method, into the 🤗 transformers library and the Hugging Face Hub. We have some exciting plans for improving accessibility in the field of Deep RL and we are looking forward to sharing them with you over the coming weeks and months.
|
19 |
+
|
20 |
+
- What is Offline Reinforcement Learning?
|
21 |
+
- Introducing Decision Transformers
|
22 |
+
- Using the Decision Transformer in 🤗 Transformers
|
23 |
+
- Conclusion
|
24 |
+
- What's next?
|
25 |
+
- References
|
26 |
+
|
27 |
+
## What is Offline Reinforcement Learning?
|
28 |
+
|
29 |
+
Deep Reinforcement Learning (RL) is a framework to build decision-making agents. These agents aim to learn optimal behavior (policy) by interacting with the environment through trial and error and receiving rewards as unique feedback.
|
30 |
+
|
31 |
+
The agent’s goal is to maximize **its cumulative reward, called return.** Because RL is based on the reward hypothesis: **all goals can be described as the maximization of the expected cumulative reward.**
|
32 |
+
|
33 |
+
Deep Reinforcement Learning agents **learn with batches of experience.** The question is, how do they collect it?:
|
34 |
+
|
35 |
+
_A comparison between Reinforcement Learning in an Online and Offline setting, figure taken from this post_
|
36 |
+
|
37 |
+
In online reinforcement learning, **the agent gathers data directly**: it collects a batch of experience by interacting with the environment. Then, it uses this experience immediately (or via some replay buffer) to learn from it (update its policy).
|
38 |
+
|
39 |
+
But this implies that either you train your agent directly in the real world or have a simulator. If you don’t have one, you need to build it, which can be very complex (how to reflect the complex reality of the real world in an environment?), expensive, and insecure since if the simulator has flaws, the agent will exploit them if they provide a competitive advantage.
|
40 |
+
|
41 |
+
On the other hand, in offline reinforcement learning, the agent only uses data collected from other agents or human demonstrations. **It does not interact with the environment**.
|
42 |
+
|
43 |
+
The process is as follows:
|
44 |
+
|
45 |
+
1. Create a dataset using one or more policies and/or human interactions.
|
46 |
+
2. Run offline RL on this dataset to learn a policy
|
47 |
+
|
48 |
+
This method has one drawback: the counterfactual queries problem. What do we do if our agent decides to do something for which we don’t have the data? For instance, turning right on an intersection but we don’t have this trajectory.
|
49 |
+
|
50 |
+
There’s already exists some solutions on this topic, but if you want to know more about offline reinforcement learning you can watch this video
|
51 |
+
|
52 |
+
## Introducing Decision Transformers
|
53 |
+
|
54 |
+
The Decision Transformer model was introduced by “Decision Transformer: Reinforcement Learning via Sequence Modeling” by Chen L. et al. It abstracts Reinforcement Learning as a **conditional-sequence modeling problem**.
|
55 |
+
|
56 |
+
The main idea is that instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return. It’s an autoregressive model conditioned on the desired return, past states, and actions to generate future actions that achieve the desired return.
|
57 |
+
|
58 |
+
This is a complete shift in the Reinforcement Learning paradigm since we use generative trajectory modeling (modeling the joint distribution of the sequence of states, actions, and rewards) to replace conventional RL algorithms. It means that in Decision Transformers, we don’t maximize the return but rather generate a series of future actions that achieve the desired return.
|
59 |
+
|
60 |
+
The process goes this way:
|
61 |
+
|
62 |
+
1. We feed the last K timesteps into the Decision Transformer with 3 inputs:
|
63 |
+
- Return-to-go
|
64 |
+
- State
|
65 |
+
- Action
|
66 |
+
2. The tokens are embedded either with a linear layer if the state is a vector or CNN encoder if it’s frames.
|
67 |
+
3. The inputs are processed by a GPT-2 model which predicts future actions via autoregressive modeling.
|
68 |
+
|
69 |
+
_Decision Transformer architecture. States, actions, and returns are fed into modality specific linear embeddings and a positional episodic timestep encoding is added. Tokens are fed into a GPT architecture which predicts actions autoregressively using a causal self-attention mask. Figure from \[1\]._
|
70 |
+
|
71 |
+
## Using the Decision Transformer in 🤗 Transformers
|
72 |
+
|
73 |
+
The Decision Transformer model is now available as part of the 🤗 transformers library. In addition, we share nine pre-trained model checkpoints for continuous control tasks in the Gym environment.
|
74 |
+
|
75 |
+
_An “expert” Decision Transformers model, learned using offline RL in the Gym Walker2d environment._
|
76 |
+
|
77 |
+
### Install the package
|
78 |
+
|
79 |
+
```
|
80 |
+
pip install git+https://github.com/huggingface/transformers
|
81 |
+
```
|
82 |
+
|
83 |
+
### Loading the model
|
84 |
+
|
85 |
+
Using the Decision Transformer is relatively easy, but as it is an autoregressive model, some care has to be taken in order to prepare the model’s inputs at each time-step. We have prepared both a Python script and a Colab notebook that demonstrates how to use this model.
|
86 |
+
|
87 |
+
Loading a pretrained Decision Transformer is simple in the 🤗 transformers library:
|
88 |
+
|
89 |
+
```
|
90 |
+
from transformers import DecisionTransformerModel
|
91 |
+
|
92 |
+
model_name = "edbeeching/decision-transformer-gym-hopper-expert"
|
93 |
+
model = DecisionTransformerModel.from_pretrained(model_name)
|
94 |
+
```
|
95 |
+
|
96 |
+
### Creating the environment
|
97 |
+
|
98 |
+
We provide pretrained checkpoints for the Gym Hopper, Walker2D and Halfcheetah. Checkpoints for Atari environments will soon be available.
|
99 |
+
|
100 |
+
```
|
101 |
+
import gym
|
102 |
+
env = gym.make("Hopper-v3")
|
103 |
+
state_dim = env.observation_space.shape[0] # state size
|
104 |
+
act_dim = env.action_space.shape[0] # action size
|
105 |
+
```
|
106 |
+
|
107 |
+
### Autoregressive prediction function
|
108 |
+
|
109 |
+
The model performs an autoregressive prediction; that is to say that predictions made at the current time-step **t** are sequentially conditioned on the outputs from previous time-steps. This function is quite meaty, so we will aim to explain it in the comments.
|
110 |
+
|
111 |
+
```
|
112 |
+
# Function that gets an action from the model using autoregressive prediction
|
113 |
+
# with a window of the previous 20 timesteps.
|
114 |
+
def get_action(model, states, actions, rewards, returns_to_go, timesteps):
|
115 |
+
# This implementation does not condition on past rewards
|
116 |
+
|
117 |
+
states = states.reshape(1, -1, model.config.state_dim)
|
118 |
+
actions = actions.reshape(1, -1, model.config.act_dim)
|
119 |
+
returns_to_go = returns_to_go.reshape(1, -1, 1)
|
120 |
+
timesteps = timesteps.reshape(1, -1)
|
121 |
+
|
122 |
+
# The prediction is conditioned on up to 20 previous time-steps
|
123 |
+
states = states[:, -model.config.max_length :]
|
124 |
+
actions = actions[:, -model.config.max_length :]
|
125 |
+
returns_to_go = returns_to_go[:, -model.config.max_length :]
|
126 |
+
timesteps = timesteps[:, -model.config.max_length :]
|
127 |
+
|
128 |
+
# pad all tokens to sequence length, this is required if we process batches
|
129 |
+
padding = model.config.max_length - states.shape[1]
|
130 |
+
attention_mask = torch.cat([torch.zeros(padding), torch.ones(states.shape[1])])
|
131 |
+
attention_mask = attention_mask.to(dtype=torch.long).reshape(1, -1)
|
132 |
+
states = torch.cat([torch.zeros((1, padding, state_dim)), states], dim=1).float()
|
133 |
+
actions = torch.cat([torch.zeros((1, padding, act_dim)), actions], dim=1).float()
|
134 |
+
returns_to_go = torch.cat([torch.zeros((1, padding, 1)), returns_to_go], dim=1).float()
|
135 |
+
timesteps = torch.cat([torch.zeros((1, padding), dtype=torch.long), timesteps], dim=1)
|
136 |
+
|
137 |
+
# perform the prediction
|
138 |
+
state_preds, action_preds, return_preds = model(
|
139 |
+
states=states,
|
140 |
+
actions=actions,
|
141 |
+
rewards=rewards,
|
142 |
+
returns_to_go=returns_to_go,
|
143 |
+
timesteps=timesteps,
|
144 |
+
attention_mask=attention_mask,
|
145 |
+
return_dict=False,)
|
146 |
+
return action_preds[0, -1]
|
147 |
+
```
|
148 |
+
|
149 |
+
### Evaluating the model
|
150 |
+
|
151 |
+
In order to evaluate the model, we need some additional information; the mean and standard deviation of the states that were used during training. Fortunately, these are available for each of the checkpoint’s model card on the Hugging Face Hub!
|
152 |
+
|
153 |
+
We also need a target return for the model. This is the power of Offline Reinforcement Learning: we can use the target return to control the performance of the policy. This could be really powerful in a multiplayer setting, where we would like to adjust the performance of an opponent bot to be at a suitable difficulty for the player. The authors show a great plot of this in their paper!
|
154 |
+
|
155 |
+
_Sampled (evaluation) returns accumulated by Decision Transformer when conditioned on the specified target (desired) returns. Top: Atari. Bottom: D4RL medium-replay datasets. Figure from \[1\]._
|
156 |
+
|
157 |
+
```
|
158 |
+
TARGET_RETURN = 3.6 # This was normalized during training
|
159 |
+
MAX_EPISODE_LENGTH = 1000
|
160 |
+
|
161 |
+
state_mean = np.array(
|
162 |
+
[1.3490015, -0.11208222, -0.5506444, -0.13188992, -0.00378754, 2.6071432,
|
163 |
+
0.02322114, -0.01626922, -0.06840388, -0.05183131, 0.04272673,])
|
164 |
+
|
165 |
+
state_std = np.array(
|
166 |
+
[0.15980862, 0.0446214, 0.14307782, 0.17629202, 0.5912333, 0.5899924,
|
167 |
+
1.5405099, 0.8152689, 2.0173461, 2.4107876, 5.8440027,])
|
168 |
+
|
169 |
+
state_mean = torch.from_numpy(state_mean)
|
170 |
+
state_std = torch.from_numpy(state_std)
|
171 |
+
|
172 |
+
state = env.reset()
|
173 |
+
target_return = torch.tensor(TARGET_RETURN).float().reshape(1, 1)
|
174 |
+
states = torch.from_numpy(state).reshape(1, state_dim).float()
|
175 |
+
actions = torch.zeros((0, act_dim)).float()
|
176 |
+
rewards = torch.zeros(0).float()
|
177 |
+
timesteps = torch.tensor(0).reshape(1, 1).long()
|
178 |
+
|
179 |
+
# take steps in the environment
|
180 |
+
for t in range(max_ep_len):
|
181 |
+
# add zeros for actions as input for the current time-step
|
182 |
+
actions = torch.cat([actions, torch.zeros((1, act_dim))], dim=0)
|
183 |
+
rewards = torch.cat([rewards, torch.zeros(1)])
|
184 |
+
|
185 |
+
# predicting the action to take
|
186 |
+
action = get_action(model,
|
187 |
+
(states - state_mean) / state_std,
|
188 |
+
actions,
|
189 |
+
rewards,
|
190 |
+
target_return,
|
191 |
+
timesteps)
|
192 |
+
actions[-1] = action
|
193 |
+
action = action.detach().numpy()
|
194 |
+
|
195 |
+
# interact with the environment based on this action
|
196 |
+
state, reward, done, _ = env.step(action)
|
197 |
+
|
198 |
+
cur_state = torch.from_numpy(state).reshape(1, state_dim)
|
199 |
+
states = torch.cat([states, cur_state], dim=0)
|
200 |
+
rewards[-1] = reward
|
201 |
+
|
202 |
+
pred_return = target_return[0, -1] - (reward / scale)
|
203 |
+
target_return = torch.cat([target_return, pred_return.reshape(1, 1)], dim=1)
|
204 |
+
timesteps = torch.cat([timesteps, torch.ones((1, 1)).long() * (t + 1)], dim=1)
|
205 |
+
|
206 |
+
if done:
|
207 |
+
break
|
208 |
+
```
|
209 |
+
|
210 |
+
You will find a more detailed example, with the creation of videos of the agent in our Colab notebook.
|
211 |
+
|
212 |
+
## Conclusion
|
213 |
+
|
214 |
+
In addition to Decision Transformers, we want to support more use cases and tools from the Deep Reinforcement Learning community. Therefore, it would be great to hear your feedback on the Decision Transformer model, and more generally anything we can build with you that would be useful for RL. Feel free to **reach out to us**.
|
215 |
+
|
216 |
+
## What’s next?
|
217 |
+
|
218 |
+
In the coming weeks and months, we plan on supporting other tools from the ecosystem:
|
219 |
+
|
220 |
+
- Integrating **RL-baselines3-zoo**
|
221 |
+
- Uploading **RL-trained-agents models** into the Hub: a big collection of pre-trained Reinforcement Learning agents using stable-baselines3
|
222 |
+
- Integrating other Deep Reinforcement Learning libraries
|
223 |
+
- Implementing Convolutional Decision Transformers For Atari
|
224 |
+
- And more to come 🥳
|
225 |
+
|
226 |
+
The best way to keep in touch is to **join our discord server** to exchange with us and with the community.
|
227 |
+
|
228 |
+
## References
|
229 |
+
|
230 |
+
\[1\] Chen, Lili, et al. "Decision transformer: Reinforcement learning via sequence modeling." _Advances in neural information processing systems_ 34 (2021).
|
231 |
+
|
232 |
+
\[2\] Agarwal, Rishabh, Dale Schuurmans, and Mohammad Norouzi. "An optimistic perspective on offline reinforcement learning." _International Conference on Machine Learning_. PMLR, 2020.
|
233 |
+
|
234 |
+
### Acknowledgements
|
235 |
+
|
236 |
+
We would like to thank the paper’s first authors, Kevin Lu and Lili Chen, for their constructive conversations.
|
examples/HFblog-Introducing Hugging Face for Education.txt
ADDED
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
created: 2022-05-23T01:23:51 (UTC +02:00)
|
3 |
+
tags: []
|
4 |
+
source: https://huggingface.co/blog/education
|
5 |
+
author: Violette
|
6 |
+
Violette Lepercq
|
7 |
+
---
|
8 |
+
|
9 |
+
# Introducing Hugging Face for Education 🤗
|
10 |
+
|
11 |
+
> ## Excerpt
|
12 |
+
> We’re on a journey to advance and democratize artificial intelligence through open source and open science.
|
13 |
+
|
14 |
+
---
|
15 |
+
Back to blog
|
16 |
+
|
17 |
+
Given that machine learning will make up the overwhelming majority of software development and that non-technical people will be exposed to AI systems more and more, one of the main challenges of AI is adapting and enhancing employee skills. It is also becoming necessary to support teaching staff in proactively taking AI's ethical and critical issues into account.
|
18 |
+
|
19 |
+
As an open-source company democratizing machine learning, Hugging Face believes it is essential to educate people from all backgrounds worldwide.
|
20 |
+
|
21 |
+
We launched the ML demo.cratization tour in March 2022, where experts from Hugging Face taught hands-on classes on Building Machine Learning Collaboratively to more than 1000 students from 16 countries. Our new goal: **to teach machine learning to 5 million people by the end of 2023**.
|
22 |
+
|
23 |
+
_This blog post provides a high-level description of how we will reach our goals around education._
|
24 |
+
|
25 |
+
## 🤗 **Education for All**
|
26 |
+
|
27 |
+
🗣️ Our goal is to make the potential and limitations of machine learning understandable to everyone. We believe that doing so will help evolve the field in a direction where the application of these technologies will lead to net benefits for society as a whole.
|
28 |
+
|
29 |
+
Some examples of our existing efforts:
|
30 |
+
|
31 |
+
- we describe in a very accessible way different uses of ML models (summarization, text generation, object detection…),
|
32 |
+
- we allow everyone to try out models directly in their browser through widgets in the model pages, hence lowering the need for technical skills to do so (example),
|
33 |
+
- we document and warn about harmful biases identified in systems (like GPT-2).
|
34 |
+
- we provide tools to create open-source ML apps that allow anyone to understand the potential of ML in one click.
|
35 |
+
|
36 |
+
## 🤗 **Education for Beginners**
|
37 |
+
|
38 |
+
🗣️ We want to lower the barrier to becoming a machine learning engineer by providing online courses, hands-on workshops, and other innovative techniques.
|
39 |
+
|
40 |
+
- We provide a free course about natural language processing (NLP) and more domains (soon) using free tools and libraries from the Hugging Face ecosystem. It’s completely free and without ads. The ultimate goal of this course is to learn how to apply Transformers to (almost) any machine learning problem!
|
41 |
+
- We provide a free course about Deep Reinforcement Learning. In this course, you can study Deep Reinforcement Learning in theory and practice, learn to use famous Deep RL libraries, train agents in unique environments, publish your trained agents in one line of code to the Hugging Face Hub, and more!
|
42 |
+
- We provide a free course on how to build interactive demos for your machine learning models. The ultimate goal of this course is to allow ML developers to easily present their work to a wide audience including non-technical teams or customers, researchers to more easily reproduce machine learning models and behavior, end users to more easily identify and debug failure points of models, and more!
|
43 |
+
- Experts at Hugging Face wrote a book on Transformers and their applications to a wide range of NLP tasks.
|
44 |
+
|
45 |
+
Apart from those efforts, many team members are involved in other educational efforts such as:
|
46 |
+
|
47 |
+
- Participating in meetups, conferences and workshops.
|
48 |
+
- Creating podcasts, YouTube videos, and blog posts.
|
49 |
+
- Organizing events in which free GPUs are provided for anyone to be able to train and share models and create demos for them.
|
50 |
+
|
51 |
+
## 🤗 **Education for Instructors**
|
52 |
+
|
53 |
+
🗣️ We want to empower educators with tools and offer collaborative spaces where students can build machine learning using open-source technologies and state-of-the-art machine learning models.
|
54 |
+
|
55 |
+
- We provide to educators free infrastructure and resources to quickly introduce real-world applications of ML to theirs students and make learning more fun and interesting. By creating a classroom for free from the hub, instructors can turn their classes into collaborative environments where students can learn and build ML-powered applications using free open-source technologies and state-of-the-art models.
|
56 |
+
|
57 |
+
- We’ve assembled a free toolkit translated to 8 languages that instructors of machine learning or Data Science can use to easily prepare labs, homework, or classes. The content is self-contained so that it can be easily incorporated into an existing curriculum. This content is free and uses well-known Open Source technologies (🤗 transformers, gradio, etc). Feel free to pick a tutorial and teach it!
|
58 |
+
|
59 |
+
1️⃣ A Tour through the Hugging Face Hub
|
60 |
+
|
61 |
+
2️⃣ Build and Host Machine Learning Demos with Gradio & Hugging Face
|
62 |
+
|
63 |
+
3️⃣ Getting Started with Transformers
|
64 |
+
|
65 |
+
- We're organizing a dedicated, free workshop (June 6) on how to teach our educational resources in your machine learning and data science classes. Do not hesitate to register.
|
66 |
+
|
67 |
+
- We are currently doing a worldwide tour in collaboration with university instructors to teach more than 10000 students one of our core topics: How to build machine learning collaboratively? You can request someone on the Hugging Face team to run the session for your class via the ML demo.cratization tour initiative**.**
|
68 |
+
|
69 |
+
|
70 |
+
## 🤗 **Education Events & News**
|
71 |
+
|
72 |
+
- **05/13**\[NEWS\]: Are you studying machine learning? Do you want to be a part of our ML democratization efforts and show your campus community how to build ML models with Hugging Face? We want to support you in your journey! You have until June 13th to apply to 🤗 Student Application Program.
|
73 |
+
- **06/06**\[EVENT\]: How to Teach Open-Source Machine Learning Tools. Register
|
74 |
+
- **09/08**\[EVENT\]: ML Demo.cratization tour in Argentina at 2pm (GMT-3). Link coming soon
|
75 |
+
|
76 |
+
🔥 We are currently working on more content in the course, and more! Stay tuned!
|