Deep RL Course documentation

Glossary

Join the Hugging Face community

to get started

# Glossary

This is a community-created glossary. Contributions are welcomed!

### Strategies to find the optimal policy

• Policy-based methods. The policy is usually trained with a neural network to select what action to take given a state. In this case is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
• Value-based methods. In this case, a value function is trained to output the value of a state or a state-action pair that will represent our policy. However, this value doesn’t define what action the agent should take. In contrast, we need to specify the behavior of the agent given the output of the value function. For example, we could decide to adopt a policy to take the action that always leads to the biggest reward (Greedy Policy). In summary, the policy is a Greedy Policy (or whatever decision the user takes) that uses the values of the value-function to decide the actions to take.

### Among the value-based methods, we can find two main strategies

• The state-value function. For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
• The action-value function. In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state and takes an action. Then it follows the policy forever after.

### Epsilon-greedy strategy:

• Common exploration strategy used in reinforcement learning that involves balancing exploration and exploitation.
• Chooses the action with the highest expected reward with a probability of 1-epsilon.
• Chooses a random action with a probability of epsilon.
• Epsilon is typically decreased over time to shift focus towards exploitation.

### Greedy strategy:

• Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (only exploitation)
• Always chooses the action with the highest expected reward.
• Does not include any exploration.
• Can be disadvantageous in environments with uncertainty or unknown optimal actions.

If you want to improve the course, you can open a Pull Request.

This glossary was made possible thanks to: