Reinforcement learning is the computational approach of learning from action by interacting with an environment through trial and error and receiving rewards (negative or positive) as feedback
Red traffic light, pedestrians are about to pass.
Stop the car.
Yellow light, pedestrians have crossed.
About Reinforcement Learning
Reinforcement learning is known for its application to video games. Since the games provide a safe environment for the agent to be trained in the sense that it is perfectly defined and controllable, this makes them perfect candidates for experimentation and will help a lot to learn about the capabilities and limitations of various RL algorithms.
There are many videos on the Internet where a game-playing reinforcement learning agent starts with a terrible gaming strategy due to random initialization of its settings, but over iterations, the agent gets better and better with each episode of the training. This paper mainly investigates the performance of RL in popular games such as Minecraft or Dota2. The agent's performance can exceed a human player's, although there are still some challenges mainly related to efficiency in constructing the gaming policy of the reinforcement learning agent.
You can contribute variants of this task here.
Agent: The learner and the decision maker.
Environment: The part of the world the agent interacts, comprising everything outside the agent.
State: Information the agent receives from the environment. In the case of a video game it can be a frame (a screenshot), in the case of a chess playing agent it can be the board position, in the case of a trading agent it can be the price of a certain stock.
Action: The decision taken by the agent.
Reward: The numerical feedback signal that the agent receives from the environment based on the chosen action
Return: Cumulative Reward. In the simplest case, the return is the sum of the rewards.
Episode: For some applications there is a natural notion of final time step. In this case, there is a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States. For instance, think about Chess: an episode begins at the initial board position and ends when the game is over.
Policy: The Policy is the brain of the Agent, it’s the function that tells what action to take given the state. So it defines the agent’s behavior at a given time. Reinforcement learning methods specify how the agent’s policy is changed as a result of its experience.
You can add a small snippet here that shows how to infer with
Would you like to learn more about the topic? Awesome! Here you can find some curated resources that you may find helpful!
- HuggingFace Deep Reinforcement Learning Class
- Introduction to Deep Reinforcement Learning
- Stable Baselines Integration with HuggingFace
- Train a Deep Reinforcement Learning lander agent to land correctly on the Moon 🌕 using Stable-Baselines3
- Introduction to Unity MLAgents
- Training Decision Transformers with 🤗 transformers
No example widget is defined for this task.
Note Contribute by proposing a widget for this task !
Note A Reinforcement Learning model trained on expert data from the Gym Hopper environment
Note A PPO agent playing seals/CartPole-v0 using the stable-baselines3 library and the RL Zoo.
Note A curation of widely used datasets for Data Driven Deep Reinforcement Learning (D4RL)
- Discounted Total Reward
- Accumulated reward across all time steps discounted by a factor that ranges between 0 and 1 and determines how much the agent optimizes for future relative to immediate rewards. Measures how good is the policy ultimately found by a given algorithm considering uncertainty over the future.
- Mean Reward
- Average return obtained after running the policy for a certain number of evaluation episodes. As opposed to total reward, mean reward considers how much reward a given algorithm receives while learning.
- Level of Performance After Some Time
- Measures how good a given algorithm is after a predefined time. Some algorithms may be guaranteed to converge to optimal behavior across many time steps. However, an agent that reaches an acceptable level of optimality after a given time horizon may be preferable to one that ultimately reaches optimality but takes a long time.