Reinforcement Learning

Reinforcement learning is the computational approach of learning from action by interacting with an environment through trial and error and receiving rewards (negative or positive) as feedback


Red traffic light, pedestrians are about to pass.

Reinforcement Learning Model

Stop the car.

Next State

Yellow light, pedestrians have crossed.

About Reinforcement Learning

Use Cases


Reinforcement learning is known for its application to video games. Since the games provide a safe environment for the agent to be trained in the sense that it is perfectly defined and controllable, this makes them perfect candidates for experimentation and will help a lot to learn about the capabilities and limitations of various RL algorithms.

There are many videos on the Internet where a game-playing reinforcement learning agent starts with a terrible gaming strategy due to random initialization of its settings, but over iterations, the agent gets better and better with each episode of the training. This paper mainly investigates the performance of RL in popular games such as Minecraft or Dota2. The agent's performance can exceed a human player's, although there are still some challenges mainly related to efficiency in constructing the gaming policy of the reinforcement learning agent.

Task Variants

You can contribute variants of this task here.


Agent: The learner and the decision maker.

Environment: The part of the world the agent interacts, comprising everything outside the agent.

State: Information the agent receives from the environment. In the case of a video game it can be a frame (a screenshot), in the case of a chess playing agent it can be the board position, in the case of a trading agent it can be the price of a certain stock.

Action: The decision taken by the agent.

Reward: The numerical feedback signal that the agent receives from the environment based on the chosen action

Return: Cumulative Reward. In the simplest case, the return is the sum of the rewards.

Episode: For some applications there is a natural notion of final time step. In this case, there is a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States. For instance, think about Chess: an episode begins at the initial board position and ends when the game is over.

Policy: The Policy is the brain of the Agent, it’s the function that tells what action to take given the state. So it defines the agent’s behavior at a given time. Reinforcement learning methods specify how the agent’s policy is changed as a result of its experience.


You can add a small snippet here that shows how to infer with reinforcement-learning models.

Useful Resources

Would you like to learn more about the topic? Awesome! Here you can find some curated resources that you may find helpful!


This page was made possible thanks to the efforts of Ram Ananth, Emilio Lehoucq and Osman Alenbey.

Reinforcement Learning demo

No example widget is defined for this task.

Note Contribute by proposing a widget for this task !

Models for Reinforcement Learning
Browse Models (3304)

Note A Reinforcement Learning model trained on expert data from the Gym Hopper environment

Note A PPO agent playing seals/CartPole-v0 using the stable-baselines3 library and the RL Zoo.

Datasets for Reinforcement Learning

Note A curation of widely used datasets for Data Driven Deep Reinforcement Learning (D4RL)

Metrics for Reinforcement Learning
Discounted Total Reward
Accumulated reward across all time steps discounted by a factor that ranges between 0 and 1 and determines how much the agent optimizes for future relative to immediate rewards. Measures how good is the policy ultimately found by a given algorithm considering uncertainty over the future.
Mean Reward
Average return obtained after running the policy for a certain number of evaluation episodes. As opposed to total reward, mean reward considers how much reward a given algorithm receives while learning.
Level of Performance After Some Time
Measures how good a given algorithm is after a predefined time. Some algorithms may be guaranteed to converge to optimal behavior across many time steps. However, an agent that reaches an acceptable level of optimality after a given time horizon may be preferable to one that ultimately reaches optimality but takes a long time.