Update README.md
Browse files
README.md
CHANGED
|
@@ -1,236 +1,241 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
| 13 |
-
|
|
| 14 |
-
| **MDP & Environment** | **
|
| 15 |
-
| **MDP & Environment** | **
|
| 16 |
-
| **
|
| 17 |
-
| **
|
| 18 |
-
| **
|
| 19 |
-
| **
|
| 20 |
-
| **
|
| 21 |
-
| **
|
| 22 |
-
| **
|
| 23 |
-
| **
|
| 24 |
-
| **
|
| 25 |
-
| **
|
| 26 |
-
| **
|
| 27 |
-
| **
|
| 28 |
-
| **
|
| 29 |
-
| **
|
| 30 |
-
| **
|
| 31 |
-
| **
|
| 32 |
-
| **
|
| 33 |
-
| **Temporal Difference** | **
|
| 34 |
-
| **Temporal Difference** | **
|
| 35 |
-
| **Temporal Difference** | **
|
| 36 |
-
| **Temporal Difference** | **
|
| 37 |
-
| **Temporal Difference** | **
|
| 38 |
-
| **Temporal Difference** | **
|
| 39 |
-
| **
|
| 40 |
-
| **
|
| 41 |
-
| **
|
| 42 |
-
| **
|
| 43 |
-
| **
|
| 44 |
-
| **
|
| 45 |
-
| **
|
| 46 |
-
| **
|
| 47 |
-
| **
|
| 48 |
-
| **
|
| 49 |
-
| **
|
| 50 |
-
| **
|
| 51 |
-
| **
|
| 52 |
-
| **
|
| 53 |
-
| **
|
| 54 |
-
| **
|
| 55 |
-
| **
|
| 56 |
-
| **
|
| 57 |
-
| **
|
| 58 |
-
| **
|
| 59 |
-
| **
|
| 60 |
-
| **
|
| 61 |
-
| **
|
| 62 |
-
| **
|
| 63 |
-
| **
|
| 64 |
-
| **
|
| 65 |
-
| **
|
| 66 |
-
| **
|
| 67 |
-
| **
|
| 68 |
-
| **
|
| 69 |
-
| **
|
| 70 |
-
| **
|
| 71 |
-
| **
|
| 72 |
-
| **
|
| 73 |
-
| **
|
| 74 |
-
| **
|
| 75 |
-
| **
|
| 76 |
-
| **
|
| 77 |
-
| **Advanced / Misc** | **
|
| 78 |
-
| **Advanced / Misc** | **
|
| 79 |
-
| **Advanced / Misc** | **
|
| 80 |
-
| **Advanced / Misc** | **
|
| 81 |
-
| **Advanced / Misc** | **
|
| 82 |
-
| **Advanced / Misc** | **
|
| 83 |
-
| **
|
| 84 |
-
| **
|
| 85 |
-
| **
|
| 86 |
-
| **
|
| 87 |
-
| **
|
| 88 |
-
| **
|
| 89 |
-
| **
|
| 90 |
-
| **
|
| 91 |
-
| **
|
| 92 |
-
| **
|
| 93 |
-
| **
|
| 94 |
-
| **
|
| 95 |
-
| **Advanced / Misc** | **
|
| 96 |
-
| **
|
| 97 |
-
| **
|
| 98 |
-
| **
|
| 99 |
-
| **
|
| 100 |
-
| **Advanced / Misc** | **
|
| 101 |
-
| **
|
| 102 |
-
| **
|
| 103 |
-
| **
|
| 104 |
-
| **
|
| 105 |
-
| **
|
| 106 |
-
| **
|
| 107 |
-
| **
|
| 108 |
-
| **
|
| 109 |
-
| **
|
| 110 |
-
| **
|
| 111 |
-
| **
|
| 112 |
-
| **
|
| 113 |
-
| **
|
| 114 |
-
| **
|
| 115 |
-
| **Exploration** | **
|
| 116 |
-
| **
|
| 117 |
-
| **
|
| 118 |
-
| **
|
| 119 |
-
| **
|
| 120 |
-
| **
|
| 121 |
-
| **
|
| 122 |
-
| **
|
| 123 |
-
| **
|
| 124 |
-
| **
|
| 125 |
-
| **
|
| 126 |
-
| **
|
| 127 |
-
| **
|
| 128 |
-
| **
|
| 129 |
-
| **
|
| 130 |
-
| **
|
| 131 |
-
| **
|
| 132 |
-
| **
|
| 133 |
-
| **
|
| 134 |
-
| **
|
| 135 |
-
| **
|
| 136 |
-
| **
|
| 137 |
-
| **
|
| 138 |
-
| **
|
| 139 |
-
| **
|
| 140 |
-
| **
|
| 141 |
-
| **
|
| 142 |
-
| **
|
| 143 |
-
| **
|
| 144 |
-
| **
|
| 145 |
-
| **
|
| 146 |
-
| **
|
| 147 |
-
| **
|
| 148 |
-
| **
|
| 149 |
-
| **
|
| 150 |
-
| **
|
| 151 |
-
| **
|
| 152 |
-
| **
|
| 153 |
-
| **Theory** | **
|
| 154 |
-
| **
|
| 155 |
-
| **
|
| 156 |
-
| **
|
| 157 |
-
| **
|
| 158 |
-
| **
|
| 159 |
-
| **
|
| 160 |
-
| **
|
| 161 |
-
| **
|
| 162 |
-
| **
|
| 163 |
-
| **
|
| 164 |
-
| **
|
| 165 |
-
| **
|
| 166 |
-
| **
|
| 167 |
-
| **
|
| 168 |
-
| **
|
| 169 |
-
| **
|
| 170 |
-
| **
|
| 171 |
-
| **
|
| 172 |
-
| **
|
| 173 |
-
| **
|
| 174 |
-
| **
|
| 175 |
-
| **
|
| 176 |
-
| **
|
| 177 |
-
| **
|
| 178 |
-
| **
|
| 179 |
-
| **
|
| 180 |
-
| **
|
| 181 |
-
| **
|
| 182 |
-
| **
|
| 183 |
-
| **
|
| 184 |
-
| **
|
| 185 |
-
| **
|
| 186 |
-
| **
|
| 187 |
-
| **
|
| 188 |
-
| **
|
| 189 |
-
| **
|
| 190 |
-
| **
|
| 191 |
-
| **
|
| 192 |
-
| **
|
| 193 |
-
| **
|
| 194 |
-
| **
|
| 195 |
-
| **
|
| 196 |
-
| **Applied RL** | **
|
| 197 |
-
| **Applied RL** | **
|
| 198 |
-
| **
|
| 199 |
-
| **
|
| 200 |
-
| **
|
| 201 |
-
| **Applied RL** | **
|
| 202 |
-
| **Applied RL** | **
|
| 203 |
-
| **
|
| 204 |
-
| **
|
| 205 |
-
| **
|
| 206 |
-
| **
|
| 207 |
-
| **Applied RL** | **
|
| 208 |
-
| **
|
| 209 |
-
| **
|
| 210 |
-
| **
|
| 211 |
-
| **
|
| 212 |
-
| **Applied RL** | **
|
| 213 |
-
| **
|
| 214 |
-
| **Applied RL** | **
|
| 215 |
-
| **Applied RL** | **
|
| 216 |
-
| **Applied RL** | **
|
| 217 |
-
| **Applied RL** | **
|
| 218 |
-
| **Applied RL** | **
|
| 219 |
-
| **Applied RL** | **
|
| 220 |
-
| **Applied RL** | **
|
| 221 |
-
| **Applied RL** | **
|
| 222 |
-
| **Applied RL** | **
|
| 223 |
-
| **Applied RL** | **
|
| 224 |
-
| **Applied RL** | **
|
| 225 |
-
| **Applied RL** | **
|
| 226 |
-
| **
|
| 227 |
-
| **
|
| 228 |
-
| **Applied RL** | **
|
| 229 |
-
| **
|
| 230 |
-
| **
|
| 231 |
-
| **
|
| 232 |
-
| **
|
| 233 |
-
| **
|
| 234 |
-
| **
|
| 235 |
-
| **
|
| 236 |
-
| **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Reinforcement Learning Graphical Representations
|
| 3 |
+
date: 2026-04-08
|
| 4 |
+
category: Reinforcement Learning
|
| 5 |
+
description: A comprehensive gallery of 230 standard RL components and their graphical presentations.
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# Reinforcement Learning Graphical Representations
|
| 9 |
+
|
| 10 |
+
This repository contains a full set of 230 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning.
|
| 11 |
+
|
| 12 |
+
| Category | Component | Illustration | Details | Context |
|
| 13 |
+
|----------|-----------|--------------|---------|---------|
|
| 14 |
+
| **MDP & Environment** | **Agent-Environment Interaction Loop** |  | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | All RL algorithms |
|
| 15 |
+
| **MDP & Environment** | **Markov Decision Process (MDP) Tuple** |  | (S, A, P, R, γ) with transition dynamics and reward function | s,a) and R(s,a,s′)) |
|
| 16 |
+
| **MDP & Environment** | **State Transition Graph** |  | Full probabilistic transitions between discrete states | Gridworld, Taxi, Cliff Walking |
|
| 17 |
+
| **MDP & Environment** | **Trajectory / Episode Sequence** |  | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Monte Carlo, episodic tasks |
|
| 18 |
+
| **MDP & Environment** | **Continuous State/Action Space Visualization** |  | High-dimensional spaces (e.g., robot joints, pixel inputs) | Continuous-control tasks (MuJoCo, PyBullet) |
|
| 19 |
+
| **MDP & Environment** | **Reward Function / Landscape** |  | Scalar reward as function of state/action | All algorithms; especially reward shaping |
|
| 20 |
+
| **MDP & Environment** | **Discount Factor (γ) Effect** |  | How future rewards are weighted | All discounted MDPs |
|
| 21 |
+
| **Value & Policy** | **State-Value Function V(s)** |  | Expected return from state s under policy π | Value-based methods |
|
| 22 |
+
| **Value & Policy** | **Action-Value Function Q(s,a)** |  | Expected return from state-action pair | Q-learning family |
|
| 23 |
+
| **Value & Policy** | **Policy π(s) or π(a\** |  | s) | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps |
|
| 24 |
+
| **Value & Policy** | **Advantage Function A(s,a)** |  | Q(s,a) – V(s) | A2C, PPO, SAC, TD3 |
|
| 25 |
+
| **Value & Policy** | **Optimal Value Function V* / Q*** |  | Solution to Bellman optimality | Value iteration, Q-learning |
|
| 26 |
+
| **Dynamic Programming** | **Policy Evaluation Backup** |  | Iterative update of V using Bellman expectation | Policy iteration |
|
| 27 |
+
| **Dynamic Programming** | **Policy Improvement** |  | Greedy policy update over Q | Policy iteration |
|
| 28 |
+
| **Dynamic Programming** | **Value Iteration Backup** |  | Update using Bellman optimality | Value iteration |
|
| 29 |
+
| **Dynamic Programming** | **Policy Iteration Full Cycle** |  | Evaluation → Improvement loop | Classic DP methods |
|
| 30 |
+
| **Monte Carlo** | **Monte Carlo Backup** |  | Update using full episode return G_t | First-visit / every-visit MC |
|
| 31 |
+
| **Monte Carlo** | **Monte Carlo Tree (MCTS)** |  | Search tree with selection, expansion, simulation, backprop | AlphaGo, AlphaZero |
|
| 32 |
+
| **Monte Carlo** | **Importance Sampling Ratio** |  | Off-policy correction ρ = π(a\ | s) |
|
| 33 |
+
| **Temporal Difference** | **TD(0) Backup** |  | Bootstrapped update using R + γV(s′) | TD learning |
|
| 34 |
+
| **Temporal Difference** | **Bootstrapping (general)** |  | Using estimated future value instead of full return | All TD methods |
|
| 35 |
+
| **Temporal Difference** | **n-step TD Backup** |  | Multi-step return G_t^{(n)} | n-step TD, TD(λ) |
|
| 36 |
+
| **Temporal Difference** | **TD(λ) & Eligibility Traces** |  | Decaying trace z_t for credit assignment | TD(λ), SARSA(λ), Q(λ) |
|
| 37 |
+
| **Temporal Difference** | **SARSA Update** |  | On-policy TD control | SARSA |
|
| 38 |
+
| **Temporal Difference** | **Q-Learning Update** |  | Off-policy TD control | Q-learning, Deep Q-Network |
|
| 39 |
+
| **Temporal Difference** | **Expected SARSA** |  | Expectation over next action under policy | Expected SARSA |
|
| 40 |
+
| **Temporal Difference** | **Double Q-Learning / Double DQN** |  | Two separate Q estimators to reduce overestimation | Double DQN, TD3 |
|
| 41 |
+
| **Temporal Difference** | **Dueling DQN Architecture** |  | Separate streams for state value V(s) and advantage A(s,a) | Dueling DQN |
|
| 42 |
+
| **Temporal Difference** | **Prioritized Experience Replay** |  | Importance sampling of transitions by TD error | Prioritized DQN, Rainbow |
|
| 43 |
+
| **Temporal Difference** | **Rainbow DQN Components** |  | All extensions combined (Double, Dueling, PER, etc.) | Rainbow DQN |
|
| 44 |
+
| **Function Approximation** | **Linear Function Approximation** |  | Feature vector φ(s) → wᵀφ(s) | Tabular → linear FA |
|
| 45 |
+
| **Function Approximation** | **Neural Network Layers (MLP, CNN, RNN, Transformer)** |  | Full deep network for value/policy | DQN, A3C, PPO, Decision Transformer |
|
| 46 |
+
| **Function Approximation** | **Computation Graph / Backpropagation Flow** |  | Gradient flow through network | All deep RL |
|
| 47 |
+
| **Function Approximation** | **Target Network** |  | Frozen copy of Q-network for stability | DQN, DDQN, SAC, TD3 |
|
| 48 |
+
| **Policy Gradients** | **Policy Gradient Theorem** |  | ∇_θ J(θ) = E[∇_θ log π(a\ | Flow diagram from reward → log-prob → gradient |
|
| 49 |
+
| **Policy Gradients** | **REINFORCE Update** |  | Monte-Carlo policy gradient | REINFORCE |
|
| 50 |
+
| **Policy Gradients** | **Baseline / Advantage Subtraction** |  | Subtract b(s) to reduce variance | All modern PG |
|
| 51 |
+
| **Policy Gradients** | **Trust Region (TRPO)** |  | KL-divergence constraint on policy update | TRPO |
|
| 52 |
+
| **Policy Gradients** | **Proximal Policy Optimization (PPO)** |  | Clipped surrogate objective | PPO, PPO-Clip |
|
| 53 |
+
| **Actor-Critic** | **Actor-Critic Architecture** |  | Separate or shared actor (policy) + critic (value) networks | A2C, A3C, SAC, TD3 |
|
| 54 |
+
| **Actor-Critic** | **Advantage Actor-Critic (A2C/A3C)** |  | Synchronous/asynchronous multi-worker | A2C/A3C |
|
| 55 |
+
| **Actor-Critic** | **Soft Actor-Critic (SAC)** |  | Entropy-regularized policy + twin critics | SAC |
|
| 56 |
+
| **Actor-Critic** | **Twin Delayed DDPG (TD3)** |  | Twin critics + delayed policy + target smoothing | TD3 |
|
| 57 |
+
| **Exploration** | **ε-Greedy Strategy** |  | Probability ε of random action | DQN family |
|
| 58 |
+
| **Exploration** | **Softmax / Boltzmann Exploration** |  | Temperature τ in softmax | Softmax policies |
|
| 59 |
+
| **Exploration** | **Upper Confidence Bound (UCB)** |  | Optimism in face of uncertainty | UCB1, bandits |
|
| 60 |
+
| **Exploration** | **Intrinsic Motivation / Curiosity** |  | Prediction error as intrinsic reward | ICM, RND, Curiosity-driven RL |
|
| 61 |
+
| **Exploration** | **Entropy Regularization** |  | Bonus term αH(π) | SAC, maximum-entropy RL |
|
| 62 |
+
| **Hierarchical RL** | **Options Framework** |  | High-level policy over options (temporally extended actions) | Option-Critic |
|
| 63 |
+
| **Hierarchical RL** | **Feudal Networks / Hierarchical Actor-Critic** |  | Manager-worker hierarchy | Feudal RL |
|
| 64 |
+
| **Hierarchical RL** | **Skill Discovery** |  | Unsupervised emergence of reusable skills | DIAYN, VALOR |
|
| 65 |
+
| **Model-Based RL** | **Learned Dynamics Model** |  | ˆP(s′\ | Separate model network diagram (often RNN or transformer) |
|
| 66 |
+
| **Model-Based RL** | **Model-Based Planning** |  | Rollouts inside learned model | MuZero, DreamerV3 |
|
| 67 |
+
| **Model-Based RL** | **Imagination-Augmented Agents (I2A)** |  | Imagination module + policy | I2A |
|
| 68 |
+
| **Offline RL** | **Offline Dataset** |  | Fixed batch of trajectories | BC, CQL, IQL |
|
| 69 |
+
| **Offline RL** | **Conservative Q-Learning (CQL)** |  | Penalty on out-of-distribution actions | CQL |
|
| 70 |
+
| **Multi-Agent RL** | **Multi-Agent Interaction Graph** |  | Agents communicating or competing | MARL, MADDPG |
|
| 71 |
+
| **Multi-Agent RL** | **Centralized Training Decentralized Execution (CTDE)** |  | Shared critic during training | QMIX, VDN, MADDPG |
|
| 72 |
+
| **Multi-Agent RL** | **Cooperative / Competitive Payoff Matrix** |  | Joint reward for multiple agents | Prisoner's Dilemma, multi-agent gridworlds |
|
| 73 |
+
| **Inverse RL / IRL** | **Reward Inference** |  | Infer reward from expert demonstrations | IRL, GAIL |
|
| 74 |
+
| **Inverse RL / IRL** | **Generative Adversarial Imitation Learning (GAIL)** |  | Discriminator vs. policy generator | GAIL, AIRL |
|
| 75 |
+
| **Meta-RL** | **Meta-RL Architecture** |  | Outer loop (meta-policy) + inner loop (task adaptation) | MAML for RL, RL² |
|
| 76 |
+
| **Meta-RL** | **Task Distribution Visualization** |  | Multiple MDPs sampled from meta-distribution | Meta-RL benchmarks |
|
| 77 |
+
| **Advanced / Misc** | **Experience Replay Buffer** |  | Stored (s,a,r,s′,done) tuples | DQN and all off-policy deep RL |
|
| 78 |
+
| **Advanced / Misc** | **State Visitation / Occupancy Measure** |  | Frequency of visiting each state | All algorithms (analysis) |
|
| 79 |
+
| **Advanced / Misc** | **Learning Curve** |  | Average episodic return vs. episodes / steps | Standard performance reporting |
|
| 80 |
+
| **Advanced / Misc** | **Regret / Cumulative Regret** |  | Sub-optimality accumulated | Bandits and online RL |
|
| 81 |
+
| **Advanced / Misc** | **Attention Mechanisms (Transformers in RL)** |  | Attention weights | Decision Transformer, Trajectory Transformer |
|
| 82 |
+
| **Advanced / Misc** | **Diffusion Policy** |  | Denoising diffusion process for action generation | Diffusion-RL policies |
|
| 83 |
+
| **Advanced / Misc** | **Graph Neural Networks for RL** |  | Node/edge message passing | Graph RL, relational RL |
|
| 84 |
+
| **Advanced / Misc** | **World Model / Latent Space** |  | Encoder-decoder dynamics in latent space | Dreamer, PlaNet |
|
| 85 |
+
| **Advanced / Misc** | **Convergence Analysis Plots** |  | Error / value change over iterations | DP, TD, value iteration |
|
| 86 |
+
| **Advanced / Misc** | **RL Algorithm Taxonomy** |  | Comprehensive classification of algorithms | All RL |
|
| 87 |
+
| **Advanced / Misc** | **Probabilistic Graphical Model (RL as Inference)** |  | Formalizing RL as probabilistic inference | Control as Inference, MaxEnt RL |
|
| 88 |
+
| **Value & Policy** | **Distributional RL (C51 / Categorical)** |  | Representing return as a probability distribution | C51, QR-DQN, IQN |
|
| 89 |
+
| **Exploration** | **Hindsight Experience Replay (HER)** |  | Learning from failures by relabeling goals | Sparse reward robotics, HER |
|
| 90 |
+
| **Model-Based RL** | **Dyna-Q Architecture** |  | Integration of real experience and model-based planning | Dyna-Q, Dyna-2 |
|
| 91 |
+
| **Function Approximation** | **Noisy Networks (Parameter Noise)** |  | Stochastic weights for exploration | Noisy DQN, Rainbow |
|
| 92 |
+
| **Exploration** | **Intrinsic Curiosity Module (ICM)** |  | Reward based on prediction error | Curiosity-driven exploration, ICM |
|
| 93 |
+
| **Temporal Difference** | **V-trace (IMPALA)** |  | Asynchronous off-policy importance sampling | IMPALA, V-trace |
|
| 94 |
+
| **Multi-Agent RL** | **QMIX Mixing Network** |  | Monotonic value function factorization | QMIX, VDN |
|
| 95 |
+
| **Advanced / Misc** | **Saliency Maps / Attention on State** |  | Visualizing what the agent "sees" or prioritizes | Interpretability, Atari RL |
|
| 96 |
+
| **Exploration** | **Action Selection Noise (OU vs Gaussian)** |  | Temporal correlation in exploration noise | DDPG, TD3 |
|
| 97 |
+
| **Advanced / Misc** | **t-SNE / UMAP State Embeddings** |  | Dimension reduction of high-dim neural states | Interpretability, SRL |
|
| 98 |
+
| **Advanced / Misc** | **Loss Landscape Visualization** |  | Optimization surface geometry | Training stability analysis |
|
| 99 |
+
| **Advanced / Misc** | **Success Rate vs Steps** |  | Percentage of successful episodes | Goal-conditioned RL, Robotics |
|
| 100 |
+
| **Advanced / Misc** | **Hyperparameter Sensitivity Heatmap** |  | Performance across parameter grids | Hyperparameter tuning |
|
| 101 |
+
| **Dynamics** | **Action Persistence (Frame Skipping)** |  | Temporal abstraction by repeating actions | Atari RL, Robotics |
|
| 102 |
+
| **Model-Based RL** | **MuZero Dynamics Search Tree** |  | Planning with learned transition and value functions | MuZero, Gumbel MuZero |
|
| 103 |
+
| **Deep RL** | **Policy Distillation** |  | Compressing knowledge from teacher to student | Kickstarting, multitask learning |
|
| 104 |
+
| **Transformers** | **Decision Transformer Token Sequence** |  | Sequential modeling of RL as a translation task | Decision Transformer, TT |
|
| 105 |
+
| **Advanced / Misc** | **Performance Profiles (rliable)** |  | Robust aggregate performance metrics | Reliable RL evaluation |
|
| 106 |
+
| **Safety RL** | **Safety Shielding / Barrier Functions** |  | Hard constraints on the action space | Constrained MDPs, Safe RL |
|
| 107 |
+
| **Training** | **Automated Curriculum Learning** |  | Progressively increasing task difficulty | Curriculum RL, ALP-GMM |
|
| 108 |
+
| **Sim-to-Real** | **Domain Randomization** |  | Generalizing across environment variations | Robotics, Sim-to-Real |
|
| 109 |
+
| **Alignment** | **RL with Human Feedback (RLHF)** |  | Aligning agents with human preferences | ChatGPT, InstructGPT |
|
| 110 |
+
| **Neuro-inspired RL** | **Successor Representation (SR)** |  | Predictive state representations | SR-Dyna, Neuro-RL |
|
| 111 |
+
| **Inverse RL / IRL** | **Maximum Entropy IRL** |  | Probability distribution over trajectories | MaxEnt IRL, Ziebart |
|
| 112 |
+
| **Theory** | **Information Bottleneck** |  | Mutual information $I(S;Z)$ and $I(Z;A)$ balance | VIB-RL, Information Theory |
|
| 113 |
+
| **Evolutionary RL** | **Evolutionary Strategies Population** |  | Population-based parameter search | OpenAI-ES, Salimans |
|
| 114 |
+
| **Safety RL** | **Control Barrier Functions (CBF)** |  | Set-theoretic safety guarantees | CBF-RL, Control Theory |
|
| 115 |
+
| **Exploration** | **Count-based Exploration Heatmap** |  | Visitation frequency and intrinsic bonus | MBIE-EB, RND |
|
| 116 |
+
| **Exploration** | **Thompson Sampling Posteriors** |  | Direct uncertainty-based action selection | Bandits, Bayesian RL |
|
| 117 |
+
| **Multi-Agent RL** | **Adversarial RL Interaction** |  | Competition between protaganist and antagonist | Robust RL, RARL |
|
| 118 |
+
| **Hierarchical RL** | **Hierarchical Subgoal Trajectory** |  | Decomposing long-horizon tasks | Subgoal RL, HIRO |
|
| 119 |
+
| **Offline RL** | **Offline Action Distribution Shift** |  | Mismatch between dataset and current policy | CQL, IQL, D4RL |
|
| 120 |
+
| **Exploration** | **Random Network Distillation (RND)** |  | Prediction error as intrinsic reward | RND, OpenAI |
|
| 121 |
+
| **Offline RL** | **Batch-Constrained Q-learning (BCQ)** |  | Constraining actions to behavior dataset | BCQ, Fujimoto |
|
| 122 |
+
| **Training** | **Population-Based Training (PBT)** |  | Evolutionary hyperparameter optimization | PBT, DeepMind |
|
| 123 |
+
| **Deep RL** | **Recurrent State Flow (DRQN/R2D2)** |  | Temporal dependency in state-action value | DRQN, R2D2 |
|
| 124 |
+
| **Theory** | **Belief State in POMDPs** |  | Probability distribution over hidden states | POMDPs, Belief Space |
|
| 125 |
+
| **Multi-Objective RL** | **Multi-Objective Pareto Front** |  | Balancing conflicting reward signals | MORL, Pareto Optimal |
|
| 126 |
+
| **Theory** | **Differential Value (Average Reward RL)** |  | Values relative to average gain | Average Reward RL, Mahadevan |
|
| 127 |
+
| **Infrastructure** | **Distributed RL Cluster (Ray/RLLib)** |  | Parallelizing experience collection | Ray, RLLib, Ape-X |
|
| 128 |
+
| **Evolutionary RL** | **Neuroevolution Topology Evolution** |  | Evolving neural network architectures | NEAT, HyperNEAT |
|
| 129 |
+
| **Continual RL** | **Elastic Weight Consolidation (EWC)** |  | Preventing catastrophic forgetting | EWC, Kirkpatric |
|
| 130 |
+
| **Theory** | **Successor Features (SF)** |  | Generalizing predictive representations | SF-Dyna, Barreto |
|
| 131 |
+
| **Safety** | **Adversarial State Noise (Perception)** |  | Attacks on agent observation space | Adversarial RL, Huang |
|
| 132 |
+
| **Imitation Learning** | **Behavioral Cloning (Imitation)** |  | Direct supervised learning from experts | BC, DAGGER |
|
| 133 |
+
| **Relational RL** | **Relational Graph State Representation** |  | Modeling objects and their relations | Relational MDPs, BoxWorld |
|
| 134 |
+
| **Quantum RL** | **Quantum RL Circuit (PQC)** |  | Gate-based quantum policy networks | Quantum RL, PQC |
|
| 135 |
+
| **Symbolic RL** | **Symbolic Policy Tree** |  | Policies as mathematical expressions | Symbolic RL, GP |
|
| 136 |
+
| **Control** | **Differentiable Physics Gradient Flow** |  | Gradient-based planning through simulators | Brax, Isaac Gym |
|
| 137 |
+
| **Multi-Agent RL** | **MARL Communication Channel** |  | Information exchange between agents | CommNet, DIAL |
|
| 138 |
+
| **Safety** | **Lagrangian Constraint Landscape** |  | Constrained optimization boundaries | Constrained RL, CPO |
|
| 139 |
+
| **Hierarchical RL** | **MAXQ Task Hierarchy** |  | Recursive task decomposition | MAXQ, Dietterich |
|
| 140 |
+
| **Agentic AI** | **ReAct Agentic Cycle** |  | Reasoning-Action loops for LLMs | ReAct, Agentic LLM |
|
| 141 |
+
| **Bio-inspired RL** | **Synaptic Plasticity RL** |  | Hebbian-style synaptic weight updates | Hebbian RL, STDP |
|
| 142 |
+
| **Control** | **Guided Policy Search (GPS)** |  | Distilling trajectories into a policy | GPS, Levine |
|
| 143 |
+
| **Robotics** | **Sim-to-Real Jitter & Latency** |  | Temporal robustness in transfer | Sim-to-Real, Robustness |
|
| 144 |
+
| **Policy Gradients** | **Deterministic Policy Gradient (DDPG) Flow** |  | Gradient flow for deterministic policies | DDPG |
|
| 145 |
+
| **Model-Based RL** | **Dreamer Latent Imagination** |  | Learning and planning in latent space | Dreamer (V1-V3) |
|
| 146 |
+
| **Deep RL** | **UNREAL Auxiliary Tasks** |  | Learning from non-reward signals | UNREAL, A3C extension |
|
| 147 |
+
| **Offline RL** | **Implicit Q-Learning (IQL) Expectile** |  | In-sample learning via expectile regression | IQL |
|
| 148 |
+
| **Model-Based RL** | **Prioritized Sweeping** |  | Planning prioritized by TD error | Sutton & Barto classic MBRL |
|
| 149 |
+
| **Imitation Learning** | **DAgger Expert Loop** |  | Training on expert labels in agent-visited states | DAgger |
|
| 150 |
+
| **Representation** | **Self-Predictive Representations (SPR)** |  | Consistency between predicted and target latents | SPR, sample-efficient RL |
|
| 151 |
+
| **Multi-Agent RL** | **Joint Action Space** |  | Cartesian product of individual actions | MARL theory, Game Theory |
|
| 152 |
+
| **Multi-Agent RL** | **Dec-POMDP Formal Model** |  | Decentralized partially observable MDP | Multi-agent coordination |
|
| 153 |
+
| **Theory** | **Bisimulation Metric** |  | State equivalence based on transitions/rewards | State abstraction, bisimulation theory |
|
| 154 |
+
| **Theory** | **Potential-Based Reward Shaping** |  | Reward transformation preserving optimal policy | Sutton & Barto, Ng et al. |
|
| 155 |
+
| **Training** | **Transfer RL: Source to Target** |  | Reusing knowledge across different MDPs | Transfer Learning, Distillation |
|
| 156 |
+
| **Deep RL** | **Multi-Task Backbone Arch** |  | Single agent learning multiple tasks | Multi-task RL, IMPALA |
|
| 157 |
+
| **Bandits** | **Contextual Bandit Pipeline** |  | Decision making given context but no transitions | Personalization, Ad-tech |
|
| 158 |
+
| **Theory** | **Theoretical Regret Bounds** |  | Analytical performance guarantees | Online Learning, Bandits |
|
| 159 |
+
| **Value-based** | **Soft Q Boltzmann Probabilities** |  | Probabilistic action selection from Q-values | s) \propto \exp(Q/\tau)$ |
|
| 160 |
+
| **Robotics** | **Autonomous Driving RL Pipeline** |  | End-to-end or modular driving stack | Wayve, Tesla, Comma.ai |
|
| 161 |
+
| **Policy** | **Policy action gradient comparison** |  | Comparison of gradient derivation types | PG Theorem vs DPG Theorem |
|
| 162 |
+
| **Inverse RL / IRL** | **IRL: Feature Expectation Matching** |  | Comparing expert vs learner feature visitor frequency | \mu(\pi^*) - \mu(\pi) |
|
| 163 |
+
| **Imitation Learning** | **Apprenticeship Learning Loop** |  | Training to match expert performance via reward inference | Apprenticeship Learning |
|
| 164 |
+
| **Theory** | **Active Inference Loop** |  | Agents minimizing surprise (free energy) | Free Energy Principle, Friston |
|
| 165 |
+
| **Theory** | **Bellman Residual Landscape** |  | Training surface of the Bellman error | TD learning, fitted Q-iteration |
|
| 166 |
+
| **Model-Based RL** | **Plan-to-Explore Uncertainty Map** |  | Systematic exploration in learned world models | Plan-to-Explore, Sekar et al. |
|
| 167 |
+
| **Safety RL** | **Robust RL Uncertainty Set** |  | Optimizing for the worst-case environment transition | Robust MDPs, minimax RL |
|
| 168 |
+
| **Training** | **HPO Bayesian Opt Cycle** |  | Automating hyperparameter selection with GP | Hyperparameter Optimization |
|
| 169 |
+
| **Applied RL** | **Slate RL Recommendation** |  | Optimizing list/slate of items for users | Recommender Systems, Ie et al. |
|
| 170 |
+
| **Multi-Agent RL** | **Fictitious Play Interaction** |  | Belief-based learning in games | Game Theory, Brown (1951) |
|
| 171 |
+
| **Conceptual** | **Universal RL Framework Diagram** |  | High-level summary of RL components | All RL |
|
| 172 |
+
| **Offline RL** | **Offline Density Ratio Estimator** |  | Estimating $w(s,a)$ for off-policy data | Importance Sampling, Offline RL |
|
| 173 |
+
| **Continual RL** | **Continual Task Interference Heatmap** |  | Measuring negative transfer between tasks | Lifelong Learning, EWC |
|
| 174 |
+
| **Safety RL** | **Lyapunov Stability Safe Set** |  | Invariant sets for safe control | Lyapunov RL, Chow et al. |
|
| 175 |
+
| **Applied RL** | **Molecular RL (Atom Coordinates)** |  | RL for molecular design/protein folding | Chemistry RL, AlphaFold-style |
|
| 176 |
+
| **Architecture** | **MoE Multi-task Architecture** |  | Scaling models with mixture of experts | MoE-RL, Sparsity |
|
| 177 |
+
| **Direct Policy Search** | **CMA-ES Policy Search** |  | Evolutionary strategy for policy weights | ES for RL, Salimans |
|
| 178 |
+
| **Alignment** | **Elo Rating Preference Plot** |  | Measuring agent strength over time | AlphaZero, League training |
|
| 179 |
+
| **Explainable RL** | **Explainable RL (SHAP Attribution)** |  | Local attribution of features to agent actions | Interpretability, SHAP/LIME |
|
| 180 |
+
| **Meta-RL** | **PEARL Context Encoder** |  | Learning latent task representations | PEARL, Rakelly et al. |
|
| 181 |
+
| **Applied RL** | **Medical RL Therapy Pipeline** |  | Personalized medicine and dosing | Healthcare RL, ICU Sepsis |
|
| 182 |
+
| **Applied RL** | **Supply Chain RL Pipeline** |  | Optimizing stock levels and orders | Logistics, Inventory Management |
|
| 183 |
+
| **Robotics** | **Sim-to-Real SysID Loop** |  | Closing the reality gap via parameter estimation | System Identification, Robotics |
|
| 184 |
+
| **Architecture** | **Transformer World Model** |  | Sequence-to-sequence dynamics modeling | DreamerV3, Transframer |
|
| 185 |
+
| **Applied RL** | **Network Traffic RL** |  | Optimizing data packet routing in graphs | Networking, Traffic Engineering |
|
| 186 |
+
| **Training** | **RLHF: PPO with Reference Policy** |  | Ensuring RL fine-tuning doesn't drift too far | InstructGPT, Llama 2/3 |
|
| 187 |
+
| **Multi-Agent RL** | **PSRO Meta-Game Update** |  | Reaching Nash equilibrium in large games | PSRO, Lanctot et al. |
|
| 188 |
+
| **Multi-Agent RL** | **DIAL: Differentiable Comm** |  | End-to-end learning of communication protocols | DIAL, Foerster et al. |
|
| 189 |
+
| **Batch RL** | **Fitted Q-Iteration Loop** |  | Data-driven iteration with a supervised regressor | Ernst et al. (2005) |
|
| 190 |
+
| **Safety RL** | **CMDP Feasible Region** |  | Constrained optimization within a safety budget | Constrained MDPs, Altman |
|
| 191 |
+
| **Control** | **MPC vs RL Planning** |  | Comparison of control paradigms | Control Theory vs RL |
|
| 192 |
+
| **AutoML** | **Learning to Optimize (L2O)** |  | Using RL to learn an optimization update rule | L2O, Li & Malik |
|
| 193 |
+
| **Applied RL** | **Smart Grid RL Management** |  | Optimizing energy supply and demand | Energy RL, Smart Grids |
|
| 194 |
+
| **Applied RL** | **Quantum State Tomography RL** |  | RL for quantum state estimation | Quantum RL, Neural Tomography |
|
| 195 |
+
| **Applied RL** | **RL for Chip Placement** |  | Placing components on silicon grids | Google Chip Placement |
|
| 196 |
+
| **Applied RL** | **RL Compiler Optimization (MLGO)** |  | Inlining and sizing in compilers | MLGO, LLVM |
|
| 197 |
+
| **Applied RL** | **RL for Theorem Proving** |  | Automated reasoning and proof search | LeanRL, AlphaProof |
|
| 198 |
+
| **Modern RL** | **Diffusion-QL Offline RL** |  | Policy as reverse diffusion process | s,k)$ with noise injection |
|
| 199 |
+
| **Principles** | **Fairness-reward Pareto Frontier** |  | Balancing equity and returns | Fair RL, Jabbari et al. |
|
| 200 |
+
| **Principles** | **Differentially Private RL** |  | Privacy-preserving training | DP-RL, Agarwal et al. |
|
| 201 |
+
| **Applied RL** | **Smart Agriculture RL** |  | Optimizing crop yield and resources | Precision Agriculture |
|
| 202 |
+
| **Applied RL** | **Climate Mitigation RL (Grid)** |  | Environmental control policies | ClimateRL, Carbon Control |
|
| 203 |
+
| **Applied RL** | **AI Education (Knowledge Tracing)** |  | Personalized learning paths | ITS, Bayesian Knowledge Tracing |
|
| 204 |
+
| **Modern RL** | **Decision SDE Flow** |  | RL in continuous stochastic systems | Neural SDEs, Control |
|
| 205 |
+
| **Control** | **Differentiable physics (Brax)** |  | Gradients through simulators | Brax, PhysX, MuJoCo |
|
| 206 |
+
| **Applied RL** | **Wireless Beamforming RL** |  | Optimizing antenna signal directions | 5G/6G Networking |
|
| 207 |
+
| **Applied RL** | **Quantum Error Correction RL** |  | Correcting noise in quantum circuits | Quantum Computing RL |
|
| 208 |
+
| **Multi-Agent RL** | **Mean Field RL Interaction** |  | Large population agent dynamics | MF-RL, Yang et al. |
|
| 209 |
+
| **HRL** | **Goal-GAN Curriculum** |  | Automatic goal generation | Goal-GAN, Florensa et al. |
|
| 210 |
+
| **Modern RL** | **JEPA: Predictive Architecture** |  | LeCun's world model framework | JEPA, I-JEPA |
|
| 211 |
+
| **Offline RL** | **CQL Value Penalty Landscape** |  | Conservatism in value functions | CQL, Kumar et al. |
|
| 212 |
+
| **Applied RL** | **Causal RL** |  | Causal Inverse RL Graph | DAG with $S, A, R$ and latent $U$ |
|
| 213 |
+
| **Quantum RL** | **VQE-RL Optimization** |  | Quantum circuit param tuning | VQE, Quantum RL |
|
| 214 |
+
| **Applied RL** | **De-novo Drug Discovery RL** |  | Generating optimized lead molecules | Drug Discovery, Molecule RL |
|
| 215 |
+
| **Applied RL** | **Traffic Signal Coordination RL** |  | Multi-intersection coordination | IntelliLight, PressLight |
|
| 216 |
+
| **Applied RL** | **Mars Rover Pathfinding RL** |  | Navigation on rough terrain | Space RL, Mars Rover |
|
| 217 |
+
| **Applied RL** | **Sports Player Movement RL** |  | Predicting/Optimizing player actions | Sports Analytics, Ghosting |
|
| 218 |
+
| **Applied RL** | **Cryptography Attack RL** |  | Searching for keys/vulnerabilities | Crypto-RL, Learning to Attack |
|
| 219 |
+
| **Applied RL** | **Humanitarian Resource RL** |  | Disaster response allocation | AI for Good, Resource RL |
|
| 220 |
+
| **Applied RL** | **Video Compression RL (RD)** |  | Optimizing bit-rate vs distortion | Learned Video Compression |
|
| 221 |
+
| **Applied RL** | **Kubernetes Auto-scaling RL** |  | Cloud resource management | Cloud RL, K8s Scaling |
|
| 222 |
+
| **Applied RL** | **Fluid Dynamics Flow Control RL** |  | Airfoil/Turbulence control | Aero-RL, Flow Control |
|
| 223 |
+
| **Applied RL** | **Structural Optimization RL** |  | Topology/Material design | Structural RL, Topology Opt |
|
| 224 |
+
| **Applied RL** | **Human Decision Modeling** |  | Prospect Theory in RL | Behavioral RL, Prospect Theory |
|
| 225 |
+
| **Applied RL** | **Semantic Parsing RL** |  | Language to Logic transformation | Semantic Parsing, Seq2Seq-RL |
|
| 226 |
+
| **Applied RL** | **Music Melody RL** |  | Reward-based melody generation | Music-RL, Magenta |
|
| 227 |
+
| **Applied RL** | **Plasma Fusion Control RL** |  | Magnetic control of Tokamaks | DeepMind Fusion, Tokamak RL |
|
| 228 |
+
| **Applied RL** | **Carbon Capture RL cycle** |  | Adsorption/Desorption optimization | Carbon Capture, Green RL |
|
| 229 |
+
| **Applied RL** | **Swarm Robotics RL** |  | Decentralized swarm coordination | Swarm-RL, Multi-Robot |
|
| 230 |
+
| **Applied RL** | **Legal Compliance RL Game** |  | Regulatory games | Legal-RL, RegTech |
|
| 231 |
+
| **Physics RL** | **Physics-Informed RL (PINN)** |  | Constraint-based RL loss | PINN-RL, SciML |
|
| 232 |
+
| **Modern RL** | **Neuro-Symbolic RL** |  | Combining logic and neural nets | Neuro-Symbolic, Logic RL |
|
| 233 |
+
| **Applied RL** | **DeFi Liquidity Pool RL** |  | Yield farming/Liquidity balancing | DeFi-RL, AMM Optimization |
|
| 234 |
+
| **Neuro RL** | **Dopamine Reward Prediction Error** |  | Biological RL signal curves | Neuroscience-RL, Wolfram |
|
| 235 |
+
| **Robotics** | **Proprioceptive Sensory-Motor RL** |  | Low-level joint control | Proprioceptive RL, Unitree |
|
| 236 |
+
| **Applied RL** | **AR Object Placement RL** |  | AR visual overlay optimization | AR-RL, Visual Overlay |
|
| 237 |
+
| **Reco RL** | **Sequential Bundle RL** |  | Recommendation item grouping | Bundle-RL, E-commerce |
|
| 238 |
+
| **Theoretical** | **Online Gradient Descent vs RL** |  | Gradient-based learning comparison | Online Learning, Regret |
|
| 239 |
+
| **Modern RL** | **Active Learning: Query RL** |  | Query-based sample selection | Active-RL, Query Opt |
|
| 240 |
+
| **Modern RL** | **Federated RL global Aggregator** |  | Privacy-preserving distributed RL | Federated-RL, FedAvg-RL |
|
| 241 |
+
| **Conceptual** | **Ultimate Universal RL Mastery Diagram** |  | Final summary of 230 items | Absolute Mastery Milestone |
|