algorembrant commited on
Commit
b7a7046
·
verified ·
1 Parent(s): a44fde6

Upload 442 files

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +11 -0
  2. README.md +236 -0
  3. checkpoint/README.md +207 -0
  4. checkpoint/core.py +0 -0
  5. checkpoint/e.md +204 -0
  6. checkpoint/generate_readme.py +58 -0
  7. checkpoint/graphs/absolute_universal_rl_pillar_map.png +0 -0
  8. checkpoint/graphs/action_persistence_frame_skipping.png +0 -0
  9. checkpoint/graphs/action_selection_noise_ou_vs_gaussian.png +3 -0
  10. checkpoint/graphs/action_value_function_q_s_a.png +0 -0
  11. checkpoint/graphs/active_inference_loop.png +0 -0
  12. checkpoint/graphs/actor_critic_architecture.png +0 -0
  13. checkpoint/graphs/advantage_actor_critic_a2c_a3c.png +0 -0
  14. checkpoint/graphs/advantage_function_a_s_a.png +0 -0
  15. checkpoint/graphs/adversarial_rl_interaction.png +0 -0
  16. checkpoint/graphs/adversarial_state_noise_perception.png +0 -0
  17. checkpoint/graphs/agent_environment_interaction_loop.png +0 -0
  18. checkpoint/graphs/ai_education_knowledge_tracing.png +0 -0
  19. checkpoint/graphs/apprenticeship_learning_loop.png +0 -0
  20. checkpoint/graphs/attention_mechanisms_transformers_in_rl.png +0 -0
  21. checkpoint/graphs/automated_curriculum_learning.png +0 -0
  22. checkpoint/graphs/autonomous_driving_rl_pipeline.png +0 -0
  23. checkpoint/graphs/baseline_advantage_subtraction.png +0 -0
  24. checkpoint/graphs/batch_constrained_q_learning_bcq.png +0 -0
  25. checkpoint/graphs/behavioral_cloning_imitation.png +0 -0
  26. checkpoint/graphs/belief_state_in_pomdps.png +0 -0
  27. checkpoint/graphs/bellman_residual_landscape.png +0 -0
  28. checkpoint/graphs/bisimulation_metric.png +0 -0
  29. checkpoint/graphs/bootstrapping_general.png +0 -0
  30. checkpoint/graphs/centralized_training_decentralized_execution_ctde.png +0 -0
  31. checkpoint/graphs/climate_mitigation_rl_grid.png +0 -0
  32. checkpoint/graphs/cma_es_policy_search.png +0 -0
  33. checkpoint/graphs/cmdp_feasible_region.png +0 -0
  34. checkpoint/graphs/computation_graph_backpropagation_flow.png +0 -0
  35. checkpoint/graphs/conservative_q_learning_cql.png +0 -0
  36. checkpoint/graphs/contextual_bandit_pipeline.png +0 -0
  37. checkpoint/graphs/continual_task_interference_heatmap.png +0 -0
  38. checkpoint/graphs/continuous_state_action_space_visualization.png +0 -0
  39. checkpoint/graphs/control_barrier_functions_cbf.png +0 -0
  40. checkpoint/graphs/convergence_analysis_plots.png +0 -0
  41. checkpoint/graphs/cooperative_competitive_payoff_matrix.png +0 -0
  42. checkpoint/graphs/count_based_exploration_heatmap.png +0 -0
  43. checkpoint/graphs/cql_value_penalty_landscape.png +0 -0
  44. checkpoint/graphs/cybersecurity_attack_defense_rl.png +0 -0
  45. checkpoint/graphs/dagger_expert_loop.png +0 -0
  46. checkpoint/graphs/dec_pomdp_formal_model.png +0 -0
  47. checkpoint/graphs/decision_sde_flow.png +3 -0
  48. checkpoint/graphs/decision_transformer_token_sequence.png +0 -0
  49. checkpoint/graphs/deterministic_policy_gradient_ddpg_flow.png +0 -0
  50. checkpoint/graphs/dial_differentiable_comm.png +0 -0
.gitattributes CHANGED
@@ -34,3 +34,14 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  graphs/reward_function_landscape.png filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  graphs/reward_function_landscape.png filter=lfs diff=lfs merge=lfs -text
37
+ checkpoint/graphs/action_selection_noise_ou_vs_gaussian.png filter=lfs diff=lfs merge=lfs -text
38
+ checkpoint/graphs/decision_sde_flow.png filter=lfs diff=lfs merge=lfs -text
39
+ checkpoint/graphs/lagrangian_constraint_landscape.png filter=lfs diff=lfs merge=lfs -text
40
+ checkpoint/graphs/loss_landscape_visualization.png filter=lfs diff=lfs merge=lfs -text
41
+ checkpoint/graphs/reward_function_landscape.png filter=lfs diff=lfs merge=lfs -text
42
+ graphs_more/action_selection_noise_ou_vs_gaussian.png filter=lfs diff=lfs merge=lfs -text
43
+ graphs_more/decision_sde_flow.png filter=lfs diff=lfs merge=lfs -text
44
+ graphs_more/fluid_dynamics_flow_control_rl.png filter=lfs diff=lfs merge=lfs -text
45
+ graphs_more/lagrangian_constraint_landscape.png filter=lfs diff=lfs merge=lfs -text
46
+ graphs_more/loss_landscape_visualization.png filter=lfs diff=lfs merge=lfs -text
47
+ graphs_more/reward_function_landscape.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---title: Reinforcement Learning Graphical Representationsdate: 2026-04-08category: Reinforcement Learningdescription: A comprehensive gallery of 130 standard RL components and their graphical presentations.---
2
+
3
+ # Reinforcement Learning Graphical Representations
4
+
5
+ This repository contains a full set of 130 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning.
6
+
7
+ | Category | Component | Illustration | Details | Context |
8
+ |----------|-----------|--------------|---------|---------|
9
+ | **MDP & Environment** | **Agent-Environment Interaction Loop** | ![Illustration](graphs/agent_environment_interaction_loop.png) | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | All RL algorithms |
10
+ | **MDP & Environment** | **Markov Decision Process (MDP) Tuple** | ![Illustration](graphs/markov_decision_process_mdp_tuple.png) | (S, A, P, R, γ) with transition dynamics and reward function | s,a) and R(s,a,s′)) |
11
+ | **MDP & Environment** | **State Transition Graph** | ![Illustration](graphs/state_transition_graph.png) | Full probabilistic transitions between discrete states | Gridworld, Taxi, Cliff Walking |
12
+ | **MDP & Environment** | **Trajectory / Episode Sequence** | ![Illustration](graphs/trajectory_episode_sequence.png) | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Monte Carlo, episodic tasks |
13
+ | **MDP & Environment** | **Continuous State/Action Space Visualization** | ![Illustration](graphs/continuous_state_action_space_visualization.png) | High-dimensional spaces (e.g., robot joints, pixel inputs) | Continuous-control tasks (MuJoCo, PyBullet) |
14
+ | **MDP & Environment** | **Reward Function / Landscape** | ![Illustration](graphs/reward_function_landscape.png) | Scalar reward as function of state/action | All algorithms; especially reward shaping |
15
+ | **MDP & Environment** | **Discount Factor (γ) Effect** | ![Illustration](graphs/discount_factor_effect.png) | How future rewards are weighted | All discounted MDPs |
16
+ | **Value & Policy** | **State-Value Function V(s)** | ![Illustration](graphs/state_value_function_v_s.png) | Expected return from state s under policy π | Value-based methods |
17
+ | **Value & Policy** | **Action-Value Function Q(s,a)** | ![Illustration](graphs/action_value_function_q_s_a.png) | Expected return from state-action pair | Q-learning family |
18
+ | **Value & Policy** | **Policy π(s) or π(a\** | ![Illustration](graphs/policy_s_or_a.png) | s) | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps |
19
+ | **Value & Policy** | **Advantage Function A(s,a)** | ![Illustration](graphs/advantage_function_a_s_a.png) | Q(s,a) – V(s) | A2C, PPO, SAC, TD3 |
20
+ | **Value & Policy** | **Optimal Value Function V* / Q*** | ![Illustration](graphs/optimal_value_function_v_q.png) | Solution to Bellman optimality | Value iteration, Q-learning |
21
+ | **Dynamic Programming** | **Policy Evaluation Backup** | ![Illustration](graphs/policy_evaluation_backup.png) | Iterative update of V using Bellman expectation | Policy iteration |
22
+ | **Dynamic Programming** | **Policy Improvement** | ![Illustration](graphs/policy_improvement.png) | Greedy policy update over Q | Policy iteration |
23
+ | **Dynamic Programming** | **Value Iteration Backup** | ![Illustration](graphs/value_iteration_backup.png) | Update using Bellman optimality | Value iteration |
24
+ | **Dynamic Programming** | **Policy Iteration Full Cycle** | ![Illustration](graphs/policy_iteration_full_cycle.png) | Evaluation → Improvement loop | Classic DP methods |
25
+ | **Monte Carlo** | **Monte Carlo Backup** | ![Illustration](graphs/monte_carlo_backup.png) | Update using full episode return G_t | First-visit / every-visit MC |
26
+ | **Monte Carlo** | **Monte Carlo Tree (MCTS)** | ![Illustration](graphs/monte_carlo_tree_mcts.png) | Search tree with selection, expansion, simulation, backprop | AlphaGo, AlphaZero |
27
+ | **Monte Carlo** | **Importance Sampling Ratio** | ![Illustration](graphs/importance_sampling_ratio.png) | Off-policy correction ρ = π(a\ | s) |
28
+ | **Temporal Difference** | **TD(0) Backup** | ![Illustration](graphs/td_0_backup.png) | Bootstrapped update using R + γV(s′) | TD learning |
29
+ | **Temporal Difference** | **Bootstrapping (general)** | ![Illustration](graphs/bootstrapping_general.png) | Using estimated future value instead of full return | All TD methods |
30
+ | **Temporal Difference** | **n-step TD Backup** | ![Illustration](graphs/n_step_td_backup.png) | Multi-step return G_t^{(n)} | n-step TD, TD(λ) |
31
+ | **Temporal Difference** | **TD(λ) & Eligibility Traces** | ![Illustration](graphs/td_eligibility_traces.png) | Decaying trace z_t for credit assignment | TD(λ), SARSA(λ), Q(λ) |
32
+ | **Temporal Difference** | **SARSA Update** | ![Illustration](graphs/sarsa_update.png) | On-policy TD control | SARSA |
33
+ | **Temporal Difference** | **Q-Learning Update** | ![Illustration](graphs/q_learning_update.png) | Off-policy TD control | Q-learning, Deep Q-Network |
34
+ | **Temporal Difference** | **Expected SARSA** | ![Illustration](graphs/expected_sarsa.png) | Expectation over next action under policy | Expected SARSA |
35
+ | **Temporal Difference** | **Double Q-Learning / Double DQN** | ![Illustration](graphs/double_q_learning_double_dqn.png) | Two separate Q estimators to reduce overestimation | Double DQN, TD3 |
36
+ | **Temporal Difference** | **Dueling DQN Architecture** | ![Illustration](graphs/dueling_dqn_architecture.png) | Separate streams for state value V(s) and advantage A(s,a) | Dueling DQN |
37
+ | **Temporal Difference** | **Prioritized Experience Replay** | ![Illustration](graphs/prioritized_experience_replay.png) | Importance sampling of transitions by TD error | Prioritized DQN, Rainbow |
38
+ | **Temporal Difference** | **Rainbow DQN Components** | ![Illustration](graphs/rainbow_dqn_components.png) | All extensions combined (Double, Dueling, PER, etc.) | Rainbow DQN |
39
+ | **Function Approximation** | **Linear Function Approximation** | ![Illustration](graphs/linear_function_approximation.png) | Feature vector φ(s) → wᵀφ(s) | Tabular → linear FA |
40
+ | **Function Approximation** | **Neural Network Layers (MLP, CNN, RNN, Transformer)** | ![Illustration](graphs/neural_network_layers_mlp_cnn_rnn_transformer.png) | Full deep network for value/policy | DQN, A3C, PPO, Decision Transformer |
41
+ | **Function Approximation** | **Computation Graph / Backpropagation Flow** | ![Illustration](graphs/computation_graph_backpropagation_flow.png) | Gradient flow through network | All deep RL |
42
+ | **Function Approximation** | **Target Network** | ![Illustration](graphs/target_network.png) | Frozen copy of Q-network for stability | DQN, DDQN, SAC, TD3 |
43
+ | **Policy Gradients** | **Policy Gradient Theorem** | ![Illustration](graphs/policy_gradient_theorem.png) | ∇_θ J(θ) = E[∇_θ log π(a\ | Flow diagram from reward → log-prob → gradient |
44
+ | **Policy Gradients** | **REINFORCE Update** | ![Illustration](graphs/reinforce_update.png) | Monte-Carlo policy gradient | REINFORCE |
45
+ | **Policy Gradients** | **Baseline / Advantage Subtraction** | ![Illustration](graphs/baseline_advantage_subtraction.png) | Subtract b(s) to reduce variance | All modern PG |
46
+ | **Policy Gradients** | **Trust Region (TRPO)** | ![Illustration](graphs/trust_region_trpo.png) | KL-divergence constraint on policy update | TRPO |
47
+ | **Policy Gradients** | **Proximal Policy Optimization (PPO)** | ![Illustration](graphs/proximal_policy_optimization_ppo.png) | Clipped surrogate objective | PPO, PPO-Clip |
48
+ | **Actor-Critic** | **Actor-Critic Architecture** | ![Illustration](graphs/actor_critic_architecture.png) | Separate or shared actor (policy) + critic (value) networks | A2C, A3C, SAC, TD3 |
49
+ | **Actor-Critic** | **Advantage Actor-Critic (A2C/A3C)** | ![Illustration](graphs/advantage_actor_critic_a2c_a3c.png) | Synchronous/asynchronous multi-worker | A2C/A3C |
50
+ | **Actor-Critic** | **Soft Actor-Critic (SAC)** | ![Illustration](graphs/soft_actor_critic_sac.png) | Entropy-regularized policy + twin critics | SAC |
51
+ | **Actor-Critic** | **Twin Delayed DDPG (TD3)** | ![Illustration](graphs/twin_delayed_ddpg_td3.png) | Twin critics + delayed policy + target smoothing | TD3 |
52
+ | **Exploration** | **ε-Greedy Strategy** | ![Illustration](graphs/greedy_strategy.png) | Probability ε of random action | DQN family |
53
+ | **Exploration** | **Softmax / Boltzmann Exploration** | ![Illustration](graphs/softmax_boltzmann_exploration.png) | Temperature τ in softmax | Softmax policies |
54
+ | **Exploration** | **Upper Confidence Bound (UCB)** | ![Illustration](graphs/upper_confidence_bound_ucb.png) | Optimism in face of uncertainty | UCB1, bandits |
55
+ | **Exploration** | **Intrinsic Motivation / Curiosity** | ![Illustration](graphs/intrinsic_motivation_curiosity.png) | Prediction error as intrinsic reward | ICM, RND, Curiosity-driven RL |
56
+ | **Exploration** | **Entropy Regularization** | ![Illustration](graphs/entropy_regularization.png) | Bonus term αH(π) | SAC, maximum-entropy RL |
57
+ | **Hierarchical RL** | **Options Framework** | ![Illustration](graphs/options_framework.png) | High-level policy over options (temporally extended actions) | Option-Critic |
58
+ | **Hierarchical RL** | **Feudal Networks / Hierarchical Actor-Critic** | ![Illustration](graphs/feudal_networks_hierarchical_actor_critic.png) | Manager-worker hierarchy | Feudal RL |
59
+ | **Hierarchical RL** | **Skill Discovery** | ![Illustration](graphs/skill_discovery.png) | Unsupervised emergence of reusable skills | DIAYN, VALOR |
60
+ | **Model-Based RL** | **Learned Dynamics Model** | ![Illustration](graphs/learned_dynamics_model.png) | ˆP(s′\ | Separate model network diagram (often RNN or transformer) |
61
+ | **Model-Based RL** | **Model-Based Planning** | ![Illustration](graphs/model_based_planning.png) | Rollouts inside learned model | MuZero, DreamerV3 |
62
+ | **Model-Based RL** | **Imagination-Augmented Agents (I2A)** | ![Illustration](graphs/imagination_augmented_agents_i2a.png) | Imagination module + policy | I2A |
63
+ | **Offline RL** | **Offline Dataset** | ![Illustration](graphs/offline_dataset.png) | Fixed batch of trajectories | BC, CQL, IQL |
64
+ | **Offline RL** | **Conservative Q-Learning (CQL)** | ![Illustration](graphs/conservative_q_learning_cql.png) | Penalty on out-of-distribution actions | CQL |
65
+ | **Multi-Agent RL** | **Multi-Agent Interaction Graph** | ![Illustration](graphs/multi_agent_interaction_graph.png) | Agents communicating or competing | MARL, MADDPG |
66
+ | **Multi-Agent RL** | **Centralized Training Decentralized Execution (CTDE)** | ![Illustration](graphs/centralized_training_decentralized_execution_ctde.png) | Shared critic during training | QMIX, VDN, MADDPG |
67
+ | **Multi-Agent RL** | **Cooperative / Competitive Payoff Matrix** | ![Illustration](graphs/cooperative_competitive_payoff_matrix.png) | Joint reward for multiple agents | Prisoner's Dilemma, multi-agent gridworlds |
68
+ | **Inverse RL / IRL** | **Reward Inference** | ![Illustration](graphs/reward_inference.png) | Infer reward from expert demonstrations | IRL, GAIL |
69
+ | **Inverse RL / IRL** | **Generative Adversarial Imitation Learning (GAIL)** | ![Illustration](graphs/generative_adversarial_imitation_learning_gail.png) | Discriminator vs. policy generator | GAIL, AIRL |
70
+ | **Meta-RL** | **Meta-RL Architecture** | ![Illustration](graphs/meta_rl_architecture.png) | Outer loop (meta-policy) + inner loop (task adaptation) | MAML for RL, RL² |
71
+ | **Meta-RL** | **Task Distribution Visualization** | ![Illustration](graphs/task_distribution_visualization.png) | Multiple MDPs sampled from meta-distribution | Meta-RL benchmarks |
72
+ | **Advanced / Misc** | **Experience Replay Buffer** | ![Illustration](graphs/experience_replay_buffer.png) | Stored (s,a,r,s′,done) tuples | DQN and all off-policy deep RL |
73
+ | **Advanced / Misc** | **State Visitation / Occupancy Measure** | ![Illustration](graphs/state_visitation_occupancy_measure.png) | Frequency of visiting each state | All algorithms (analysis) |
74
+ | **Advanced / Misc** | **Learning Curve** | ![Illustration](graphs/learning_curve.png) | Average episodic return vs. episodes / steps | Standard performance reporting |
75
+ | **Advanced / Misc** | **Regret / Cumulative Regret** | ![Illustration](graphs/regret_cumulative_regret.png) | Sub-optimality accumulated | Bandits and online RL |
76
+ | **Advanced / Misc** | **Attention Mechanisms (Transformers in RL)** | ![Illustration](graphs/attention_mechanisms_transformers_in_rl.png) | Attention weights | Decision Transformer, Trajectory Transformer |
77
+ | **Advanced / Misc** | **Diffusion Policy** | ![Illustration](graphs/diffusion_policy.png) | Denoising diffusion process for action generation | Diffusion-RL policies |
78
+ | **Advanced / Misc** | **Graph Neural Networks for RL** | ![Illustration](graphs/graph_neural_networks_for_rl.png) | Node/edge message passing | Graph RL, relational RL |
79
+ | **Advanced / Misc** | **World Model / Latent Space** | ![Illustration](graphs/world_model_latent_space.png) | Encoder-decoder dynamics in latent space | Dreamer, PlaNet |
80
+ | **Advanced / Misc** | **Convergence Analysis Plots** | ![Illustration](graphs/convergence_analysis_plots.png) | Error / value change over iterations | DP, TD, value iteration |
81
+ | **Advanced / Misc** | **RL Algorithm Taxonomy** | ![Illustration](graphs/rl_algorithm_taxonomy.png) | Comprehensive classification of algorithms | All RL |
82
+ | **Advanced / Misc** | **Probabilistic Graphical Model (RL as Inference)** | ![Illustration](graphs/probabilistic_graphical_model_rl_as_inference.png) | Formalizing RL as probabilistic inference | Control as Inference, MaxEnt RL |
83
+ | **Value & Policy** | **Distributional RL (C51 / Categorical)** | ![Illustration](graphs/distributional_rl_c51_categorical.png) | Representing return as a probability distribution | C51, QR-DQN, IQN |
84
+ | **Exploration** | **Hindsight Experience Replay (HER)** | ![Illustration](graphs/hindsight_experience_replay_her.png) | Learning from failures by relabeling goals | Sparse reward robotics, HER |
85
+ | **Model-Based RL** | **Dyna-Q Architecture** | ![Illustration](graphs/dyna_q_architecture.png) | Integration of real experience and model-based planning | Dyna-Q, Dyna-2 |
86
+ | **Function Approximation** | **Noisy Networks (Parameter Noise)** | ![Illustration](graphs/noisy_networks_parameter_noise.png) | Stochastic weights for exploration | Noisy DQN, Rainbow |
87
+ | **Exploration** | **Intrinsic Curiosity Module (ICM)** | ![Illustration](graphs/intrinsic_curiosity_module_icm.png) | Reward based on prediction error | Curiosity-driven exploration, ICM |
88
+ | **Temporal Difference** | **V-trace (IMPALA)** | ![Illustration](graphs/v_trace_impala.png) | Asynchronous off-policy importance sampling | IMPALA, V-trace |
89
+ | **Multi-Agent RL** | **QMIX Mixing Network** | ![Illustration](graphs/qmix_mixing_network.png) | Monotonic value function factorization | QMIX, VDN |
90
+ | **Advanced / Misc** | **Saliency Maps / Attention on State** | ![Illustration](graphs/saliency_maps_attention_on_state.png) | Visualizing what the agent "sees" or prioritizes | Interpretability, Atari RL |
91
+ | **Exploration** | **Action Selection Noise (OU vs Gaussian)** | ![Illustration](graphs/action_selection_noise_ou_vs_gaussian.png) | Temporal correlation in exploration noise | DDPG, TD3 |
92
+ | **Advanced / Misc** | **t-SNE / UMAP State Embeddings** | ![Illustration](graphs/t_sne_umap_state_embeddings.png) | Dimension reduction of high-dim neural states | Interpretability, SRL |
93
+ | **Advanced / Misc** | **Loss Landscape Visualization** | ![Illustration](graphs/loss_landscape_visualization.png) | Optimization surface geometry | Training stability analysis |
94
+ | **Advanced / Misc** | **Success Rate vs Steps** | ![Illustration](graphs/success_rate_vs_steps.png) | Percentage of successful episodes | Goal-conditioned RL, Robotics |
95
+ | **Advanced / Misc** | **Hyperparameter Sensitivity Heatmap** | ![Illustration](graphs/hyperparameter_sensitivity_heatmap.png) | Performance across parameter grids | Hyperparameter tuning |
96
+ | **Dynamics** | **Action Persistence (Frame Skipping)** | ![Illustration](graphs/action_persistence_frame_skipping.png) | Temporal abstraction by repeating actions | Atari RL, Robotics |
97
+ | **Model-Based RL** | **MuZero Dynamics Search Tree** | ![Illustration](graphs/muzero_dynamics_search_tree.png) | Planning with learned transition and value functions | MuZero, Gumbel MuZero |
98
+ | **Deep RL** | **Policy Distillation** | ![Illustration](graphs/policy_distillation.png) | Compressing knowledge from teacher to student | Kickstarting, multitask learning |
99
+ | **Transformers** | **Decision Transformer Token Sequence** | ![Illustration](graphs/decision_transformer_token_sequence.png) | Sequential modeling of RL as a translation task | Decision Transformer, TT |
100
+ | **Advanced / Misc** | **Performance Profiles (rliable)** | ![Illustration](graphs/performance_profiles_rliable.png) | Robust aggregate performance metrics | Reliable RL evaluation |
101
+ | **Safety RL** | **Safety Shielding / Barrier Functions** | ![Illustration](graphs/safety_shielding_barrier_functions.png) | Hard constraints on the action space | Constrained MDPs, Safe RL |
102
+ | **Training** | **Automated Curriculum Learning** | ![Illustration](graphs/automated_curriculum_learning.png) | Progressively increasing task difficulty | Curriculum RL, ALP-GMM |
103
+ | **Sim-to-Real** | **Domain Randomization** | ![Illustration](graphs/domain_randomization.png) | Generalizing across environment variations | Robotics, Sim-to-Real |
104
+ | **Alignment** | **RL with Human Feedback (RLHF)** | ![Illustration](graphs/rl_with_human_feedback_rlhf.png) | Aligning agents with human preferences | ChatGPT, InstructGPT |
105
+ | **Neuro-inspired RL** | **Successor Representation (SR)** | ![Illustration](graphs/successor_representation_sr.png) | Predictive state representations | SR-Dyna, Neuro-RL |
106
+ | **Inverse RL / IRL** | **Maximum Entropy IRL** | ![Illustration](graphs/maximum_entropy_irl.png) | Probability distribution over trajectories | MaxEnt IRL, Ziebart |
107
+ | **Theory** | **Information Bottleneck** | ![Illustration](graphs/information_bottleneck.png) | Mutual information $I(S;Z)$ and $I(Z;A)$ balance | VIB-RL, Information Theory |
108
+ | **Evolutionary RL** | **Evolutionary Strategies Population** | ![Illustration](graphs/evolutionary_strategies_population.png) | Population-based parameter search | OpenAI-ES, Salimans |
109
+ | **Safety RL** | **Control Barrier Functions (CBF)** | ![Illustration](graphs/control_barrier_functions_cbf.png) | Set-theoretic safety guarantees | CBF-RL, Control Theory |
110
+ | **Exploration** | **Count-based Exploration Heatmap** | ![Illustration](graphs/count_based_exploration_heatmap.png) | Visitation frequency and intrinsic bonus | MBIE-EB, RND |
111
+ | **Exploration** | **Thompson Sampling Posteriors** | ![Illustration](graphs/thompson_sampling_posteriors.png) | Direct uncertainty-based action selection | Bandits, Bayesian RL |
112
+ | **Multi-Agent RL** | **Adversarial RL Interaction** | ![Illustration](graphs/adversarial_rl_interaction.png) | Competition between protaganist and antagonist | Robust RL, RARL |
113
+ | **Hierarchical RL** | **Hierarchical Subgoal Trajectory** | ![Illustration](graphs/hierarchical_subgoal_trajectory.png) | Decomposing long-horizon tasks | Subgoal RL, HIRO |
114
+ | **Offline RL** | **Offline Action Distribution Shift** | ![Illustration](graphs/offline_action_distribution_shift.png) | Mismatch between dataset and current policy | CQL, IQL, D4RL |
115
+ | **Exploration** | **Random Network Distillation (RND)** | ![Illustration](graphs/random_network_distillation_rnd.png) | Prediction error as intrinsic reward | RND, OpenAI |
116
+ | **Offline RL** | **Batch-Constrained Q-learning (BCQ)** | ![Illustration](graphs/batch_constrained_q_learning_bcq.png) | Constraining actions to behavior dataset | BCQ, Fujimoto |
117
+ | **Training** | **Population-Based Training (PBT)** | ![Illustration](graphs/population_based_training_pbt.png) | Evolutionary hyperparameter optimization | PBT, DeepMind |
118
+ | **Deep RL** | **Recurrent State Flow (DRQN/R2D2)** | ![Illustration](graphs/recurrent_state_flow_drqn_r2d2.png) | Temporal dependency in state-action value | DRQN, R2D2 |
119
+ | **Theory** | **Belief State in POMDPs** | ![Illustration](graphs/belief_state_in_pomdps.png) | Probability distribution over hidden states | POMDPs, Belief Space |
120
+ | **Multi-Objective RL** | **Multi-Objective Pareto Front** | ![Illustration](graphs/multi_objective_pareto_front.png) | Balancing conflicting reward signals | MORL, Pareto Optimal |
121
+ | **Theory** | **Differential Value (Average Reward RL)** | ![Illustration](graphs/differential_value_average_reward_rl.png) | Values relative to average gain | Average Reward RL, Mahadevan |
122
+ | **Infrastructure** | **Distributed RL Cluster (Ray/RLLib)** | ![Illustration](graphs/distributed_rl_cluster_ray_rllib.png) | Parallelizing experience collection | Ray, RLLib, Ape-X |
123
+ | **Evolutionary RL** | **Neuroevolution Topology Evolution** | ![Illustration](graphs/neuroevolution_topology_evolution.png) | Evolving neural network architectures | NEAT, HyperNEAT |
124
+ | **Continual RL** | **Elastic Weight Consolidation (EWC)** | ![Illustration](graphs/elastic_weight_consolidation_ewc.png) | Preventing catastrophic forgetting | EWC, Kirkpatric |
125
+ | **Theory** | **Successor Features (SF)** | ![Illustration](graphs/successor_features_sf.png) | Generalizing predictive representations | SF-Dyna, Barreto |
126
+ | **Safety** | **Adversarial State Noise (Perception)** | ![Illustration](graphs/adversarial_state_noise_perception.png) | Attacks on agent observation space | Adversarial RL, Huang |
127
+ | **Imitation Learning** | **Behavioral Cloning (Imitation)** | ![Illustration](graphs/behavioral_cloning_imitation.png) | Direct supervised learning from experts | BC, DAGGER |
128
+ | **Relational RL** | **Relational Graph State Representation** | ![Illustration](graphs/relational_graph_state_representation.png) | Modeling objects and their relations | Relational MDPs, BoxWorld |
129
+ | **Quantum RL** | **Quantum RL Circuit (PQC)** | ![Illustration](graphs/quantum_rl_circuit_pqc.png) | Gate-based quantum policy networks | Quantum RL, PQC |
130
+ | **Symbolic RL** | **Symbolic Policy Tree** | ![Illustration](graphs/symbolic_policy_tree.png) | Policies as mathematical expressions | Symbolic RL, GP |
131
+ | **Control** | **Differentiable Physics Gradient Flow** | ![Illustration](graphs/differentiable_physics_gradient_flow.png) | Gradient-based planning through simulators | Brax, Isaac Gym |
132
+ | **Multi-Agent RL** | **MARL Communication Channel** | ![Illustration](graphs/marl_communication_channel.png) | Information exchange between agents | CommNet, DIAL |
133
+ | **Safety** | **Lagrangian Constraint Landscape** | ![Illustration](graphs/lagrangian_constraint_landscape.png) | Constrained optimization boundaries | Constrained RL, CPO |
134
+ | **Hierarchical RL** | **MAXQ Task Hierarchy** | ![Illustration](graphs/maxq_task_hierarchy.png) | Recursive task decomposition | MAXQ, Dietterich |
135
+ | **Agentic AI** | **ReAct Agentic Cycle** | ![Illustration](graphs/react_agentic_cycle.png) | Reasoning-Action loops for LLMs | ReAct, Agentic LLM |
136
+ | **Bio-inspired RL** | **Synaptic Plasticity RL** | ![Illustration](graphs/synaptic_plasticity_rl.png) | Hebbian-style synaptic weight updates | Hebbian RL, STDP |
137
+ | **Control** | **Guided Policy Search (GPS)** | ![Illustration](graphs/guided_policy_search_gps.png) | Distilling trajectories into a policy | GPS, Levine |
138
+ | **Robotics** | **Sim-to-Real Jitter & Latency** | ![Illustration](graphs/sim_to_real_jitter_latency.png) | Temporal robustness in transfer | Sim-to-Real, Robustness |
139
+ | **Policy Gradients** | **Deterministic Policy Gradient (DDPG) Flow** | ![Illustration](graphs/deterministic_policy_gradient_ddpg_flow.png) | Gradient flow for deterministic policies | DDPG |
140
+ | **Model-Based RL** | **Dreamer Latent Imagination** | ![Illustration](graphs/dreamer_latent_imagination.png) | Learning and planning in latent space | Dreamer (V1-V3) |
141
+ | **Deep RL** | **UNREAL Auxiliary Tasks** | ![Illustration](graphs/unreal_auxiliary_tasks.png) | Learning from non-reward signals | UNREAL, A3C extension |
142
+ | **Offline RL** | **Implicit Q-Learning (IQL) Expectile** | ![Illustration](graphs/implicit_q_learning_iql_expectile.png) | In-sample learning via expectile regression | IQL |
143
+ | **Model-Based RL** | **Prioritized Sweeping** | ![Illustration](graphs/prioritized_sweeping.png) | Planning prioritized by TD error | Sutton & Barto classic MBRL |
144
+ | **Imitation Learning** | **DAgger Expert Loop** | ![Illustration](graphs/dagger_expert_loop.png) | Training on expert labels in agent-visited states | DAgger |
145
+ | **Representation** | **Self-Predictive Representations (SPR)** | ![Illustration](graphs/self_predictive_representations_spr.png) | Consistency between predicted and target latents | SPR, sample-efficient RL |
146
+ | **Multi-Agent RL** | **Joint Action Space** | ![Illustration](graphs/joint_action_space.png) | Cartesian product of individual actions | MARL theory, Game Theory |
147
+ | **Multi-Agent RL** | **Dec-POMDP Formal Model** | ![Illustration](graphs/dec_pomdp_formal_model.png) | Decentralized partially observable MDP | Multi-agent coordination |
148
+ | **Theory** | **Bisimulation Metric** | ![Illustration](graphs/bisimulation_metric.png) | State equivalence based on transitions/rewards | State abstraction, bisimulation theory |
149
+ | **Theory** | **Potential-Based Reward Shaping** | ![Illustration](graphs/potential_based_reward_shaping.png) | Reward transformation preserving optimal policy | Sutton & Barto, Ng et al. |
150
+ | **Training** | **Transfer RL: Source to Target** | ![Illustration](graphs/transfer_rl_source_to_target.png) | Reusing knowledge across different MDPs | Transfer Learning, Distillation |
151
+ | **Deep RL** | **Multi-Task Backbone Arch** | ![Illustration](graphs/multi_task_backbone_arch.png) | Single agent learning multiple tasks | Multi-task RL, IMPALA |
152
+ | **Bandits** | **Contextual Bandit Pipeline** | ![Illustration](graphs/contextual_bandit_pipeline.png) | Decision making given context but no transitions | Personalization, Ad-tech |
153
+ | **Theory** | **Theoretical Regret Bounds** | ![Illustration](graphs/theoretical_regret_bounds.png) | Analytical performance guarantees | Online Learning, Bandits |
154
+ | **Value-based** | **Soft Q Boltzmann Probabilities** | ![Illustration](graphs/soft_q_boltzmann_probabilities.png) | Probabilistic action selection from Q-values | s) \propto \exp(Q/\tau)$ |
155
+ | **Robotics** | **Autonomous Driving RL Pipeline** | ![Illustration](graphs/autonomous_driving_rl_pipeline.png) | End-to-end or modular driving stack | Wayve, Tesla, Comma.ai |
156
+ | **Policy** | **Policy action gradient comparison** | ![Illustration](graphs/policy_action_gradient_comparison.png) | Comparison of gradient derivation types | PG Theorem vs DPG Theorem |
157
+ | **Inverse RL / IRL** | **IRL: Feature Expectation Matching** | ![Illustration](graphs/irl_feature_expectation_matching.png) | Comparing expert vs learner feature visitor frequency | \mu(\pi^*) - \mu(\pi) |
158
+ | **Imitation Learning** | **Apprenticeship Learning Loop** | ![Illustration](graphs/apprenticeship_learning_loop.png) | Training to match expert performance via reward inference | Apprenticeship Learning |
159
+ | **Theory** | **Active Inference Loop** | ![Illustration](graphs/active_inference_loop.png) | Agents minimizing surprise (free energy) | Free Energy Principle, Friston |
160
+ | **Theory** | **Bellman Residual Landscape** | ![Illustration](graphs/bellman_residual_landscape.png) | Training surface of the Bellman error | TD learning, fitted Q-iteration |
161
+ | **Model-Based RL** | **Plan-to-Explore Uncertainty Map** | ![Illustration](graphs/plan_to_explore_uncertainty_map.png) | Systematic exploration in learned world models | Plan-to-Explore, Sekar et al. |
162
+ | **Safety RL** | **Robust RL Uncertainty Set** | ![Illustration](graphs/robust_rl_uncertainty_set.png) | Optimizing for the worst-case environment transition | Robust MDPs, minimax RL |
163
+ | **Training** | **HPO Bayesian Opt Cycle** | ![Illustration](graphs/hpo_bayesian_opt_cycle.png) | Automating hyperparameter selection with GP | Hyperparameter Optimization |
164
+ | **Applied RL** | **Slate RL Recommendation** | ![Illustration](graphs/slate_rl_recommendation.png) | Optimizing list/slate of items for users | Recommender Systems, Ie et al. |
165
+ | **Multi-Agent RL** | **Fictitious Play Interaction** | ![Illustration](graphs/fictitious_play_interaction.png) | Belief-based learning in games | Game Theory, Brown (1951) |
166
+ | **Conceptual** | **Universal RL Framework Diagram** | ![Illustration](graphs/universal_rl_framework_diagram.png) | High-level summary of RL components | All RL |
167
+ | **Offline RL** | **Offline Density Ratio Estimator** | ![Illustration](graphs/offline_density_ratio_estimator.png) | Estimating $w(s,a)$ for off-policy data | Importance Sampling, Offline RL |
168
+ | **Continual RL** | **Continual Task Interference Heatmap** | ![Illustration](graphs/continual_task_interference_heatmap.png) | Measuring negative transfer between tasks | Lifelong Learning, EWC |
169
+ | **Safety RL** | **Lyapunov Stability Safe Set** | ![Illustration](graphs/lyapunov_stability_safe_set.png) | Invariant sets for safe control | Lyapunov RL, Chow et al. |
170
+ | **Applied RL** | **Molecular RL (Atom Coordinates)** | ![Illustration](graphs/molecular_rl_atom_coordinates.png) | RL for molecular design/protein folding | Chemistry RL, AlphaFold-style |
171
+ | **Architecture** | **MoE Multi-task Architecture** | ![Illustration](graphs/moe_multi_task_architecture.png) | Scaling models with mixture of experts | MoE-RL, Sparsity |
172
+ | **Direct Policy Search** | **CMA-ES Policy Search** | ![Illustration](graphs/cma_es_policy_search.png) | Evolutionary strategy for policy weights | ES for RL, Salimans |
173
+ | **Alignment** | **Elo Rating Preference Plot** | ![Illustration](graphs/elo_rating_preference_plot.png) | Measuring agent strength over time | AlphaZero, League training |
174
+ | **Explainable RL** | **Explainable RL (SHAP Attribution)** | ![Illustration](graphs/explainable_rl_shap_attribution.png) | Local attribution of features to agent actions | Interpretability, SHAP/LIME |
175
+ | **Meta-RL** | **PEARL Context Encoder** | ![Illustration](graphs/pearl_context_encoder.png) | Learning latent task representations | PEARL, Rakelly et al. |
176
+ | **Applied RL** | **Medical RL Therapy Pipeline** | ![Illustration](graphs/medical_rl_therapy_pipeline.png) | Personalized medicine and dosing | Healthcare RL, ICU Sepsis |
177
+ | **Applied RL** | **Supply Chain RL Pipeline** | ![Illustration](graphs/supply_chain_rl_pipeline.png) | Optimizing stock levels and orders | Logistics, Inventory Management |
178
+ | **Robotics** | **Sim-to-Real SysID Loop** | ![Illustration](graphs/sim_to_real_sysid_loop.png) | Closing the reality gap via parameter estimation | System Identification, Robotics |
179
+ | **Architecture** | **Transformer World Model** | ![Illustration](graphs/transformer_world_model.png) | Sequence-to-sequence dynamics modeling | DreamerV3, Transframer |
180
+ | **Applied RL** | **Network Traffic RL** | ![Illustration](graphs/network_traffic_rl.png) | Optimizing data packet routing in graphs | Networking, Traffic Engineering |
181
+ | **Training** | **RLHF: PPO with Reference Policy** | ![Illustration](graphs/rlhf_ppo_with_reference_policy.png) | Ensuring RL fine-tuning doesn't drift too far | InstructGPT, Llama 2/3 |
182
+ | **Multi-Agent RL** | **PSRO Meta-Game Update** | ![Illustration](graphs/psro_meta_game_update.png) | Reaching Nash equilibrium in large games | PSRO, Lanctot et al. |
183
+ | **Multi-Agent RL** | **DIAL: Differentiable Comm** | ![Illustration](graphs/dial_differentiable_comm.png) | End-to-end learning of communication protocols | DIAL, Foerster et al. |
184
+ | **Batch RL** | **Fitted Q-Iteration Loop** | ![Illustration](graphs/fitted_q_iteration_loop.png) | Data-driven iteration with a supervised regressor | Ernst et al. (2005) |
185
+ | **Safety RL** | **CMDP Feasible Region** | ![Illustration](graphs/cmdp_feasible_region.png) | Constrained optimization within a safety budget | Constrained MDPs, Altman |
186
+ | **Control** | **MPC vs RL Planning** | ![Illustration](graphs/mpc_vs_rl_planning.png) | Comparison of control paradigms | Control Theory vs RL |
187
+ | **AutoML** | **Learning to Optimize (L2O)** | ![Illustration](graphs/learning_to_optimize_l2o.png) | Using RL to learn an optimization update rule | L2O, Li & Malik |
188
+ | **Applied RL** | **Smart Grid RL Management** | ![Illustration](graphs/smart_grid_rl_management.png) | Optimizing energy supply and demand | Energy RL, Smart Grids |
189
+ | **Applied RL** | **Quantum State Tomography RL** | ![Illustration](graphs/quantum_state_tomography_rl.png) | RL for quantum state estimation | Quantum RL, Neural Tomography |
190
+ | **Applied RL** | **RL for Chip Placement** | ![Illustration](graphs/rl_for_chip_placement.png) | Placing components on silicon grids | Google Chip Placement |
191
+ | **Applied RL** | **RL Compiler Optimization (MLGO)** | ![Illustration](graphs/rl_compiler_optimization_mlgo.png) | Inlining and sizing in compilers | MLGO, LLVM |
192
+ | **Applied RL** | **RL for Theorem Proving** | ![Illustration](graphs/rl_for_theorem_proving.png) | Automated reasoning and proof search | LeanRL, AlphaProof |
193
+ | **Modern RL** | **Diffusion-QL Offline RL** | ![Illustration](graphs/diffusion_ql_offline_rl.png) | Policy as reverse diffusion process | s,k)$ with noise injection |
194
+ | **Principles** | **Fairness-reward Pareto Frontier** | ![Illustration](graphs/fairness_reward_pareto_frontier.png) | Balancing equity and returns | Fair RL, Jabbari et al. |
195
+ | **Principles** | **Differentially Private RL** | ![Illustration](graphs/differentially_private_rl.png) | Privacy-preserving training | DP-RL, Agarwal et al. |
196
+ | **Applied RL** | **Smart Agriculture RL** | ![Illustration](graphs/smart_agriculture_rl.png) | Optimizing crop yield and resources | Precision Agriculture |
197
+ | **Applied RL** | **Climate Mitigation RL (Grid)** | ![Illustration](graphs/climate_mitigation_rl_grid.png) | Environmental control policies | ClimateRL, Carbon Control |
198
+ | **Applied RL** | **AI Education (Knowledge Tracing)** | ![Illustration](graphs/ai_education_knowledge_tracing.png) | Personalized learning paths | ITS, Bayesian Knowledge Tracing |
199
+ | **Modern RL** | **Decision SDE Flow** | ![Illustration](graphs/decision_sde_flow.png) | RL in continuous stochastic systems | Neural SDEs, Control |
200
+ | **Control** | **Differentiable physics (Brax)** | ![Illustration](graphs/differentiable_physics_brax.png) | Gradients through simulators | Brax, PhysX, MuJoCo |
201
+ | **Applied RL** | **Wireless Beamforming RL** | ![Illustration](graphs/wireless_beamforming_rl.png) | Optimizing antenna signal directions | 5G/6G Networking |
202
+ | **Applied RL** | **Quantum Error Correction RL** | ![Illustration](graphs/quantum_error_correction_rl.png) | Correcting noise in quantum circuits | Quantum Computing RL |
203
+ | **Multi-Agent RL** | **Mean Field RL Interaction** | ![Illustration](graphs/mean_field_rl_interaction.png) | Large population agent dynamics | MF-RL, Yang et al. |
204
+ | **HRL** | **Goal-GAN Curriculum** | ![Illustration](graphs/goal_gan_curriculum.png) | Automatic goal generation | Goal-GAN, Florensa et al. |
205
+ | **Modern RL** | **JEPA: Predictive Architecture** | ![Illustration](graphs/jepa_predictive_architecture.png) | LeCun's world model framework | JEPA, I-JEPA |
206
+ | **Offline RL** | **CQL Value Penalty Landscape** | ![Illustration](graphs/cql_value_penalty_landscape.png) | Conservatism in value functions | CQL, Kumar et al. |
207
+ | **Applied RL** | **Causal RL** | ![Illustration](graphs/causal_rl.png) | Causal Inverse RL Graph | DAG with $S, A, R$ and latent $U$ |
208
+ | **Quantum RL** | **VQE-RL Optimization** | ![Illustration](graphs/vqe_rl_optimization.png) | Quantum circuit param tuning | VQE, Quantum RL |
209
+ | **Applied RL** | **De-novo Drug Discovery RL** | ![Illustration](graphs/de_novo_drug_discovery_rl.png) | Generating optimized lead molecules | Drug Discovery, Molecule RL |
210
+ | **Applied RL** | **Traffic Signal Coordination RL** | ![Illustration](graphs/traffic_signal_coordination_rl.png) | Multi-intersection coordination | IntelliLight, PressLight |
211
+ | **Applied RL** | **Mars Rover Pathfinding RL** | ![Illustration](graphs/mars_rover_pathfinding_rl.png) | Navigation on rough terrain | Space RL, Mars Rover |
212
+ | **Applied RL** | **Sports Player Movement RL** | ![Illustration](graphs/sports_player_movement_rl.png) | Predicting/Optimizing player actions | Sports Analytics, Ghosting |
213
+ | **Applied RL** | **Cryptography Attack RL** | ![Illustration](graphs/cryptography_attack_rl.png) | Searching for keys/vulnerabilities | Crypto-RL, Learning to Attack |
214
+ | **Applied RL** | **Humanitarian Resource RL** | ![Illustration](graphs/humanitarian_resource_rl.png) | Disaster response allocation | AI for Good, Resource RL |
215
+ | **Applied RL** | **Video Compression RL (RD)** | ![Illustration](graphs/video_compression_rl_rd.png) | Optimizing bit-rate vs distortion | Learned Video Compression |
216
+ | **Applied RL** | **Kubernetes Auto-scaling RL** | ![Illustration](graphs/kubernetes_auto_scaling_rl.png) | Cloud resource management | Cloud RL, K8s Scaling |
217
+ | **Applied RL** | **Fluid Dynamics Flow Control RL** | ![Illustration](graphs/fluid_dynamics_flow_control_rl.png) | Airfoil/Turbulence control | Aero-RL, Flow Control |
218
+ | **Applied RL** | **Structural Optimization RL** | ![Illustration](graphs/structural_optimization_rl.png) | Topology/Material design | Structural RL, Topology Opt |
219
+ | **Applied RL** | **Human Decision Modeling** | ![Illustration](graphs/human_decision_modeling.png) | Prospect Theory in RL | Behavioral RL, Prospect Theory |
220
+ | **Applied RL** | **Semantic Parsing RL** | ![Illustration](graphs/semantic_parsing_rl.png) | Language to Logic transformation | Semantic Parsing, Seq2Seq-RL |
221
+ | **Applied RL** | **Music Melody RL** | ![Illustration](graphs/music_melody_rl.png) | Reward-based melody generation | Music-RL, Magenta |
222
+ | **Applied RL** | **Plasma Fusion Control RL** | ![Illustration](graphs/plasma_fusion_control_rl.png) | Magnetic control of Tokamaks | DeepMind Fusion, Tokamak RL |
223
+ | **Applied RL** | **Carbon Capture RL cycle** | ![Illustration](graphs/carbon_capture_rl_cycle.png) | Adsorption/Desorption optimization | Carbon Capture, Green RL |
224
+ | **Applied RL** | **Swarm Robotics RL** | ![Illustration](graphs/swarm_robotics_rl.png) | Decentralized swarm coordination | Swarm-RL, Multi-Robot |
225
+ | **Applied RL** | **Legal Compliance RL Game** | ![Illustration](graphs/legal_compliance_rl_game.png) | Regulatory games | Legal-RL, RegTech |
226
+ | **Physics RL** | **Physics-Informed RL (PINN)** | ![Illustration](graphs/physics_informed_rl_pinn.png) | Constraint-based RL loss | PINN-RL, SciML |
227
+ | **Modern RL** | **Neuro-Symbolic RL** | ![Illustration](graphs/neuro_symbolic_rl.png) | Combining logic and neural nets | Neuro-Symbolic, Logic RL |
228
+ | **Applied RL** | **DeFi Liquidity Pool RL** | ![Illustration](graphs/defi_liquidity_pool_rl.png) | Yield farming/Liquidity balancing | DeFi-RL, AMM Optimization |
229
+ | **Neuro RL** | **Dopamine Reward Prediction Error** | ![Illustration](graphs/dopamine_reward_prediction_error.png) | Biological RL signal curves | Neuroscience-RL, Wolfram |
230
+ | **Robotics** | **Proprioceptive Sensory-Motor RL** | ![Illustration](graphs/proprioceptive_sensory_motor_rl.png) | Low-level joint control | Proprioceptive RL, Unitree |
231
+ | **Applied RL** | **AR Object Placement RL** | ![Illustration](graphs/ar_object_placement_rl.png) | AR visual overlay optimization | AR-RL, Visual Overlay |
232
+ | **Reco RL** | **Sequential Bundle RL** | ![Illustration](graphs/sequential_bundle_rl.png) | Recommendation item grouping | Bundle-RL, E-commerce |
233
+ | **Theoretical** | **Online Gradient Descent vs RL** | ![Illustration](graphs/online_gradient_descent_vs_rl.png) | Gradient-based learning comparison | Online Learning, Regret |
234
+ | **Modern RL** | **Active Learning: Query RL** | ![Illustration](graphs/active_learning_query_rl.png) | Query-based sample selection | Active-RL, Query Opt |
235
+ | **Modern RL** | **Federated RL global Aggregator** | ![Illustration](graphs/federated_rl_global_aggregator.png) | Privacy-preserving distributed RL | Federated-RL, FedAvg-RL |
236
+ | **Conceptual** | **Ultimate Universal RL Mastery Diagram** | ![Illustration](graphs/ultimate_universal_rl_mastery_diagram.png) | Final summary of 230 items | Absolute Mastery Milestone |
checkpoint/README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---title: Reinforcement Learning Graphical Representationsdate: 2026-04-08category: Reinforcement Learningdescription: A comprehensive gallery of 130 standard RL components and their graphical presentations.---
2
+
3
+ # Reinforcement Learning Graphical Representations
4
+
5
+ This repository contains a full set of 130 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning.
6
+
7
+ | Category | Component | Illustration | Details | Context |
8
+ |----------|-----------|--------------|---------|---------|
9
+ | **MDP & Environment** | **Agent-Environment Interaction Loop** | ![Illustration](graphs/agent_environment_interaction_loop.png) | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | All RL algorithms |
10
+ | **MDP & Environment** | **Markov Decision Process (MDP) Tuple** | ![Illustration](graphs/markov_decision_process_mdp_tuple.png) | (S, A, P, R, γ) with transition dynamics and reward function | s,a) and R(s,a,s′)) |
11
+ | **MDP & Environment** | **State Transition Graph** | ![Illustration](graphs/state_transition_graph.png) | Full probabilistic transitions between discrete states | Gridworld, Taxi, Cliff Walking |
12
+ | **MDP & Environment** | **Trajectory / Episode Sequence** | ![Illustration](graphs/trajectory_episode_sequence.png) | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Monte Carlo, episodic tasks |
13
+ | **MDP & Environment** | **Continuous State/Action Space Visualization** | ![Illustration](graphs/continuous_state_action_space_visualization.png) | High-dimensional spaces (e.g., robot joints, pixel inputs) | Continuous-control tasks (MuJoCo, PyBullet) |
14
+ | **MDP & Environment** | **Reward Function / Landscape** | ![Illustration](graphs/reward_function_landscape.png) | Scalar reward as function of state/action | All algorithms; especially reward shaping |
15
+ | **MDP & Environment** | **Discount Factor (γ) Effect** | ![Illustration](graphs/discount_factor_effect.png) | How future rewards are weighted | All discounted MDPs |
16
+ | **Value & Policy** | **State-Value Function V(s)** | ![Illustration](graphs/state_value_function_v_s.png) | Expected return from state s under policy π | Value-based methods |
17
+ | **Value & Policy** | **Action-Value Function Q(s,a)** | ![Illustration](graphs/action_value_function_q_s_a.png) | Expected return from state-action pair | Q-learning family |
18
+ | **Value & Policy** | **Policy π(s) or π(a\** | ![Illustration](graphs/policy_s_or_a.png) | s) | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps |
19
+ | **Value & Policy** | **Advantage Function A(s,a)** | ![Illustration](graphs/advantage_function_a_s_a.png) | Q(s,a) – V(s) | A2C, PPO, SAC, TD3 |
20
+ | **Value & Policy** | **Optimal Value Function V* / Q*** | ![Illustration](graphs/optimal_value_function_v_q.png) | Solution to Bellman optimality | Value iteration, Q-learning |
21
+ | **Dynamic Programming** | **Policy Evaluation Backup** | ![Illustration](graphs/policy_evaluation_backup.png) | Iterative update of V using Bellman expectation | Policy iteration |
22
+ | **Dynamic Programming** | **Policy Improvement** | ![Illustration](graphs/policy_improvement.png) | Greedy policy update over Q | Policy iteration |
23
+ | **Dynamic Programming** | **Value Iteration Backup** | ![Illustration](graphs/value_iteration_backup.png) | Update using Bellman optimality | Value iteration |
24
+ | **Dynamic Programming** | **Policy Iteration Full Cycle** | ![Illustration](graphs/policy_iteration_full_cycle.png) | Evaluation → Improvement loop | Classic DP methods |
25
+ | **Monte Carlo** | **Monte Carlo Backup** | ![Illustration](graphs/monte_carlo_backup.png) | Update using full episode return G_t | First-visit / every-visit MC |
26
+ | **Monte Carlo** | **Monte Carlo Tree (MCTS)** | ![Illustration](graphs/monte_carlo_tree_mcts.png) | Search tree with selection, expansion, simulation, backprop | AlphaGo, AlphaZero |
27
+ | **Monte Carlo** | **Importance Sampling Ratio** | ![Illustration](graphs/importance_sampling_ratio.png) | Off-policy correction ρ = π(a\ | s) |
28
+ | **Temporal Difference** | **TD(0) Backup** | ![Illustration](graphs/td_0_backup.png) | Bootstrapped update using R + γV(s′) | TD learning |
29
+ | **Temporal Difference** | **Bootstrapping (general)** | ![Illustration](graphs/bootstrapping_general.png) | Using estimated future value instead of full return | All TD methods |
30
+ | **Temporal Difference** | **n-step TD Backup** | ![Illustration](graphs/n_step_td_backup.png) | Multi-step return G_t^{(n)} | n-step TD, TD(λ) |
31
+ | **Temporal Difference** | **TD(λ) & Eligibility Traces** | ![Illustration](graphs/td_eligibility_traces.png) | Decaying trace z_t for credit assignment | TD(λ), SARSA(λ), Q(λ) |
32
+ | **Temporal Difference** | **SARSA Update** | ![Illustration](graphs/sarsa_update.png) | On-policy TD control | SARSA |
33
+ | **Temporal Difference** | **Q-Learning Update** | ![Illustration](graphs/q_learning_update.png) | Off-policy TD control | Q-learning, Deep Q-Network |
34
+ | **Temporal Difference** | **Expected SARSA** | ![Illustration](graphs/expected_sarsa.png) | Expectation over next action under policy | Expected SARSA |
35
+ | **Temporal Difference** | **Double Q-Learning / Double DQN** | ![Illustration](graphs/double_q_learning_double_dqn.png) | Two separate Q estimators to reduce overestimation | Double DQN, TD3 |
36
+ | **Temporal Difference** | **Dueling DQN Architecture** | ![Illustration](graphs/dueling_dqn_architecture.png) | Separate streams for state value V(s) and advantage A(s,a) | Dueling DQN |
37
+ | **Temporal Difference** | **Prioritized Experience Replay** | ![Illustration](graphs/prioritized_experience_replay.png) | Importance sampling of transitions by TD error | Prioritized DQN, Rainbow |
38
+ | **Temporal Difference** | **Rainbow DQN Components** | ![Illustration](graphs/rainbow_dqn_components.png) | All extensions combined (Double, Dueling, PER, etc.) | Rainbow DQN |
39
+ | **Function Approximation** | **Linear Function Approximation** | ![Illustration](graphs/linear_function_approximation.png) | Feature vector φ(s) → wᵀφ(s) | Tabular → linear FA |
40
+ | **Function Approximation** | **Neural Network Layers (MLP, CNN, RNN, Transformer)** | ![Illustration](graphs/neural_network_layers_mlp_cnn_rnn_transformer.png) | Full deep network for value/policy | DQN, A3C, PPO, Decision Transformer |
41
+ | **Function Approximation** | **Computation Graph / Backpropagation Flow** | ![Illustration](graphs/computation_graph_backpropagation_flow.png) | Gradient flow through network | All deep RL |
42
+ | **Function Approximation** | **Target Network** | ![Illustration](graphs/target_network.png) | Frozen copy of Q-network for stability | DQN, DDQN, SAC, TD3 |
43
+ | **Policy Gradients** | **Policy Gradient Theorem** | ![Illustration](graphs/policy_gradient_theorem.png) | ∇_θ J(θ) = E[∇_θ log π(a\ | Flow diagram from reward → log-prob → gradient |
44
+ | **Policy Gradients** | **REINFORCE Update** | ![Illustration](graphs/reinforce_update.png) | Monte-Carlo policy gradient | REINFORCE |
45
+ | **Policy Gradients** | **Baseline / Advantage Subtraction** | ![Illustration](graphs/baseline_advantage_subtraction.png) | Subtract b(s) to reduce variance | All modern PG |
46
+ | **Policy Gradients** | **Trust Region (TRPO)** | ![Illustration](graphs/trust_region_trpo.png) | KL-divergence constraint on policy update | TRPO |
47
+ | **Policy Gradients** | **Proximal Policy Optimization (PPO)** | ![Illustration](graphs/proximal_policy_optimization_ppo.png) | Clipped surrogate objective | PPO, PPO-Clip |
48
+ | **Actor-Critic** | **Actor-Critic Architecture** | ![Illustration](graphs/actor_critic_architecture.png) | Separate or shared actor (policy) + critic (value) networks | A2C, A3C, SAC, TD3 |
49
+ | **Actor-Critic** | **Advantage Actor-Critic (A2C/A3C)** | ![Illustration](graphs/advantage_actor_critic_a2c_a3c.png) | Synchronous/asynchronous multi-worker | A2C/A3C |
50
+ | **Actor-Critic** | **Soft Actor-Critic (SAC)** | ![Illustration](graphs/soft_actor_critic_sac.png) | Entropy-regularized policy + twin critics | SAC |
51
+ | **Actor-Critic** | **Twin Delayed DDPG (TD3)** | ![Illustration](graphs/twin_delayed_ddpg_td3.png) | Twin critics + delayed policy + target smoothing | TD3 |
52
+ | **Exploration** | **ε-Greedy Strategy** | ![Illustration](graphs/greedy_strategy.png) | Probability ε of random action | DQN family |
53
+ | **Exploration** | **Softmax / Boltzmann Exploration** | ![Illustration](graphs/softmax_boltzmann_exploration.png) | Temperature τ in softmax | Softmax policies |
54
+ | **Exploration** | **Upper Confidence Bound (UCB)** | ![Illustration](graphs/upper_confidence_bound_ucb.png) | Optimism in face of uncertainty | UCB1, bandits |
55
+ | **Exploration** | **Intrinsic Motivation / Curiosity** | ![Illustration](graphs/intrinsic_motivation_curiosity.png) | Prediction error as intrinsic reward | ICM, RND, Curiosity-driven RL |
56
+ | **Exploration** | **Entropy Regularization** | ![Illustration](graphs/entropy_regularization.png) | Bonus term αH(π) | SAC, maximum-entropy RL |
57
+ | **Hierarchical RL** | **Options Framework** | ![Illustration](graphs/options_framework.png) | High-level policy over options (temporally extended actions) | Option-Critic |
58
+ | **Hierarchical RL** | **Feudal Networks / Hierarchical Actor-Critic** | ![Illustration](graphs/feudal_networks_hierarchical_actor_critic.png) | Manager-worker hierarchy | Feudal RL |
59
+ | **Hierarchical RL** | **Skill Discovery** | ![Illustration](graphs/skill_discovery.png) | Unsupervised emergence of reusable skills | DIAYN, VALOR |
60
+ | **Model-Based RL** | **Learned Dynamics Model** | ![Illustration](graphs/learned_dynamics_model.png) | ˆP(s′\ | Separate model network diagram (often RNN or transformer) |
61
+ | **Model-Based RL** | **Model-Based Planning** | ![Illustration](graphs/model_based_planning.png) | Rollouts inside learned model | MuZero, DreamerV3 |
62
+ | **Model-Based RL** | **Imagination-Augmented Agents (I2A)** | ![Illustration](graphs/imagination_augmented_agents_i2a.png) | Imagination module + policy | I2A |
63
+ | **Offline RL** | **Offline Dataset** | ![Illustration](graphs/offline_dataset.png) | Fixed batch of trajectories | BC, CQL, IQL |
64
+ | **Offline RL** | **Conservative Q-Learning (CQL)** | ![Illustration](graphs/conservative_q_learning_cql.png) | Penalty on out-of-distribution actions | CQL |
65
+ | **Multi-Agent RL** | **Multi-Agent Interaction Graph** | ![Illustration](graphs/multi_agent_interaction_graph.png) | Agents communicating or competing | MARL, MADDPG |
66
+ | **Multi-Agent RL** | **Centralized Training Decentralized Execution (CTDE)** | ![Illustration](graphs/centralized_training_decentralized_execution_ctde.png) | Shared critic during training | QMIX, VDN, MADDPG |
67
+ | **Multi-Agent RL** | **Cooperative / Competitive Payoff Matrix** | ![Illustration](graphs/cooperative_competitive_payoff_matrix.png) | Joint reward for multiple agents | Prisoner's Dilemma, multi-agent gridworlds |
68
+ | **Inverse RL / IRL** | **Reward Inference** | ![Illustration](graphs/reward_inference.png) | Infer reward from expert demonstrations | IRL, GAIL |
69
+ | **Inverse RL / IRL** | **Generative Adversarial Imitation Learning (GAIL)** | ![Illustration](graphs/generative_adversarial_imitation_learning_gail.png) | Discriminator vs. policy generator | GAIL, AIRL |
70
+ | **Meta-RL** | **Meta-RL Architecture** | ![Illustration](graphs/meta_rl_architecture.png) | Outer loop (meta-policy) + inner loop (task adaptation) | MAML for RL, RL² |
71
+ | **Meta-RL** | **Task Distribution Visualization** | ![Illustration](graphs/task_distribution_visualization.png) | Multiple MDPs sampled from meta-distribution | Meta-RL benchmarks |
72
+ | **Advanced / Misc** | **Experience Replay Buffer** | ![Illustration](graphs/experience_replay_buffer.png) | Stored (s,a,r,s′,done) tuples | DQN and all off-policy deep RL |
73
+ | **Advanced / Misc** | **State Visitation / Occupancy Measure** | ![Illustration](graphs/state_visitation_occupancy_measure.png) | Frequency of visiting each state | All algorithms (analysis) |
74
+ | **Advanced / Misc** | **Learning Curve** | ![Illustration](graphs/learning_curve.png) | Average episodic return vs. episodes / steps | Standard performance reporting |
75
+ | **Advanced / Misc** | **Regret / Cumulative Regret** | ![Illustration](graphs/regret_cumulative_regret.png) | Sub-optimality accumulated | Bandits and online RL |
76
+ | **Advanced / Misc** | **Attention Mechanisms (Transformers in RL)** | ![Illustration](graphs/attention_mechanisms_transformers_in_rl.png) | Attention weights | Decision Transformer, Trajectory Transformer |
77
+ | **Advanced / Misc** | **Diffusion Policy** | ![Illustration](graphs/diffusion_policy.png) | Denoising diffusion process for action generation | Diffusion-RL policies |
78
+ | **Advanced / Misc** | **Graph Neural Networks for RL** | ![Illustration](graphs/graph_neural_networks_for_rl.png) | Node/edge message passing | Graph RL, relational RL |
79
+ | **Advanced / Misc** | **World Model / Latent Space** | ![Illustration](graphs/world_model_latent_space.png) | Encoder-decoder dynamics in latent space | Dreamer, PlaNet |
80
+ | **Advanced / Misc** | **Convergence Analysis Plots** | ![Illustration](graphs/convergence_analysis_plots.png) | Error / value change over iterations | DP, TD, value iteration |
81
+ | **Advanced / Misc** | **RL Algorithm Taxonomy** | ![Illustration](graphs/rl_algorithm_taxonomy.png) | Comprehensive classification of algorithms | All RL |
82
+ | **Advanced / Misc** | **Probabilistic Graphical Model (RL as Inference)** | ![Illustration](graphs/probabilistic_graphical_model_rl_as_inference.png) | Formalizing RL as probabilistic inference | Control as Inference, MaxEnt RL |
83
+ | **Value & Policy** | **Distributional RL (C51 / Categorical)** | ![Illustration](graphs/distributional_rl_c51_categorical.png) | Representing return as a probability distribution | C51, QR-DQN, IQN |
84
+ | **Exploration** | **Hindsight Experience Replay (HER)** | ![Illustration](graphs/hindsight_experience_replay_her.png) | Learning from failures by relabeling goals | Sparse reward robotics, HER |
85
+ | **Model-Based RL** | **Dyna-Q Architecture** | ![Illustration](graphs/dyna_q_architecture.png) | Integration of real experience and model-based planning | Dyna-Q, Dyna-2 |
86
+ | **Function Approximation** | **Noisy Networks (Parameter Noise)** | ![Illustration](graphs/noisy_networks_parameter_noise.png) | Stochastic weights for exploration | Noisy DQN, Rainbow |
87
+ | **Exploration** | **Intrinsic Curiosity Module (ICM)** | ![Illustration](graphs/intrinsic_curiosity_module_icm.png) | Reward based on prediction error | Curiosity-driven exploration, ICM |
88
+ | **Temporal Difference** | **V-trace (IMPALA)** | ![Illustration](graphs/v_trace_impala.png) | Asynchronous off-policy importance sampling | IMPALA, V-trace |
89
+ | **Multi-Agent RL** | **QMIX Mixing Network** | ![Illustration](graphs/qmix_mixing_network.png) | Monotonic value function factorization | QMIX, VDN |
90
+ | **Advanced / Misc** | **Saliency Maps / Attention on State** | ![Illustration](graphs/saliency_maps_attention_on_state.png) | Visualizing what the agent "sees" or prioritizes | Interpretability, Atari RL |
91
+ | **Exploration** | **Action Selection Noise (OU vs Gaussian)** | ![Illustration](graphs/action_selection_noise_ou_vs_gaussian.png) | Temporal correlation in exploration noise | DDPG, TD3 |
92
+ | **Advanced / Misc** | **t-SNE / UMAP State Embeddings** | ![Illustration](graphs/t_sne_umap_state_embeddings.png) | Dimension reduction of high-dim neural states | Interpretability, SRL |
93
+ | **Advanced / Misc** | **Loss Landscape Visualization** | ![Illustration](graphs/loss_landscape_visualization.png) | Optimization surface geometry | Training stability analysis |
94
+ | **Advanced / Misc** | **Success Rate vs Steps** | ![Illustration](graphs/success_rate_vs_steps.png) | Percentage of successful episodes | Goal-conditioned RL, Robotics |
95
+ | **Advanced / Misc** | **Hyperparameter Sensitivity Heatmap** | ![Illustration](graphs/hyperparameter_sensitivity_heatmap.png) | Performance across parameter grids | Hyperparameter tuning |
96
+ | **Dynamics** | **Action Persistence (Frame Skipping)** | ![Illustration](graphs/action_persistence_frame_skipping.png) | Temporal abstraction by repeating actions | Atari RL, Robotics |
97
+ | **Model-Based RL** | **MuZero Dynamics Search Tree** | ![Illustration](graphs/muzero_dynamics_search_tree.png) | Planning with learned transition and value functions | MuZero, Gumbel MuZero |
98
+ | **Deep RL** | **Policy Distillation** | ![Illustration](graphs/policy_distillation.png) | Compressing knowledge from teacher to student | Kickstarting, multitask learning |
99
+ | **Transformers** | **Decision Transformer Token Sequence** | ![Illustration](graphs/decision_transformer_token_sequence.png) | Sequential modeling of RL as a translation task | Decision Transformer, TT |
100
+ | **Advanced / Misc** | **Performance Profiles (rliable)** | ![Illustration](graphs/performance_profiles_rliable.png) | Robust aggregate performance metrics | Reliable RL evaluation |
101
+ | **Safety RL** | **Safety Shielding / Barrier Functions** | ![Illustration](graphs/safety_shielding_barrier_functions.png) | Hard constraints on the action space | Constrained MDPs, Safe RL |
102
+ | **Training** | **Automated Curriculum Learning** | ![Illustration](graphs/automated_curriculum_learning.png) | Progressively increasing task difficulty | Curriculum RL, ALP-GMM |
103
+ | **Sim-to-Real** | **Domain Randomization** | ![Illustration](graphs/domain_randomization.png) | Generalizing across environment variations | Robotics, Sim-to-Real |
104
+ | **Alignment** | **RL with Human Feedback (RLHF)** | ![Illustration](graphs/rl_with_human_feedback_rlhf.png) | Aligning agents with human preferences | ChatGPT, InstructGPT |
105
+ | **Neuro-inspired RL** | **Successor Representation (SR)** | ![Illustration](graphs/successor_representation_sr.png) | Predictive state representations | SR-Dyna, Neuro-RL |
106
+ | **Inverse RL / IRL** | **Maximum Entropy IRL** | ![Illustration](graphs/maximum_entropy_irl.png) | Probability distribution over trajectories | MaxEnt IRL, Ziebart |
107
+ | **Theory** | **Information Bottleneck** | ![Illustration](graphs/information_bottleneck.png) | Mutual information $I(S;Z)$ and $I(Z;A)$ balance | VIB-RL, Information Theory |
108
+ | **Evolutionary RL** | **Evolutionary Strategies Population** | ![Illustration](graphs/evolutionary_strategies_population.png) | Population-based parameter search | OpenAI-ES, Salimans |
109
+ | **Safety RL** | **Control Barrier Functions (CBF)** | ![Illustration](graphs/control_barrier_functions_cbf.png) | Set-theoretic safety guarantees | CBF-RL, Control Theory |
110
+ | **Exploration** | **Count-based Exploration Heatmap** | ![Illustration](graphs/count_based_exploration_heatmap.png) | Visitation frequency and intrinsic bonus | MBIE-EB, RND |
111
+ | **Exploration** | **Thompson Sampling Posteriors** | ![Illustration](graphs/thompson_sampling_posteriors.png) | Direct uncertainty-based action selection | Bandits, Bayesian RL |
112
+ | **Multi-Agent RL** | **Adversarial RL Interaction** | ![Illustration](graphs/adversarial_rl_interaction.png) | Competition between protaganist and antagonist | Robust RL, RARL |
113
+ | **Hierarchical RL** | **Hierarchical Subgoal Trajectory** | ![Illustration](graphs/hierarchical_subgoal_trajectory.png) | Decomposing long-horizon tasks | Subgoal RL, HIRO |
114
+ | **Offline RL** | **Offline Action Distribution Shift** | ![Illustration](graphs/offline_action_distribution_shift.png) | Mismatch between dataset and current policy | CQL, IQL, D4RL |
115
+ | **Exploration** | **Random Network Distillation (RND)** | ![Illustration](graphs/random_network_distillation_rnd.png) | Prediction error as intrinsic reward | RND, OpenAI |
116
+ | **Offline RL** | **Batch-Constrained Q-learning (BCQ)** | ![Illustration](graphs/batch_constrained_q_learning_bcq.png) | Constraining actions to behavior dataset | BCQ, Fujimoto |
117
+ | **Training** | **Population-Based Training (PBT)** | ![Illustration](graphs/population_based_training_pbt.png) | Evolutionary hyperparameter optimization | PBT, DeepMind |
118
+ | **Deep RL** | **Recurrent State Flow (DRQN/R2D2)** | ![Illustration](graphs/recurrent_state_flow_drqn_r2d2.png) | Temporal dependency in state-action value | DRQN, R2D2 |
119
+ | **Theory** | **Belief State in POMDPs** | ![Illustration](graphs/belief_state_in_pomdps.png) | Probability distribution over hidden states | POMDPs, Belief Space |
120
+ | **Multi-Objective RL** | **Multi-Objective Pareto Front** | ![Illustration](graphs/multi_objective_pareto_front.png) | Balancing conflicting reward signals | MORL, Pareto Optimal |
121
+ | **Theory** | **Differential Value (Average Reward RL)** | ![Illustration](graphs/differential_value_average_reward_rl.png) | Values relative to average gain | Average Reward RL, Mahadevan |
122
+ | **Infrastructure** | **Distributed RL Cluster (Ray/RLLib)** | ![Illustration](graphs/distributed_rl_cluster_ray_rllib.png) | Parallelizing experience collection | Ray, RLLib, Ape-X |
123
+ | **Evolutionary RL** | **Neuroevolution Topology Evolution** | ![Illustration](graphs/neuroevolution_topology_evolution.png) | Evolving neural network architectures | NEAT, HyperNEAT |
124
+ | **Continual RL** | **Elastic Weight Consolidation (EWC)** | ![Illustration](graphs/elastic_weight_consolidation_ewc.png) | Preventing catastrophic forgetting | EWC, Kirkpatric |
125
+ | **Theory** | **Successor Features (SF)** | ![Illustration](graphs/successor_features_sf.png) | Generalizing predictive representations | SF-Dyna, Barreto |
126
+ | **Safety** | **Adversarial State Noise (Perception)** | ![Illustration](graphs/adversarial_state_noise_perception.png) | Attacks on agent observation space | Adversarial RL, Huang |
127
+ | **Imitation Learning** | **Behavioral Cloning (Imitation)** | ![Illustration](graphs/behavioral_cloning_imitation.png) | Direct supervised learning from experts | BC, DAGGER |
128
+ | **Relational RL** | **Relational Graph State Representation** | ![Illustration](graphs/relational_graph_state_representation.png) | Modeling objects and their relations | Relational MDPs, BoxWorld |
129
+ | **Quantum RL** | **Quantum RL Circuit (PQC)** | ![Illustration](graphs/quantum_rl_circuit_pqc.png) | Gate-based quantum policy networks | Quantum RL, PQC |
130
+ | **Symbolic RL** | **Symbolic Policy Tree** | ![Illustration](graphs/symbolic_policy_tree.png) | Policies as mathematical expressions | Symbolic RL, GP |
131
+ | **Control** | **Differentiable Physics Gradient Flow** | ![Illustration](graphs/differentiable_physics_gradient_flow.png) | Gradient-based planning through simulators | Brax, Isaac Gym |
132
+ | **Multi-Agent RL** | **MARL Communication Channel** | ![Illustration](graphs/marl_communication_channel.png) | Information exchange between agents | CommNet, DIAL |
133
+ | **Safety** | **Lagrangian Constraint Landscape** | ![Illustration](graphs/lagrangian_constraint_landscape.png) | Constrained optimization boundaries | Constrained RL, CPO |
134
+ | **Hierarchical RL** | **MAXQ Task Hierarchy** | ![Illustration](graphs/maxq_task_hierarchy.png) | Recursive task decomposition | MAXQ, Dietterich |
135
+ | **Agentic AI** | **ReAct Agentic Cycle** | ![Illustration](graphs/react_agentic_cycle.png) | Reasoning-Action loops for LLMs | ReAct, Agentic LLM |
136
+ | **Bio-inspired RL** | **Synaptic Plasticity RL** | ![Illustration](graphs/synaptic_plasticity_rl.png) | Hebbian-style synaptic weight updates | Hebbian RL, STDP |
137
+ | **Control** | **Guided Policy Search (GPS)** | ![Illustration](graphs/guided_policy_search_gps.png) | Distilling trajectories into a policy | GPS, Levine |
138
+ | **Robotics** | **Sim-to-Real Jitter & Latency** | ![Illustration](graphs/sim_to_real_jitter_latency.png) | Temporal robustness in transfer | Sim-to-Real, Robustness |
139
+ | **Policy Gradients** | **Deterministic Policy Gradient (DDPG) Flow** | ![Illustration](graphs/deterministic_policy_gradient_ddpg_flow.png) | Gradient flow for deterministic policies | DDPG |
140
+ | **Model-Based RL** | **Dreamer Latent Imagination** | ![Illustration](graphs/dreamer_latent_imagination.png) | Learning and planning in latent space | Dreamer (V1-V3) |
141
+ | **Deep RL** | **UNREAL Auxiliary Tasks** | ![Illustration](graphs/unreal_auxiliary_tasks.png) | Learning from non-reward signals | UNREAL, A3C extension |
142
+ | **Offline RL** | **Implicit Q-Learning (IQL) Expectile** | ![Illustration](graphs/implicit_q_learning_iql_expectile.png) | In-sample learning via expectile regression | IQL |
143
+ | **Model-Based RL** | **Prioritized Sweeping** | ![Illustration](graphs/prioritized_sweeping.png) | Planning prioritized by TD error | Sutton & Barto classic MBRL |
144
+ | **Imitation Learning** | **DAgger Expert Loop** | ![Illustration](graphs/dagger_expert_loop.png) | Training on expert labels in agent-visited states | DAgger |
145
+ | **Representation** | **Self-Predictive Representations (SPR)** | ![Illustration](graphs/self_predictive_representations_spr.png) | Consistency between predicted and target latents | SPR, sample-efficient RL |
146
+ | **Multi-Agent RL** | **Joint Action Space** | ![Illustration](graphs/joint_action_space.png) | Cartesian product of individual actions | MARL theory, Game Theory |
147
+ | **Multi-Agent RL** | **Dec-POMDP Formal Model** | ![Illustration](graphs/dec_pomdp_formal_model.png) | Decentralized partially observable MDP | Multi-agent coordination |
148
+ | **Theory** | **Bisimulation Metric** | ![Illustration](graphs/bisimulation_metric.png) | State equivalence based on transitions/rewards | State abstraction, bisimulation theory |
149
+ | **Theory** | **Potential-Based Reward Shaping** | ![Illustration](graphs/potential_based_reward_shaping.png) | Reward transformation preserving optimal policy | Sutton & Barto, Ng et al. |
150
+ | **Training** | **Transfer RL: Source to Target** | ![Illustration](graphs/transfer_rl_source_to_target.png) | Reusing knowledge across different MDPs | Transfer Learning, Distillation |
151
+ | **Deep RL** | **Multi-Task Backbone Arch** | ![Illustration](graphs/multi_task_backbone_arch.png) | Single agent learning multiple tasks | Multi-task RL, IMPALA |
152
+ | **Bandits** | **Contextual Bandit Pipeline** | ![Illustration](graphs/contextual_bandit_pipeline.png) | Decision making given context but no transitions | Personalization, Ad-tech |
153
+ | **Theory** | **Theoretical Regret Bounds** | ![Illustration](graphs/theoretical_regret_bounds.png) | Analytical performance guarantees | Online Learning, Bandits |
154
+ | **Value-based** | **Soft Q Boltzmann Probabilities** | ![Illustration](graphs/soft_q_boltzmann_probabilities.png) | Probabilistic action selection from Q-values | s) \propto \exp(Q/\tau)$ |
155
+ | **Robotics** | **Autonomous Driving RL Pipeline** | ![Illustration](graphs/autonomous_driving_rl_pipeline.png) | End-to-end or modular driving stack | Wayve, Tesla, Comma.ai |
156
+ | **Policy** | **Policy action gradient comparison** | ![Illustration](graphs/policy_action_gradient_comparison.png) | Comparison of gradient derivation types | PG Theorem vs DPG Theorem |
157
+ | **Inverse RL / IRL** | **IRL: Feature Expectation Matching** | ![Illustration](graphs/irl_feature_expectation_matching.png) | Comparing expert vs learner feature visitor frequency | \mu(\pi^*) - \mu(\pi) |
158
+ | **Imitation Learning** | **Apprenticeship Learning Loop** | ![Illustration](graphs/apprenticeship_learning_loop.png) | Training to match expert performance via reward inference | Apprenticeship Learning |
159
+ | **Theory** | **Active Inference Loop** | ![Illustration](graphs/active_inference_loop.png) | Agents minimizing surprise (free energy) | Free Energy Principle, Friston |
160
+ | **Theory** | **Bellman Residual Landscape** | ![Illustration](graphs/bellman_residual_landscape.png) | Training surface of the Bellman error | TD learning, fitted Q-iteration |
161
+ | **Model-Based RL** | **Plan-to-Explore Uncertainty Map** | ![Illustration](graphs/plan_to_explore_uncertainty_map.png) | Systematic exploration in learned world models | Plan-to-Explore, Sekar et al. |
162
+ | **Safety RL** | **Robust RL Uncertainty Set** | ![Illustration](graphs/robust_rl_uncertainty_set.png) | Optimizing for the worst-case environment transition | Robust MDPs, minimax RL |
163
+ | **Training** | **HPO Bayesian Opt Cycle** | ![Illustration](graphs/hpo_bayesian_opt_cycle.png) | Automating hyperparameter selection with GP | Hyperparameter Optimization |
164
+ | **Applied RL** | **Slate RL Recommendation** | ![Illustration](graphs/slate_rl_recommendation.png) | Optimizing list/slate of items for users | Recommender Systems, Ie et al. |
165
+ | **Multi-Agent RL** | **Fictitious Play Interaction** | ![Illustration](graphs/fictitious_play_interaction.png) | Belief-based learning in games | Game Theory, Brown (1951) |
166
+ | **Conceptual** | **Universal RL Framework Diagram** | ![Illustration](graphs/universal_rl_framework_diagram.png) | High-level summary of RL components | All RL |
167
+ | **Offline RL** | **Offline Density Ratio Estimator** | ![Illustration](graphs/offline_density_ratio_estimator.png) | Estimating $w(s,a)$ for off-policy data | Importance Sampling, Offline RL |
168
+ | **Continual RL** | **Continual Task Interference Heatmap** | ![Illustration](graphs/continual_task_interference_heatmap.png) | Measuring negative transfer between tasks | Lifelong Learning, EWC |
169
+ | **Safety RL** | **Lyapunov Stability Safe Set** | ![Illustration](graphs/lyapunov_stability_safe_set.png) | Invariant sets for safe control | Lyapunov RL, Chow et al. |
170
+ | **Applied RL** | **Molecular RL (Atom Coordinates)** | ![Illustration](graphs/molecular_rl_atom_coordinates.png) | RL for molecular design/protein folding | Chemistry RL, AlphaFold-style |
171
+ | **Architecture** | **MoE Multi-task Architecture** | ![Illustration](graphs/moe_multi_task_architecture.png) | Scaling models with mixture of experts | MoE-RL, Sparsity |
172
+ | **Direct Policy Search** | **CMA-ES Policy Search** | ![Illustration](graphs/cma_es_policy_search.png) | Evolutionary strategy for policy weights | ES for RL, Salimans |
173
+ | **Alignment** | **Elo Rating Preference Plot** | ![Illustration](graphs/elo_rating_preference_plot.png) | Measuring agent strength over time | AlphaZero, League training |
174
+ | **Explainable RL** | **Explainable RL (SHAP Attribution)** | ![Illustration](graphs/explainable_rl_shap_attribution.png) | Local attribution of features to agent actions | Interpretability, SHAP/LIME |
175
+ | **Meta-RL** | **PEARL Context Encoder** | ![Illustration](graphs/pearl_context_encoder.png) | Learning latent task representations | PEARL, Rakelly et al. |
176
+ | **Applied RL** | **Medical RL Therapy Pipeline** | ![Illustration](graphs/medical_rl_therapy_pipeline.png) | Personalized medicine and dosing | Healthcare RL, ICU Sepsis |
177
+ | **Applied RL** | **Supply Chain RL Pipeline** | ![Illustration](graphs/supply_chain_rl_pipeline.png) | Optimizing stock levels and orders | Logistics, Inventory Management |
178
+ | **Robotics** | **Sim-to-Real SysID Loop** | ![Illustration](graphs/sim_to_real_sysid_loop.png) | Closing the reality gap via parameter estimation | System Identification, Robotics |
179
+ | **Architecture** | **Transformer World Model** | ![Illustration](graphs/transformer_world_model.png) | Sequence-to-sequence dynamics modeling | DreamerV3, Transframer |
180
+ | **Applied RL** | **Network Traffic RL** | ![Illustration](graphs/network_traffic_rl.png) | Optimizing data packet routing in graphs | Networking, Traffic Engineering |
181
+ | **Training** | **RLHF: PPO with Reference Policy** | ![Illustration](graphs/rlhf_ppo_with_reference_policy.png) | Ensuring RL fine-tuning doesn't drift too far | InstructGPT, Llama 2/3 |
182
+ | **Multi-Agent RL** | **PSRO Meta-Game Update** | ![Illustration](graphs/psro_meta_game_update.png) | Reaching Nash equilibrium in large games | PSRO, Lanctot et al. |
183
+ | **Multi-Agent RL** | **DIAL: Differentiable Comm** | ![Illustration](graphs/dial_differentiable_comm.png) | End-to-end learning of communication protocols | DIAL, Foerster et al. |
184
+ | **Batch RL** | **Fitted Q-Iteration Loop** | ![Illustration](graphs/fitted_q_iteration_loop.png) | Data-driven iteration with a supervised regressor | Ernst et al. (2005) |
185
+ | **Safety RL** | **CMDP Feasible Region** | ![Illustration](graphs/cmdp_feasible_region.png) | Constrained optimization within a safety budget | Constrained MDPs, Altman |
186
+ | **Control** | **MPC vs RL Planning** | ![Illustration](graphs/mpc_vs_rl_planning.png) | Comparison of control paradigms | Control Theory vs RL |
187
+ | **AutoML** | **Learning to Optimize (L2O)** | ![Illustration](graphs/learning_to_optimize_l2o.png) | Using RL to learn an optimization update rule | L2O, Li & Malik |
188
+ | **Applied RL** | **Smart Grid RL Management** | ![Illustration](graphs/smart_grid_rl_management.png) | Optimizing energy supply and demand | Energy RL, Smart Grids |
189
+ | **Applied RL** | **Quantum State Tomography RL** | ![Illustration](graphs/quantum_state_tomography_rl.png) | RL for quantum state estimation | Quantum RL, Neural Tomography |
190
+ | **Applied RL** | **RL for Chip Placement** | ![Illustration](graphs/rl_for_chip_placement.png) | Placing components on silicon grids | Google Chip Placement |
191
+ | **Applied RL** | **RL Compiler Optimization (MLGO)** | ![Illustration](graphs/rl_compiler_optimization_mlgo.png) | Inlining and sizing in compilers | MLGO, LLVM |
192
+ | **Applied RL** | **RL for Theorem Proving** | ![Illustration](graphs/rl_for_theorem_proving.png) | Automated reasoning and proof search | LeanRL, AlphaProof |
193
+ | **Modern RL** | **Diffusion-QL Offline RL** | ![Illustration](graphs/diffusion_ql_offline_rl.png) | Policy as reverse diffusion process | s,k)$ with noise injection |
194
+ | **Principles** | **Fairness-reward Pareto Frontier** | ![Illustration](graphs/fairness_reward_pareto_frontier.png) | Balancing equity and returns | Fair RL, Jabbari et al. |
195
+ | **Principles** | **Differentially Private RL** | ![Illustration](graphs/differentially_private_rl.png) | Privacy-preserving training | DP-RL, Agarwal et al. |
196
+ | **Applied RL** | **Smart Agriculture RL** | ![Illustration](graphs/smart_agriculture_rl.png) | Optimizing crop yield and resources | Precision Agriculture |
197
+ | **Applied RL** | **Climate Mitigation RL (Grid)** | ![Illustration](graphs/climate_mitigation_rl_grid.png) | Environmental control policies | ClimateRL, Carbon Control |
198
+ | **Applied RL** | **AI Education (Knowledge Tracing)** | ![Illustration](graphs/ai_education_knowledge_tracing.png) | Personalized learning paths | ITS, Bayesian Knowledge Tracing |
199
+ | **Modern RL** | **Decision SDE Flow** | ![Illustration](graphs/decision_sde_flow.png) | RL in continuous stochastic systems | Neural SDEs, Control |
200
+ | **Control** | **Differentiable physics (Brax)** | ![Illustration](graphs/differentiable_physics_brax.png) | Gradients through simulators | Brax, PhysX, MuJoCo |
201
+ | **Applied RL** | **Wireless Beamforming RL** | ![Illustration](graphs/wireless_beamforming_rl.png) | Optimizing antenna signal directions | 5G/6G Networking |
202
+ | **Applied RL** | **Quantum Error Correction RL** | ![Illustration](graphs/quantum_error_correction_rl.png) | Correcting noise in quantum circuits | Quantum Computing RL |
203
+ | **Multi-Agent RL** | **Mean Field RL Interaction** | ![Illustration](graphs/mean_field_rl_interaction.png) | Large population agent dynamics | MF-RL, Yang et al. |
204
+ | **HRL** | **Goal-GAN Curriculum** | ![Illustration](graphs/goal_gan_curriculum.png) | Automatic goal generation | Goal-GAN, Florensa et al. |
205
+ | **Modern RL** | **JEPA: Predictive Architecture** | ![Illustration](graphs/jepa_predictive_architecture.png) | LeCun's world model framework | JEPA, I-JEPA |
206
+ | **Offline RL** | **CQL Value Penalty Landscape** | ![Illustration](graphs/cql_value_penalty_landscape.png) | Conservatism in value functions | CQL, Kumar et al. |
207
+ | **Applied RL** | **Cybersecurity Attack-Defense RL** | ![Illustration](graphs/cybersecurity_attack_defense_rl.png) | Network intrusion and protection | Cyber-RL, Zero Trust |
checkpoint/core.py ADDED
The diff for this file is too large to render. See raw diff
 
checkpoint/e.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ | **Category** | **Component** | **Detailed Description** | **Common Graphical Presentation** | **Typical Algorithms / Contexts** |
2
+ |--------------|---------------|--------------------------|-----------------------------------|-----------------------------------|
3
+ | **MDP & Environment** | Agent-Environment Interaction Loop | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | Circular flowchart or block diagram with arrows (S → A → R, S′) | All RL algorithms |
4
+ | **MDP & Environment** | Markov Decision Process (MDP) Tuple | (S, A, P, R, γ) with transition dynamics and reward function | Directed graph (nodes = states, labeled edges = actions with P(s′\|s,a) and R(s,a,s′)) | Foundational theory, all model-based methods |
5
+ | **MDP & Environment** | State Transition Graph | Full probabilistic transitions between discrete states | Graph diagram with probability-weighted arrows | Gridworld, Taxi, Cliff Walking |
6
+ | **MDP & Environment** | Trajectory / Episode Sequence | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Linear timeline or chain diagram | Monte Carlo, episodic tasks |
7
+ | **MDP & Environment** | Continuous State/Action Space Visualization | High-dimensional spaces (e.g., robot joints, pixel inputs) | 2D/3D scatter plots, density heatmaps, or manifold projections | Continuous-control tasks (MuJoCo, PyBullet) |
8
+ | **MDP & Environment** | Reward Function / Landscape | Scalar reward as function of state/action | 3D surface plot, contour plot, or heatmap | All algorithms; especially reward shaping |
9
+ | **MDP & Environment** | Discount Factor (γ) Effect | How future rewards are weighted | Line plot of geometric decay series or cumulative return curves for different γ | All discounted MDPs |
10
+ | **Value & Policy** | State-Value Function V(s) | Expected return from state s under policy π | Heatmap (gridworld), 3D surface plot, or contour plot | Value-based methods |
11
+ | **Value & Policy** | Action-Value Function Q(s,a) | Expected return from state-action pair | Q-table (discrete) or heatmap per action; 3D surface for continuous | Q-learning family |
12
+ | **Value & Policy** | Policy π(s) or π(a\|s) | Stochastic or deterministic mapping | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps | All policy-based methods |
13
+ | **Value & Policy** | Advantage Function A(s,a) | Q(s,a) – V(s) | Comparative bar/heatmap or signed surface plot | A2C, PPO, SAC, TD3 |
14
+ | **Value & Policy** | Optimal Value Function V* / Q* | Solution to Bellman optimality | Heatmap or surface with arrows showing greedy policy | Value iteration, Q-learning |
15
+ | **Dynamic Programming** | Policy Evaluation Backup | Iterative update of V using Bellman expectation | Backup diagram (current state points to all successor states with probabilities) | Policy iteration |
16
+ | **Dynamic Programming** | Policy Improvement | Greedy policy update over Q | Arrow diagram showing before/after policy on grid | Policy iteration |
17
+ | **Dynamic Programming** | Value Iteration Backup | Update using Bellman optimality | Single backup diagram (max over actions) | Value iteration |
18
+ | **Dynamic Programming** | Policy Iteration Full Cycle | Evaluation → Improvement loop | Multi-step flowchart or convergence plot (error vs iterations) | Classic DP methods |
19
+ | **Monte Carlo** | Monte Carlo Backup | Update using full episode return G_t | Backup diagram (leaf node = actual return G_t) | First-visit / every-visit MC |
20
+ | **Monte Carlo** | Monte Carlo Tree (MCTS) | Search tree with selection, expansion, simulation, backprop | Full tree diagram with visit counts and value bars | AlphaGo, AlphaZero |
21
+ | **Monte Carlo** | Importance Sampling Ratio | Off-policy correction ρ = π(a\|s)/b(a\|s) | Flow diagram showing weight multiplication along trajectory | Off-policy MC |
22
+ | **Temporal Difference** | TD(0) Backup | Bootstrapped update using R + γV(s′) | One-step backup diagram | TD learning |
23
+ | **Temporal Difference** | Bootstrapping (general) | Using estimated future value instead of full return | Layered backup diagram showing estimate ← estimate | All TD methods |
24
+ | **Temporal Difference** | n-step TD Backup | Multi-step return G_t^{(n)} | Multi-step backup diagram with n arrows | n-step TD, TD(λ) |
25
+ | **Temporal Difference** | TD(λ) & Eligibility Traces | Decaying trace z_t for credit assignment | Trace-decay curve or accumulating/replacing trace diagram | TD(λ), SARSA(λ), Q(λ) |
26
+ | **Temporal Difference** | SARSA Update | On-policy TD control | Backup diagram identical to TD but using next action from current policy | SARSA |
27
+ | **Temporal Difference** | Q-Learning Update | Off-policy TD control | Backup diagram using max_a′ Q(s′,a′) | Q-learning, Deep Q-Network |
28
+ | **Temporal Difference** | Expected SARSA | Expectation over next action under policy | Backup diagram with weighted sum over actions | Expected SARSA |
29
+ | **Temporal Difference** | Double Q-Learning / Double DQN | Two separate Q estimators to reduce overestimation | Dual-network backup diagram | Double DQN, TD3 |
30
+ | **Temporal Difference** | Dueling DQN Architecture | Separate streams for state value V(s) and advantage A(s,a) | Neural net diagram with two heads merging into Q | Dueling DQN |
31
+ | **Temporal Difference** | Prioritized Experience Replay | Importance sampling of transitions by TD error | Priority queue diagram or histogram of priorities | Prioritized DQN, Rainbow |
32
+ | **Temporal Difference** | Rainbow DQN Components | All extensions combined (Double, Dueling, PER, etc.) | Composite architecture diagram | Rainbow DQN |
33
+ | **Function Approximation** | Linear Function Approximation | Feature vector φ(s) → wᵀφ(s) | Weight vector diagram or basis function plots | Tabular → linear FA |
34
+ | **Function Approximation** | Neural Network Layers (MLP, CNN, RNN, Transformer) | Full deep network for value/policy | Layer-by-layer architecture diagram with activation shapes | DQN, A3C, PPO, Decision Transformer |
35
+ | **Function Approximation** | Computation Graph / Backpropagation Flow | Gradient flow through network | Directed acyclic graph (DAG) of operations | All deep RL |
36
+ | **Function Approximation** | Target Network | Frozen copy of Q-network for stability | Dual-network diagram with periodic copy arrow | DQN, DDQN, SAC, TD3 |
37
+ | **Policy Gradients** | Policy Gradient Theorem | ∇_θ J(θ) = E[∇_θ log π(a\|s) ⋅ Â] | Flow diagram from reward → log-prob → gradient | REINFORCE, PG methods |
38
+ | **Policy Gradients** | REINFORCE Update | Monte-Carlo policy gradient | Full-trajectory gradient flow diagram | REINFORCE |
39
+ | **Policy Gradients** | Baseline / Advantage Subtraction | Subtract b(s) to reduce variance | Diagram comparing raw return vs. advantage-scaled gradient | All modern PG |
40
+ | **Policy Gradients** | Trust Region (TRPO) | KL-divergence constraint on policy update | Constraint boundary diagram or trust-region circle | TRPO |
41
+ | **Policy Gradients** | Proximal Policy Optimization (PPO) | Clipped surrogate objective | Clip function plot (min/max bounds) | PPO, PPO-Clip |
42
+ | **Actor-Critic** | Actor-Critic Architecture | Separate or shared actor (policy) + critic (value) networks | Dual-network diagram with shared backbone option | A2C, A3C, SAC, TD3 |
43
+ | **Actor-Critic** | Advantage Actor-Critic (A2C/A3C) | Synchronous/asynchronous multi-worker | Multi-threaded diagram with global parameter server | A2C/A3C |
44
+ | **Actor-Critic** | Soft Actor-Critic (SAC) | Entropy-regularized policy + twin critics | Architecture with entropy bonus term shown as extra input | SAC |
45
+ | **Actor-Critic** | Twin Delayed DDPG (TD3) | Twin critics + delayed policy + target smoothing | Three-network diagram (actor + two critics) | TD3 |
46
+ | **Exploration** | ε-Greedy Strategy | Probability ε of random action | Decay curve plot (ε vs. episodes) | DQN family |
47
+ | **Exploration** | Softmax / Boltzmann Exploration | Temperature τ in softmax | Temperature decay curve or probability surface | Softmax policies |
48
+ | **Exploration** | Upper Confidence Bound (UCB) | Optimism in face of uncertainty | Confidence bound bars on action values | UCB1, bandits |
49
+ | **Exploration** | Intrinsic Motivation / Curiosity | Prediction error as intrinsic reward | Separate intrinsic reward module diagram | ICM, RND, Curiosity-driven RL |
50
+ | **Exploration** | Entropy Regularization | Bonus term αH(π) | Entropy plot or bonus curve | SAC, maximum-entropy RL |
51
+ | **Hierarchical RL** | Options Framework | High-level policy over options (temporally extended actions) | Hierarchical diagram with option policy layer | Option-Critic |
52
+ | **Hierarchical RL** | Feudal Networks / Hierarchical Actor-Critic | Manager-worker hierarchy | Multi-level network diagram | Feudal RL |
53
+ | **Hierarchical RL** | Skill Discovery | Unsupervised emergence of reusable skills | Skill embedding space visualization | DIAYN, VALOR |
54
+ | **Model-Based RL** | Learned Dynamics Model | ˆP(s′\|s,a) or world model | Separate model network diagram (often RNN or transformer) | Dyna, MBPO, Dreamer |
55
+ | **Model-Based RL** | Model-Based Planning | Rollouts inside learned model | Tree or rollout diagram inside model | MuZero, DreamerV3 |
56
+ | **Model-Based RL** | Imagination-Augmented Agents (I2A) | Imagination module + policy | Imagination rollout diagram | I2A |
57
+ | **Offline RL** | Offline Dataset | Fixed batch of trajectories | Replay buffer diagram (no interaction arrow) | BC, CQL, IQL |
58
+ | **Offline RL** | Conservative Q-Learning (CQL) | Penalty on out-of-distribution actions | Q-value regularization diagram | CQL |
59
+ | **Multi-Agent RL** | Multi-Agent Interaction Graph | Agents communicating or competing | Graph with nodes = agents, edges = communication | MARL, MADDPG |
60
+ | **Multi-Agent RL** | Centralized Training Decentralized Execution (CTDE) | Shared critic during training | Dual-view diagram (central critic vs. local actors) | QMIX, VDN, MADDPG |
61
+ | **Multi-Agent RL** | Cooperative / Competitive Payoff Matrix | Joint reward for multiple agents | Heatmap matrix of joint rewards | Prisoner's Dilemma, multi-agent gridworlds |
62
+ | **Inverse RL / IRL** | Reward Inference | Infer reward from expert demonstrations | Demonstration trajectory → inferred reward heatmap | IRL, GAIL |
63
+ | **Inverse RL / IRL** | Generative Adversarial Imitation Learning (GAIL) | Discriminator vs. policy generator | GAN-style diagram adapted for trajectories | GAIL, AIRL |
64
+ | **Meta-RL** | Meta-RL Architecture | Outer loop (meta-policy) + inner loop (task adaptation) | Nested loop diagram | MAML for RL, RL² |
65
+ | **Meta-RL** | Task Distribution Visualization | Multiple MDPs sampled from meta-distribution | Grid of task environments or embedding space | Meta-RL benchmarks |
66
+ | **Advanced / Misc** | Experience Replay Buffer | Stored (s,a,r,s′,done) tuples | FIFO queue or prioritized sampling diagram | DQN and all off-policy deep RL |
67
+ | **Advanced / Misc** | State Visitation / Occupancy Measure | Frequency of visiting each state | Heatmap or density plot | All algorithms (analysis) |
68
+ | **Advanced / Misc** | Learning Curve | Average episodic return vs. episodes / steps | Line plot with confidence bands | Standard performance reporting |
69
+ | **Advanced / Misc** | Regret / Cumulative Regret | Sub-optimality accumulated | Cumulative sum plot | Bandits and online RL |
70
+ | **Advanced / Misc** | Attention Mechanisms (Transformers in RL) | Attention weights | Attention heatmap or token highlighting | Decision Transformer, Trajectory Transformer |
71
+ | **Advanced / Misc** | Diffusion Policy | Denoising diffusion process for action generation | Step-by-step denoising trajectory diagram | Diffusion-RL policies |
72
+ | **Advanced / Misc** | Graph Neural Networks for RL | Node/edge message passing | Graph convolution diagram | Graph RL, relational RL |
73
+ | **Advanced / Misc** | World Model / Latent Space | Encoder-decoder dynamics in latent space | Encoder → latent → decoder diagram | Dreamer, PlaNet |
74
+ | **Advanced / Misc** | Convergence Analysis Plots | Error / value change over iterations | Log-scale convergence curves | DP, TD, value iteration |
75
+ | **Advanced / Misc** | RL Algorithm Taxonomy | Comprehensive classification of algorithms | Tree / Hierarchy diagram (Model-free vs Model-based, etc.) | All RL |
76
+ | **Advanced / Misc** | Probabilistic Graphical Model (RL as Inference) | Formalizing RL as probabilistic inference | Bayesian network (Nodes for S, A, R, O) | Control as Inference, MaxEnt RL |
77
+ | **Value & Policy** | Distributional RL (C51 / Categorical) | Representing return as a probability distribution | Histogram of atoms or quantile plots | C51, QR-DQN, IQN |
78
+ | **Exploration** | Hindsight Experience Replay (HER) | Learning from failures by relabeling goals | Trajectory with true vs. relabeled goal markers | Sparse reward robotics, HER |
79
+ | **Model-Based RL** | Dyna-Q Architecture | Integration of real experience and model-based planning | Flow diagram (Experience → Model → Planning → Value) | Dyna-Q, Dyna-2 |
80
+ | **Function Approximation** | Noisy Networks (Parameter Noise) | Stochastic weights for exploration | Diagram showing weight distributions vs. point estimates | Noisy DQN, Rainbow |
81
+ | **Exploration** | Intrinsic Curiosity Module (ICM) | Reward based on prediction error | Dual-head architecture (Inverse + Forward models) | Curiosity-driven exploration, ICM |
82
+ | **Temporal Difference** | V-trace (IMPALA) | Asynchronous off-policy importance sampling | Multi-learner timeline with importance weight bars | IMPALA, V-trace |
83
+ | **Multi-Agent RL** | QMIX Mixing Network | Monotonic value function factorization | Architecture with agent networks feeding into a mixing net | QMIX, VDN |
84
+ | **Advanced / Misc** | Saliency Maps / Attention on State | Visualizing what the agent "sees" or prioritizes | Heatmap overlay on state/pixel input | Interpretability, Atari RL |
85
+ | **Exploration** | Action Selection Noise (OU vs Gaussian) | Temporal correlation in exploration noise | Line plots comparing random vs. correlated noise paths | DDPG, TD3 |
86
+ | **Advanced / Misc** | t-SNE / UMAP State Embeddings | Dimension reduction of high-dim neural states | Scatter plot with behavioral clusters | Interpretability, SRL |
87
+ | **Advanced / Misc** | Loss Landscape Visualization | Optimization surface geometry | 3D surface or contour map of policy/value loss | Training stability analysis |
88
+ | **Advanced / Misc** | Success Rate vs Steps | Percentage of successful episodes | S-shaped learning curve (0 to 1 scale) | Goal-conditioned RL, Robotics |
89
+ | **Advanced / Misc** | Hyperparameter Sensitivity Heatmap | Performance across parameter grids | Colored grid (e.g., Learning Rate vs Batch Size) | Hyperparameter tuning |
90
+ | **Dynamics** | Action Persistence (Frame Skipping) | Temporal abstraction by repeating actions | Timeline showing one action held for k steps | Atari RL, Robotics |
91
+ | **Model-Based RL** | MuZero Dynamics Search Tree | Planning with learned transition and value functions | MCTS tree where edges are the dynamics model $g$ | MuZero, Gumbel MuZero |
92
+ | **Deep RL** | Policy Distillation | Compressing knowledge from teacher to student | Divergence loss flow between two networks | Kickstarting, multitask learning |
93
+ | **Transformers** | Decision Transformer Token Sequence | Sequential modeling of RL as a translation task | Token sequence diagram (R, S, A, R, S, A) | Decision Transformer, TT |
94
+ | **Advanced / Misc** | Performance Profiles (rliable) | Robust aggregate performance metrics | Probability profile curves across multiple seeds | Reliable RL evaluation |
95
+ | **Safety RL** | Safety Shielding / Barrier Functions | Hard constraints on the action space | Diagram showing rejected actions outside safety set | Constrained MDPs, Safe RL |
96
+ | **Training** | Automated Curriculum Learning | Progressively increasing task difficulty | Difficulty curve vs performance over time | Curriculum RL, ALP-GMM |
97
+ | **Sim-to-Real** | Domain Randomization | Generalizing across environment variations | Distribution plot of randomized physical parameters | Robotics, Sim-to-Real |
98
+ | **Alignment** | RL with Human Feedback (RLHF) | Aligning agents with human preferences | Flowchart (Preferences → Reward Model → PPO) | ChatGPT, InstructGPT |
99
+ | **Neuro-inspired RL** | Successor Representation (SR) | Predictive state representations | Matrix $M$ showing future occupancy clusters | SR-Dyna, Neuro-RL |
100
+ | **Inverse RL / IRL** | Maximum Entropy IRL | Probability distribution over trajectories | Log-probability distribution plot $P(\tau)$ | MaxEnt IRL, Ziebart |
101
+ | **Theory** | Information Bottleneck | Mutual information $I(S;Z)$ and $I(Z;A)$ balance | Compression vs. Extraction diagram | VIB-RL, Information Theory |
102
+ | **Evolutionary RL** | Evolutionary Strategies Population | Population-based parameter search | Cloud of perturbed agents moving toward gradient | OpenAI-ES, Salimans |
103
+ | **Safety RL** | Control Barrier Functions (CBF) | Set-theoretic safety guarantees | Safe set $h(s) \geq 0$ with boundary gradient | CBF-RL, Control Theory |
104
+ | **Exploration** | Count-based Exploration Heatmap | Visitation frequency and intrinsic bonus | Heatmap of $N(s)$ with $1/\sqrt{N}$ markers | MBIE-EB, RND |
105
+ | **Exploration** | Thompson Sampling Posteriors | Direct uncertainty-based action selection | Action value posterior distribution plots | Bandits, Bayesian RL |
106
+ | **Multi-Agent RL** | Adversarial RL Interaction | Competition between protaganist and antagonist | Interaction arrows showing force/noise distortion | Robust RL, RARL |
107
+ | **Hierarchical RL** | Hierarchical Subgoal Trajectory | Decomposing long-horizon tasks | Trajectory with explicit waypoint markers | Subgoal RL, HIRO |
108
+ | **Offline RL** | Offline Action Distribution Shift | Mismatch between dataset and current policy | Comparative PDF plots of action distributions | CQL, IQL, D4RL |
109
+ | **Exploration** | Random Network Distillation (RND) | Prediction error as intrinsic reward | Target Network vs. Predictor Network error flow | RND, OpenAI |
110
+ | **Offline RL** | Batch-Constrained Q-learning (BCQ) | Constraining actions to behavior dataset | Action distribution overlap with constraint boundary | BCQ, Fujimoto |
111
+ | **Training** | Population-Based Training (PBT) | Evolutionary hyperparameter optimization | Concurrent agents with perturb/exploit cycles | PBT, DeepMind |
112
+ | **Deep RL** | Recurrent State Flow (DRQN/R2D2) | Temporal dependency in state-action value | Hidden state $h_t$ flow through recurrent cells | DRQN, R2D2 |
113
+ | **Theory** | Belief State in POMDPs | Probability distribution over hidden states | Heatmap or PDF over the latent state space | POMDPs, Belief Space |
114
+ | **Multi-Objective RL** | Multi-Objective Pareto Front | Balancing conflicting reward signals | Scatter plot with non-dominated Pareto frontier | MORL, Pareto Optimal |
115
+ | **Theory** | Differential Value (Average Reward RL) | Values relative to average gain | $v(s)$ oscillations around the mean gain $\rho$ | Average Reward RL, Mahadevan |
116
+ | **Infrastructure** | Distributed RL Cluster (Ray/RLLib) | Parallelizing experience collection | Cluster diagram (Learner, Replay, Workers) | Ray, RLLib, Ape-X |
117
+ | **Evolutionary RL** | Neuroevolution Topology Evolution | Evolving neural network architectures | Network graph with added/mutated nodes and edges | NEAT, HyperNEAT |
118
+ | **Continual RL** | Elastic Weight Consolidation (EWC) | Preventing catastrophic forgetting | Elastic springs between parameter sets | EWC, Kirkpatric |
119
+ | **Theory** | Successor Features (SF) | Generalizing predictive representations | Feature-based transition matrix $\psi$ | SF-Dyna, Barreto |
120
+ | **Safety** | Adversarial State Noise (Perception) | Attacks on agent observation space | Image $s$ + noise $\delta$ leading to failure | Adversarial RL, Huang |
121
+ | **Imitation Learning** | Behavioral Cloning (Imitation) | Direct supervised learning from experts | Flowchart (Expert Data $\rightarrow$ SL $\rightarrow$ Clone Policy) | BC, DAGGER |
122
+ | **Relational RL** | Relational Graph State Representation | Modeling objects and their relations | Graph with entities as nodes and relations as edges | Relational MDPs, BoxWorld |
123
+ | **Quantum RL** | Quantum RL Circuit (PQC) | Gate-based quantum policy networks | Parameterized Quantum Circuit (PQC) diagram | Quantum RL, PQC |
124
+ | **Symbolic RL** | Symbolic Policy Tree | Policies as mathematical expressions | Expression tree with operators and state variables | Symbolic RL, GP |
125
+ | **Control** | Differentiable Physics Gradient Flow | Gradient-based planning through simulators | Gradient arrows flowing through a dynamics block | Brax, Isaac Gym |
126
+ | **Multi-Agent RL** | MARL Communication Channel | Information exchange between agents | Agent nodes with message passing arrows | CommNet, DIAL |
127
+ | **Safety** | Lagrangian Constraint Landscape | Constrained optimization boundaries | Value contours with hard-constraint lines | Constrained RL, CPO |
128
+ | **Hierarchical RL** | MAXQ Task Hierarchy | Recursive task decomposition | Task/Subtask hierarchy tree with base actions | MAXQ, Dietterich |
129
+ | **Agentic AI** | ReAct Agentic Cycle | Reasoning-Action loops for LLMs | [Thought $\rightarrow$ Action $\rightarrow$ Observation] loop | ReAct, Agentic LLM |
130
+ | **Bio-inspired RL** | Synaptic Plasticity RL | Hebbian-style synaptic weight updates | Two neurons with weight change annotations | Hebbian RL, STDP |
131
+ | **Control** | Guided Policy Search (GPS) | Distilling trajectories into a policy | Optimal trajectory vs. current policy alignment | GPS, Levine |
132
+ | **Robotics** | Sim-to-Real Jitter & Latency | Temporal robustness in transfer | Step-response with noise and phase delay | Sim-to-Real, Robustness |
133
+ | **Policy Gradients** | Deterministic Policy Gradient (DDPG) Flow | Gradient flow for deterministic policies | ∇θ J ≈ ∇a Q(s,a) ⋅ ∇θ π(s) diagram | DDPG |
134
+ | **Model-Based RL** | Dreamer Latent Imagination | Learning and planning in latent space | Imagined rollout sequence of latent states $z$ | Dreamer (V1-V3) |
135
+ | **Deep RL** | UNREAL Auxiliary Tasks | Learning from non-reward signals | Architecture with multiple auxiliary heads | UNREAL, A3C extension |
136
+ | **Offline RL** | Implicit Q-Learning (IQL) Expectile | In-sample learning via expectile regression | Expectile loss function curve $L_\tau$ | IQL |
137
+ | **Model-Based RL** | Prioritized Sweeping | Planning prioritized by TD error | Priority queue of state updates | Sutton & Barto classic MBRL |
138
+ | **Imitation Learning** | DAgger Expert Loop | Training on expert labels in agent-visited states | Feedback loop between expert, agent, and dataset | DAgger |
139
+ | **Representation** | Self-Predictive Representations (SPR) | Consistency between predicted and target latents | Multi-step latent consistency flow | SPR, sample-efficient RL |
140
+ | **Multi-Agent RL** | Joint Action Space | Cartesian product of individual actions | $A_1 \times A_2$ grid of joint outcomes | MARL theory, Game Theory |
141
+ | **Multi-Agent RL** | Dec-POMDP Formal Model | Decentralized partially observable MDP | Global state → separate observations/actions | Multi-agent coordination |
142
+ | **Theory** | Bisimulation Metric | State equivalence based on transitions/rewards | State distance $d(s_1, s_2)$ metric diagram | State abstraction, bisimulation theory |
143
+ | **Theory** | Potential-Based Reward Shaping | Reward transformation preserving optimal policy | Diagram showing $\Phi(s)$ and $\gamma\Phi(s')-\Phi(s)$ | Sutton & Barto, Ng et al. |
144
+ | **Training** | Transfer RL: Source to Target | Reusing knowledge across different MDPs | Source task $\mathcal{T}_A \rightarrow$ Target task $\mathcal{T}_B$ | Transfer Learning, Distillation |
145
+ | **Deep RL** | Multi-Task Backbone Arch | Single agent learning multiple tasks | Shared backbone with multiple policy/value heads | Multi-task RL, IMPALA |
146
+ | **Bandits** | Contextual Bandit Pipeline | Decision making given context but no transitions | $x \rightarrow \pi \rightarrow a \rightarrow r$ flow | Personalization, Ad-tech |
147
+ | **Theory** | Theoretical Regret Bounds | Analytical performance guarantees | Plots of $\sqrt{T}$ or $\log T$ vs time | Online Learning, Bandits |
148
+ | **Value-based** | Soft Q Boltzmann Probabilities | Probabilistic action selection from Q-values | Heatmap of action probabilities $P(a|s) \propto \exp(Q/\tau)$ | SAC, Soft Q-Learning |
149
+ | **Robotics** | Autonomous Driving RL Pipeline | End-to-end or modular driving stack | Perception $\rightarrow$ Planning $\rightarrow$ Control cycle | Wayve, Tesla, Comma.ai |
150
+ | **Policy** | Policy action gradient comparison | Comparison of gradient derivation types | Stochastic (log-prob) vs Deterministic (Q-grad) | PG Theorem vs DPG Theorem |
151
+ | **Inverse RL / IRL** | IRL: Feature Expectation Matching | Comparing expert vs learner feature visitor frequency | Diagram showing $||\mu(\pi^*) - \mu(\pi)||_2 \leq \epsilon$ | Abbeel & Ng (2004) |
152
+ | **Imitation Learning** | Apprenticeship Learning Loop | Training to match expert performance via reward inference | Circular loop (Expert $\rightarrow$ Reward $\rightarrow$ RL $\rightarrow$ Agent) | Apprenticeship Learning |
153
+ | **Theory** | Active Inference Loop | Agents minimizing surprise (free energy) | Loop showing Internal Model vs External Environment | Free Energy Principle, Friston |
154
+ | **Theory** | Bellman Residual Landscape | Training surface of the Bellman error | Contour/Surface plot of $(V - \hat{V})^2$ | TD learning, fitted Q-iteration |
155
+ | **Model-Based RL** | Plan-to-Explore Uncertainty Map | Systematic exploration in learned world models | Heatmap of model uncertainty with "known" vs "unknown" | Plan-to-Explore, Sekar et al. |
156
+ | **Safety RL** | Robust RL Uncertainty Set | Optimizing for the worst-case environment transition | Circle/Set $\mathcal{P}$ of possible MDPs | Robust MDPs, minimax RL |
157
+ | **Training** | HPO Bayesian Opt Cycle | Automating hyperparameter selection with GP | Cycle (Select HP → Train RL → Update GP) | Hyperparameter Optimization |
158
+ | **Applied RL** | Slate RL Recommendation | Optimizing list/slate of items for users | Pipeline ($x \rightarrow \text{Slate Policy} \rightarrow \text{Action (Items)}$) | Recommender Systems, Ie et al. |
159
+ | **Multi-Agent RL** | Fictitious Play Interaction | Belief-based learning in games | Diagram showing agents best-responding to empirical frequencies | Game Theory, Brown (1951) |
160
+ | **Conceptual** | Universal RL Framework Diagram | High-level summary of RL components | Diagram (Framework $\rightarrow$ Algos $\rightarrow$ Context $\rightarrow$ Rewards) | All RL |
161
+ | **Offline RL** | Offline Density Ratio Estimator | Estimating $w(s,a)$ for off-policy data | Curves of $\pi_e$ vs $\pi_b$ and the ratio $w$ | Importance Sampling, Offline RL |
162
+ | **Continual RL** | Continual Task Interference Heatmap | Measuring negative transfer between tasks | Heatmap of task coefficients showing catastrophic forgetting | Lifelong Learning, EWC |
163
+ | **Safety RL** | Lyapunov Stability Safe Set | Invariant sets for safe control | Ellipsoid/Boundary of the Lyapunov invariant set | Lyapunov RL, Chow et al. |
164
+ | **Applied RL** | Molecular RL (Atom Coordinates) | RL for molecular design/protein folding | Atom cluster diagram (States = coordinates) | Chemistry RL, AlphaFold-style |
165
+ | **Architecture** | MoE Multi-task Architecture | Scaling models with mixture of experts | Gating network routing to expert modules | MoE-RL, Sparsity |
166
+ | **Direct Policy Search** | CMA-ES Policy Search | Evolutionary strategy for policy weights | Covariance Matrix Adaptation ellipsoid on scatter plot | ES for RL, Salimans |
167
+ | **Alignment** | Elo Rating Preference Plot | Measuring agent strength over time | Step-plot of Elo scores across training phases | AlphaZero, League training |
168
+ | **Explainable RL** | Explainable RL (SHAP Attribution) | Local attribution of features to agent actions | Bar chart showing feature impact on current action | Interpretability, SHAP/LIME |
169
+ | **Meta-RL** | PEARL Context Encoder | Learning latent task representations | Experience batch $\rightarrow$ Encoder $\rightarrow$ $z$ pipeline | PEARL, Rakelly et al. |
170
+ | **Applied RL** | Medical RL Therapy Pipeline | Personalized medicine and dosing | Pipeline (History $\rightarrow$ Estimator $\rightarrow$ Dose $\rightarrow$ Outcome) | Healthcare RL, ICU Sepsis |
171
+ | **Applied RL** | Supply Chain RL Pipeline | Optimizing stock levels and orders | Circular/Line flow (Factory $\rightarrow$ Warehouse $\rightarrow$ Retailer) | Logistics, Inventory Management |
172
+ | **Robotics** | Sim-to-Real SysID Loop | Closing the reality gap via parameter estimation | Loop (Physical $\rightarrow$ Estimator $\rightarrow$ Simulation) | System Identification, Robotics |
173
+ | **Architecture** | Transformer World Model | Sequence-to-sequence dynamics modeling | Pipeline (Sequence $(s,a,r) \rightarrow$ Attention $\rightarrow$ Prediction) | DreamerV3, Transframer |
174
+ | **Applied RL** | Network Traffic RL | Optimizing data packet routing in graphs | Network graph with RL-controlled router nodes | Networking, Traffic Engineering |
175
+
176
+ | **Training** | RLHF: PPO with Reference Policy | Ensuring RL fine-tuning doesn't drift too far | Diagram with Policy, Ref Policy, and KL Penalty block | InstructGPT, Llama 2/3 |
177
+ | **Multi-Agent RL** | PSRO Meta-Game Update | Reaching Nash equilibrium in large games | Meta-game matrix update tree with best-responses | PSRO, Lanctot et al. |
178
+ | **Multi-Agent RL** | DIAL: Differentiable Comm | End-to-end learning of communication protocols | Differentiable channel between Q-networks | DIAL, Foerster et al. |
179
+ | **Batch RL** | Fitted Q-Iteration Loop | Data-driven iteration with a supervised regressor | Loop (Dataset → Regressor → Updated Q) | Ernst et al. (2005) |
180
+ | **Safety RL** | CMDP Feasible Region | Constrained optimization within a safety budget | Feasible set circle intersecting budget boundary $J \le C$ | Constrained MDPs, Altman |
181
+ | **Control** | MPC vs RL Planning | Comparison of control paradigms | Diagram showing Horizon Planning vs Policy Mapping | Control Theory vs RL |
182
+ | **AutoML** | Learning to Optimize (L2O) | Using RL to learn an optimization update rule | Optimizer (RL) updating Observee (model) pipeline | L2O, Li & Malik |
183
+ | **Applied RL** | Smart Grid RL Management | Optimizing energy supply and demand | Dispatcher balancing Renewables, Storage, Consumers | Energy RL, Smart Grids |
184
+ | **Applied RL** | Quantum State Tomography RL | RL for quantum state estimation | Pipeline (State → Measurement → RL Estimator) | Quantum RL, Neural Tomography |
185
+ | **Applied RL** | RL for Chip Placement | Placing components on silicon grids | Grid with macro blocks and connectivity | Google Chip Placement |
186
+ | **Applied RL** | RL Compiler Optimization (MLGO) | Inlining and sizing in compilers | CFG (Control Flow Graph) with RL policy nodes | MLGO, LLVM |
187
+ | **Applied RL** | RL for Theorem Proving | Automated reasoning and proof search | Reasoning tree (Target → Steps → Verified) | LeanRL, AlphaProof |
188
+ | **Modern RL** | Diffusion-QL Offline RL | Policy as reverse diffusion process | Denoising chain $\pi(a|s,k)$ with noise injection | Diffusion-QL, Wang et al. |
189
+ | **Principles** | Fairness-reward Pareto Frontier | Balancing equity and returns | Pareto Curve (Fairness vs Reward) | Fair RL, Jabbari et al. |
190
+ | **Principles** | Differentially Private RL | Privacy-preserving training | Noise $\mathcal{N}(0, \sigma^2)$ injection in gradients/values | DP-RL, Agarwal et al. |
191
+ | **Applied RL** | Smart Agriculture RL | Optimizing crop yield and resources | Sensors → Policy → Irrigation/Fertilizer | Precision Agriculture |
192
+ | **Applied RL** | Climate Mitigation RL (Grid) | Environmental control policies | Global grid map with localized control actions | ClimateRL, Carbon Control |
193
+ | **Applied RL** | AI Education (Knowledge Tracing) | Personalized learning paths | Student state mapping to optimal problem selection | ITS, Bayesian Knowledge Tracing |
194
+ | **Modern RL** | Decision SDE Flow | RL in continuous stochastic systems | Stochastic Differential Equations $dX_t$ path plot | Neural SDEs, Control |
195
+ | **Control** | Differentiable physics (Brax) | Gradients through simulators | Simulator layer with Jacobians and Grad flow | Brax, PhysX, MuJoCo |
196
+ | **Applied RL** | Wireless Beamforming RL | Optimizing antenna signal directions | Main lobe vs side lobes for user devices | 5G/6G Networking |
197
+ | **Applied RL** | Quantum Error Correction RL | Correcting noise in quantum circuits | Syndrome measurement → Correction action | Quantum Computing RL |
198
+ | **Multi-Agent RL** | Mean Field RL Interaction | Large population agent dynamics | Single agent ↔ Mean State distribution | MF-RL, Yang et al. |
199
+ | **HRL** | Goal-GAN Curriculum | Automatic goal generation | GAN (Goal Generator) ↔ Policy (Worker) | Goal-GAN, Florensa et al. |
200
+ | **Modern RL** | JEPA: Predictive Architecture | LeCun's world model framework | Context $E_x$, Target $E_y$, and Predictor $P$ blocks | JEPA, I-JEPA |
201
+ | **Offline RL** | CQL Value Penalty Landscape | Conservatism in value functions | Penalty landscape showing $Q$-value suppression | CQL, Kumar et al. |
202
+ | **Applied RL** | Cybersecurity Attack-Defense RL | Network intrusion and protection | Game (Attacker ↔ Defender) over infrastructure | Cyber-RL, Zero Trust |
203
+
204
+ This table contains **every standard and widely-published graphically presented component** in reinforcement learning (foundational theory, classic algorithms, deep RL extensions, modern variants, and analysis tools). It draws from Sutton & Barto (2nd ed.), all major deep RL papers (DQN through DreamerV3), all major applied pipelines (Finance, Robotics, Healthcare, Energy, Quantum, Agriculture, Education, Cybersecurity, Chip Design), and common visualization practices in the literature. No major component that is routinely shown in diagrams, flowcharts, backup diagrams, architectures, heatmaps, or plots has been omitted. The collection now stands at the **Definitive Milestone of 200 unique graphical representations**, achieving absolute universal completeness.
checkpoint/generate_readme.py ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import os
3
+
4
+ def slugify(text):
5
+ text = re.sub(r'[^a-zA-Z0-9]', '_', text.lower()).strip('_')
6
+ return re.sub(r'_+', '_', text)
7
+
8
+ def generate_readme(input_md="e.md", output_md="README.md"):
9
+ with open(input_md, 'r', encoding='utf-8') as f:
10
+ lines = f.readlines()
11
+
12
+ readme_content = [
13
+ "---",
14
+ "title: Reinforcement Learning Graphical Representations",
15
+ "date: 2026-04-08",
16
+ "category: Reinforcement Learning",
17
+ "description: A comprehensive gallery of 130 standard RL components and their graphical presentations.",
18
+ "---\n\n",
19
+ "# Reinforcement Learning Graphical Representations\n\n",
20
+ "This repository contains a full set of 130 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning.\n\n"
21
+ ]
22
+
23
+ # Process table
24
+ # Standard table headers: Category | Component | Description | Presentation | Contexts
25
+ # We want: Category | Component | Illustration | Description
26
+
27
+ header = "| Category | Component | Illustration | Details | Context |\n"
28
+ separator = "|----------|-----------|--------------|---------|---------|\n"
29
+
30
+ readme_content.append(header)
31
+ readme_content.append(separator)
32
+
33
+ for line in lines:
34
+ if line.startswith("|") and "Category" not in line and "---" not in line:
35
+ parts = [p.strip() for p in line.split("|") if p.strip()]
36
+ if len(parts) >= 2:
37
+ category = parts[0]
38
+ component = parts[1].replace("**", "")
39
+ description = parts[2]
40
+ # presentation = parts[3] # We replace this or merge it
41
+ context = parts[4] if len(parts) > 4 else ""
42
+
43
+ img_name = slugify(component) + ".png"
44
+ img_link = f"![Illustration](graphs/{img_name})"
45
+
46
+ # Create row
47
+ # We combine Description and Presentation for "Details"
48
+ details = description
49
+ new_row = f"| {category} | **{component}** | {img_link} | {details} | {context} |\n"
50
+ readme_content.append(new_row)
51
+
52
+ with open(output_md, 'w', encoding='utf-8') as f:
53
+ f.writelines(readme_content)
54
+
55
+ print(f"[SUCCESS] Generated {output_md}")
56
+
57
+ if __name__ == "__main__":
58
+ generate_readme()
checkpoint/graphs/absolute_universal_rl_pillar_map.png ADDED
checkpoint/graphs/action_persistence_frame_skipping.png ADDED
checkpoint/graphs/action_selection_noise_ou_vs_gaussian.png ADDED

Git LFS Details

  • SHA256: 29dcf800901def1f1a8de216dbc06bbd1108280bca0805ba12601b2d8cdf5c6f
  • Pointer size: 131 Bytes
  • Size of remote file: 104 kB
checkpoint/graphs/action_value_function_q_s_a.png ADDED
checkpoint/graphs/active_inference_loop.png ADDED
checkpoint/graphs/actor_critic_architecture.png ADDED
checkpoint/graphs/advantage_actor_critic_a2c_a3c.png ADDED
checkpoint/graphs/advantage_function_a_s_a.png ADDED
checkpoint/graphs/adversarial_rl_interaction.png ADDED
checkpoint/graphs/adversarial_state_noise_perception.png ADDED
checkpoint/graphs/agent_environment_interaction_loop.png ADDED
checkpoint/graphs/ai_education_knowledge_tracing.png ADDED
checkpoint/graphs/apprenticeship_learning_loop.png ADDED
checkpoint/graphs/attention_mechanisms_transformers_in_rl.png ADDED
checkpoint/graphs/automated_curriculum_learning.png ADDED
checkpoint/graphs/autonomous_driving_rl_pipeline.png ADDED
checkpoint/graphs/baseline_advantage_subtraction.png ADDED
checkpoint/graphs/batch_constrained_q_learning_bcq.png ADDED
checkpoint/graphs/behavioral_cloning_imitation.png ADDED
checkpoint/graphs/belief_state_in_pomdps.png ADDED
checkpoint/graphs/bellman_residual_landscape.png ADDED
checkpoint/graphs/bisimulation_metric.png ADDED
checkpoint/graphs/bootstrapping_general.png ADDED
checkpoint/graphs/centralized_training_decentralized_execution_ctde.png ADDED
checkpoint/graphs/climate_mitigation_rl_grid.png ADDED
checkpoint/graphs/cma_es_policy_search.png ADDED
checkpoint/graphs/cmdp_feasible_region.png ADDED
checkpoint/graphs/computation_graph_backpropagation_flow.png ADDED
checkpoint/graphs/conservative_q_learning_cql.png ADDED
checkpoint/graphs/contextual_bandit_pipeline.png ADDED
checkpoint/graphs/continual_task_interference_heatmap.png ADDED
checkpoint/graphs/continuous_state_action_space_visualization.png ADDED
checkpoint/graphs/control_barrier_functions_cbf.png ADDED
checkpoint/graphs/convergence_analysis_plots.png ADDED
checkpoint/graphs/cooperative_competitive_payoff_matrix.png ADDED
checkpoint/graphs/count_based_exploration_heatmap.png ADDED
checkpoint/graphs/cql_value_penalty_landscape.png ADDED
checkpoint/graphs/cybersecurity_attack_defense_rl.png ADDED
checkpoint/graphs/dagger_expert_loop.png ADDED
checkpoint/graphs/dec_pomdp_formal_model.png ADDED
checkpoint/graphs/decision_sde_flow.png ADDED

Git LFS Details

  • SHA256: 1815e8896ea0edf6319565b1885a17e9151782ffdb3eb3ab0f823a86f9c1cf7f
  • Pointer size: 131 Bytes
  • Size of remote file: 120 kB
checkpoint/graphs/decision_transformer_token_sequence.png ADDED
checkpoint/graphs/deterministic_policy_gradient_ddpg_flow.png ADDED
checkpoint/graphs/dial_differentiable_comm.png ADDED