kgdrathan commited on
Commit
1eaaf1d
·
verified ·
1 Parent(s): 5869d56

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +11 -0
  2. __marimo__/session/a.py.json +0 -0
  3. a.html +0 -0
  4. a.py +166 -0
README.md CHANGED
@@ -13,6 +13,17 @@ tags:
13
  - RL
14
  ---
15
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  <p align="center">
18
  <span style="font-size:2.2em; font-weight:bold;">See. Interact. Understand.</span>
 
13
  - RL
14
  ---
15
 
16
+ <p align="center">
17
+ <a href="https://kgdrathan-explainer-env-dashboard.hf.space/">
18
+ <strong>Open the live dashboard</strong>
19
+ </a>
20
+ </p>
21
+
22
+ <p align="center">
23
+ <a href="https://kgdrathan-explainer-env-dashboard.hf.space/">
24
+ https://kgdrathan-explainer-env-dashboard.hf.space/
25
+ </a>
26
+ </p>
27
 
28
  <p align="center">
29
  <span style="font-size:2.2em; font-weight:bold;">See. Interact. Understand.</span>
__marimo__/session/a.py.json CHANGED
The diff for this file is too large to render. See raw diff
 
a.html ADDED
The diff for this file is too large to render. See raw diff
 
a.py CHANGED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import marimo as mo
2
+ import numpy as np
3
+ import pandas as pd
4
+ import matplotlib.pyplot as plt
5
+ from matplotlib.patches import Rectangle
6
+
7
+ # Shared variables
8
+ app = mo.App()
9
+
10
+
11
+ @app.cell
12
+ def _(mo):
13
+ mo.md("""
14
+ # Reinforcement Learning Basics
15
+
16
+ Reinforcement Learning (RL) is a type of machine learning where an **agent** learns to make decisions by interacting with an **environment**. The goal is to learn a **policy** that maximizes the cumulative reward over time.
17
+
18
+ ## Core Concepts
19
+
20
+ 1. **Agent**: The learner/decision-maker
21
+ 2. **Environment**: Everything outside the agent
22
+ 3. **State (s)**: The current situation
23
+ 4. **Action (a)**: What the agent can do
24
+ 5. **Reward (r)**: Feedback from environment
25
+ 6. **Policy (π)**: Strategy that agents use to decide actions
26
+ 7. **Value Function (V)**: How good it is to be in a state
27
+ 8. **Q-Function (Q)**: How good it is to take an action in a state
28
+ 9. **Bellman Equation**: Relationship between value functions at different time steps
29
+ """)
30
+ return
31
+
32
+
33
+ @app.cell
34
+ def _(mo):
35
+ # Simple grid world example
36
+ grid_size = 4
37
+ start = (0, 0)
38
+ goal = (3, 3)
39
+ obstacles = [(1, 1), (1, 2)]
40
+
41
+ # Create a simple visualization
42
+ _fig, _ax = plt.subplots(figsize=(6, 6))
43
+ _ax.set_xlim(0, grid_size)
44
+ _ax.set_ylim(0, grid_size)
45
+ _ax.set_xticks(range(grid_size))
46
+ _ax.set_yticks(range(grid_size))
47
+ _ax.grid(True)
48
+
49
+ # Draw obstacles
50
+ for obs in obstacles:
51
+ rect = Rectangle((obs[0], obs[1]), 1, 1, facecolor="black", alpha=0.7)
52
+ _ax.add_patch(rect)
53
+
54
+ # Draw start and goal
55
+ start_rect = Rectangle(start, 1, 1, facecolor="green", alpha=0.7)
56
+ goal_rect = Rectangle(goal, 1, 1, facecolor="red", alpha=0.7)
57
+ _ax.add_patch(start_rect)
58
+ _ax.add_patch(goal_rect)
59
+
60
+ _ax.text(start[0] + 0.5, start[1] + 0.5, "Start", ha="center", va="center")
61
+ _ax.text(goal[0] + 0.5, goal[1] + 0.5, "Goal", ha="center", va="center")
62
+
63
+ _ax.set_title("Simple Grid World Example")
64
+ _ax.invert_yaxis() # To match standard grid coordinates
65
+
66
+ mo.ui.matplotlib(_fig)
67
+ plt.close(_fig)
68
+ return
69
+
70
+
71
+ @app.cell
72
+ def _(mo):
73
+ mo.md("""
74
+ ## How It Works
75
+
76
+ The agent interacts with the environment in episodes:
77
+
78
+ 1. **Observe State (s)**: Agent senses its current situation
79
+ 2. **Choose Action (a)**: Based on policy π(a|s)
80
+ 3. **Environment Transitions**: Move to new state s'
81
+ 4. **Receive Reward (r)**: Immediate feedback
82
+ 5. **Update Knowledge**: Learn from experience
83
+
84
+ The goal is to maximize expected cumulative discounted reward:
85
+
86
+ $G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$
87
+
88
+ Where γ ∈ [0,1] is the discount factor.
89
+ """)
90
+ return
91
+
92
+
93
+ @app.cell
94
+ def _(mo):
95
+ # Value function explanation
96
+ mo.md("""
97
+ ### Value Functions
98
+
99
+ The **Value Function V(s)** represents how good it is to be in a state:
100
+
101
+ $V(s) = \mathbb{E}[G_t | S_t = s]$
102
+
103
+ The **Q-Function Q(s,a)** represents how good it is to take an action in a state:
104
+
105
+ $Q(s,a) = \mathbb{E}[G_t | S_t = s, A_t = a]$
106
+
107
+ These functions help the agent evaluate the long-term reward of states and actions.
108
+ """)
109
+ return
110
+
111
+
112
+ @app.cell
113
+ def _(mo):
114
+ # Bellman Equation explanation
115
+ mo.md("""
116
+ ### Bellman Equation
117
+
118
+ The Bellman equation expresses the relationship between the value of a state and the values of subsequent states:
119
+
120
+ $V(s) = \max_a \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma V(s')]$
121
+
122
+ And for Q-values:
123
+
124
+ $Q(s,a) = \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma \max_{a'} Q(s',a')]$
125
+
126
+ These equations are fundamental to solving RL problems iteratively.
127
+ """)
128
+ return
129
+
130
+
131
+ @app.cell
132
+ def _(mo):
133
+ # Policy definition
134
+ mo.md("""
135
+ ### Policy π(a|s)
136
+
137
+ A policy defines the behavior of an agent. It's a mapping from states to probabilities of selecting each possible action.
138
+
139
+ For example, a stochastic policy could be:
140
+
141
+ $\pi(a|s) = \text{Probability of taking action } a \text{ in state } s$
142
+
143
+ The goal is to find an optimal policy π* that maximizes expected cumulative reward.
144
+ """)
145
+ return
146
+
147
+
148
+ @app.cell
149
+ def _(mo):
150
+ # Interactive elements
151
+ mo.md("""
152
+ ## Try It Yourself!
153
+
154
+ Below is an interactive grid world. You can visualize how an agent might navigate from start to goal while avoiding obstacles.
155
+
156
+ ### Next Steps
157
+
158
+ - Understand how rewards influence agent behavior
159
+ - Explore how policies change based on learning
160
+ - Study how value functions converge over time
161
+ """)
162
+ return
163
+
164
+
165
+ if __name__ == "__main__":
166
+ app.run()