Spaces:
Running
title: Explainer Env Environment Server
emoji: π»
colorFrom: pink
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
Research -> Interactive Explainer Environment
An OpenEnv RL environment that trains small language models to create interactive educational content. Given a STEM topic, the agent explores with explicit research tools, generates a Marimo reactive notebook or Manim math animation, and gets one repair attempt if lint/build validation fails.
Episode Flow
reset() --> topic + tier assigned
|
explore x 0..3 --> choose research tools + queries
|
generate x 1 --> produce marimo/manim code
|
repair x 0..1 --> fix lint/build errors if needed --> episode ends
Actions
Explore -- search for information relevant to the assigned topic:
ExplainerAction(
action_type="explore",
tool="search_arxiv",
query="merge sort divide and conquer visual explanation",
intent="find examples and visual intuition",
)
Available tools: search_wikipedia, search_hf_papers, search_arxiv, search_scholar, fetch_docs, and search_hf_hub.
Generate -- produce educational code using accumulated research:
ExplainerAction(
action_type="generate",
format="marimo", # or "manim"
code="import marimo...",
narration="...", # manim only
)
Repair -- revise generated code using lint/build feedback:
ExplainerAction(
action_type="repair",
format="marimo",
code="import marimo...",
repair_notes="fixed the reported Marimo validation error",
)
Reward System
Multi-component rewards across exploration, generation, and repair. See rewards/README.md for the full breakdown.
Exploration (per-step): tool choice, query quality, source quality, coverage delta, novelty, diversity, gated by information sufficiency. Step cost of -0.05 forces the agent to justify each search.
Generation/repair: keyword coverage, format match, structural quality (via marimo check CLI or manim scene analysis), narration (manim only), context usage, and repair success.
Key design: marimo check CLI catches 5 breaking rules (MB001-MB005) in ~100ms. Code that doesn't parse scores 0. Code that doesn't execute gets quality * 0.4.
Quick Start
# Install & run locally
cd explainer_env && uv sync
uv run server # http://localhost:8000
# Client usage
python -c "
from client import ExplainerEnv
from models import ExplainerAction
with ExplainerEnv(base_url='http://localhost:8000').sync() as sc:
result = sc.reset()
print(f'Topic: {result.observation.topic}, Tier: {result.observation.tier}')
# Explore
result = sc.step(ExplainerAction(
action_type='explore',
tool='search_wikipedia',
query=result.observation.topic,
intent='overview',
))
print(f'Explore reward: {result.reward:.3f}')
# Generate
result = sc.step(ExplainerAction(
action_type='generate',
format='marimo',
code='import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n mo.md(\"# Hello\")\n return\n',
))
print(f'Generate reward: {result.reward:.3f}, done: {result.done}')
"
Concurrent WebSocket Sessions
The server supports multiple concurrent WebSocket connections for parallel GRPO training rollouts:
from client import ExplainerEnv
from models import ExplainerAction
from concurrent.futures import ThreadPoolExecutor
def run_episode(client_id: int):
with ExplainerEnv(base_url="http://localhost:8000").sync() as sc:
result = sc.reset()
result = sc.step(ExplainerAction(
action_type="explore",
tool="search_wikipedia",
query=result.observation.topic,
))
result = sc.step(ExplainerAction(
action_type="generate", format="marimo",
code="import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n return\n",
))
return client_id, result.reward
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(run_episode, range(4)))
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/reset |
POST | Start new episode, get topic assignment |
/step |
POST | Submit action, get observation + reward |
/state |
GET | Current episode state |
/schema |
GET | Action/Observation JSON schemas |
/ws |
WebSocket | Low-latency session for training |
/docs |
GET | Interactive API docs |
Task Bank
26 tasks across 4 categories (ML, Math, Algorithms, Statistics), 3 difficulty levels (easy, medium, hard), and 3 audience tiers (beginner, intermediate, advanced). Each task specifies keywords for reward scoring and an optional preferred output format.
File Structure
explainer_env/
βββ server/
β βββ explainer_env_environment.py # Environment logic (reset/step/state)
β βββ app.py # FastAPI server (create_app)
β βββ Dockerfile # Multi-stage Docker build
βββ rewards/
β βββ exploration.py # Explore-phase reward components
β βββ generation.py # Generate/repair reward components
β βββ sources.py # Compatibility wrapper for research tools
β βββ sandbox.py # Code validation (marimo check, AST, execution)
β βββ README.md # Reward system documentation
βββ research/ # Research tools, structured results, retrieval
βββ models.py # ExplainerAction, ExplainerObservation
βββ task_bank.py # 26 curated STEM tasks
βββ client.py # ExplainerEnv WebSocket client
βββ openenv.yaml # OpenEnv manifest