explainer-env / README.backup.md
kgdrathan's picture
Upload folder using huggingface_hub
b12f1bd verified
metadata
title: Explainer Env Environment Server
emoji: πŸ’»
colorFrom: pink
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Research -> Interactive Explainer Environment

An OpenEnv RL environment that trains small language models to create interactive educational content. Given a STEM topic, the agent explores with explicit research tools, generates a Marimo reactive notebook or Manim math animation, and gets one repair attempt if lint/build validation fails.

Episode Flow

reset() --> topic + tier assigned
  |
explore x 0..3 --> choose research tools + queries
  |
generate x 1 --> produce marimo/manim code
  |
repair x 0..1 --> fix lint/build errors if needed --> episode ends

Actions

Explore -- search for information relevant to the assigned topic:

ExplainerAction(
    action_type="explore",
    tool="search_arxiv",
    query="merge sort divide and conquer visual explanation",
    intent="find examples and visual intuition",
)

Available tools: search_wikipedia, search_hf_papers, search_arxiv, search_scholar, fetch_docs, and search_hf_hub.

Generate -- produce educational code using accumulated research:

ExplainerAction(
    action_type="generate",
    format="marimo",      # or "manim"
    code="import marimo...",
    narration="...",       # manim only
)

Repair -- revise generated code using lint/build feedback:

ExplainerAction(
    action_type="repair",
    format="marimo",
    code="import marimo...",
    repair_notes="fixed the reported Marimo validation error",
)

Reward System

Multi-component rewards across exploration, generation, and repair. See rewards/README.md for the full breakdown.

Exploration (per-step): tool choice, query quality, source quality, coverage delta, novelty, diversity, gated by information sufficiency. Step cost of -0.05 forces the agent to justify each search.

Generation/repair: keyword coverage, format match, structural quality (via marimo check CLI or manim scene analysis), narration (manim only), context usage, and repair success.

Key design: marimo check CLI catches 5 breaking rules (MB001-MB005) in ~100ms. Code that doesn't parse scores 0. Code that doesn't execute gets quality * 0.4.

Quick Start

# Install & run locally
cd explainer_env && uv sync
uv run server  # http://localhost:8000

# Client usage
python -c "
from client import ExplainerEnv
from models import ExplainerAction

with ExplainerEnv(base_url='http://localhost:8000').sync() as sc:
    result = sc.reset()
    print(f'Topic: {result.observation.topic}, Tier: {result.observation.tier}')

    # Explore
    result = sc.step(ExplainerAction(
        action_type='explore',
        tool='search_wikipedia',
        query=result.observation.topic,
        intent='overview',
    ))
    print(f'Explore reward: {result.reward:.3f}')

    # Generate
    result = sc.step(ExplainerAction(
        action_type='generate',
        format='marimo',
        code='import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n    mo.md(\"# Hello\")\n    return\n',
    ))
    print(f'Generate reward: {result.reward:.3f}, done: {result.done}')
"

Concurrent WebSocket Sessions

The server supports multiple concurrent WebSocket connections for parallel GRPO training rollouts:

from client import ExplainerEnv
from models import ExplainerAction
from concurrent.futures import ThreadPoolExecutor

def run_episode(client_id: int):
    with ExplainerEnv(base_url="http://localhost:8000").sync() as sc:
        result = sc.reset()
        result = sc.step(ExplainerAction(
            action_type="explore",
            tool="search_wikipedia",
            query=result.observation.topic,
        ))
        result = sc.step(ExplainerAction(
            action_type="generate", format="marimo",
            code="import marimo as mo\napp = mo.App()\n@app.cell\ndef _():\n    return\n",
        ))
        return client_id, result.reward

with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(run_episode, range(4)))

API Endpoints

Endpoint Method Description
/reset POST Start new episode, get topic assignment
/step POST Submit action, get observation + reward
/state GET Current episode state
/schema GET Action/Observation JSON schemas
/ws WebSocket Low-latency session for training
/docs GET Interactive API docs

Task Bank

26 tasks across 4 categories (ML, Math, Algorithms, Statistics), 3 difficulty levels (easy, medium, hard), and 3 audience tiers (beginner, intermediate, advanced). Each task specifies keywords for reward scoring and an optional preferred output format.

File Structure

explainer_env/
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ explainer_env_environment.py  # Environment logic (reset/step/state)
β”‚   β”œβ”€β”€ app.py                        # FastAPI server (create_app)
β”‚   └── Dockerfile                    # Multi-stage Docker build
β”œβ”€β”€ rewards/
β”‚   β”œβ”€β”€ exploration.py                # Explore-phase reward components
β”‚   β”œβ”€β”€ generation.py                 # Generate/repair reward components
β”‚   β”œβ”€β”€ sources.py                    # Compatibility wrapper for research tools
β”‚   β”œβ”€β”€ sandbox.py                    # Code validation (marimo check, AST, execution)
β”‚   └── README.md                     # Reward system documentation
β”œβ”€β”€ research/                          # Research tools, structured results, retrieval
β”œβ”€β”€ models.py                         # ExplainerAction, ExplainerObservation
β”œβ”€β”€ task_bank.py                      # 26 curated STEM tasks
β”œβ”€β”€ client.py                         # ExplainerEnv WebSocket client
└── openenv.yaml                      # OpenEnv manifest