explainer-env / README.md
kgdrathan's picture
Upload folder using huggingface_hub
ac7572a verified
metadata
emoji: 💻
colorFrom: pink
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - OpenEnv
  - RL

The dashboard is served by this Space at /web/ in the custom tab.

See. Interact. Understand.

Teaching Small Models to Build Interactive Explainers

What if a small language model could do more than answer a STEM question?

What if it could research the topic, decide what kind of visual explanation would help, build a working interactive notebook or animation, and then fix its own code when validation fails?

That is the idea behind this project: an OpenEnv reinforcement learning environment for training small language models to create visual, executable educational content.

Built for the OpenEnv Hackathon in India, April 25-26, 2026.

Expected episode flow

The Problem

Most educational answers from language models are still text-first. That is fine for simple definitions, but it breaks down for topics that are easier to understand by seeing and trying:

  • gradient descent is clearer when you move the learning rate and watch the loss curve change
  • Fourier transforms are clearer when frequencies become visible
  • sorting algorithms are clearer when every comparison and swap is animated
  • probability and statistics are clearer when samples, distributions, and uncertainty move on screen

The goal here is not just to generate an explanation. The goal is to train a model to build an artifact that teaches.

The artifact can be:

  • a Marimo reactive Python notebook for interactive explanations, sliders, charts, tables, and data exploration
  • a Manim animation for step-by-step math and algorithm visuals

Why RL?

That matters here because "make a good explainer" is not a one-shot task.

The model has to make a sequence of decisions:

  1. understand the assigned topic
  2. decide what to research
  3. choose the right search or documentation tool
  4. stop exploring when it has enough context
  5. generate runnable Marimo or Manim code
  6. use validation feedback to repair failures

This is exactly the kind of workflow where an RL is useful. The model is rewarded for the process, not just the final text.

The Episode

Every episode starts with a STEM topic, an audience tier, keywords, and a target difficulty.

The agent then moves through three phases.

1. Explore

The agent can call explicit research tools:

  • search_wikipedia for fundamentals
  • search_hf_papers for ML and AI papers
  • search_arxiv for scientific papers
  • search_hf_hub for models, datasets, Spaces, and examples

It gets up to three exploration steps. This keeps the task long enough to learn research behavior, but short enough for practical GRPO training.

2. Generate

The agent submits one JSON action with a complete Python artifact:

  • format="marimo" for a reactive notebook
  • format="manim" for an animation scene

The code is not judged only by how it looks. It is parsed, linted, checked, and run.

3. Repair

If the generated artifact fails validation, the environment returns the error message. The agent gets one repair attempt.

This is important because code generation usually fails in boring ways: invalid Marimo cell dependencies, duplicate notebook globals, missing Manim scene classes, syntax errors, or examples that look plausible but do not execute.

The repair step teaches the model to read the validation feedback and change the code, instead of blindly regenerating another answer.

The Reward Signal

The reward system is deliberately practical. It avoids using an LLM judge inside the training loop because that would be slower, noisier, and harder to reproduce.

Instead, the environment rewards things that can be checked quickly.

Exploration Reward

The model gets rewarded when it:

  • chooses a useful tool for the topic
  • writes a relevant query
  • retrieves useful sources
  • increases keyword coverage
  • adds new information instead of repeating the same search
  • stops when the context is already good enough

There is also a small step cost. Exploring forever should not be the winning strategy.

Generation Reward

The generated code is rewarded for:

  • valid JSON action format
  • matching the requested artifact type
  • covering the key concepts
  • passing Marimo or Manim validation
  • actually running or rendering

Broken code cannot score well just because it mentions the right words. The validation checks act like gates.

Repair Reward

The repair step rewards the model for:

  • fixing the reported error
  • passing validation after the fix
  • avoiding repeated unchanged code

This makes the environment closer to a real development loop: build, test, read the error, fix.

Why Retrieval Is Part of the Environment

Small models do not have unlimited context, and this task can easily become context-heavy. The agent may research equations, examples, library APIs, visualization patterns, and task-specific concepts before generating code.

So the environment filters research results before sending them back.

RAG for long-horizon exploration tasks

The retrieval pipeline uses only bge-small-en-v1.5 to fetch, chunk, and rank candidate sources, returning the most useful snippets in the observation.

The goal is the same: provide the model with enough relevant context to build a better explainer, without overwhelming it with irrelevant text.

What We Trained First

Before RL, the model needs to know the shape of the artifacts.

Even larger models often produce Marimo and Manim code that looks reasonable but fails under real validation. So the first step is supervised fine-tuning on examples built from:

  • curated STEM tasks
  • Marimo examples and documentation patterns
  • Manim examples, guides, and reference snippets
  • generate and repair action templates

The current target model is:

unsloth/Ministral-3-3B-Instruct-2512-unsloth-bnb-4bit

The SFT adapter is here:

kgdrathan/ministral-3-3b-4bit-marimo-manim

SFT training curves

Links