Spaces:

kgdrathan
/

explainer-env

Running

App Files Files Community

explainer-env / README.md

kgdrathan

Upload folder using huggingface_hub

ac7572a verified 11 days ago

preview code

raw

history blame contribute delete

6.77 kB

metadata

emoji: 💻
colorFrom: pink
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - OpenEnv
  - RL

The dashboard is served by this Space at /web/ in the custom tab.

See. Interact. Understand.

Teaching Small Models to Build Interactive Explainers

What if a small language model could do more than answer a STEM question?

What if it could research the topic, decide what kind of visual explanation would help, build a working interactive notebook or animation, and then fix its own code when validation fails?

That is the idea behind this project: an OpenEnv reinforcement learning environment for training small language models to create visual, executable educational content.

Built for the OpenEnv Hackathon in India, April 25-26, 2026.

The Problem

Most educational answers from language models are still text-first. That is fine for simple definitions, but it breaks down for topics that are easier to understand by seeing and trying:

gradient descent is clearer when you move the learning rate and watch the loss curve change
Fourier transforms are clearer when frequencies become visible
sorting algorithms are clearer when every comparison and swap is animated
probability and statistics are clearer when samples, distributions, and uncertainty move on screen

The goal here is not just to generate an explanation. The goal is to train a model to build an artifact that teaches.

The artifact can be:

a Marimo reactive Python notebook for interactive explanations, sliders, charts, tables, and data exploration
a Manim animation for step-by-step math and algorithm visuals

Why RL?

That matters here because "make a good explainer" is not a one-shot task.

The model has to make a sequence of decisions:

understand the assigned topic
decide what to research
choose the right search or documentation tool
stop exploring when it has enough context
generate runnable Marimo or Manim code
use validation feedback to repair failures

This is exactly the kind of workflow where an RL is useful. The model is rewarded for the process, not just the final text.

The Episode

Every episode starts with a STEM topic, an audience tier, keywords, and a target difficulty.

The agent then moves through three phases.

1. Explore

The agent can call explicit research tools:

search_wikipedia for fundamentals
search_hf_papers for ML and AI papers
search_arxiv for scientific papers
search_hf_hub for models, datasets, Spaces, and examples

It gets up to three exploration steps. This keeps the task long enough to learn research behavior, but short enough for practical GRPO training.

2. Generate

The agent submits one JSON action with a complete Python artifact:

format="marimo" for a reactive notebook
format="manim" for an animation scene

The code is not judged only by how it looks. It is parsed, linted, checked, and run.

3. Repair

If the generated artifact fails validation, the environment returns the error message. The agent gets one repair attempt.

This is important because code generation usually fails in boring ways: invalid Marimo cell dependencies, duplicate notebook globals, missing Manim scene classes, syntax errors, or examples that look plausible but do not execute.

The repair step teaches the model to read the validation feedback and change the code, instead of blindly regenerating another answer.

The Reward Signal

The reward system is deliberately practical. It avoids using an LLM judge inside the training loop because that would be slower, noisier, and harder to reproduce.

Instead, the environment rewards things that can be checked quickly.

Exploration Reward

The model gets rewarded when it:

chooses a useful tool for the topic
writes a relevant query
retrieves useful sources
increases keyword coverage
adds new information instead of repeating the same search
stops when the context is already good enough

There is also a small step cost. Exploring forever should not be the winning strategy.

Generation Reward

The generated code is rewarded for:

valid JSON action format
matching the requested artifact type
covering the key concepts
passing Marimo or Manim validation
actually running or rendering

Broken code cannot score well just because it mentions the right words. The validation checks act like gates.

Repair Reward

The repair step rewards the model for:

fixing the reported error
passing validation after the fix
avoiding repeated unchanged code

This makes the environment closer to a real development loop: build, test, read the error, fix.

Why Retrieval Is Part of the Environment

Small models do not have unlimited context, and this task can easily become context-heavy. The agent may research equations, examples, library APIs, visualization patterns, and task-specific concepts before generating code.

So the environment filters research results before sending them back.

The retrieval pipeline uses only bge-small-en-v1.5 to fetch, chunk, and rank candidate sources, returning the most useful snippets in the observation.

The goal is the same: provide the model with enough relevant context to build a better explainer, without overwhelming it with irrelevant text.

What We Trained First

Before RL, the model needs to know the shape of the artifacts.

Even larger models often produce Marimo and Manim code that looks reasonable but fails under real validation. So the first step is supervised fine-tuning on examples built from:

curated STEM tasks
Marimo examples and documentation patterns
Manim examples, guides, and reference snippets
generate and repair action templates

The current target model is:

unsloth/Ministral-3-3B-Instruct-2512-unsloth-bnb-4bit

The SFT adapter is here:

kgdrathan/ministral-3-3b-4bit-marimo-manim