Spaces:

AdithyaSK
/

opencode-env-rollout

Sleeping

App Files Files Community

AdithyaSK HF Staff commited on 16 days ago

Commit

7698d12

verified ·

1 Parent(s): 28a00d6

Upload folder using huggingface_hub

Browse files

Files changed (21) hide show

Dockerfile +69 -0
README.md +118 -5
__init__.py +21 -0
client.py +140 -0
models.py +69 -0
opencode_openenv.egg-info/PKG-INFO +19 -0
opencode_openenv.egg-info/SOURCES.txt +21 -0
opencode_openenv.egg-info/dependency_links.txt +1 -0
opencode_openenv.egg-info/entry_points.txt +2 -0
opencode_openenv.egg-info/requires.txt +14 -0
opencode_openenv.egg-info/top_level.txt +1 -0
openenv.yaml +6 -0
pyproject.toml +51 -0
server/__init__.py +0 -0
server/app.py +92 -0
server/gradio_ui.py +295 -0
server/opencode_environment.py +352 -0
server/requirements.txt +9 -0
tests/__init__.py +0 -0
tests/test_client.py +253 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,69 @@

+# syntax=docker/dockerfile:1
+# Multi-stage build for opencode-openenv
+# Mirrors the pattern used by jupyter-agent-openenv.
+#
+# Build:
+#   docker build -t opencode-openenv .
+#
+# Run:
+#   docker run -p 8000:8000 \
+#     -e E2B_API_KEY=e2b_... \
+#     -e ENABLE_WEB_INTERFACE=true \
+#     opencode-openenv
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+# ── Stage 1: builder ──────────────────────────────────────────────────────────
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+ARG BUILD_MODE=standalone
+COPY . /app/env
+WORKDIR /app/env
+# Ensure uv is available
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Install dependencies (cache-friendly two-pass)
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# ── Stage 2: runtime ──────────────────────────────────────────────────────────
+FROM ${BASE_IMAGE}
+WORKDIR /app
+COPY --from=builder /app/env/.venv /app/.venv
+COPY --from=builder /app/env /app/env
+ENV PATH="/app/.venv/bin:$PATH"
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+EXPOSE 8000
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,123 @@
 ---
-title: Opencode Env Rollout
-emoji: 😻
-colorFrom: purple
-colorTo: indigo
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: OpenCode Env
+emoji: 🖥️
+colorFrom: indigo
+colorTo: pink
 sdk: docker
+app_port: 8000
 pinned: false
+base_path: /web
 ---
+# opencode-openenv
+An [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible environment
+for the [OpenCode](https://opencode.ai/) CLI coding agent. Each call runs
+the agent inside an isolated E2B sandbox against an OpenAI-compatible LLM
+endpoint of your choice, executes a user-supplied bash verifier, and
+returns the scalar reward plus artifacts.
+Layout mirrors [`jupyter-agent-openenv`](https://huggingface.co/spaces/AdithyaSK/jupyter-agent-openenv).
+## The one tool
+| Property | Value |
+|---|---|
+| Framework | OpenEnv `MCPEnvironment` |
+| Execution backend | E2B sandbox |
+| Server | FastAPI + Gradio UI at `/` |
+| Client | `OpenCodeEnv(MCPToolClient)` |
+| Tool | Description |
+|---|---|
+| `run_rollout` | Spawn an E2B sandbox, run `opencode run` against your LLM endpoint, run your verifier, return reward + trace + workdir files |
+## Environment variables
+| Variable | Required | Default | Description |
+|---|---|---|---|
+| `E2B_API_KEY` | Yes | - | API key from [e2b.dev](https://e2b.dev/) |
+| `ENABLE_WEB_INTERFACE` | No | `true` | Enable Gradio UI at `/` |
+| `MAX_CONCURRENT_ENVS` | No | `4` | Max concurrent sandbox sessions |
+## Run locally
+**Prerequisites:** Python 3.10+, [uv](https://docs.astral.sh/uv/)
+```bash
+cd trl-internal/environments/opencode/openenv
+uv sync
+E2B_API_KEY=e2b_... uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+The server starts at `http://localhost:8000`; the Gradio UI is mounted at
+the root path.
+**Verify it works:**
+```bash
+curl http://localhost:8000/health
+# {"status":"healthy"}
+curl -X POST http://localhost:8000/mcp \
+  -H "Content-Type: application/json" \
+  -d '{"jsonrpc":"2.0","method":"tools/list","id":1,"params":{}}'
+```
+## Run with Docker
+```bash
+docker build -t opencode-openenv .
+docker run -p 8000:8000 -e E2B_API_KEY=e2b_... opencode-openenv
+```
+## Python client usage
+```python
+from opencode_env_server import OpenCodeEnv
+with OpenCodeEnv(base_url="http://localhost:8000") as env:
+    env.reset()
+    result = env.run_rollout(
+        vllm_url="https://your-llm-host/v1",
+        model="Qwen/Qwen3.5-4B",
+        instruction="Write fizzbuzz.py in the current directory.",
+        test_script=open("my_tests/fizzbuzz.sh").read(),
+        task_id="fizzbuzz_001",
+        mode="transparent_proxy",
+        disable_thinking=True,
+    )
+    print(result.reward, len(result.proxy_turns))
+```
+## REST API
+Standard OpenEnv endpoints are available:
+```
+GET  /health          # {"status": "healthy"}
+GET  /metadata        # env name, version, description
+GET  /schema          # action + observation JSON schemas
+POST /reset           # start new episode
+POST /step            # execute an action
+POST /mcp             # JSON-RPC 2.0 for MCP tool calls
+GET  /                # Gradio UI
+```
+## Project structure
+```
+opencode-openenv/
+├── __init__.py          # Package exports
+├── client.py            # OpenCodeEnv(MCPToolClient)
+├── models.py            # OpenCodeState, RolloutTurn, RolloutResult
+├── openenv.yaml         # OpenEnv manifest
+├── pyproject.toml       # Dependencies
+├── .env.example         # Environment variable template
+├── Dockerfile           # Multi-stage uv build on openenv-base
+└── server/
+    ├── app.py                  # FastAPI + Gradio mount
+    ├── opencode_environment.py # MCPEnvironment implementation
+    ├── gradio_ui.py            # Interactive UI
+    └── requirements.txt        # Pip fallback deps
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,21 @@

+"""OpenEnv OpenCode environment.
+Exposes a single MCP tool ``run_rollout`` that spawns an E2B sandbox, runs
+the OpenCode CLI agent against a caller-supplied LLM endpoint, runs the
+caller-supplied verifier script, and returns reward + proxy trace +
+workdir contents as a JSON-serialized :class:`RolloutResult`.
+Import either the :class:`OpenCodeEnv` HTTP client (for training scripts
+talking to a deployed server) or the models (for type-safe parsing of
+rollout results).
+"""
+from .client import OpenCodeEnv
+from .models import OpenCodeState, RolloutResult, RolloutTurn
+__all__ = [
+    "OpenCodeEnv",
+    "OpenCodeState",
+    "RolloutResult",
+    "RolloutTurn",
+]

client.py ADDED Viewed

	@@ -0,0 +1,140 @@

+"""OpenCode Environment Client.
+Thin MCP client over the deployed ``opencode-openenv`` server. The server
+exposes a single tool ``run_rollout`` that takes a task + LLM endpoint,
+runs one OpenCode agent rollout in a fresh E2B sandbox, and returns a
+JSON-serialized :class:`RolloutResult`.
+Example::
+    from opencode_env_server import OpenCodeEnv
+    with OpenCodeEnv(base_url="https://adithya-s-k-opencode-openenv.hf.space") as env:
+        env.reset()
+        result = env.run_rollout(
+            vllm_url="https://your-llm-host/v1",
+            model="Qwen/Qwen3.5-4B",
+            instruction="Write fizzbuzz.py in the current directory.",
+            test_script="#!/bin/bash\\n...",
+            task_id="fizzbuzz_001",
+            mode="transparent_proxy",
+            disable_thinking=True,
+        )
+        print(result.reward, len(result.proxy_turns))
+Docker convenience::
+    env = OpenCodeEnv.from_docker_image("opencode-openenv:latest")
+    env.reset()
+    ...
+"""
+from __future__ import annotations
+import json
+from typing import Any
+from openenv.core.mcp_client import MCPToolClient
+try:
+    from .models import RolloutResult
+except ImportError:
+    from models import RolloutResult  # type: ignore
+class OpenCodeEnv(MCPToolClient):
+    """Client for the OpenCode OpenEnv server.
+    Inherits MCP plumbing (``reset``, ``call_tool``, ``list_tools``,
+    ``from_docker_image``, context-manager support) from
+    :class:`MCPToolClient`. Adds :meth:`run_rollout` as a typed helper that
+    deserializes the tool result into a :class:`RolloutResult`.
+    """
+    def run_rollout(
+        self,
+        *,
+        vllm_url: str,
+        model: str,
+        instruction: str,
+        test_script: str,
+        task_id: str = "",
+        setup_shell: str = "",
+        upload_files: dict[str, str] | None = None,
+        provider: str = "openai_compatible",
+        api_key: str = "intercepted",
+        mode: str = "transparent_proxy",
+        disable_thinking: bool = False,
+        max_tokens_cap: int = 4096,
+        agent_timeout_s: float = 600.0,
+    ) -> RolloutResult:
+        """Typed helper around ``call_tool("run_rollout", ...)``."""
+        raw = self.call_tool(
+            "run_rollout",
+            vllm_url=vllm_url,
+            model=model,
+            instruction=instruction,
+            test_script=test_script,
+            task_id=task_id,
+            setup_shell=setup_shell,
+            upload_files=upload_files or {},
+            provider=provider,
+            api_key=api_key,
+            mode=mode,
+            disable_thinking=disable_thinking,
+            max_tokens_cap=max_tokens_cap,
+            agent_timeout_s=agent_timeout_s,
+        )
+        payload = _extract_text(raw)
+        return RolloutResult.model_validate_json(payload)
+def _extract_text(result: Any) -> str:
+    """Pull the text payload out of an MCP tool result shape.
+    Handles three shapes MCPToolClient / call_tool may return:
+    - the raw tool text (str)
+    - a CallToolObservation-like object with ``.result.content[0].text``
+    - a dict with ``content`` list containing ``{"text": ...}`` entries
+    """
+    if isinstance(result, str):
+        return result
+    # Object with attribute chain: obs.result.content[0].text
+    inner = getattr(result, "result", None)
+    if inner is not None:
+        content = getattr(inner, "content", None)
+        if content:
+            first = content[0]
+            text = getattr(first, "text", None)
+            if isinstance(text, str):
+                return text
+            if isinstance(first, dict) and "text" in first:
+                return first["text"]
+    if isinstance(result, dict):
+        content = result.get("content")
+        if isinstance(content, list) and content:
+            first = content[0]
+            if isinstance(first, dict) and "text" in first:
+                return first["text"]
+        nested = result.get("result")
+        if isinstance(nested, dict):
+            content = nested.get("content")
+            if isinstance(content, list) and content:
+                first = content[0]
+                if isinstance(first, dict) and "text" in first:
+                    return first["text"]
+        return json.dumps(result, default=str)
+    # Object with .content directly
+    content = getattr(result, "content", None)
+    if content:
+        first = content[0]
+        text = getattr(first, "text", None)
+        if isinstance(text, str):
+            return text
+    return str(result)

models.py ADDED Viewed

	@@ -0,0 +1,69 @@

+"""Pydantic models for the OpenCode OpenEnv environment."""
+from __future__ import annotations
+from typing import Any
+from openenv.core.env_server.types import State
+from pydantic import BaseModel, Field
+class RolloutTurn(BaseModel):
+    """One intercepted LLM turn captured by the in-sandbox proxy (Mode B)."""
+    turn: int
+    request: dict[str, Any] = Field(default_factory=dict)
+    response: dict[str, Any] = Field(default_factory=dict)
+    completion_tokens: list[str] = Field(default_factory=list)
+    completion_token_ids: list[int] = Field(default_factory=list)
+    per_token_logps: list[float] = Field(default_factory=list)
+    finish_reason: str | None = None
+    latency_s: float = 0.0
+    timestamp: float = 0.0
+class RolloutResult(BaseModel):
+    """Outcome of one call to the ``run_rollout`` tool.
+    Serialized to JSON as the tool result. The training-side client
+    deserializes and feeds ``proxy_turns`` + ``reward`` into GRPO.
+    """
+    # Identifiers
+    task_id: str = ""
+    sandbox_id: str = ""
+    # Scalars
+    reward: float | None = None
+    exit_code: int = 0
+    wall_s: float = 0.0
+    mode: str = "transparent_proxy"
+    # Per-turn trajectory (empty in black_box mode)
+    proxy_turns: list[RolloutTurn] = Field(default_factory=list)
+    # Agent artifacts
+    workdir_files: dict[str, str] = Field(default_factory=dict)
+    agent_log_tail: str = ""
+    # Verifier bookkeeping
+    verifier_stdout: str = ""
+    verifier_stderr: str = ""
+    test_exit_code: int | None = None
+    # Errors (if any) surfacing from sandbox/proxy/verifier path
+    error: str | None = None
+class OpenCodeState(State):
+    """Persistent env state across calls to the single environment instance.
+    Each HTTP session gets its own OpenCodeEnvironment (via
+    ``SUPPORTS_CONCURRENT_SESSIONS = True`` on the server class), so this
+    state is per-session.
+    """
+    rollouts_completed: int = 0
+    last_reward: float | None = None
+    last_task_id: str | None = None
+    last_sandbox_id: str | None = None

opencode_openenv.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,19 @@

+Metadata-Version: 2.4
+Name: opencode-openenv
+Version: 0.1.0
+Summary: OpenEnv OpenCode environment — spawns an E2B sandbox per rollout, runs OpenCode against a caller-supplied LLM endpoint, returns reward + proxy trace.
+Author-email: adithya-s-k <adithyaskolavi@gmail.com>
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core] @ git+https://github.com/adithya-s-k/OpenEnv.git@opencode-harness
+Requires-Dist: openenv-opencode_env @ git+https://github.com/adithya-s-k/OpenEnv.git@opencode-harness#subdirectory=envs/opencode_env
+Requires-Dist: fastmcp>=3.0.0
+Requires-Dist: fastapi>=0.115.0
+Requires-Dist: uvicorn>=0.24.0
+Requires-Dist: pydantic>=2.0.0
+Requires-Dist: gradio>=4.0.0
+Requires-Dist: python-dotenv>=1.0.0
+Requires-Dist: requests>=2.31.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
+Requires-Dist: httpx>=0.27.0; extra == "dev"

opencode_openenv.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+README.md
+__init__.py
+client.py
+models.py
+openenv.yaml
+pyproject.toml
+./__init__.py
+./client.py
+./models.py
+./openenv.yaml
+opencode_openenv.egg-info/PKG-INFO
+opencode_openenv.egg-info/SOURCES.txt
+opencode_openenv.egg-info/dependency_links.txt
+opencode_openenv.egg-info/entry_points.txt
+opencode_openenv.egg-info/requires.txt
+opencode_openenv.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/gradio_ui.py
+server/opencode_environment.py
+server/requirements.txt

opencode_openenv.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

opencode_openenv.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = opencode_env_server.server.app:main

opencode_openenv.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+openenv-core[core] @ git+https://github.com/adithya-s-k/OpenEnv.git@opencode-harness
+openenv-opencode_env @ git+https://github.com/adithya-s-k/OpenEnv.git@opencode-harness#subdirectory=envs/opencode_env
+fastmcp>=3.0.0
+fastapi>=0.115.0
+uvicorn>=0.24.0
+pydantic>=2.0.0
+gradio>=4.0.0
+python-dotenv>=1.0.0
+requests>=2.31.0
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0
+httpx>=0.27.0

opencode_openenv.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ opencode_env_server

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: opencode_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

pyproject.toml ADDED Viewed

	@@ -0,0 +1,51 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "opencode-openenv"
+version = "0.1.0"
+description = "OpenEnv environment running the OpenCode coding agent inside an E2B sandbox, returning reward and per-turn trace."
+authors = [
+    { name = "adithya-s-k", email = "adithyaskolavi@gmail.com" }
+]
+requires-python = ">=3.10"
+dependencies = [
+    # NOTE: openenv-core must come from the same branch as the primitive —
+    # the ``openenv.core.harness`` module doesn't exist on PyPI yet (it lives
+    # on PR #471 and our opencode-harness branch stacked on top of it).
+    "openenv-core[core] @ git+https://github.com/adithya-s-k/OpenEnv.git@opencode-harness",
+    "openenv-opencode_env @ git+https://github.com/adithya-s-k/OpenEnv.git@opencode-harness#subdirectory=envs/opencode_env",
+    "fastmcp>=3.0.0",
+    "fastapi>=0.115.0",
+    "uvicorn>=0.24.0",
+    "pydantic>=2.0.0",
+    "gradio>=4.0.0",
+    "python-dotenv>=1.0.0",
+    "requests>=2.31.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+    "httpx>=0.27.0",
+]
+[project.scripts]
+server = "opencode_env_server.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["opencode_env_server", "opencode_env_server.server"]
+package-dir = { "opencode_env_server" = ".", "opencode_env_server.server" = "server" }
+[tool.setuptools.package-data]
+"*" = ["*.txt", "*.yaml"]
+[dependency-groups]
+dev = [
+    "httpx>=0.27.0",
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]

server/__init__.py ADDED Viewed

File without changes

server/app.py ADDED Viewed

	@@ -0,0 +1,92 @@

+"""FastAPI application for the OpenCode Environment.
+The Gradio web UI is mounted at root (``/``). OpenEnv's standard API
+endpoints (``/health``, ``/reset``, ``/step``, ``/mcp``) are registered
+first and take precedence over Gradio routes.
+Usage::
+    # Development:
+    E2B_API_KEY=... uv run uvicorn server.app:app --reload
+    # Via uv project script:
+    E2B_API_KEY=... uv run --project . server
+    # Docker:
+    docker run -p 8000:8000 -e E2B_API_KEY=... opencode-openenv
+"""
+from __future__ import annotations
+import os
+import gradio as gr
+try:
+    from openenv.core.env_server.gradio_theme import (
+        OPENENV_GRADIO_CSS,
+        OPENENV_GRADIO_THEME,
+    )
+    from openenv.core.env_server.http_server import create_app
+    from openenv.core.env_server.mcp_types import (
+        CallToolAction,
+        CallToolObservation,
+    )
+    from openenv.core.env_server.web_interface import WebInterfaceManager
+    from .opencode_environment import OpenCodeEnvironment
+    from .gradio_ui import opencode_ui_builder
+except ImportError:
+    from openenv.core.env_server.gradio_theme import (
+        OPENENV_GRADIO_CSS,
+        OPENENV_GRADIO_THEME,
+    )
+    from openenv.core.env_server.http_server import create_app
+    from openenv.core.env_server.mcp_types import (
+        CallToolAction,
+        CallToolObservation,
+    )
+    from openenv.core.env_server.web_interface import WebInterfaceManager
+    from server.opencode_environment import OpenCodeEnvironment
+    from server.gradio_ui import opencode_ui_builder
+# Build the HTTP API server with MCP routing. We mount our own Gradio UI
+# below, so disable the built-in web interface from create_app.
+os.environ["ENABLE_WEB_INTERFACE"] = "false"
+app = create_app(
+    OpenCodeEnvironment,
+    CallToolAction,
+    CallToolObservation,
+    env_name="opencode_env",
+    max_concurrent_envs=int(os.getenv("MAX_CONCURRENT_ENVS", "4")),
+)
+_web_manager = WebInterfaceManager(
+    OpenCodeEnvironment,
+    CallToolAction,
+    CallToolObservation,
+)
+_demo = opencode_ui_builder(
+    web_manager=_web_manager,
+)
+app = gr.mount_gradio_app(
+    app,
+    _demo,
+    path="/",
+    theme=OPENENV_GRADIO_THEME,
+    css=OPENENV_GRADIO_CSS,
+)
+def main(host: str = "0.0.0.0", port: int = 8000) -> None:
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    main()

server/gradio_ui.py ADDED Viewed

	@@ -0,0 +1,295 @@

+"""Gradio UI for the OpenCode OpenEnv server.
+One page. Top half: LLM config + task inputs. Bottom half: rollout
+summary, proxy trace, workdir files, verifier output.
+"""
+from __future__ import annotations
+import json
+from typing import Any
+import gradio as gr
+# ── Defaults pre-filled in the form (can be overridden per run) ─────────────
+_DEFAULT_INSTRUCTION = (
+    "Write a Python script `fizzbuzz.py` in the current directory that "
+    "prints FizzBuzz for numbers 1..15, one per line. Print 'Fizz' for "
+    "multiples of 3, 'Buzz' for multiples of 5, 'FizzBuzz' for both."
+)
+_DEFAULT_TEST_SCRIPT = r"""#!/usr/bin/env bash
+set -u
+mkdir -p /home/user/logs/verifier
+REWARD_PATH=/home/user/logs/verifier/reward.txt
+cd /home/user/workdir || { echo 0 > "$REWARD_PATH"; exit 0; }
+if [ ! -f fizzbuzz.py ]; then
+    echo 0 > "$REWARD_PATH"
+    exit 0
+fi
+OUTPUT=$(python fizzbuzz.py 2>&1 | head -20 || true)
+EXPECTED=(1 2 Fizz 4 Buzz Fizz 7 8 Fizz Buzz 11 Fizz 13 14 FizzBuzz)
+HITS=0
+for line in "${EXPECTED[@]}"; do
+    if echo "$OUTPUT" | grep -qxF "$line"; then HITS=$((HITS + 1)); fi
+done
+python -c "print(${HITS} / ${#EXPECTED[@]})" > "$REWARD_PATH"
+echo "fizzbuzz: ${HITS}/${#EXPECTED[@]} lines correct"
+"""
+_EXAMPLE_MODELS = [
+    "Qwen/Qwen3.5-4B",
+    "Qwen/Qwen3-Coder-Next",
+    "openai/gpt-4o-mini",
+    "openai/gpt-5.3-chat-latest",
+]
+_EXAMPLE_VLLM_URLS = [
+    "https://<your-public-llm-host>/v1",
+    "https://api.openai.com/v1",
+]
+def opencode_ui_builder(
+    *,
+    web_manager: Any,
+    title: str = "OpenCode Env",
+    **_: Any,
+) -> gr.Blocks:
+    """Build the Gradio Blocks UI bound to ``web_manager``.
+    The web manager is a thin wrapper around ``OpenCodeEnvironment`` —
+    calling ``call_tool("run_rollout", ...)`` on it drives one rollout.
+    """
+    with gr.Blocks(title=title, analytics_enabled=False) as demo:
+        gr.Markdown(f"# {title}\nRun one OpenCode rollout against any OpenAI-compatible endpoint.")
+        # ── Config ─────────────────────────────────────────────────────────
+        with gr.Row():
+            with gr.Column(scale=1):
+                vllm_url = gr.Textbox(
+                    label="vLLM / LLM base URL",
+                    value=_EXAMPLE_VLLM_URLS[0],
+                    placeholder="https://.../v1",
+                )
+                model = gr.Textbox(
+                    label="Model id",
+                    value=_EXAMPLE_MODELS[0],
+                    placeholder="Qwen/Qwen3.5-4B",
+                )
+                provider = gr.Dropdown(
+                    label="Provider",
+                    choices=["openai_compatible", "openai", "anthropic"],
+                    value="openai_compatible",
+                )
+                api_key = gr.Textbox(
+                    label="API key (ignored by vLLM)",
+                    value="intercepted",
+                    type="password",
+                )
+            with gr.Column(scale=1):
+                mode = gr.Dropdown(
+                    label="Mode",
+                    choices=["transparent_proxy", "black_box"],
+                    value="transparent_proxy",
+                )
+                disable_thinking = gr.Checkbox(
+                    label="Disable Qwen3 thinking mode",
+                    value=True,
+                )
+                max_tokens_cap = gr.Slider(
+                    label="max_tokens cap",
+                    minimum=512, maximum=32768, value=4096, step=256,
+                )
+                agent_timeout_s = gr.Slider(
+                    label="Agent timeout (s)",
+                    minimum=60, maximum=1200, value=300, step=30,
+                )
+        # ── Task ───────────────────────────────────────────────────────────
+        with gr.Row():
+            task_id = gr.Textbox(label="Task id (optional)", value="fizzbuzz_demo")
+        instruction = gr.Textbox(
+            label="Instruction",
+            value=_DEFAULT_INSTRUCTION,
+            lines=4,
+        )
+        test_script = gr.Code(
+            label="test.sh (bash verifier — writes reward to /home/user/logs/verifier/reward.txt)",
+            value=_DEFAULT_TEST_SCRIPT,
+            language="shell",
+        )
+        setup_shell = gr.Textbox(
+            label="Setup shell (optional, runs before opencode)",
+            value="",
+            placeholder="e.g. pip install polars",
+        )
+        run_btn = gr.Button("▶ Run rollout", variant="primary")
+        # ── Output panels ─────────────────────────────────────────────────
+        status = gr.Markdown()
+        with gr.Row():
+            reward = gr.Number(label="reward", value=None, interactive=False)
+            wall_s = gr.Number(label="wall_s", value=None, interactive=False)
+            exit_code = gr.Number(label="exit_code", value=None, interactive=False)
+            n_turns = gr.Number(label="proxy_turns", value=None, interactive=False)
+        with gr.Accordion("Workdir files", open=True):
+            workdir_md = gr.Markdown()
+        with gr.Accordion("Proxy trace (per turn)", open=False):
+            proxy_trace_json = gr.JSON(label=None)
+        with gr.Accordion("Verifier stdout / stderr", open=False):
+            verifier_out = gr.Textbox(label="stdout", lines=8)
+            verifier_err = gr.Textbox(label="stderr", lines=4)
+        with gr.Accordion("Raw result JSON", open=False):
+            raw_json = gr.JSON(label=None)
+        # ── Run handler ────────────────────────────────────────────────────
+        def _run(
+            vllm_url_v: str,
+            model_v: str,
+            provider_v: str,
+            api_key_v: str,
+            mode_v: str,
+            disable_thinking_v: bool,
+            max_tokens_cap_v: int,
+            agent_timeout_s_v: float,
+            task_id_v: str,
+            instruction_v: str,
+            test_script_v: str,
+            setup_shell_v: str,
+        ):
+            try:
+                env = web_manager.get_environment()
+                env.reset()
+                result_raw = env.call_tool(
+                    "run_rollout",
+                    vllm_url=vllm_url_v,
+                    model=model_v,
+                    instruction=instruction_v,
+                    test_script=test_script_v,
+                    task_id=task_id_v,
+                    setup_shell=setup_shell_v,
+                    upload_files={},
+                    provider=provider_v,
+                    api_key=api_key_v,
+                    mode=mode_v,
+                    disable_thinking=bool(disable_thinking_v),
+                    max_tokens_cap=int(max_tokens_cap_v),
+                    agent_timeout_s=float(agent_timeout_s_v),
+                )
+                result = _parse_result(result_raw)
+            except Exception as exc:
+                msg = f"**Error:** `{type(exc).__name__}: {exc}`"
+                return (msg, None, None, None, None, "", [], "", "", {"error": str(exc)})
+            status_md = _summarize_status(result)
+            wd_md = _render_workdir(result.get("workdir_files") or {})
+            turns = result.get("proxy_turns") or []
+            verifier_stdout = (result.get("verifier_stdout") or "")[:4000]
+            verifier_stderr = (result.get("verifier_stderr") or "")[:2000]
+            return (
+                status_md,
+                result.get("reward"),
+                result.get("wall_s"),
+                result.get("exit_code"),
+                len(turns),
+                wd_md,
+                turns,
+                verifier_stdout,
+                verifier_stderr,
+                result,
+            )
+        run_btn.click(
+            _run,
+            inputs=[
+                vllm_url, model, provider, api_key, mode, disable_thinking,
+                max_tokens_cap, agent_timeout_s,
+                task_id, instruction, test_script, setup_shell,
+            ],
+            outputs=[
+                status, reward, wall_s, exit_code, n_turns,
+                workdir_md, proxy_trace_json,
+                verifier_out, verifier_err, raw_json,
+            ],
+        )
+    return demo
+# ── Helpers ─────────────────────────────────────────────────────────────────
+def _parse_result(raw: Any) -> dict[str, Any]:
+    """Unwrap the server's JSON tool result into a plain dict."""
+    # Object with attribute chain: obs.result.content[0].text
+    inner = getattr(raw, "result", None)
+    if inner is not None:
+        content = getattr(inner, "content", None)
+        if content:
+            first = content[0]
+            text = getattr(first, "text", None)
+            if isinstance(text, str):
+                try:
+                    return json.loads(text)
+                except Exception:
+                    return {"raw": text}
+    if isinstance(raw, dict):
+        content = raw.get("content")
+        if isinstance(content, list) and content:
+            first = content[0]
+            text = first.get("text") if isinstance(first, dict) else None
+            if isinstance(text, str):
+                try:
+                    return json.loads(text)
+                except Exception:
+                    return {"raw": text}
+        return raw
+    if isinstance(raw, str):
+        try:
+            return json.loads(raw)
+        except Exception:
+            return {"raw": raw}
+    return {"raw": str(raw)}
+def _summarize_status(result: dict[str, Any]) -> str:
+    if result.get("error"):
+        return f"❌ **Error:** `{result['error']}`"
+    reward = result.get("reward")
+    turns = result.get("proxy_turns") or []
+    wall = result.get("wall_s", 0.0)
+    sb = result.get("sandbox_id", "")
+    exit_code = result.get("exit_code")
+    parts = [
+        f"**reward** = `{reward}`",
+        f"**wall** = `{wall}s`",
+        f"**turns** = `{len(turns)}`",
+        f"**exit** = `{exit_code}`",
+    ]
+    if sb:
+        parts.append(f"**sandbox** = `{sb}`")
+    return " · ".join(parts)
+def _render_workdir(files: dict[str, str]) -> str:
+    if not files:
+        return "_(no files produced)_"
+    lines = []
+    for path, contents in files.items():
+        lines.append(f"### `{path}`")
+        lines.append("")
+        lines.append("```")
+        lines.append((contents or "").rstrip()[:2000])
+        lines.append("```")
+    return "\n".join(lines)

server/opencode_environment.py ADDED Viewed

	@@ -0,0 +1,352 @@

+"""OpenCode MCP Environment.
+Exposes a single tool ``run_rollout`` that runs one OpenCode agent rollout
+end-to-end against a caller-supplied LLM endpoint:
+1. Spawn a fresh E2B sandbox (via the primitive's ``E2BSandboxBackend``).
+2. Install opencode in the sandbox, write its config pointing at an
+   in-sandbox proxy (Mode B) or the caller's LLM URL directly (Mode A).
+3. Stage the caller-supplied task: instruction + test.sh + any extra files.
+4. Run ``opencode run`` to completion.
+5. Execute the verifier script; read the scalar reward from
+   ``/home/user/logs/verifier/reward.txt``.
+6. Collect proxy trace + workdir contents.
+7. Return a JSON-serialized :class:`RolloutResult`.
+The env is deliberately task-agnostic — the training script passes the
+full task (instruction + verifier) through the tool arguments.
+"""
+from __future__ import annotations
+import json
+import os
+import time
+from typing import Any, Optional
+from uuid import uuid4
+from dotenv import load_dotenv
+from fastmcp import FastMCP
+from openenv.core.env_server.mcp_environment import MCPEnvironment
+from openenv.core.env_server.types import Action, Observation
+load_dotenv()
+# Default test-script and reward paths inside the sandbox. The server writes
+# the caller-supplied ``test_script`` text to this path; the verifier reads
+# the reward file back out after it finishes.
+REMOTE_TEST_PATH = "/home/user/tests/test.sh"
+REMOTE_REWARD_PATH = "/home/user/logs/verifier/reward.txt"
+WORKDIR_PATH = "/home/user/workdir"
+VERIFIER_TIMEOUT_S = 120
+class OpenCodeEnvironment(MCPEnvironment):
+    """One-tool MCP environment.
+    The single tool ``run_rollout`` is synchronous and returns a JSON string
+    — one call = one complete agent rollout. Each call creates and destroys
+    its own E2B sandbox, so the environment is stateless across calls.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def __init__(self) -> None:
+        # Import inside __init__ to keep module import cheap and to allow
+        # patching for tests. Dual-import pattern: package when installed,
+        # flat when run directly out of the repo via ``server.app:app``.
+        try:
+            from ..models import OpenCodeState, RolloutResult, RolloutTurn
+        except ImportError:
+            from models import OpenCodeState, RolloutResult, RolloutTurn  # type: ignore
+        from opencode_env import (
+            E2BSandboxBackend,
+            OpenCodeConfig,
+            OpenCodeSessionFactory,
+            collect_rollout_summary,
+        )
+        from openenv.core.harness import VerifyResult
+        self._state_cls = OpenCodeState
+        self._result_cls = RolloutResult
+        self._turn_cls = RolloutTurn
+        self._OpenCodeConfig = OpenCodeConfig
+        self._OpenCodeSessionFactory = OpenCodeSessionFactory
+        self._E2BSandboxBackend = E2BSandboxBackend
+        self._collect_rollout_summary = collect_rollout_summary
+        self._VerifyResult = VerifyResult
+        # Require E2B credentials up front — fail loudly if unset.
+        if not os.environ.get("E2B_API_KEY"):
+            raise RuntimeError(
+                "E2B_API_KEY environment variable is required for OpenCodeEnvironment"
+            )
+        self._state = self._state_cls(episode_id=str(uuid4()))
+        mcp = FastMCP("opencode_env")
+        @mcp.tool
+        def run_rollout(
+            vllm_url: str,
+            model: str,
+            instruction: str,
+            test_script: str,
+            task_id: str = "",
+            setup_shell: str = "",
+            upload_files: Optional[dict[str, str]] = None,
+            provider: str = "openai_compatible",
+            api_key: str = "intercepted",
+            mode: str = "transparent_proxy",
+            disable_thinking: bool = False,
+            max_tokens_cap: int = 4096,
+            agent_timeout_s: float = 600.0,
+        ) -> str:
+            """Run one OpenCode rollout end-to-end.
+            Args:
+                vllm_url: LLM endpoint (``https://host/v1``).
+                model: Model id the provider recognizes.
+                instruction: Prompt passed to ``opencode run``.
+                test_script: Bash verifier. Must write a float reward to
+                    ``/home/user/logs/verifier/reward.txt``.
+                task_id: Optional identifier echoed back for traceability.
+                setup_shell: Optional shell run before opencode starts.
+                upload_files: Optional {remote_path: content} staged into the
+                    sandbox.
+                provider: OpenCodeConfig provider id. For vLLM use
+                    ``"openai_compatible"``; for real OpenAI ``"openai"``.
+                api_key: Provider API key. vLLM ignores this.
+                mode: ``"transparent_proxy"`` (captures per-turn logprobs) or
+                    ``"black_box"`` (direct connection, no logprobs).
+                disable_thinking: Qwen3/Qwen3.5 proxy-side thinking disable.
+                max_tokens_cap: Clamp forwarded ``max_tokens``.
+                agent_timeout_s: Max opencode runtime in seconds.
+            Returns:
+                JSON-serialized :class:`RolloutResult`.
+            """
+            return self._run_rollout_impl(
+                vllm_url=vllm_url,
+                model=model,
+                instruction=instruction,
+                test_script=test_script,
+                task_id=task_id,
+                setup_shell=setup_shell,
+                upload_files=upload_files or {},
+                provider=provider,
+                api_key=api_key,
+                mode=mode,
+                disable_thinking=disable_thinking,
+                max_tokens_cap=max_tokens_cap,
+                agent_timeout_s=agent_timeout_s,
+            )
+        super().__init__(mcp)
+    # ── OpenEnv lifecycle ───────────────────────────────────────────────────
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **_: Any,
+    ) -> Observation:
+        self._state = self._state_cls(episode_id=episode_id or str(uuid4()))
+        return Observation(
+            done=False,
+            reward=None,
+            metadata={
+                "status": "ready",
+                "message": "OpenCode env ready. Call run_rollout(...) with a task.",
+            },
+        )
+    def _step_impl(
+        self,
+        action: Action,
+        timeout_s: Optional[float] = None,
+        **_: Any,
+    ) -> Observation:
+        return Observation(
+            done=False,
+            reward=None,
+            metadata={
+                "error": (
+                    f"Unknown action type: {type(action).__name__}. "
+                    "Use CallToolAction(name='run_rollout', ...)."
+                ),
+            },
+        )
+    @property
+    def state(self) -> Any:
+        return self._state
+    # ── Rollout implementation ──────────────────────────────────────────────
+    def _run_rollout_impl(
+        self,
+        *,
+        vllm_url: str,
+        model: str,
+        instruction: str,
+        test_script: str,
+        task_id: str,
+        setup_shell: str,
+        upload_files: dict[str, str],
+        provider: str,
+        api_key: str,
+        mode: str,
+        disable_thinking: bool,
+        max_tokens_cap: int,
+        agent_timeout_s: float,
+    ) -> str:
+        from opencode_env import OpenCodeTask
+        result = self._result_cls(task_id=task_id, mode=mode)
+        t0 = time.time()
+        provider_model = _qualify_model(provider, model)
+        config = self._OpenCodeConfig(
+            provider=provider,
+            base_url=vllm_url.rstrip("/"),
+            api_key=api_key,
+            model=provider_model,
+            agent_timeout_s=agent_timeout_s,
+            proxy_disable_thinking=disable_thinking,
+            proxy_max_tokens_cap=max_tokens_cap if max_tokens_cap > 0 else None,
+        )
+        factory = self._OpenCodeSessionFactory(
+            config=config,
+            sandbox_backend=self._E2BSandboxBackend(),
+            mode=mode,  # "transparent_proxy" or "black_box"
+            verifier=None,  # we run the caller's test_script ourselves below
+        )
+        merged_uploads = dict(upload_files)
+        merged_uploads[REMOTE_TEST_PATH] = test_script
+        task = OpenCodeTask(
+            instruction=instruction,
+            setup_shell=setup_shell or None,
+            upload_files=merged_uploads,
+            metadata={"task_id": task_id},
+        )
+        session = None
+        try:
+            session = factory.create(task=task)
+            result.sandbox_id = session.sandbox.sandbox_id
+            exit_code = session.wait_for_completion(timeout_s=agent_timeout_s)
+            result.exit_code = int(exit_code)
+            # Run the verifier. Exit code is ignored; the reward file is the
+            # source of truth.
+            session.sandbox.exec(
+                f"mkdir -p /home/user/logs/verifier /home/user/tests && "
+                f"chmod +x {REMOTE_TEST_PATH}",
+                timeout=15,
+            )
+            verifier_run = session.sandbox.exec(
+                f"bash {REMOTE_TEST_PATH}",
+                cwd=WORKDIR_PATH,
+                timeout=VERIFIER_TIMEOUT_S,
+            )
+            result.test_exit_code = int(verifier_run.exit_code)
+            result.verifier_stdout = (verifier_run.stdout or "")[:4000]
+            result.verifier_stderr = (verifier_run.stderr or "")[:2000]
+            result.reward = _read_reward(session.sandbox, REMOTE_REWARD_PATH)
+            # Collect artifacts via the primitive's summary helper.
+            summary = self._collect_rollout_summary(session)
+            result.agent_log_tail = _tail(summary.opencode_events, 20)
+            result.workdir_files = {
+                path: (contents or "")[:8000]
+                for path, contents in (summary.workdir_contents or {}).items()
+            }
+            for raw in summary.proxy_turns:
+                result.proxy_turns.append(self._turn_cls(**_clamp_turn(raw)))
+        except Exception as exc:
+            result.error = f"{type(exc).__name__}: {exc}"
+        finally:
+            if session is not None:
+                try:
+                    session.close()
+                except Exception:
+                    pass
+        result.wall_s = round(time.time() - t0, 3)
+        # Persist lightweight state for bookkeeping.
+        self._state.rollouts_completed += 1
+        self._state.last_reward = result.reward
+        self._state.last_task_id = task_id or None
+        self._state.last_sandbox_id = result.sandbox_id or None
+        return result.model_dump_json()
+# ── Helpers ─────────────────────────────────────────────────────────────────
+def _qualify_model(provider: str, model: str) -> str:
+    """Return a ``<provider>/<model>`` string acceptable to the primitive.
+    If the caller already prefixed the model, leave it alone; otherwise
+    prepend the provider so OpenCode's config file is well-formed.
+    """
+    if "/" in model:
+        return model
+    return f"{provider}/{model}"
+def _read_reward(sandbox: Any, reward_path: str) -> Optional[float]:
+    try:
+        raw = sandbox.read_text(reward_path).strip()
+    except Exception:
+        return None
+    if not raw:
+        return None
+    try:
+        return float(raw)
+    except ValueError:
+        return None
+def _clamp_turn(turn: dict[str, Any]) -> dict[str, Any]:
+    """Clamp per-turn payload sizes to keep responses under a reasonable cap."""
+    out = dict(turn)
+    # Compact ``response`` — we already captured tokens/logps explicitly.
+    out["response"] = {
+        "finish_reason": (out.get("response") or {}).get("choices", [{}])[0].get(
+            "finish_reason"
+        ),
+        "usage": (out.get("response") or {}).get("usage"),
+    }
+    req = out.get("request") or {}
+    messages = req.get("messages") or []
+    # Keep request messages (trainer needs them) but drop very long tool schemas.
+    req = {
+        "model": req.get("model"),
+        "messages": messages,
+        "temperature": req.get("temperature"),
+        "top_p": req.get("top_p"),
+        "max_tokens": req.get("max_tokens"),
+        "max_completion_tokens": req.get("max_completion_tokens"),
+        "logprobs": req.get("logprobs"),
+        "top_logprobs": req.get("top_logprobs"),
+        "stream": req.get("stream"),
+    }
+    out["request"] = req
+    return out
+def _tail(events: list[dict[str, Any]], n: int) -> str:
+    """Return the last ``n`` opencode event lines as a newline-joined string."""
+    if not events:
+        return ""
+    return "\n".join(json.dumps(e) for e in events[-n:])

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+openenv-core[core] @ git+https://github.com/adithya-s-k/OpenEnv.git@opencode-harness
+openenv-opencode_env @ git+https://github.com/adithya-s-k/OpenEnv.git@opencode-harness#subdirectory=envs/opencode_env
+fastmcp>=3.0.0
+fastapi>=0.115.0
+uvicorn>=0.24.0
+pydantic>=2.0.0
+gradio>=4.0.0
+python-dotenv>=1.0.0
+requests>=2.31.0

tests/__init__.py ADDED Viewed

File without changes

tests/test_client.py ADDED Viewed

	@@ -0,0 +1,253 @@

+"""End-to-end HTTP tests for the deployed OpenCode OpenEnv server.
+By default the tests hit the HF Space deployment. Override
+``OPENCODE_ENV_URL`` to point at a local ``uvicorn server.app:app``
+or a ``docker run``-backed container. Every test also needs a reachable
+vLLM endpoint — set ``VLLM_BASE_URL`` to the public URL of a running
+``vllm serve Qwen/Qwen3.5-4B`` (see the slurm scripts under dev/slurm/
+for one way to stand one up).
+Run::
+    export VLLM_BASE_URL=https://your-llm-host/v1
+    uv run pytest tests/ -v -s
+    # against a local server:
+    OPENCODE_ENV_URL=http://localhost:8000 uv run pytest tests/ -v -s
+"""
+from __future__ import annotations
+import json
+import os
+from typing import Any
+import pytest
+ENV_URL = os.getenv(
+    "OPENCODE_ENV_URL", "https://AdithyaSK-opencode-openenv.hf.space"
+)
+VLLM_BASE_URL = os.getenv("VLLM_BASE_URL", "").rstrip("/")
+VLLM_MODEL = os.getenv("VLLM_MODEL", "Qwen/Qwen3.5-4B")
+pytestmark = pytest.mark.skipif(
+    not VLLM_BASE_URL,
+    reason=(
+        "VLLM_BASE_URL not set; point it at a live public-endpointed "
+        "vLLM endpoint (see dev/slurm/vllm_endpoint_qwen35_4b.slurm)."
+    ),
+)
+# ── Inline task bundles ─────────────────────────────────────────────────────
+# Tasks live in the training script, not the env — these are test fixtures
+# mirroring what a trainer would send through ``run_rollout``.
+_FIZZBUZZ_INSTRUCTION = (
+    "Write a Python script `fizzbuzz.py` in the current working directory "
+    "that prints FizzBuzz for numbers 1..15, one per line. Print 'Fizz' "
+    "for multiples of 3, 'Buzz' for multiples of 5, 'FizzBuzz' for both."
+)
+_FIZZBUZZ_TEST = r"""#!/usr/bin/env bash
+set -u
+mkdir -p /home/user/logs/verifier
+REWARD=/home/user/logs/verifier/reward.txt
+cd /home/user/workdir || { echo 0 > "$REWARD"; exit 0; }
+[ -f fizzbuzz.py ] || { echo 0 > "$REWARD"; exit 0; }
+OUT=$(python fizzbuzz.py 2>&1 | head -20 || true)
+EXPECTED=(1 2 Fizz 4 Buzz Fizz 7 8 Fizz Buzz 11 Fizz 13 14 FizzBuzz)
+HITS=0
+for line in "${EXPECTED[@]}"; do
+    echo "$OUT" | grep -qxF "$line" && HITS=$((HITS + 1))
+done
+python -c "print(${HITS}/${#EXPECTED[@]})" > "$REWARD"
+echo "fizzbuzz: ${HITS}/${#EXPECTED[@]}"
+"""
+_SORT_LIST_INSTRUCTION = (
+    "Write a Python script `sort_list.py` in the current working directory "
+    "that sorts [42, 7, 13, 1, 99, 5, 23, 8, 31, 11] ascending and prints "
+    "the result as one comma-separated line with no spaces. Expected "
+    "output (exactly): 1,5,7,8,11,13,23,31,42,99 — do not print anything else."
+)
+_SORT_LIST_TEST = r"""#!/usr/bin/env bash
+set -u
+mkdir -p /home/user/logs/verifier
+REWARD=/home/user/logs/verifier/reward.txt
+cd /home/user/workdir || { echo 0 > "$REWARD"; exit 0; }
+[ -f sort_list.py ] || { echo 0 > "$REWARD"; exit 0; }
+EXPECTED="1,5,7,8,11,13,23,31,42,99"
+OUT=$(python sort_list.py 2>/dev/null | head -1 || true)
+if [ "$OUT" = "$EXPECTED" ]; then
+    echo 1.0 > "$REWARD"
+    echo "sort_list: PASS"
+else
+    echo 0.0 > "$REWARD"
+    echo "sort_list: FAIL got='${OUT}' want='${EXPECTED}'"
+fi
+"""
+_SIMPLE_IO_INSTRUCTION = (
+    "Create a file `greeting.txt` in the current working directory "
+    "containing exactly the line `hello, world` (followed by a newline). "
+    "Then write a Python script `read_and_echo.py` that opens "
+    "`greeting.txt` and prints its contents to stdout. Run the script "
+    "to verify it prints `hello, world` before you stop."
+)
+_SIMPLE_IO_TEST = r"""#!/usr/bin/env bash
+set -u
+mkdir -p /home/user/logs/verifier
+REWARD=/home/user/logs/verifier/reward.txt
+cd /home/user/workdir || { echo 0 > "$REWARD"; exit 0; }
+SCORE=0.0
+if [ -f greeting.txt ]; then
+    if [ "$(cat greeting.txt)" = "hello, world" ]; then
+        SCORE=$(python -c "print(${SCORE} + 0.5)")
+    fi
+fi
+if [ -f read_and_echo.py ]; then
+    OUT=$(python read_and_echo.py 2>/dev/null | head -1 || true)
+    if [ "$OUT" = "hello, world" ]; then
+        SCORE=$(python -c "print(${SCORE} + 0.5)")
+    fi
+fi
+echo "$SCORE" > "$REWARD"
+echo "simple_io: score=$SCORE"
+"""
+_TASKS = {
+    "fizzbuzz": (_FIZZBUZZ_INSTRUCTION, _FIZZBUZZ_TEST),
+    "sort_list": (_SORT_LIST_INSTRUCTION, _SORT_LIST_TEST),
+    "simple_io": (_SIMPLE_IO_INSTRUCTION, _SIMPLE_IO_TEST),
+}
+# ── Fixtures ────────────────────────────────────────────────────────────────
+@pytest.fixture(scope="module")
+def client():
+    """Create a sync MCP client against the env server."""
+    try:
+        from opencode_env_server import OpenCodeEnv
+    except ImportError:
+        # Running from the source tree before the package is pip-installed.
+        import sys
+        from pathlib import Path
+        sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
+        from client import OpenCodeEnv  # type: ignore
+    env = OpenCodeEnv(base_url=ENV_URL).sync()
+    env.__enter__()
+    yield env
+    env.__exit__(None, None, None)
+# ── Server-liveness tests ───────────────────────────────────────────────────
+class TestOpenEnvServer:
+    """Basic OpenEnv MCP contract checks."""
+    def test_reset(self, client):
+        client.reset()
+    def test_list_tools(self, client):
+        client.reset()
+        tools = client.list_tools()
+        names = sorted(t.name for t in tools)
+        assert names == ["run_rollout"], f"unexpected tool set: {names}"
+# ── Rollout tests (require VLLM_BASE_URL) ─────────────────────────────────
+class TestRunRollout:
+    """Drive one rollout per bundled task via the server and verify the result."""
+    @pytest.mark.parametrize("task_id", ["fizzbuzz", "sort_list", "simple_io"])
+    def test_run_rollout(self, client, task_id: str):
+        instruction, test_script = _TASKS[task_id]
+        client.reset()
+        base_url = VLLM_BASE_URL if VLLM_BASE_URL.endswith("/v1") else f"{VLLM_BASE_URL}/v1"
+        raw = client.call_tool(
+            "run_rollout",
+            vllm_url=base_url,
+            model=VLLM_MODEL,
+            instruction=instruction,
+            test_script=test_script,
+            task_id=task_id,
+            provider="openai_compatible",
+            api_key="intercepted",
+            mode="transparent_proxy",
+            disable_thinking=True,
+            max_tokens_cap=4096,
+            agent_timeout_s=360.0,
+        )
+        result = _parse_json(raw)
+        print(
+            f"\n[{task_id}] reward={result['reward']} wall={result['wall_s']}s "
+            f"turns={len(result['proxy_turns'])} files={list((result['workdir_files'] or {}).keys())}"
+        )
+        # Contract assertions
+        assert result["error"] is None, f"rollout errored: {result['error']}"
+        assert result["exit_code"] == 0, "opencode did not exit cleanly"
+        assert (
+            len(result["proxy_turns"]) >= 1
+        ), "proxy captured zero turns — logprob path is broken"
+        # At least one turn must carry logprobs (Mode B contract).
+        productive = [t for t in result["proxy_turns"] if t["completion_tokens"]]
+        assert (
+            len(productive) >= 1
+        ), "no productive turns — streaming / logprob capture is broken"
+        first = productive[0]
+        assert first["request"].get("logprobs") is True
+        assert len(first["per_token_logps"]) == len(first["completion_tokens"])
+        # Task quality
+        assert result["reward"] is not None, "verifier did not write reward.txt"
+        assert result["reward"] >= 0.5, (
+            f"task={task_id} reward={result['reward']} too low; "
+            f"workdir={list((result['workdir_files'] or {}).keys())} "
+            f"verifier_stdout={(result['verifier_stdout'] or '').strip()[:200]}"
+        )
+# ── helpers ────────────────────────────────────────────────────────────────
+def _parse_json(raw: Any) -> dict[str, Any]:
+    """Unwrap a CallTool result shape into a plain dict."""
+    if isinstance(raw, str):
+        return json.loads(raw)
+    if isinstance(raw, dict):
+        content = raw.get("content")
+        if isinstance(content, list) and content:
+            first = content[0]
+            if isinstance(first, dict) and isinstance(first.get("text"), str):
+                return json.loads(first["text"])
+        return raw
+    # Handle MCP object shapes (.result.content[0].text or .content[0].text)
+    inner = getattr(raw, "result", None) or raw
+    content = getattr(inner, "content", None)
+    if content:
+        first = content[0]
+        text = getattr(first, "text", None)
+        if isinstance(text, str):
+            return json.loads(text)
+    raise TypeError(f"Cannot parse tool result of type {type(raw).__name__}: {raw!r}")

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff