scheduling_env / docs /HACKATHON_META.md
Akshaykumarbm's picture
Upload folder using huggingface_hub
7bdbe90 verified

Meta OpenEnv Hackathon - Round 1

Overview

Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard step() / reset() / state() API.

Task Requirements

Must-Have Features

  1. Real-world Task Simulation

    • Must simulate tasks humans actually do
    • Not games or toys
    • Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation
  2. OpenEnv Spec Compliance

    • Typed Observation, Action, and Reward Pydantic models
    • step(action) β†’ returns observation, reward, done, info
    • reset() β†’ returns initial observation
    • state() β†’ returns current state
    • openenv.yaml with metadata
    • Must pass openenv validate
  3. Minimum 3 Tasks with Agent Graders

    • Each task defines a concrete objective
    • Programmatic grader scoring (0.0–1.0)
    • Difficulty range: easy β†’ medium β†’ hard
    • Clear, deterministic success/failure criteria
  4. Meaningful Reward Function

    • Provides signal over full trajectory (not just binary)
    • Rewards partial progress toward completion
    • Penalizes undesirable behavior (infinite loops, destructive actions)
  5. Baseline Inference Script

    • Uses OpenAI API client
    • Reads credentials from OPENAI_API_KEY environment variable
    • Produces reproducible baseline scores on all 3 tasks

Non-Functional Requirements

Deployment

  • Hugging Face Space: Environment must run as containerized HF Space tagged with openenv
  • Dockerfile: Working containerization with clean docker build + docker run

Documentation

README must include:

  • Environment description and motivation
  • Action and observation space definitions
  • Task descriptions with expected difficulty
  • Setup and usage instructions
  • Baseline scores

Evaluation Criteria & Scoring

Scoring Breakdown (100 points)

Criterion Weight Description
Real-world utility 30% Does the environment model a genuine task? Would someone use this for training/evaluating agents?
Task & grader quality 25% Well-defined tasks with clear objectives? Accurate graders? Meaningful difficulty progression?
Environment design 20% Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries
Code quality & spec compliance 15% Follows OpenEnv spec, clean structure, typed models, documented, tested, working Dockerfile
Creativity & novelty 10% Novel problem domain, interesting mechanics, clever reward design, original approach

Detailed Scoring Rubrics

Real-world Utility (30%)

  • 0–5: Toy/artificial problem with no practical application
  • 6–15: Valid domain but shallow modeling
  • 16–25: Good domain modeling, useful for agent evaluation
  • 26–30: Excellent β€” fills real gap, immediate value for RL/agent community

Task & Grader Quality (25%)

  • 3+ tasks with difficulty range?
  • Graders produce scores between 0.0–1.0?
  • Graders deterministic and reproducible?
  • Hard task genuinely challenges frontier models?

Environment Design (20%)

  • reset() produces clean state?
  • Action/observation types well-designed and documented?
  • Reward function provides useful varying signal (not sparse)?
  • Episode boundaries sensible?

Code Quality & Spec Compliance (15%)

  • openenv validate passes?
  • docker build && docker run works?
  • HF Space deploys and responds?
  • Baseline script runs and reproduces scores?

Creativity & Novelty (10%)

  • Domain not seen in OpenEnv before?
  • Reward design has interesting properties?
  • Clever mechanics that make environment engaging?

Judging Process

Phase 1: Automated Validation (Pass/Fail Gate)

  • HF Space deploys
  • OpenEnv spec compliance
  • Dockerfile builds
  • Baseline reproduces
  • 3+ tasks with graders

Phase 2: Agentic Evaluation (Scored)

  • Baseline agent re-run
  • Standard Open LLM agent (e.g., Nemotron 3 Super) run against all environments
  • Score variance check

Phase 3: Human Review

Top submissions reviewed by Meta and Hugging Face engineers for:

  • Real-world utility
  • Creativity
  • Exploit checks

Disqualification Criteria

  • Environment does not deploy or respond
  • Plagiarized or trivially modified existing environments
  • Graders that always return the same score
  • No baseline inference script

Pre-Submission Checklist

All must pass or you're disqualified:

  • HF Space deploys (200 response to reset())
  • OpenEnv spec compliance validated
  • Dockerfile builds successfully
  • Baseline script reproduces without error
  • 3+ tasks with graders (scores in 0.0–1.0 range)

Mandatory Requirements

Environment Variables

Must be defined in your environment configuration:

API_BASE_URL    # The API endpoint for the LLM
MODEL_NAME      # The model identifier to use for inference
HF_TOKEN        # Your Hugging Face / API key
LOCAL_IMAGE_NAME # (Optional) Name of local image if using from_docker_image()

Script Requirements

  • Filename: inference.py (must be in root directory)
  • LLM Calls: Must use OpenAI Client with above variables
  • Logging Format: Must follow [START], [STEP], [END] format (see below)

Infrastructure Restrictions

  • Runtime: Inference script must complete in < 20 minutes
  • Resources: Must run on vcpu=2, memory=8GB

STDOUT Logging Format

Required Format

The script must emit exactly three line types to stdout, in this order:

[START] task=<task_name> env=<benchmark> model=<model_name>
[STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>

Format Rules

  • One [START] line at episode begin
  • One [STEP] line per step, immediately after env.step() returns
  • One [END] line after env.close(), always emitted (even on exception)
  • reward and rewards formatted to 2 decimal places
  • done and success are lowercase booleans: true or false
  • error is the raw last_action_error string, or null if none
  • All fields on a single line with no newlines within a line
  • Each task should return score in [0, 1]

Example Output

[START] task=click-test env=miniwob model=Qwen3-VL-30B
[STEP] step=1 action=click('123') reward=0.00 done=false error=null
[STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
[STEP] step=3 action=click('789') reward=1.00 done=true error=null
[END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00

Sample Inference Script

"""
Inference Script Example
===================================
MANDATORY
- Before submitting, ensure the following variables are defined in your environment configuration:
    API_BASE_URL   The API endpoint for the LLM.
    MODEL_NAME     The model identifier to use for inference.
    HF_TOKEN       Your Hugging Face / API key.
    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
                     method

- Defaults are set only for API_BASE_URL and MODEL_NAME 
    (and should reflect your active inference setup):
    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
    
- The inference script must be named `inference.py` and placed in the root directory of the project
- Participants must use OpenAI Client for all LLM calls using above variables

STDOUT FORMAT
- The script must emit exactly three line types to stdout, in this order:

    [START] task=<task_name> env=<benchmark> model=<model_name>
    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>

  Rules:
    - One [START] line at episode begin.
    - One [STEP] line per step, immediately after env.step() returns.
    - One [END] line after env.close(), always emitted (even on exception).
    - reward and rewards are formatted to 2 decimal places.
    - done and success are lowercase booleans: true or false.
    - error is the raw last_action_error string, or null if none.
    - All fields on a single line with no newlines within a line.
    - Each tasks should return score in [0, 1]

  Example:
    [START] task=click-test env=miniwob model=Qwen3-VL-30B
    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
    [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
"""

import asyncio
import os
import textwrap
from typing import List, Optional

from openai import OpenAI

from my_env_v4 import MyEnvV4Action, MyEnvV4Env

IMAGE_NAME = os.getenv("IMAGE_NAME")  # If you are using docker image 
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
MAX_STEPS = 8
TEMPERATURE = 0.7

# TODO: Implement the rest of your inference script here

Pre-Validation Script

#!/usr/bin/env bash
#
# validate-submission.sh β€” OpenEnv Submission Validator
#
# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
#
# Prerequisites:
#   - Docker:       https://docs.docker.com/get-docker/
#   - openenv-core: pip install openenv-core
#   - curl (usually pre-installed)
#
# Run:
#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
#
#   Or download and run locally:
#     chmod +x validate-submission.sh
#     ./validate-submission.sh <ping_url> [repo_dir]
#
# Arguments:
#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
#   repo_dir   Path to your repo (default: current directory)
#
# Examples:
#   ./validate-submission.sh https://my-team.hf.space
#   ./validate-submission.sh https://my-team.hf.space ./my-repo
#

set -uo pipefail

DOCKER_BUILD_TIMEOUT=600

if [ -t 1 ]; then
  RED='\033[0;31m'
  GREEN='\033[0;32m'
  YELLOW='\033[1;33m'
  BOLD='\033[1m'
  NC='\033[0m'
else
  RED=''
  GREEN=''
  YELLOW=''
  BOLD=''
  NC=''
fi

# TODO: Add the rest of the validation script

Tips for Success

  1. Choose a Real Problem: Pick a task that has genuine value for the AI/agent community
  2. Design Good Rewards: Provide meaningful signals throughout the episode, not just at the end
  3. Test Thoroughly: Ensure your environment works cleanly with docker build && docker run
  4. Document Well: Clear README helps reviewers understand your contribution
  5. Start Simple: Get the basic OpenEnv spec working first, then add complexity
  6. Run Validator: Use the pre-validation script before submitting

Resources

Submission Deadline

[To be announced]


Good luck with your submission! πŸš€