cloud-ops-optimizer / README.md
hirann's picture
Add Project Overview to README
c8c0f98 verified
metadata
title: CloudOps Optimizer
emoji: ☁️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860

🚀 Project Overview

CloudOps Optimizer is an OpenEnv simulation for Autonomous FinOps. It challenges AI agents to balance cloud infrastructure costs against performance SLAs, simulating real-world SRE tasks.

The Problem It Simulates

Companies using AWS/Azure/GCP waste millions yearly on:

  • Oversized servers - paying for capacity they don't need
  • Undersized servers - causing performance issues
  • Poor resource allocation - balancing cost vs performance

The Agent's Job

  1. See current infrastructure (CPU usage, costs, latency)
  2. Choose actions like change srv-1 to t3.small
  3. Get rewards/penalties based on cost savings + performance
  4. Learn to optimize cost vs performance tradeoffs

CloudOps Optimizer Environment

Overview

CloudOps Optimizer is a real-world simulation of cloud infrastructure cost and performance optimization. The agent acts as a Cloud Site Reliability Engineer (SRE) optimizing a fleet of virtual cloud instances to meet Service Level Agreement (SLA) requirements while minimizing monthly costs.

Why This Matters

  • Real-world utility: Every company using AWS/Azure/GCP struggles with "Cloud Waste". Training agents to right-size instances is a multi-million dollar problem.
  • Not a toy: Unlike chatbots or simple games, this environment requires quantitative reasoning about cost vs performance tradeoffs.

Environment Description

Observation Space

The agent receives structured data including:

  • Inventory: List of cloud resources (id, type, cpu_usage, mem_usage, monthly_cost)
  • Metrics: Real-time performance (avg_latency_ms, error_rate, throughput_rps)
  • SLA: Target constraints (max_latency_ms, max_budget, min_uptime_pct)
  • Task Info: task_id, task_name, difficulty, current step

Action Space

The agent sends text commands in format: change [resource_id] to [instance_type]

Available instance types:

  • t3.nano: $3.60/mo, capacity 1.0
  • t3.small: $11.50/mo, capacity 2.0
  • t3.medium: $23.00/mo, capacity 4.0
  • m5.large: $70.00/mo, capacity 8.0
  • m5.xlarge: $140.00/mo, capacity 16.0

Tasks & Grading

Task Difficulty Description Grading
Right-Sizing Easy Reduce an overpriced server without breaking SLA Score = reward value (0-1)
Latency Fix Medium Resolve performance bottleneck under budget Score = reward value (0-1)
Balance Optimization Hard Optimize multi-server cluster with tight constraints Score = reward value (0-1)

Reward Function

The reward provides continuous signals over the trajectory:

R = cost_reward + performance_reward

Where:

  • Cost Reward (0-0.5): Higher as cost approaches budget
  • Performance Reward (0-0.5): Higher as latency stays under SLA

Partial Progress: Agent receives incremental rewards for each improvement. Penalties: System crash (CPU > 110%) results in 0 reward and episode end.

Setup & Usage

Prerequisites

  • Python 3.10+
  • OpenAI API key (HF_TOKEN)

Local Installation

# Install dependencies
pip install -e .

# Run baseline inference
export HF_TOKEN=your_huggingface_token
python inference.py

Docker Execution

docker build -t cloud-ops-env .
docker run -p 8000:8000 cloud-ops-env

API Endpoints

  • POST /reset - Reset environment with optional task_id
  • POST /step - Execute action
  • GET /state - Get current state
  • GET /health - Health check

Baseline Results

Model: Qwen/Qwen2.5-72B-Instruct

Task Score Steps
Right-Sizing (Easy) 0.125 1
Latency Fix (Medium) 0.000 1
Balance (Hard) 0.000 1

Average: 0.042

Note: Baseline scores indicate the model needs better prompting to handle the optimization tradeoffs. The environment correctly penalizes overshooting budget (easy) and undersizing (medium/hard causing crashes).

Files

  • openenv.yaml - OpenEnv specification
  • models.py - Pydantic models (Observation, Action, Reward)
  • env/core.py - Environment logic with state machine
  • server/app.py - FastAPI server
  • inference.py - Baseline inference script
  • Dockerfile - Container build

Spec Compliance

  • Typed Pydantic models
  • reset() returns Observation
  • step(action) returns (Observation, Reward, done, info)
  • state() returns current state
  • openenv.yaml with metadata
  • openenv validate passes
  • 3 tasks with deterministic graders (0.0-1.0)
  • Partial reward signals
  • Strict [START]/[STEP]/[END] log format in inference.py