Spaces:

hirann
/

cloud-ops-optimizer

Sleeping

App Files Files Community

cloud-ops-optimizer / README.md

hirann

Add Project Overview to README

c8c0f98 verified about 1 month ago

preview code

raw

history blame contribute delete

4.72 kB

metadata

title: CloudOps Optimizer
emoji: ☁️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860

🚀 Project Overview

CloudOps Optimizer is an OpenEnv simulation for Autonomous FinOps. It challenges AI agents to balance cloud infrastructure costs against performance SLAs, simulating real-world SRE tasks.

The Problem It Simulates

Companies using AWS/Azure/GCP waste millions yearly on:

Oversized servers - paying for capacity they don't need
Undersized servers - causing performance issues
Poor resource allocation - balancing cost vs performance

The Agent's Job

See current infrastructure (CPU usage, costs, latency)
Choose actions like change srv-1 to t3.small
Get rewards/penalties based on cost savings + performance
Learn to optimize cost vs performance tradeoffs

CloudOps Optimizer Environment

Overview

CloudOps Optimizer is a real-world simulation of cloud infrastructure cost and performance optimization. The agent acts as a Cloud Site Reliability Engineer (SRE) optimizing a fleet of virtual cloud instances to meet Service Level Agreement (SLA) requirements while minimizing monthly costs.

Why This Matters

Real-world utility: Every company using AWS/Azure/GCP struggles with "Cloud Waste". Training agents to right-size instances is a multi-million dollar problem.
Not a toy: Unlike chatbots or simple games, this environment requires quantitative reasoning about cost vs performance tradeoffs.

Environment Description

Observation Space

The agent receives structured data including:

Inventory: List of cloud resources (id, type, cpu_usage, mem_usage, monthly_cost)
Metrics: Real-time performance (avg_latency_ms, error_rate, throughput_rps)
SLA: Target constraints (max_latency_ms, max_budget, min_uptime_pct)
Task Info: task_id, task_name, difficulty, current step

Action Space

The agent sends text commands in format: change [resource_id] to [instance_type]

Available instance types:

t3.nano: $3.60/mo, capacity 1.0
t3.small: $11.50/mo, capacity 2.0
t3.medium: $23.00/mo, capacity 4.0
m5.large: $70.00/mo, capacity 8.0
m5.xlarge: $140.00/mo, capacity 16.0

Tasks & Grading

Task	Difficulty	Description	Grading
Right-Sizing	Easy	Reduce an overpriced server without breaking SLA	Score = reward value (0-1)
Latency Fix	Medium	Resolve performance bottleneck under budget	Score = reward value (0-1)
Balance Optimization	Hard	Optimize multi-server cluster with tight constraints	Score = reward value (0-1)

Reward Function

The reward provides continuous signals over the trajectory:

R = cost_reward + performance_reward

Where:

Cost Reward (0-0.5): Higher as cost approaches budget
Performance Reward (0-0.5): Higher as latency stays under SLA

Partial Progress: Agent receives incremental rewards for each improvement. Penalties: System crash (CPU > 110%) results in 0 reward and episode end.

Setup & Usage

Prerequisites

Python 3.10+
OpenAI API key (HF_TOKEN)

Local Installation

# Install dependencies
pip install -e .

# Run baseline inference
export HF_TOKEN=your_huggingface_token
python inference.py

Docker Execution

docker build -t cloud-ops-env .
docker run -p 8000:8000 cloud-ops-env

API Endpoints

POST /reset - Reset environment with optional task_id
POST /step - Execute action
GET /state - Get current state
GET /health - Health check

Baseline Results

Model: Qwen/Qwen2.5-72B-Instruct

Task	Score	Steps
Right-Sizing (Easy)	0.125	1
Latency Fix (Medium)	0.000	1
Balance (Hard)	0.000	1

Average: 0.042

Note: Baseline scores indicate the model needs better prompting to handle the optimization tradeoffs. The environment correctly penalizes overshooting budget (easy) and undersizing (medium/hard causing crashes).

Files

openenv.yaml - OpenEnv specification
models.py - Pydantic models (Observation, Action, Reward)
env/core.py - Environment logic with state machine
server/app.py - FastAPI server
inference.py - Baseline inference script
Dockerfile - Container build

Spec Compliance

Typed Pydantic models
reset() returns Observation
step(action) returns (Observation, Reward, done, info)
state() returns current state
openenv.yaml with metadata
openenv validate passes
3 tasks with deterministic graders (0.0-1.0)
Partial reward signals
Strict [START]/[STEP]/[END] log format in inference.py