Spaces:

Alovestocode
/

router-router-zero

Running on Zero

App Files Files Community

router-router-zero / README.md

Alovestocode

Refactor: Mount Gradio on FastAPI, use gr.mount_gradio_app for proper integration

40a2927 verified 16 days ago

preview code

raw

history blame contribute delete

3.18 kB

A newer version of the Gradio SDK is available: 6.0.0

Upgrade

metadata

title: ZeroGPU Router Backend
emoji: 🛰️
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.5.0
app_file: app.py
pinned: false

ZeroGPU Router Backend Space

This directory contains a lightweight Hugging Face Space that serves a merged router checkpoint over a simple REST API. Deploy it to ZeroGPU and then point the main router UI (Milestone-6/router-agent/hf_space/app.py) at the /v1/generate endpoint via the HF_ROUTER_API environment variable.

File	Purpose
`app.py`	Loads the merged checkpoint on demand (tries `MODEL_REPO` first, then `MODEL_FALLBACKS` or the default Gemma → Llama → Qwen order), exposes a `/v1/generate` API, mounts the Gradio UI at `/gradio`, and keeps a lightweight HTML console at `/console`.
`requirements.txt`	Minimal dependency set (transformers, bitsandbytes, torch, fastapi, accelerate, sentencepiece, spaces, uvicorn).
`.huggingface/spaces.yml`	Configures the Space for ZeroGPU hardware and disables automatic sleep.

Deployment Steps

Create the Space

huggingface-cli repo create router-router-zero \
  --type space --sdk gradio --hardware zerogpu --yes

Publish the code

cd Milestone-6/router-agent/zero-gpu-space
huggingface-cli upload . Alovestocode/router-router-zero --repo-type space

Configure secrets & variables
- HF_TOKEN – token with read access to the merged checkpoint(s)
- MODEL_REPO – optional hard pin if you only want a single model considered
- MODEL_FALLBACKS – comma-separated preference order (defaults to router-gemma3-merged,router-llama31-merged,router-qwen3-32b-merged)
- MODEL_LOAD_STRATEGY – 8bit (default), 4bit, or fp16; backwards-compatible with LOAD_IN_8BIT / LOAD_IN_4BIT
- MODEL_LOAD_STRATEGIES – optional ordered fallback list (e.g. 8bit,4bit,cpu). The loader will automatically walk this list and finally fall back to 8bit→4bit→bf16→fp16→cpu.
- SKIP_WARM_START – set to 1 if you prefer to load lazily on the first request
- ALLOW_WARM_START_FAILURE – set to 1 to keep the container alive even if warm-up fails (the next request will retry)

Connect the main router UI

export HF_ROUTER_API=https://Alovestocode-router-router-zero.hf.space/v1/generate

API Contract

POST /v1/generate

{
  "prompt": "<router prompt>",
  "max_new_tokens": 600,
  "temperature": 0.2,
  "top_p": 0.9
}

Response:

{ "text": "<raw router output>" }

Use HF_ROUTER_API in the main application or the smoke-test script to validate that the deployed model returns the expected JSON plan. When running on ZeroGPU we recommend keeping MODEL_LOAD_STRATEGY=8bit (or LOAD_IN_8BIT=1) so the weights fit comfortably in the 70GB slice; if that fails the app automatically degrades through 4-bit, bf16/fp16, and finally CPU mode. You can inspect the active load mode via the /health endpoint (strategy field). The root path (/) now redirects to the Gradio UI, while /console serves the minimal HTML form for quick manual testing.

ZeroGPU Router Backend Space

Contents

Deployment Steps

API Contract