Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
6.0.0
title: ZeroGPU Router Backend
emoji: 🛰️
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.5.0
app_file: app.py
pinned: false
ZeroGPU Router Backend Space
This directory contains a lightweight Hugging Face Space that serves a merged
router checkpoint over a simple REST API. Deploy it to ZeroGPU and then point
the main router UI (Milestone-6/router-agent/hf_space/app.py) at the /v1/generate
endpoint via the HF_ROUTER_API environment variable.
Contents
| File | Purpose |
|---|---|
app.py |
Loads the merged checkpoint on demand (tries MODEL_REPO first, then MODEL_FALLBACKS or the default Gemma → Llama → Qwen order), exposes a /v1/generate API, mounts the Gradio UI at /gradio, and keeps a lightweight HTML console at /console. |
requirements.txt |
Minimal dependency set (transformers, bitsandbytes, torch, fastapi, accelerate, sentencepiece, spaces, uvicorn). |
.huggingface/spaces.yml |
Configures the Space for ZeroGPU hardware and disables automatic sleep. |
Deployment Steps
Create the Space
huggingface-cli repo create router-router-zero \ --type space --sdk gradio --hardware zerogpu --yesPublish the code
cd Milestone-6/router-agent/zero-gpu-space huggingface-cli upload . Alovestocode/router-router-zero --repo-type spaceConfigure secrets & variables
HF_TOKEN– token with read access to the merged checkpoint(s)MODEL_REPO– optional hard pin if you only want a single model consideredMODEL_FALLBACKS– comma-separated preference order (defaults torouter-gemma3-merged,router-llama31-merged,router-qwen3-32b-merged)MODEL_LOAD_STRATEGY–8bit(default),4bit, orfp16; backwards-compatible withLOAD_IN_8BIT/LOAD_IN_4BITMODEL_LOAD_STRATEGIES– optional ordered fallback list (e.g.8bit,4bit,cpu). The loader will automatically walk this list and finally fall back to8bit→4bit→bf16→fp16→cpu.SKIP_WARM_START– set to1if you prefer to load lazily on the first requestALLOW_WARM_START_FAILURE– set to1to keep the container alive even if warm-up fails (the next request will retry)
Connect the main router UI
export HF_ROUTER_API=https://Alovestocode-router-router-zero.hf.space/v1/generate
API Contract
POST /v1/generate
{
"prompt": "<router prompt>",
"max_new_tokens": 600,
"temperature": 0.2,
"top_p": 0.9
}
Response:
{ "text": "<raw router output>" }
Use HF_ROUTER_API in the main application or the smoke-test script to validate
that the deployed model returns the expected JSON plan. When running on ZeroGPU
we recommend keeping MODEL_LOAD_STRATEGY=8bit (or LOAD_IN_8BIT=1) so the
weights fit comfortably in the 70GB slice; if that fails the app automatically
degrades through 4-bit, bf16/fp16, and finally CPU mode. You can inspect the
active load mode via the /health endpoint (strategy field). The root path
(/) now redirects to the Gradio UI, while /console serves the minimal HTML
form for quick manual testing.