KillerKing93's picture
Sync from GitHub 673f3d9
3ceea9a verified
metadata
title: Transformers Inference Server (Qwen3‑VL)
emoji: 🐍
colorFrom: purple
colorTo: green
sdk: docker
app_port: 3000
pinned: false

Python FastAPI Inference Server (OpenAI-Compatible) for Qwen3-4B-Instruct

AI-Powered Marketplace Intelligence System

This repository provides an OpenAI-compatible inference server powered by Qwen3-4B-Instruct, designed to serve as the AI backend for a smart marketplace platform where:

  • Suppliers can register and list their products
  • Users can query product availability, get AI-powered recommendations, and find the nearest suppliers
  • AI Assistant helps users discover products based on their needs, location, and preferences

The system has been migrated from a Node.js/llama.cpp stack to a Python/Transformers stack for better model compatibility and performance.

Key files:

Model:

  • Default: unsloth/Qwen3-4B-Instruct-2507 (Transformers; text-only instruct model)
  • You can change the model via environment variable MODEL_REPO_ID.

Marketplace Features (Planned)

This inference server will power the following marketplace capabilities:

1. Supplier Management

  • Suppliers can register their business
  • List products with details (name, price, stock, location)
  • Update inventory in real-time

2. AI-Powered Product Discovery

Users with AI access can:

  • Query Stock Availability: "Apakah ada stok laptop gaming di Jakarta?"
  • Get Recommendations: "Saya butuh laptop untuk programming, budget 10 juta, yang bagus apa?"
  • Find Nearest Suppliers: Products are matched based on user's location to show the closest available suppliers
  • Compare Products: AI helps compare features, prices, and reviews

3. Location-Aware Intelligence

  • Automatic supplier-to-user distance calculation
  • Prioritize recommendations based on proximity
  • Suggest delivery or pickup options based on distance

4. Natural Language Interaction

  • Users can ask in natural Indonesian or English
  • AI understands context and user preferences over conversation
  • Personalized recommendations based on chat history

Current Status

FULLY IMPLEMENTED! All marketplace features are production-ready:

  • ✅ Inference server (OpenAI-compatible /v1/chat/completions)
  • ✅ Marketplace database (5 tables with relationships and indexes)
  • ✅ 10+ RESTful API endpoints (suppliers, products, users, AI search)
  • ✅ Location-aware search with Haversine distance calculation
  • ✅ AI-powered natural language product search
  • ✅ Sample data seeding script
  • ⏳ Frontend UI (planned)

Marketplace API Endpoints

Supplier Management

Register Supplier

POST /api/suppliers/register
Content-Type: application/json

{
  "name": "John Doe",
  "business_name": "Tech Store Jakarta",
  "email": "john@techstore.com",
  "phone": "+62812345678",
  "address": "Jl. Sudirman No. 123",
  "latitude": -6.2088,
  "longitude": 106.8456,
  "city": "Jakarta",
  "province": "DKI Jakarta"
}

List Suppliers

GET /api/suppliers?city=Jakarta&skip=0&limit=100

Get Supplier Details

GET /api/suppliers/{supplier_id}

Product Management

Create Product

POST /api/suppliers/{supplier_id}/products
Content-Type: application/json

{
  "name": "Laptop ASUS ROG Strix G15",
  "description": "Gaming laptop with RTX 4060",
  "price": 15999000,
  "stock_quantity": 5,
  "category": "laptop",
  "tags": "gaming,asus,rtx",
  "sku": "ASU-ROG-G15-001"
}

Update Product

PUT /api/products/{product_id}
Content-Type: application/json

{
  "price": 14999000,
  "stock_quantity": 3
}

List Products

GET /api/products?category=laptop&min_price=5000000&max_price=20000000&available_only=true

Search Products (Location-Aware)

GET /api/products/search?q=laptop+gaming&city=Jakarta&user_lat=-6.2088&user_lon=106.8456&max_price=15000000

Response includes distance_km field for each product (distance from user location).

User Management

Register User

POST /api/users/register
Content-Type: application/json

{
  "name": "Jane Smith",
  "email": "jane@email.com",
  "phone": "+62856789012",
  "latitude": -6.2088,
  "longitude": 106.8456,
  "city": "Jakarta",
  "province": "DKI Jakarta",
  "ai_access_enabled": true
}

Get User Profile

GET /api/users/{user_id}

AI-Powered Search (Premium Feature)

The main feature! Natural language product search with AI recommendations.

POST /api/chat/search
Content-Type: application/json

{
  "user_id": 1,
  "query": "Saya butuh laptop gaming di Jakarta, budget 12 juta",
  "session_id": "optional-session-id"
}

Response:

{
  "session_id": "user_1_1731485760",
  "response": "Berdasarkan budget Anda 12 juta dan lokasi di Jakarta, saya merekomendasikan HP Pavilion Gaming (Rp 11.999.000) dari Tech Store Surabaya. Laptop ini memiliki RTX 3050 graphics yang bagus untuk gaming...",
  "products_found": 3,
  "conversation_id": 1
}

How it works:

  1. Parses natural language query (extracts category, budget, location)
  2. Searches database with filters
  3. Sorts results by distance from user
  4. Injects product catalog into AI system prompt
  5. AI generates personalized recommendation
  6. Saves conversation to database for history

Requirements:

  • User must have ai_access_enabled: true
  • Location coordinates help with distance sorting

Database Setup

1. Install Dependencies

pip install sqlalchemy alembic

2. Configure Database

Create .env file (or use .env.example):

DATABASE_URL=sqlite:///./marketplace.db
# Or PostgreSQL: postgresql://user:password@localhost/marketplace
# Or MySQL: mysql+pymysql://user:password@localhost/marketplace

3. Seed Sample Data

python seed_data.py

This creates:

  • 5 suppliers (Jakarta, Bandung, Surabaya, Jakarta Selatan, Medan)
  • 15 products (laptops, smartphones, monitors, accessories)
  • 3 users (2 with AI access enabled)

4. Start Server

python main.py

Database tables are auto-created on first startup.

Testing Marketplace Endpoints

Run comprehensive test suite:

pytest tests/test_marketplace.py -v

Tests cover:

  • Supplier registration and listing
  • Product creation, update, and search
  • User registration
  • Location-aware search
  • AI-powered search (mocked)
  • Utility functions (Haversine distance, location parsing)

Deprecated Features

Note: Previous multimodal features (KTP OCR, image/video processing) are deprecated as of the migration to Qwen3-4B-Instruct (text-only model). The /ktp-ocr/ endpoint code remains but is non-functional with the current model.

Hugging Face Space

The project is hosted on Hugging Face Spaces for easy access: KillerKing93/Transformers-TextEngine-InferenceServer-OpenAPI-Compatible-V3

You can use the Space's API endpoints directly or access the web UI.

Quick Start

Option 1: Run with Docker (with-model images: CPU / NVIDIA / AMD)

Tags built by CI:

  • ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
  • ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
  • ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd

Pull:

# CPU
docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu

# NVIDIA (CUDA 12.4 wheel)
docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia

# AMD (ROCm 6.2 wheel)
docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd

Run:

# CPU
docker run -p 3000:3000 \
  -e HF_TOKEN=your_hf_token_here \
  ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu

# NVIDIA GPU (requires NVIDIA drivers + nvidia-container-toolkit on the host)
docker run --gpus all -p 3000:3000 \
  -e HF_TOKEN=your_hf_token_here \
  ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia

# AMD GPU ROCm (requires ROCm 6.2+ drivers on the host; Linux only)
# Map ROCm devices and video group (may vary by distro)
docker run --device=/dev/kfd --device=/dev/dri --group-add video \
  -p 3000:3000 \
  -e HF_TOKEN=your_hf_token_here \
  ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd

Health check:

curl http://localhost:3000/health

Swagger UI: http://localhost:3000/docs

OpenAPI (YAML): http://localhost:3000/openapi.yaml

Notes:

  • These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
  • Host requirements:
    • NVIDIA: recent driver + nvidia-container-toolkit.
    • AMD: ROCm 6.2+ driver stack, supported GPU, and mapped /dev/kfd and /dev/dri devices.

Option 2: Run Locally

Requirements

  • Python 3.10+
  • pip
  • PyTorch (install a wheel matching your platform/CUDA)
  • Optionally a GPU with enough VRAM for the chosen model

Install

  1. Create and activate a virtual environment (Windows CMD): python -m venv .venv .venv\Scripts\activate

  2. Install dependencies: pip install -r requirements.txt

  3. Install PyTorch appropriate for your platform (examples): CPU-only: pip install torch --index-url https://download.pytorch.org/whl/cpu CUDA 12.4 example: pip install torch --index-url https://download.pytorch.org/whl/cu124

  4. Create a .env from the template and adjust if needed: copy .env.example .env

    • Set HF_TOKEN if the model is gated
    • Adjust MAX_TOKENS, TEMPERATURE, DEVICE_MAP, TORCH_DTYPE, MAX_VIDEO_FRAMES as desired

Configuration via .env See .env.example. Important variables:

  • PORT=3000
  • MODEL_REPO_ID=unsloth/Qwen3-4B-Instruct-2507
  • HF_TOKEN= # optional if gated
  • MAX_TOKENS=4096
  • TEMPERATURE=0.7
  • MAX_VIDEO_FRAMES=16
  • DEVICE_MAP=auto
  • TORCH_DTYPE=auto

Additional streaming/persistence configuration

  • PERSIST_SESSIONS=1 # enable SQLite-backed resumable SSE
  • SESSIONS_DB_PATH=sessions.db # SQLite db path
  • SESSIONS_TTL_SECONDS=600 # TTL for finished sessions before GC
  • CANCEL_AFTER_DISCONNECT_SECONDS=3600 # auto-cancel generation if all clients disconnect for this many seconds (0=disable)

Cancel session API (custom extension)

  • Endpoint: POST /v1/cancel/{session_id}

  • Purpose: Manually cancel an in-flight streaming generation for the given session_id. Not part of OpenAI Chat Completions spec (the newer OpenAI Responses API has cancel), so this is provided as a practical extension.

  • Example (Windows CMD): curl -X POST http://localhost:3000/v1/cancel/mysession123 Run

  • Direct: python main.py

  • Using uvicorn: uvicorn main:app --host 0.0.0.0 --port 3000

Endpoints (OpenAI-compatible)

  • Swagger UI GET /docs

  • OpenAPI (YAML) GET /openapi.yaml

  • Health GET /health Example: curl http://localhost:3000/health Response: { "ok": true, "modelReady": true, "modelId": "unsloth/Qwen3-4B-Instruct-2507", "error": null }

  • Chat Completions (non-streaming) POST /v1/chat/completions Example (Windows CMD): curl -X POST http://localhost:3000/v1/chat/completions ^ -H "Content-Type: application/json" ^ -d "{"model":"qwen-local","messages":[{"role":"user","content":"Describe this image briefly"}],"max_tokens":4096}"

  • KTP OCR POST /ktp-ocr/ Example (Windows CMD): curl -X POST http://localhost:3000/ktp-ocr/ ^ -F "image=@image.jpg"

    Example (PowerShell): $body = @{ model = "qwen-local" messages = @(@{ role = "user"; content = "Hello Qwen3!" }) max_tokens = 4096 } | ConvertTo-Json -Depth 5 curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body

  • Chat Completions (streaming via Server-Sent Events) Set "stream": true to receive partial deltas as they are generated. Example (Windows CMD): curl -N -H "Content-Type: application/json" ^ -d "{"model":"qwen-local","messages":[{"role":"user","content":"Think step by step: what is 17 * 23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions

    The stream format follows OpenAI-style SSE: data: { "id": "...", "object": "chat.completion.chunk", "choices":[{ "delta": {"role": "assistant"} }]} data: { "choices":[{ "delta": {"content": "To"} }]} data: { "choices":[{ "delta": {"content": " think..."} }]} ... data: { "choices":[{ "delta": {}, "finish_reason": "stop"}]} data: [DONE]

Multimodal Usage

  • Text only: { "role": "user", "content": "Summarize: The quick brown fox ..." }

  • Image by URL: { "role": "user", "content": [ { "type": "text", "text": "What is in this image?" }, { "type": "image_url", "image_url": { "url": "https://example.com/cat.jpg" } } ] }

  • Image by base64: { "role": "user", "content": [ { "type": "text", "text": "OCR this." }, { "type": "input_image", "b64_json": "" } ] }

  • Video by URL (frames are sampled up to MAX_VIDEO_FRAMES): { "role": "user", "content": [ { "type": "text", "text": "Describe this clip." }, { "type": "video_url", "video_url": { "url": "https://example.com/clip.mp4" } } ] }

  • Video by base64: { "role": "user", "content": [ { "type": "text", "text": "Count the number of cars." }, { "type": "input_video", "b64_json": "" } ] }

Implementation Notes

  • Server code: main.py
    • FastAPI with CORS enabled
    • Non-streaming and streaming endpoints
    • Uses AutoProcessor and AutoModelForCausalLM with trust_remote_code=True
    • Converts OpenAI-style messages into the Qwen multimodal format
    • Images loaded via PIL; videos loaded via imageio.v3 (preferred) or OpenCV as fallback; frames sampled

Performance Tips

  • On GPUs: set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16 or float16 if supported
  • Reduce MAX_VIDEO_FRAMES to speed up video processing
  • Tune MAX_TOKENS and TEMPERATURE according to your needs

Troubleshooting

  • ImportError or no CUDA found:
    • Ensure PyTorch is installed with the correct wheel for your environment.
  • OOM / CUDA out of memory:
    • Use a smaller model, lower MAX_VIDEO_FRAMES, lower MAX_TOKENS, or run on CPU.
  • 503 Model not ready:
    • The first request triggers model load; check /health for errors and HF_TOKEN if gated.

License

  • See LICENSE for terms.

Changelog and Architecture

Streaming behavior, resume, and reconnections

The server streams responses using Server‑Sent Events (SSE) from Python.function chat_completions(), driven by token iteration in Python.function infer_stream. It now supports resumable streaming using an in‑memory ring buffer and SSE Last-Event-ID, with optional SQLite persistence (enable PERSIST_SESSIONS=1).

What’s implemented

  • Per-session in-memory ring buffer keyed by session_id (no external storage).
  • Each SSE event carries an SSE id line in the format "session_id:index" so clients can resume with Last-Event-ID.
  • On reconnect:
    • Provide the same session_id in the request body, and
    • Provide "Last-Event-ID: session_id:index" header (or query ?last_event_id=session_id:index).
    • The server replays cached events after index and continues streaming new tokens.
  • Session TTL: ~10 minutes, buffer capacity: ~2048 events. Old or finished sessions are garbage-collected in-memory.

How to start a streaming session

  • Minimal (server generates a session_id internally for SSE id lines): Windows CMD: curl -N -H "Content-Type: application/json" ^ -d "{"messages":[{"role":"user","content":"Think step by step: 17*23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions

  • With explicit session_id (recommended if you want to resume): Windows CMD: curl -N -H "Content-Type: application/json" ^ -d "{"session_id":"mysession123","messages":[{"role":"user","content":"Think step by step: 17*23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions

How to resume after disconnect

Event format

  • Chunks follow the OpenAI-style "chat.completion.chunk" shape in data payloads, plus an SSE id: id: mysession123:5 data: {"id":"mysession123","object":"chat.completion.chunk","created":..., "model":"...", "choices":[{"index":0,"delta":{"content":" token"},"finish_reason":null}]}

  • The stream ends with: data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE]

Notes and limits

  • This implementation keeps session state only in memory; restarts will drop buffers.
  • If the buffer overflows before you resume, the earliest chunks may be unavailable.
  • Cancellation on client disconnect is not automatic; generation runs to completion in the background. A cancellable stopping-criteria can be added if required.

Hugging Face repository files support

This server loads the Qwen3 model via Transformers with trust_remote_code=True, so the standard files from the repo are supported and consumed automatically. Summary for https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507/tree/main:

  • Used by model weights and architecture

    • model.safetensors — main weights loaded by AutoModelForCausalLM
    • config.json — architecture/config
    • generation_config.json — default gen params (we may override via request or env)
  • Used by tokenizer

    • tokenizer.json — primary tokenizer specification
    • tokenizer_config.json — tokenizer settings
    • merges.txt and vocab.json — fallback/compat files; if tokenizer.json exists, HF generally prefers it
  • Used by processors (multimodal)

    • preprocessor_config.json — image/text processor config
    • video_preprocessor_config.json — video processor config (frame sampling, etc.)
    • chat_template.json — chat formatting used by Python.function infer and Python.function infer_stream via processor.apply_chat_template(...)
  • Not required for runtime

    • README.md, .gitattributes — ignored by runtime

Notes:

  • We rely on Transformers’ AutoModelForCausalLM and AutoProcessor to resolve and use the above files; no manual parsing is required in our code.
  • With trust_remote_code=True, model-specific code from the repo may load additional assets transparently.
  • If the repo updates configs (e.g., new chat template), the server will pick them up on next load.

Cancellation and session persistence

  • Auto-cancel on disconnect:

    • Generation is automatically cancelled if all clients disconnect for more than CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 seconds = 1 hour). Configure in .env.example via CANCEL_AFTER_DISCONNECT_SECONDS.
    • Implemented by a timer in Python.function chat_completions that triggers a cooperative stop through a stopping criteria in Python.function infer_stream.
  • Manual cancel API (custom extension):

    • Endpoint: POST /v1/cancel/{session_id}
    • Cancels an ongoing streaming session and marks it finished in the store. Example (Windows CMD): curl -X POST http://localhost:3000/v1/cancel/mysession123
    • This is not part of OpenAI’s legacy Chat Completions spec. OpenAI’s newer Responses API has a cancel endpoint, but Chat Completions does not. We provide this custom endpoint for operational control.
  • Persistence:

    • Optional SQLite-backed persistence for resumable SSE (enable PERSIST_SESSIONS=1 in .env.example).
    • Database path: SESSIONS_DB_PATH (default: sessions.db)
    • Session TTL for GC: SESSIONS_TTL_SECONDS (default: 600)
    • See implementation in Python.class _SQLiteStore and integration in Python.function chat_completions.
    • Redis is not implemented yet; the design isolates persistence so a Redis-backed store can be added as a drop-in.

Deploy on Render

Render has two easy options. Since our image already bakes the model, the fastest path is to deploy the public Docker image (CPU). Render currently doesn’t provide NVIDIA/AMD GPUs for standard Web Services, so use the CPU image.

Option A — Deploy public Docker image (recommended)

  1. In Render Dashboard: New → Web Service
  2. Environment: Docker → Public Docker image
  3. Image
    • ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
  4. Instance and region
    • Region: closest to your users
    • Instance type: pick a plan with at least 16 GB RAM (more if you see OOM)
  5. Port/health
  6. Start command
  7. Environment variables
    • EAGER_LOAD_MODEL=1
    • MAX_TOKENS=4096
    • HF_TOKEN=your_hf_token_here (only if the model is gated)
    • Optional persistence:
      • PERSIST_SESSIONS=1
      • SESSIONS_DB_PATH=/data/sessions.db (requires a disk)
  8. Persistent Disk (optional)
    • Add a Disk (e.g., 1–5 GB) and mount it at /data if you enable SQLite persistence
  9. Create Web Service and wait for it to start
  10. Verify

Option B — Build directly from this GitHub repo (Dockerfile)

  1. In Render Dashboard: New → Web Service → Build from a Git repo (connect this repo)
  2. Render will detect the Dockerfile automatically (no Build Command needed)
  3. Advanced → Docker Build Args
    • BACKEND=cpu (ensures CPU-only torch wheel)
  4. Health and env vars
    • Health Check Path: /health
    • Set EAGER_LOAD_MODEL, MAX_TOKENS, HF_TOKEN as needed (same as Option A)
  5. (Optional) Add a Disk and mount at /data, then set SESSIONS_DB_PATH=/data/sessions.db if you want resumable SSE across restarts
  6. Deploy (first build can take a while due to the multi-GB model layer)

Notes and limits on Render

  • GPU acceleration (NVIDIA/AMD) isn’t available for standard Web Services on Render; use the CPU image.
  • The image already contains the Qwen3-VL model under /app/hf-cache, so there’s no model download at runtime.
  • SSE is supported; streaming is produced by Python.function chat_completions. Keep the connection open to avoid idle timeouts.
  • If you enable SQLite persistence, remember to attach a Disk; otherwise, the DB is ephemeral.

Example render.yaml (optional IaC) If you prefer infrastructure-as-code, you can use a render.yaml like:

services:

  • type: web name: qwen-vl-cpu env: docker image: url: ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu plan: standard region: oregon healthCheckPath: /health autoDeploy: true envVars:
    • key: EAGER_LOAD_MODEL value: "1"
    • key: MAX_TOKENS value: "4096"

    - key: HF_TOKEN

    sync: false # set in dashboard or use Render secrets

    - key: PERSIST_SESSIONS

    value: "1"

    - key: SESSIONS_DB_PATH

    value: "/data/sessions.db"

    disks:

    Uncomment if using persistence

    - name: data

    mountPath: /data

    sizeGB: 5

After deploy:

Deploy on Hugging Face Spaces

Recommended: Docker Space (works with our FastAPI app and preserves multimodal behavior). You can run CPU or GPU hardware. To persist the HF cache across restarts, enable Persistent Storage and point HF cache to /data.

A) Create the Space (Docker)

  1. Install CLI and login: pip install -U "huggingface_hub[cli]" huggingface-cli login

  2. Create a Docker Space (public or private): huggingface-cli repo create my-qwen3-vl-server --type space --sdk docker

  3. Add the Space as a remote and push this repo: git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server git push hf main

This pushes Dockerfile, main.py, requirements.txt. The Space will auto-build your container.

B) Configure Space settings

  • Hardware:

    • CPU: works out-of-the-box (fast to build, slower inference).
    • GPU: choose a GPU tier (e.g., T4/A10G/L4) for faster inference.
  • Persistent Storage (recommended):

    • Enable Persistent storage (e.g., 10–30 GB).
    • This lets you cache models and sessions across restarts.
  • Variables and Secrets:

    • Variables:
      • EAGER_LOAD_MODEL=1
      • MAX_TOKENS=4096
      • HF_HOME=/data/hf-cache
      • TRANSFORMERS_CACHE=/data/hf-cache
    • Secrets:
      • HF_TOKEN=your_hf_token_if_model_is_gated

C) CPU vs GPU on Spaces

  • CPU: No change needed. Our Dockerfile defaults to CPU PyTorch and bakes the model during build. It will run on CPU Spaces.
  • GPU: Edit the Space’s Dockerfile to switch the backend before the next build:
    • In the file editor of the Space UI, change: ARG BACKEND=cpu to: ARG BACKEND=nvidia
    • Save/commit; the Space rebuilds with a CUDA-enabled torch. Choose a GPU hardware tier in the Space settings. Note: Building the GPU image pulls CUDA torch wheels and increases build time.
  • AMD ROCm is not available on Spaces; use NVIDIA GPUs on Spaces.

D) Speed up cold starts and caching

  • With Persistent Storage enabled and HF_HOME/TRANSFORMERS_CACHE pointed to /data/hf-cache, the model cache persists across restarts (subsequent spins are much faster).
  • Keep the Space “Always on” if available on your plan to avoid cold starts.

E) Space endpoints

F) Quick test after the Space is “Running”

Notes

  • The Space build step can appear “idle” after “Model downloaded.” while Docker commits a multi‑GB layer; this is expected.
  • If you hit OOM, increase the Space hardware memory or switch to a GPU tier. Reduce MAX_VIDEO_FRAMES and MAX_TOKENS if needed.