Spaces:
Runtime error
title: Transformers Inference Server (Qwen3‑VL)
emoji: 🐍
colorFrom: purple
colorTo: green
sdk: docker
app_port: 3000
pinned: false
Python FastAPI Inference Server (OpenAI-Compatible) for Qwen3-4B-Instruct
AI-Powered Marketplace Intelligence System
This repository provides an OpenAI-compatible inference server powered by Qwen3-4B-Instruct, designed to serve as the AI backend for a smart marketplace platform where:
- Suppliers can register and list their products
- Users can query product availability, get AI-powered recommendations, and find the nearest suppliers
- AI Assistant helps users discover products based on their needs, location, and preferences
The system has been migrated from a Node.js/llama.cpp stack to a Python/Transformers stack for better model compatibility and performance.
Key files:
- Server entry: main.py
- Environment template: .env.example
- Python dependencies: requirements.txt
- Architecture: ARCHITECTURE.md
Model:
- Default: unsloth/Qwen3-4B-Instruct-2507 (Transformers; text-only instruct model)
- You can change the model via environment variable MODEL_REPO_ID.
Marketplace Features (Planned)
This inference server will power the following marketplace capabilities:
1. Supplier Management
- Suppliers can register their business
- List products with details (name, price, stock, location)
- Update inventory in real-time
2. AI-Powered Product Discovery
Users with AI access can:
- Query Stock Availability: "Apakah ada stok laptop gaming di Jakarta?"
- Get Recommendations: "Saya butuh laptop untuk programming, budget 10 juta, yang bagus apa?"
- Find Nearest Suppliers: Products are matched based on user's location to show the closest available suppliers
- Compare Products: AI helps compare features, prices, and reviews
3. Location-Aware Intelligence
- Automatic supplier-to-user distance calculation
- Prioritize recommendations based on proximity
- Suggest delivery or pickup options based on distance
4. Natural Language Interaction
- Users can ask in natural Indonesian or English
- AI understands context and user preferences over conversation
- Personalized recommendations based on chat history
Current Status
✅ FULLY IMPLEMENTED! All marketplace features are production-ready:
- ✅ Inference server (OpenAI-compatible
/v1/chat/completions) - ✅ Marketplace database (5 tables with relationships and indexes)
- ✅ 10+ RESTful API endpoints (suppliers, products, users, AI search)
- ✅ Location-aware search with Haversine distance calculation
- ✅ AI-powered natural language product search
- ✅ Sample data seeding script
- ⏳ Frontend UI (planned)
Marketplace API Endpoints
Supplier Management
Register Supplier
POST /api/suppliers/register
Content-Type: application/json
{
"name": "John Doe",
"business_name": "Tech Store Jakarta",
"email": "john@techstore.com",
"phone": "+62812345678",
"address": "Jl. Sudirman No. 123",
"latitude": -6.2088,
"longitude": 106.8456,
"city": "Jakarta",
"province": "DKI Jakarta"
}
List Suppliers
GET /api/suppliers?city=Jakarta&skip=0&limit=100
Get Supplier Details
GET /api/suppliers/{supplier_id}
Product Management
Create Product
POST /api/suppliers/{supplier_id}/products
Content-Type: application/json
{
"name": "Laptop ASUS ROG Strix G15",
"description": "Gaming laptop with RTX 4060",
"price": 15999000,
"stock_quantity": 5,
"category": "laptop",
"tags": "gaming,asus,rtx",
"sku": "ASU-ROG-G15-001"
}
Update Product
PUT /api/products/{product_id}
Content-Type: application/json
{
"price": 14999000,
"stock_quantity": 3
}
List Products
GET /api/products?category=laptop&min_price=5000000&max_price=20000000&available_only=true
Search Products (Location-Aware)
GET /api/products/search?q=laptop+gaming&city=Jakarta&user_lat=-6.2088&user_lon=106.8456&max_price=15000000
Response includes distance_km field for each product (distance from user location).
User Management
Register User
POST /api/users/register
Content-Type: application/json
{
"name": "Jane Smith",
"email": "jane@email.com",
"phone": "+62856789012",
"latitude": -6.2088,
"longitude": 106.8456,
"city": "Jakarta",
"province": "DKI Jakarta",
"ai_access_enabled": true
}
Get User Profile
GET /api/users/{user_id}
AI-Powered Search (Premium Feature)
The main feature! Natural language product search with AI recommendations.
POST /api/chat/search
Content-Type: application/json
{
"user_id": 1,
"query": "Saya butuh laptop gaming di Jakarta, budget 12 juta",
"session_id": "optional-session-id"
}
Response:
{
"session_id": "user_1_1731485760",
"response": "Berdasarkan budget Anda 12 juta dan lokasi di Jakarta, saya merekomendasikan HP Pavilion Gaming (Rp 11.999.000) dari Tech Store Surabaya. Laptop ini memiliki RTX 3050 graphics yang bagus untuk gaming...",
"products_found": 3,
"conversation_id": 1
}
How it works:
- Parses natural language query (extracts category, budget, location)
- Searches database with filters
- Sorts results by distance from user
- Injects product catalog into AI system prompt
- AI generates personalized recommendation
- Saves conversation to database for history
Requirements:
- User must have
ai_access_enabled: true - Location coordinates help with distance sorting
Database Setup
1. Install Dependencies
pip install sqlalchemy alembic
2. Configure Database
Create .env file (or use .env.example):
DATABASE_URL=sqlite:///./marketplace.db
# Or PostgreSQL: postgresql://user:password@localhost/marketplace
# Or MySQL: mysql+pymysql://user:password@localhost/marketplace
3. Seed Sample Data
python seed_data.py
This creates:
- 5 suppliers (Jakarta, Bandung, Surabaya, Jakarta Selatan, Medan)
- 15 products (laptops, smartphones, monitors, accessories)
- 3 users (2 with AI access enabled)
4. Start Server
python main.py
Database tables are auto-created on first startup.
Testing Marketplace Endpoints
Run comprehensive test suite:
pytest tests/test_marketplace.py -v
Tests cover:
- Supplier registration and listing
- Product creation, update, and search
- User registration
- Location-aware search
- AI-powered search (mocked)
- Utility functions (Haversine distance, location parsing)
Deprecated Features
Note: Previous multimodal features (KTP OCR, image/video processing) are deprecated as of the migration to Qwen3-4B-Instruct (text-only model). The /ktp-ocr/ endpoint code remains but is non-functional with the current model.
Hugging Face Space
The project is hosted on Hugging Face Spaces for easy access: KillerKing93/Transformers-TextEngine-InferenceServer-OpenAPI-Compatible-V3
You can use the Space's API endpoints directly or access the web UI.
Quick Start
Option 1: Run with Docker (with-model images: CPU / NVIDIA / AMD)
Tags built by CI:
- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
Pull:
# CPU
docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
# NVIDIA (CUDA 12.4 wheel)
docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
# AMD (ROCm 6.2 wheel)
docker pull ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
Run:
# CPU
docker run -p 3000:3000 \
-e HF_TOKEN=your_hf_token_here \
ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
# NVIDIA GPU (requires NVIDIA drivers + nvidia-container-toolkit on the host)
docker run --gpus all -p 3000:3000 \
-e HF_TOKEN=your_hf_token_here \
ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-nvidia
# AMD GPU ROCm (requires ROCm 6.2+ drivers on the host; Linux only)
# Map ROCm devices and video group (may vary by distro)
docker run --device=/dev/kfd --device=/dev/dri --group-add video \
-p 3000:3000 \
-e HF_TOKEN=your_hf_token_here \
ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-amd
Health check:
curl http://localhost:3000/health
Swagger UI: http://localhost:3000/docs
OpenAPI (YAML): http://localhost:3000/openapi.yaml
Notes:
- These are with-model images; the first pull is large. In CI, after "Model downloaded." BuildKit may appear idle while tarring/committing the multi‑GB layer.
- Host requirements:
- NVIDIA: recent driver + nvidia-container-toolkit.
- AMD: ROCm 6.2+ driver stack, supported GPU, and mapped /dev/kfd and /dev/dri devices.
Option 2: Run Locally
Requirements
- Python 3.10+
- pip
- PyTorch (install a wheel matching your platform/CUDA)
- Optionally a GPU with enough VRAM for the chosen model
Install
Create and activate a virtual environment (Windows CMD): python -m venv .venv .venv\Scripts\activate
Install dependencies: pip install -r requirements.txt
Install PyTorch appropriate for your platform (examples): CPU-only: pip install torch --index-url https://download.pytorch.org/whl/cpu CUDA 12.4 example: pip install torch --index-url https://download.pytorch.org/whl/cu124
Create a .env from the template and adjust if needed: copy .env.example .env
- Set HF_TOKEN if the model is gated
- Adjust MAX_TOKENS, TEMPERATURE, DEVICE_MAP, TORCH_DTYPE, MAX_VIDEO_FRAMES as desired
Configuration via .env See .env.example. Important variables:
- PORT=3000
- MODEL_REPO_ID=unsloth/Qwen3-4B-Instruct-2507
- HF_TOKEN= # optional if gated
- MAX_TOKENS=4096
- TEMPERATURE=0.7
- MAX_VIDEO_FRAMES=16
- DEVICE_MAP=auto
- TORCH_DTYPE=auto
Additional streaming/persistence configuration
- PERSIST_SESSIONS=1 # enable SQLite-backed resumable SSE
- SESSIONS_DB_PATH=sessions.db # SQLite db path
- SESSIONS_TTL_SECONDS=600 # TTL for finished sessions before GC
- CANCEL_AFTER_DISCONNECT_SECONDS=3600 # auto-cancel generation if all clients disconnect for this many seconds (0=disable)
Cancel session API (custom extension)
Endpoint: POST /v1/cancel/{session_id}
Purpose: Manually cancel an in-flight streaming generation for the given session_id. Not part of OpenAI Chat Completions spec (the newer OpenAI Responses API has cancel), so this is provided as a practical extension.
Example (Windows CMD): curl -X POST http://localhost:3000/v1/cancel/mysession123 Run
Direct: python main.py
Using uvicorn: uvicorn main:app --host 0.0.0.0 --port 3000
Endpoints (OpenAI-compatible)
Swagger UI GET /docs
OpenAPI (YAML) GET /openapi.yaml
Health GET /health Example: curl http://localhost:3000/health Response: { "ok": true, "modelReady": true, "modelId": "unsloth/Qwen3-4B-Instruct-2507", "error": null }
Chat Completions (non-streaming) POST /v1/chat/completions Example (Windows CMD): curl -X POST http://localhost:3000/v1/chat/completions ^ -H "Content-Type: application/json" ^ -d "{"model":"qwen-local","messages":[{"role":"user","content":"Describe this image briefly"}],"max_tokens":4096}"
KTP OCR POST /ktp-ocr/ Example (Windows CMD): curl -X POST http://localhost:3000/ktp-ocr/ ^ -F "image=@image.jpg"
Example (PowerShell): $body = @{ model = "qwen-local" messages = @(@{ role = "user"; content = "Hello Qwen3!" }) max_tokens = 4096 } | ConvertTo-Json -Depth 5 curl -Method POST http://localhost:3000/v1/chat/completions -ContentType "application/json" -Body $body
Chat Completions (streaming via Server-Sent Events) Set "stream": true to receive partial deltas as they are generated. Example (Windows CMD): curl -N -H "Content-Type: application/json" ^ -d "{"model":"qwen-local","messages":[{"role":"user","content":"Think step by step: what is 17 * 23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions
The stream format follows OpenAI-style SSE: data: { "id": "...", "object": "chat.completion.chunk", "choices":[{ "delta": {"role": "assistant"} }]} data: { "choices":[{ "delta": {"content": "To"} }]} data: { "choices":[{ "delta": {"content": " think..."} }]} ... data: { "choices":[{ "delta": {}, "finish_reason": "stop"}]} data: [DONE]
Multimodal Usage
Text only: { "role": "user", "content": "Summarize: The quick brown fox ..." }
Image by URL: { "role": "user", "content": [ { "type": "text", "text": "What is in this image?" }, { "type": "image_url", "image_url": { "url": "https://example.com/cat.jpg" } } ] }
Image by base64: { "role": "user", "content": [ { "type": "text", "text": "OCR this." }, { "type": "input_image", "b64_json": "" } ] }
Video by URL (frames are sampled up to MAX_VIDEO_FRAMES): { "role": "user", "content": [ { "type": "text", "text": "Describe this clip." }, { "type": "video_url", "video_url": { "url": "https://example.com/clip.mp4" } } ] }
Video by base64: { "role": "user", "content": [ { "type": "text", "text": "Count the number of cars." }, { "type": "input_video", "b64_json": "" } ] }
Implementation Notes
- Server code: main.py
- FastAPI with CORS enabled
- Non-streaming and streaming endpoints
- Uses AutoProcessor and AutoModelForCausalLM with trust_remote_code=True
- Converts OpenAI-style messages into the Qwen multimodal format
- Images loaded via PIL; videos loaded via imageio.v3 (preferred) or OpenCV as fallback; frames sampled
Performance Tips
- On GPUs: set DEVICE_MAP=auto and TORCH_DTYPE=bfloat16 or float16 if supported
- Reduce MAX_VIDEO_FRAMES to speed up video processing
- Tune MAX_TOKENS and TEMPERATURE according to your needs
Troubleshooting
- ImportError or no CUDA found:
- Ensure PyTorch is installed with the correct wheel for your environment.
- OOM / CUDA out of memory:
- Use a smaller model, lower MAX_VIDEO_FRAMES, lower MAX_TOKENS, or run on CPU.
- 503 Model not ready:
- The first request triggers model load; check /health for errors and HF_TOKEN if gated.
License
- See LICENSE for terms.
Changelog and Architecture
- See ARCHITECTURE.md for detailed technical documentation and CLAUDE.md for development progress logs.
Streaming behavior, resume, and reconnections
The server streams responses using Server‑Sent Events (SSE) from Python.function chat_completions(), driven by token iteration in Python.function infer_stream. It now supports resumable streaming using an in‑memory ring buffer and SSE Last-Event-ID, with optional SQLite persistence (enable PERSIST_SESSIONS=1).
What’s implemented
- Per-session in-memory ring buffer keyed by session_id (no external storage).
- Each SSE event carries an SSE id line in the format "session_id:index" so clients can resume with Last-Event-ID.
- On reconnect:
- Provide the same session_id in the request body, and
- Provide "Last-Event-ID: session_id:index" header (or query ?last_event_id=session_id:index).
- The server replays cached events after index and continues streaming new tokens.
- Session TTL: ~10 minutes, buffer capacity: ~2048 events. Old or finished sessions are garbage-collected in-memory.
How to start a streaming session
Minimal (server generates a session_id internally for SSE id lines): Windows CMD: curl -N -H "Content-Type: application/json" ^ -d "{"messages":[{"role":"user","content":"Think step by step: 17*23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions
With explicit session_id (recommended if you want to resume): Windows CMD: curl -N -H "Content-Type: application/json" ^ -d "{"session_id":"mysession123","messages":[{"role":"user","content":"Think step by step: 17*23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions
How to resume after disconnect
Use the same session_id and the SSE Last-Event-ID header (or ?last_event_id=...): Windows CMD (resume from index 42): curl -N -H "Content-Type: application/json" ^ -H "Last-Event-ID: mysession123:42" ^ -d "{"session_id":"mysession123","messages":[{"role":"user","content":"Think step by step: 17*23?"}],"stream":true}" ^ http://localhost:3000/v1/chat/completions
Alternatively with query string: http://localhost:3000/v1/chat/completions?last_event_id=mysession123:42
Event format
Chunks follow the OpenAI-style "chat.completion.chunk" shape in data payloads, plus an SSE id: id: mysession123:5 data: {"id":"mysession123","object":"chat.completion.chunk","created":..., "model":"...", "choices":[{"index":0,"delta":{"content":" token"},"finish_reason":null}]}
The stream ends with: data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE]
Notes and limits
- This implementation keeps session state only in memory; restarts will drop buffers.
- If the buffer overflows before you resume, the earliest chunks may be unavailable.
- Cancellation on client disconnect is not automatic; generation runs to completion in the background. A cancellable stopping-criteria can be added if required.
Hugging Face repository files support
This server loads the Qwen3 model via Transformers with trust_remote_code=True, so the standard files from the repo are supported and consumed automatically. Summary for https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507/tree/main:
Used by model weights and architecture
- model.safetensors — main weights loaded by AutoModelForCausalLM
- config.json — architecture/config
- generation_config.json — default gen params (we may override via request or env)
Used by tokenizer
- tokenizer.json — primary tokenizer specification
- tokenizer_config.json — tokenizer settings
- merges.txt and vocab.json — fallback/compat files; if tokenizer.json exists, HF generally prefers it
Used by processors (multimodal)
- preprocessor_config.json — image/text processor config
- video_preprocessor_config.json — video processor config (frame sampling, etc.)
- chat_template.json — chat formatting used by Python.function infer and Python.function infer_stream via
processor.apply_chat_template(...)
Not required for runtime
- README.md, .gitattributes — ignored by runtime
Notes:
- We rely on Transformers’ AutoModelForCausalLM and AutoProcessor to resolve and use the above files; no manual parsing is required in our code.
- With
trust_remote_code=True, model-specific code from the repo may load additional assets transparently. - If the repo updates configs (e.g., new chat template), the server will pick them up on next load.
Cancellation and session persistence
Auto-cancel on disconnect:
- Generation is automatically cancelled if all clients disconnect for more than CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 seconds = 1 hour). Configure in .env.example via
CANCEL_AFTER_DISCONNECT_SECONDS. - Implemented by a timer in Python.function chat_completions that triggers a cooperative stop through a stopping criteria in Python.function infer_stream.
- Generation is automatically cancelled if all clients disconnect for more than CANCEL_AFTER_DISCONNECT_SECONDS (default 3600 seconds = 1 hour). Configure in .env.example via
Manual cancel API (custom extension):
- Endpoint:
POST /v1/cancel/{session_id} - Cancels an ongoing streaming session and marks it finished in the store. Example (Windows CMD): curl -X POST http://localhost:3000/v1/cancel/mysession123
- This is not part of OpenAI’s legacy Chat Completions spec. OpenAI’s newer Responses API has a cancel endpoint, but Chat Completions does not. We provide this custom endpoint for operational control.
- Endpoint:
Persistence:
- Optional SQLite-backed persistence for resumable SSE (enable
PERSIST_SESSIONS=1in .env.example). - Database path:
SESSIONS_DB_PATH(default: sessions.db) - Session TTL for GC:
SESSIONS_TTL_SECONDS(default: 600) - See implementation in Python.class _SQLiteStore and integration in Python.function chat_completions.
- Redis is not implemented yet; the design isolates persistence so a Redis-backed store can be added as a drop-in.
- Optional SQLite-backed persistence for resumable SSE (enable
Deploy on Render
Render has two easy options. Since our image already bakes the model, the fastest path is to deploy the public Docker image (CPU). Render currently doesn’t provide NVIDIA/AMD GPUs for standard Web Services, so use the CPU image.
Option A — Deploy public Docker image (recommended)
- In Render Dashboard: New → Web Service
- Environment: Docker → Public Docker image
- Image
- ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
- Instance and region
- Region: closest to your users
- Instance type: pick a plan with at least 16 GB RAM (more if you see OOM)
- Port/health
- Render auto-injects PORT; the server binds to it via Python.os.getenv()
- Health Check Path: /health (served by Python.function health)
- Start command
- Leave blank; the image uses CMD ["python","main.py"] as defined in Dockerfile. The app entry is Python.main().
- Environment variables
- EAGER_LOAD_MODEL=1
- MAX_TOKENS=4096
- HF_TOKEN=your_hf_token_here (only if the model is gated)
- Optional persistence:
- PERSIST_SESSIONS=1
- SESSIONS_DB_PATH=/data/sessions.db (requires a disk)
- Persistent Disk (optional)
- Add a Disk (e.g., 1–5 GB) and mount it at /data if you enable SQLite persistence
- Create Web Service and wait for it to start
- Verify
- curl https://YOUR-SERVICE.onrender.com/health
- OpenAPI YAML: https://YOUR-SERVICE.onrender.com/openapi.yaml (served by Python.function openapi_yaml)
- Chat endpoint: POST https://YOUR-SERVICE.onrender.com/v1/chat/completions (implemented in Python.function chat_completions)
Option B — Build directly from this GitHub repo (Dockerfile)
- In Render Dashboard: New → Web Service → Build from a Git repo (connect this repo)
- Render will detect the Dockerfile automatically (no Build Command needed)
- Advanced → Docker Build Args
- BACKEND=cpu (ensures CPU-only torch wheel)
- Health and env vars
- Health Check Path: /health
- Set EAGER_LOAD_MODEL, MAX_TOKENS, HF_TOKEN as needed (same as Option A)
- (Optional) Add a Disk and mount at /data, then set SESSIONS_DB_PATH=/data/sessions.db if you want resumable SSE across restarts
- Deploy (first build can take a while due to the multi-GB model layer)
Notes and limits on Render
- GPU acceleration (NVIDIA/AMD) isn’t available for standard Web Services on Render; use the CPU image.
- The image already contains the Qwen3-VL model under /app/hf-cache, so there’s no model download at runtime.
- SSE is supported; streaming is produced by Python.function chat_completions. Keep the connection open to avoid idle timeouts.
- If you enable SQLite persistence, remember to attach a Disk; otherwise, the DB is ephemeral.
Example render.yaml (optional IaC) If you prefer infrastructure-as-code, you can use a render.yaml like:
services:
- type: web
name: qwen-vl-cpu
env: docker
image:
url: ghcr.io/killerking93/transformers-inferenceserver-openapi-compatible:latest-with-model-cpu
plan: standard
region: oregon
healthCheckPath: /health
autoDeploy: true
envVars:
- key: EAGER_LOAD_MODEL value: "1"
- key: MAX_TOKENS value: "4096"
- key: HF_TOKEN
sync: false # set in dashboard or use Render secrets
- key: PERSIST_SESSIONS
value: "1"
- key: SESSIONS_DB_PATH
value: "/data/sessions.db"
disks:
Uncomment if using persistence
- name: data
mountPath: /data
sizeGB: 5
After deploy:
- Health: GET /health
- OpenAPI: GET /openapi.yaml
- Inference:
curl -X POST https://YOUR-SERVICE.onrender.com/v1/chat/completions
-H "Content-Type: application/json"
-d "{"messages":[{"role":"user","content":"Hello"}],"max_tokens":4096}"
Deploy on Hugging Face Spaces
Recommended: Docker Space (works with our FastAPI app and preserves multimodal behavior). You can run CPU or GPU hardware. To persist the HF cache across restarts, enable Persistent Storage and point HF cache to /data.
A) Create the Space (Docker)
Install CLI and login: pip install -U "huggingface_hub[cli]" huggingface-cli login
Create a Docker Space (public or private): huggingface-cli repo create my-qwen3-vl-server --type space --sdk docker
Add the Space as a remote and push this repo: git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server git push hf main
This pushes Dockerfile, main.py, requirements.txt. The Space will auto-build your container.
B) Configure Space settings
Hardware:
- CPU: works out-of-the-box (fast to build, slower inference).
- GPU: choose a GPU tier (e.g., T4/A10G/L4) for faster inference.
Persistent Storage (recommended):
- Enable Persistent storage (e.g., 10–30 GB).
- This lets you cache models and sessions across restarts.
Variables and Secrets:
- Variables:
- EAGER_LOAD_MODEL=1
- MAX_TOKENS=4096
- HF_HOME=/data/hf-cache
- TRANSFORMERS_CACHE=/data/hf-cache
- Secrets:
- HF_TOKEN=your_hf_token_if_model_is_gated
- Variables:
C) CPU vs GPU on Spaces
- CPU: No change needed. Our Dockerfile defaults to CPU PyTorch and bakes the model during build. It will run on CPU Spaces.
- GPU: Edit the Space’s Dockerfile to switch the backend before the next build:
- In the file editor of the Space UI, change: ARG BACKEND=cpu to: ARG BACKEND=nvidia
- Save/commit; the Space rebuilds with a CUDA-enabled torch. Choose a GPU hardware tier in the Space settings. Note: Building the GPU image pulls CUDA torch wheels and increases build time.
- AMD ROCm is not available on Spaces; use NVIDIA GPUs on Spaces.
D) Speed up cold starts and caching
- With Persistent Storage enabled and HF_HOME/TRANSFORMERS_CACHE pointed to /data/hf-cache, the model cache persists across restarts (subsequent spins are much faster).
- Keep the Space “Always on” if available on your plan to avoid cold starts.
E) Space endpoints
- Base URL: https://huggingface.co/spaces/YOUR_USERNAME/my-qwen3-vl-server (Spaces proxy to your container)
- Swagger UI: GET /docs (interactive API with examples)
- Health: GET /health (implemented by Python.function health)
- OpenAPI YAML: GET /openapi.yaml (implemented by Python.openapi_yaml)
- Chat Completions: POST /v1/chat/completions (non-stream + SSE) Python.function chat_completions
- Cancel: POST /v1/cancel/{session_id} Python.function cancel_session
F) Quick test after the Space is “Running”
- Health: curl -s https://YOUR-SPACE-Subdomain.hf.space/health
- Non-stream:
curl -s -X POST https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions
-H "Content-Type: application/json"
-d "{"messages":[{"role":"user","content":"Hello from HF Spaces!"}],"max_tokens":4096}" - Streaming:
curl -N -H "Content-Type: application/json"
-d "{"messages":[{"role":"user","content":"Think step by step: 17*23?"}],"stream":true}"
https://YOUR-SPACE-Subdomain.hf.space/v1/chat/completions
Notes
- The Space build step can appear “idle” after “Model downloaded.” while Docker commits a multi‑GB layer; this is expected.
- If you hit OOM, increase the Space hardware memory or switch to a GPU tier. Reduce MAX_VIDEO_FRAMES and MAX_TOKENS if needed.