JoyAI-VL-Interaction

The first open, vision-driven real-time interaction model — it watches a live video stream and decides on its own when to speak, stay silent, or delegate.

📄 Paper · 🌐 Project Page & Demos · 💻 GitHub · 🤗 Paper Page

Overview

Most large models today are turn-based: they answer only when you ask. But many moments in the real world don't wait for a question — a fire starts on a security feed, someone falls, a product flashes by in a livestream. Once missed, the moment is gone.

JoyAI-VL-Interaction is built for exactly these moments. It is an 8B-scale, vision-first interaction model that continuously watches a live video stream and, every second, decides on its own to take one of three actions:

Speak — respond when something is worth saying
Stay silent — keep watching when nothing warrants a response (a first-class, trained action)
Delegate — hand a hard subtask to a background model/agent, keep watching, and weave the result back in when it returns

The decision of when to act is learned inside the model (from second-by-second time-aligned data + RL), not bolted on by an external turn-detector or polling loop. Vision is the first-class driver; speech (ASR/TTS) is treated as pluggable I/O.

To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and a complete deployable system.

vLLM Usage

vLLM-Omni provides day-0 support for JoyAI-VL-Interaction! The model is a standard Qwen3-VL VLM served by a plain vllm serve; vLLM-Omni adds the real-time interaction layer on top — the per-second speak / silence / delegate orchestration, 3-tier summary memory, and pluggable ASR / TTS / delegation. For installation and full setup, see the vLLM-Omni recipe.

Online Serving

# git clone https://github.com/vllm-project/vllm-omni.git

# 1. Serve the model (plain `vllm serve`, NOT --omni — it is vanilla Qwen3-VL)
vllm serve jdopensource/JoyAI-VL-Interaction-Preview \
  --served-model-name JoyAI-VL-Interaction-Preview --port 8061 \
  --max-model-len 131072 --enable-prefix-caching --limit-mm-per-prompt '{"image":256,"video":1}'

# 2. Start the interaction orchestrator (OpenAI-compatible, :8070)
python -m vllm_omni.experimental.fullduplex.joyvl.serving.server --port 8070 \
  --main-backend-url http://127.0.0.1:8061/v1 --main-model JoyAI-VL-Interaction-Preview

For the full browser demo — live webcam / RTSP input, voice (ASR/TTS), and the per-tick decision stream — run JD's official WebUI (services/webui) in front of the orchestrator; see the vLLM-Omni recipe for the steps.

Downloads last month: -

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for jdopensource/JoyAI-VL-Interaction-Preview

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Paper • 2606.14777 • Published 11 days ago • 193