DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining
Paper β’ 2603.08216 β’ Published β’ 2
Real-time speech endpoint detector for two-channel (user + agent) audio.
Built on top of DualTurn-Qwen2.5-Mimi-0.5B with a trained endpoint classifier that answers one question at every VAD offset:
Has the user finished speaking?
endpoint=Trueβ yes, agent should respond nowendpoint=Falseβ no, user just paused mid-sentence, keep waiting
| Output | Shape | Description |
|---|---|---|
vad_probs |
[B, T, 2] |
P(speaking now) β [:,0]=user [:,1]=agent |
eot_probs |
[B, T, 2] |
P(end of turn) per channel |
bot_probs |
[B, T, 2] |
P(beginning of turn) per channel |
fvad_probs |
[B, T, 4] |
Fast VAD (Silero) β user_short, user_long, agent_short, agent_long |
endpoint_probs |
[B, T, 1] |
P(user finished) at VAD-offset frames, 0 elsewhere |
endpoints |
list[dict] |
Sparse decisions at VAD-offset anchors with endpoint bool and p_endpoint |
Frame rate: 12.5 Hz (80 ms per frame). Audio resampled to 24 kHz internally.
pip install transformers torch torchaudio joblib scikit-learn silero-vad huggingface_hub
import torch, torchaudio
from transformers import AutoModel
model = AutoModel.from_pretrained(
"anyreach-ai/dualturn-endpointing",
trust_remote_code=True,
)
model.eval()
wav, sr = torchaudio.load("conversation.wav") # [2, T] CH0=user CH1=agent
with torch.no_grad():
out = model(wav, sr=sr)
# Per-frame signals at 12.5 Hz
print(out.vad_probs.shape) # [1, T, 2] P(speaking) β (user, agent)
print(out.eot_probs.shape) # [1, T, 2] P(end-of-turn)
print(out.bot_probs.shape) # [1, T, 2] P(begin-of-turn)
print(out.fvad_probs.shape) # [1, T, 4] fast VAD
# Sparse decisions β one per VAD offset (user stops speaking)
for ep in out.endpoints:
print(f"t={ep['t_s']:.2f}s endpoint={ep['endpoint']} P={ep['p_endpoint']:.3f}")
# ep["endpoint"] β True = user finished, agent should respond
# False = user paused mid-sentence, keep waiting
# ep["p_endpoint"] β P(user is done) threshold = 0.30
Dual-channel audio (24 kHz stereo)
β
βΌ Mimi encoder + DualTurn backbone (every 80 ms)
Per-frame signals:
vad_user, vad_agent, eot_user, eot_agent,
bot_user, bot_agent, fvad_*_short, fvad_*_long
β
βΌ watch vad_user crossing 0.5
β
VAD offset detected (user stopped) + agent silent?
β
βΌ endpoint_clf.predict_proba(10 signal values)
P(endpoint) >= 0.30 β endpoint=True agent should respond now
P(endpoint) < 0.30 β endpoint=False user paused mid-sentence, wait
Threshold 0.30 tuned to maximise endpoint recall (99% recall on held-out test set).
| File | Description |
|---|---|
best.pt |
DualTurn backbone weights (two-stream transformer, 34M params) |
endpoint_clf.pkl |
Endpoint classifier β sklearn bundle with trained model + recommended threshold |
modeling_dualturn.py |
DualTurnModel(PreTrainedModel) β AutoModel-compatible wrapper |
configuration_dualturn.py |
DualTurnConfig |
config.json |
auto_map for AutoModel/AutoConfig |
endpointing.py |
Higher-level DualTurnEndpointing class with streaming support |
src/ |
Bundled dualturn + evaluation source code |
from endpointing import DualTurnEndpointing
model = DualTurnEndpointing.from_pretrained("anyreach-ai/dualturn-endpointing")
stream = model.stream(user_channel=0, agent_channel=1)
# Feed 80 ms of 24 kHz stereo PCM float32 every tick
# chunk shape: (2, 1920)
for chunk in audio_source():
result = stream.push(chunk)
if result:
if result["endpoint"]:
agent.start_responding()
# result["p_endpoint"] β P(user finished) β [0, 1]
# result["signals"] β all 10 signal values at the anchor
Paper: DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining
@misc{rajaa2026dualturnlearningturntakingdualchannel,
title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining},
author={Shangeth Rajaa},
year={2026},
eprint={2603.08216},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.08216},
}
Base model
anyreach-ai/dualturn-qwen2.5-mimi-0.5B