anyreach-ai/dualturn-endpointing

Real-time speech endpoint detector for two-channel (user + agent) audio.

Built on top of DualTurn-Qwen2.5-Mimi-0.5B with a trained endpoint classifier that answers one question at every VAD offset:

Has the user finished speaking?
endpoint=True β€” yes, agent should respond now
endpoint=False β€” no, user just paused mid-sentence, keep waiting

Output Shape Description
vad_probs [B, T, 2] P(speaking now) β€” [:,0]=user [:,1]=agent
eot_probs [B, T, 2] P(end of turn) per channel
bot_probs [B, T, 2] P(beginning of turn) per channel
fvad_probs [B, T, 4] Fast VAD (Silero) β€” user_short, user_long, agent_short, agent_long
endpoint_probs [B, T, 1] P(user finished) at VAD-offset frames, 0 elsewhere
endpoints list[dict] Sparse decisions at VAD-offset anchors with endpoint bool and p_endpoint

Frame rate: 12.5 Hz (80 ms per frame). Audio resampled to 24 kHz internally.


Inference

pip install transformers torch torchaudio joblib scikit-learn silero-vad huggingface_hub
import torch, torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "anyreach-ai/dualturn-endpointing",
    trust_remote_code=True,
)
model.eval()

wav, sr = torchaudio.load("conversation.wav")  # [2, T]  CH0=user  CH1=agent

with torch.no_grad():
    out = model(wav, sr=sr)

# Per-frame signals at 12.5 Hz
print(out.vad_probs.shape)    # [1, T, 2]   P(speaking) β€” (user, agent)
print(out.eot_probs.shape)    # [1, T, 2]   P(end-of-turn)
print(out.bot_probs.shape)    # [1, T, 2]   P(begin-of-turn)
print(out.fvad_probs.shape)   # [1, T, 4]   fast VAD

# Sparse decisions β€” one per VAD offset (user stops speaking)
for ep in out.endpoints:
    print(f"t={ep['t_s']:.2f}s  endpoint={ep['endpoint']}  P={ep['p_endpoint']:.3f}")
    # ep["endpoint"]   β†’  True  = user finished, agent should respond
    #                     False = user paused mid-sentence, keep waiting
    # ep["p_endpoint"] β†’  P(user is done)  threshold = 0.30

Endpoint decision logic

Dual-channel audio (24 kHz stereo)
      β”‚
      β–Ό  Mimi encoder + DualTurn backbone  (every 80 ms)
Per-frame signals:
  vad_user, vad_agent, eot_user, eot_agent,
  bot_user, bot_agent, fvad_*_short, fvad_*_long
      β”‚
      β–Ό  watch vad_user crossing 0.5
      β”‚
  VAD offset detected (user stopped) + agent silent?
      β”‚
      β–Ό  endpoint_clf.predict_proba(10 signal values)
      P(endpoint) >= 0.30  β†’  endpoint=True   agent should respond now
      P(endpoint) <  0.30  β†’  endpoint=False  user paused mid-sentence, wait

Threshold 0.30 tuned to maximise endpoint recall (99% recall on held-out test set).


Files

File Description
best.pt DualTurn backbone weights (two-stream transformer, 34M params)
endpoint_clf.pkl Endpoint classifier β€” sklearn bundle with trained model + recommended threshold
modeling_dualturn.py DualTurnModel(PreTrainedModel) β€” AutoModel-compatible wrapper
configuration_dualturn.py DualTurnConfig
config.json auto_map for AutoModel/AutoConfig
endpointing.py Higher-level DualTurnEndpointing class with streaming support
src/ Bundled dualturn + evaluation source code

Streaming (real-time)

from endpointing import DualTurnEndpointing

model  = DualTurnEndpointing.from_pretrained("anyreach-ai/dualturn-endpointing")
stream = model.stream(user_channel=0, agent_channel=1)

# Feed 80 ms of 24 kHz stereo PCM float32 every tick
# chunk shape: (2, 1920)
for chunk in audio_source():
    result = stream.push(chunk)
    if result:
        if result["endpoint"]:
            agent.start_responding()
        # result["p_endpoint"] β€” P(user finished)  ∈ [0, 1]
        # result["signals"]    β€” all 10 signal values at the anchor

Authors

  • Shangeth Rajaa β€” Senior ML Research Scientist, Anyreach AI

Citation

Paper: DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

@misc{rajaa2026dualturnlearningturntakingdualchannel,
      title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining},
      author={Shangeth Rajaa},
      year={2026},
      eprint={2603.08216},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2603.08216},
}
Downloads last month
791
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for anyreach-ai/dualturn-endpointing

Finetuned
(1)
this model

Paper for anyreach-ai/dualturn-endpointing