EOU Detector — Russian Call-Center Dialog

End-of-Utterance (EOU) detector for Russian conversational speech, fine-tuned from Qwen2.5-0.5B-Instruct on 200k real call-center dialogs.

The model predicts P(<|im_end|>) at the last token position — the probability that the current speaker has finished their utterance. No classification head; the LM vocabulary does the detection.

Developed at Simplexphone — real-time voice AI for call centers.

Performance

Evaluated on 200 stratified samples (100 positive EOU + 100 negative) from held-out call-center data:

Metric Value
F1 0.851
False Alarm (1 − Precision) 22.3%
False Rejection (1 − Recall) 6.0%
Optimal threshold 0.077
GPU latency (H100, batch=1) ~10 ms
CPU latency (Xeon 28-core, batch=1) ~55 ms

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok   = AutoTokenizer.from_pretrained("feanet/eou-detector-russian")
model = AutoModelForCausalLM.from_pretrained(
    "feanet/eou-detector-russian", torch_dtype=torch.float32
)
model.eval()

EOU_ID    = tok.convert_tokens_to_ids("<|im_end|>")
THRESHOLD = 0.077

def eou_probability(history: list[dict], current_text: str) -> float:
    """
    history: list of {"role": "user"|"assistant", "content": "..."}
    current_text: the utterance to score (last client turn)
    Returns P(end-of-utterance) in [0, 1].
    """
    msgs   = history + [{"role": "user", "content": current_text}]
    prompt = tok.apply_chat_template(msgs, add_generation_prompt=False, tokenize=False)
    prompt = prompt[: prompt.rfind("<|im_end|>")]          # strip trailing EOU token
    enc    = tok(prompt, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        logits = model(**enc).logits
    return torch.softmax(logits[0, -1, :], dim=-1)[EOU_ID].item()


# Example
history = [{"role": "assistant", "content": "добрый день чем могу помочь"}]

print(eou_probability(history, "спасибо до свидания"))   # → ~0.8  (farewell, EOU)
print(eou_probability(history, "хотел уточнить по"))     # → ~0.02 (incomplete, not EOU)

ONNX / production deployment

For lower-latency production use, export to ONNX:

import torch, torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer

class EOUModel(nn.Module):
    def __init__(self, model, eou_id):
        super().__init__()
        self.lm     = model
        self.eou_id = eou_id

    def forward(self, input_ids):
        logits = self.lm(input_ids).logits
        return torch.softmax(logits[:, -1, :], dim=-1)[:, self.eou_id]

tok   = AutoTokenizer.from_pretrained("feanet/eou-detector-russian")
model = AutoModelForCausalLM.from_pretrained("feanet/eou-detector-russian",
                                              torch_dtype=torch.float32).eval()
eou_model = EOUModel(model, tok.convert_tokens_to_ids("<|im_end|>")).eval()

dummy = tok(["хорошо спасибо"], return_tensors="pt")["input_ids"]
torch.onnx.export(eou_model, (dummy,), "model.onnx",
                  input_names=["input_ids"], output_names=["eou_prob"],
                  dynamic_axes={"input_ids": {0: "batch", 1: "seq_len"},
                                "eou_prob":  {0: "batch"}},
                  opset_version=18)

ONNX batch=1 GPU latency: ~6 ms (H100).

Training

Data: 200,667 Russian call-center dialog files in [HH:MM] A/B: text format. Speaker A = customer (user), Speaker B = operator (assistant).

Method: Causal language modelling on the full Qwen2.5 chat template. The model learns to predict <|im_end|> at natural turn boundaries as part of standard next-token prediction — no artificial labels.

Full loss (not masked to EOU positions only) is essential: masking causes catastrophic overfitting where the model memorises positions rather than learning turn-end signals.

Key training details:

  • Sequences slid into 512-token windows (stride 256) → 341k training chunks
  • Optimizer: AdamW, lr=1e-5, cosine schedule, 5% warmup
  • Precision: bf16 on 1× H100 80 GB
  • Early stopping on eval loss, patience=3
  • Best checkpoint: step 29,326 (~2 epochs)
  • Weight untying applied before training (safetensors requirement for Qwen)

C++ / ONNX Runtime

Dependencies: onnxruntime, tokenizers-cpp (reads tokenizer.json directly), ICU for NFKC normalisation.

#include <onnxruntime_cxx_api.h>
#include <tokenizers_cpp.h>

// Build the Qwen chat-template prompt manually and strip the trailing <|im_end|>
// token — the model scores P(<|im_end|>) as the *next* token at that position.
//
// Template token IDs (Qwen2.5 vocab):
//   <|im_start|>=151644  <|im_end|>=151645  \n=198
//   system=8948  user=872  assistant=77091
//
// ONNX interface:
//   input  "input_ids"  INT64  [1, seq_len]
//   output "eou_prob"   FLOAT  [1]

static constexpr float THRESHOLD  = 0.0766f;
static constexpr int   MAX_TOKENS = 512;

Ort::Env     env(ORT_LOGGING_LEVEL_WARNING, "eou");
Ort::Session session(env, "model.onnx", Ort::SessionOptions{});

auto tokenizer = tokenizers::Tokenizer::FromBlobJSON(
    ReadFile("tokenizer.json"));   // your file-read helper

// Build input_ids: [system block] + turns + [user open, current text]
// then truncate to MAX_TOKENS from the right.
std::vector<int64_t> ids = BuildPromptIds(tokenizer, history, current_text);

Ort::MemoryInfo mem("Cpu", OrtDeviceAllocator, 0, OrtMemTypeDefault);
std::array<int64_t, 2> shape{1, (int64_t)ids.size()};
auto input_tensor = Ort::Value::CreateTensor<int64_t>(
    mem, ids.data(), ids.size(), shape.data(), shape.size());

const char* input_names[]  = {"input_ids"};
const char* output_names[] = {"eou_prob"};
auto output = session.Run(Ort::RunOptions{}, input_names, &input_tensor, 1,
                          output_names, 1);

float prob = output[0].GetTensorData<float>()[0];
bool  eou  = prob >= THRESHOLD;   // FA=22.3%  FR=6.0%  F1=0.851

Full header-only class with preprocessing, tokenisation, and GPU support: eou_detector.h

Latency

Runtime Hardware Batch Latency
PyTorch FP32 H100 80 GB 1 ~23 ms
ONNX Runtime FP32 H100 80 GB 1 6 ms
ONNX Runtime FP32 Xeon 28-core 1 ~55 ms
ONNX Runtime FP32 H100 80 GB 128 14 ms (9 k items/s)

Intended use

  • Voice assistant / IVR systems: detect when the caller has finished speaking before routing to ASR or NLU
  • Call-center analytics: segment transcripts by speaker turn
  • Real-time dialog systems needing a language-aware alternative to silence-based VAD

Limitations

  • Trained on Russian call-center speech transcripts; performance on other domains is good on other languages is not good
  • Scores ASR transcript text, not audio — a separate VAD/ASR stage is needed upstream
  • Short utterances (< 3 tokens) may score unreliably
Downloads last month
72
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for feanet/eou-detector-russian

Finetuned
(851)
this model