EOU Detector — Russian Call-Center Dialog
End-of-Utterance (EOU) detector for Russian conversational speech, fine-tuned from Qwen2.5-0.5B-Instruct on 200k real call-center dialogs.
The model predicts P(<|im_end|>) at the last token position — the probability that
the current speaker has finished their utterance. No classification head; the LM vocabulary
does the detection.
Developed at Simplexphone — real-time voice AI for call centers.
Performance
Evaluated on 200 stratified samples (100 positive EOU + 100 negative) from held-out call-center data:
| Metric | Value |
|---|---|
| F1 | 0.851 |
| False Alarm (1 − Precision) | 22.3% |
| False Rejection (1 − Recall) | 6.0% |
| Optimal threshold | 0.077 |
| GPU latency (H100, batch=1) | ~10 ms |
| CPU latency (Xeon 28-core, batch=1) | ~55 ms |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tok = AutoTokenizer.from_pretrained("feanet/eou-detector-russian")
model = AutoModelForCausalLM.from_pretrained(
"feanet/eou-detector-russian", torch_dtype=torch.float32
)
model.eval()
EOU_ID = tok.convert_tokens_to_ids("<|im_end|>")
THRESHOLD = 0.077
def eou_probability(history: list[dict], current_text: str) -> float:
"""
history: list of {"role": "user"|"assistant", "content": "..."}
current_text: the utterance to score (last client turn)
Returns P(end-of-utterance) in [0, 1].
"""
msgs = history + [{"role": "user", "content": current_text}]
prompt = tok.apply_chat_template(msgs, add_generation_prompt=False, tokenize=False)
prompt = prompt[: prompt.rfind("<|im_end|>")] # strip trailing EOU token
enc = tok(prompt, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**enc).logits
return torch.softmax(logits[0, -1, :], dim=-1)[EOU_ID].item()
# Example
history = [{"role": "assistant", "content": "добрый день чем могу помочь"}]
print(eou_probability(history, "спасибо до свидания")) # → ~0.8 (farewell, EOU)
print(eou_probability(history, "хотел уточнить по")) # → ~0.02 (incomplete, not EOU)
ONNX / production deployment
For lower-latency production use, export to ONNX:
import torch, torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
class EOUModel(nn.Module):
def __init__(self, model, eou_id):
super().__init__()
self.lm = model
self.eou_id = eou_id
def forward(self, input_ids):
logits = self.lm(input_ids).logits
return torch.softmax(logits[:, -1, :], dim=-1)[:, self.eou_id]
tok = AutoTokenizer.from_pretrained("feanet/eou-detector-russian")
model = AutoModelForCausalLM.from_pretrained("feanet/eou-detector-russian",
torch_dtype=torch.float32).eval()
eou_model = EOUModel(model, tok.convert_tokens_to_ids("<|im_end|>")).eval()
dummy = tok(["хорошо спасибо"], return_tensors="pt")["input_ids"]
torch.onnx.export(eou_model, (dummy,), "model.onnx",
input_names=["input_ids"], output_names=["eou_prob"],
dynamic_axes={"input_ids": {0: "batch", 1: "seq_len"},
"eou_prob": {0: "batch"}},
opset_version=18)
ONNX batch=1 GPU latency: ~6 ms (H100).
Training
Data: 200,667 Russian call-center dialog files in [HH:MM] A/B: text format.
Speaker A = customer (user), Speaker B = operator (assistant).
Method: Causal language modelling on the full Qwen2.5 chat template.
The model learns to predict <|im_end|> at natural turn boundaries as part of
standard next-token prediction — no artificial labels.
Full loss (not masked to EOU positions only) is essential: masking causes catastrophic overfitting where the model memorises positions rather than learning turn-end signals.
Key training details:
- Sequences slid into 512-token windows (stride 256) → 341k training chunks
- Optimizer: AdamW, lr=1e-5, cosine schedule, 5% warmup
- Precision: bf16 on 1× H100 80 GB
- Early stopping on eval loss, patience=3
- Best checkpoint: step 29,326 (~2 epochs)
- Weight untying applied before training (safetensors requirement for Qwen)
C++ / ONNX Runtime
Dependencies: onnxruntime,
tokenizers-cpp (reads tokenizer.json directly),
ICU for NFKC normalisation.
#include <onnxruntime_cxx_api.h>
#include <tokenizers_cpp.h>
// Build the Qwen chat-template prompt manually and strip the trailing <|im_end|>
// token — the model scores P(<|im_end|>) as the *next* token at that position.
//
// Template token IDs (Qwen2.5 vocab):
// <|im_start|>=151644 <|im_end|>=151645 \n=198
// system=8948 user=872 assistant=77091
//
// ONNX interface:
// input "input_ids" INT64 [1, seq_len]
// output "eou_prob" FLOAT [1]
static constexpr float THRESHOLD = 0.0766f;
static constexpr int MAX_TOKENS = 512;
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "eou");
Ort::Session session(env, "model.onnx", Ort::SessionOptions{});
auto tokenizer = tokenizers::Tokenizer::FromBlobJSON(
ReadFile("tokenizer.json")); // your file-read helper
// Build input_ids: [system block] + turns + [user open, current text]
// then truncate to MAX_TOKENS from the right.
std::vector<int64_t> ids = BuildPromptIds(tokenizer, history, current_text);
Ort::MemoryInfo mem("Cpu", OrtDeviceAllocator, 0, OrtMemTypeDefault);
std::array<int64_t, 2> shape{1, (int64_t)ids.size()};
auto input_tensor = Ort::Value::CreateTensor<int64_t>(
mem, ids.data(), ids.size(), shape.data(), shape.size());
const char* input_names[] = {"input_ids"};
const char* output_names[] = {"eou_prob"};
auto output = session.Run(Ort::RunOptions{}, input_names, &input_tensor, 1,
output_names, 1);
float prob = output[0].GetTensorData<float>()[0];
bool eou = prob >= THRESHOLD; // FA=22.3% FR=6.0% F1=0.851
Full header-only class with preprocessing, tokenisation, and GPU support:
eou_detector.h
Latency
| Runtime | Hardware | Batch | Latency |
|---|---|---|---|
| PyTorch FP32 | H100 80 GB | 1 | ~23 ms |
| ONNX Runtime FP32 | H100 80 GB | 1 | 6 ms |
| ONNX Runtime FP32 | Xeon 28-core | 1 | ~55 ms |
| ONNX Runtime FP32 | H100 80 GB | 128 | 14 ms (9 k items/s) |
Intended use
- Voice assistant / IVR systems: detect when the caller has finished speaking before routing to ASR or NLU
- Call-center analytics: segment transcripts by speaker turn
- Real-time dialog systems needing a language-aware alternative to silence-based VAD
Limitations
- Trained on Russian call-center speech transcripts; performance on other domains is good on other languages is not good
- Scores ASR transcript text, not audio — a separate VAD/ASR stage is needed upstream
- Short utterances (< 3 tokens) may score unreliably
- Downloads last month
- 72