Audio Classification
Transformers
Safetensors
Chinese
English
smart_turn_v3
audio
voice-activity-detection
semantic-vad
smart-turn
Instructions to use MigoXV/smart-turn-v3.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MigoXV/smart-turn-v3.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="MigoXV/smart-turn-v3.1")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("MigoXV/smart-turn-v3.1", dtype="auto") - Notebooks
- Google Colab
- Kaggle
SmartTurn V3.1 FP32
SmartTurn V3.1 is a semantic turn-completion / endpoint model. It takes the latest 8 seconds of 16 kHz audio, extracts Whisper log-mel features, and returns a binary probability for whether the current audio segment should be considered complete.
This repository contains the FP32 Hugging Face save_pretrained export recovered
from smart-turn-v3.1-gpu.onnx.
Files
config.json: SmartTurnV3 config withmodel_type = "smart_turn_v3".model.safetensors: FP32 model weights.preprocessor_config.json: Whisper feature extractor config for 8-second audio.
Important
This model repo intentionally does not ship remote code. To load it, install
or otherwise make available the smart_turn Python package that defines and
registers SmartTurnV3Config and SmartTurnV3Model.
Installation
pip install torch==2.8.* transformers librosa numpy
# Install your smart_turn package, or run inside the smart-turn project with:
export PYTHONPATH=/path/to/smart-turn-onnx/src:${PYTHONPATH}
End-to-End Inference Demo
Save this as demo_smart_turn.py and run:
python demo_smart_turn.py /path/to/audio.wav
from __future__ import annotations
import sys
from pathlib import Path
import librosa
import numpy as np
import torch
from transformers import WhisperFeatureExtractor
from smart_turn.models import load_model
REPO_ID = "MigoXV/smart-turn-v3.1"
SAMPLE_RATE = 16000
WINDOW_SECONDS = 8
THRESHOLD = 0.5
def load_audio_window(path: str | Path) -> np.ndarray:
audio, sr = librosa.load(path, sr=None, mono=True)
if sr != SAMPLE_RATE:
audio = librosa.resample(audio, orig_sr=sr, target_sr=SAMPLE_RATE)
if audio.dtype != np.float32:
audio = audio.astype(np.float32)
max_abs = np.max(np.abs(audio)) if audio.size else 0.0
if max_abs > 1.0:
audio = audio / max_abs
max_samples = WINDOW_SECONDS * SAMPLE_RATE
if audio.size >= max_samples:
return audio[-max_samples:]
padding = max_samples - audio.size
return np.pad(audio, (padding, 0), mode="constant", constant_values=0)
def predict(audio_path: str | Path, device: str = "cpu") -> tuple[int, float]:
torch_device = torch.device(device)
dtype = torch.float32
model = load_model(REPO_ID).to(device=torch_device, dtype=dtype).eval()
feature_extractor = WhisperFeatureExtractor.from_pretrained(REPO_ID)
audio = load_audio_window(audio_path)
inputs = feature_extractor(
audio,
sampling_rate=SAMPLE_RATE,
return_tensors="np",
padding="max_length",
max_length=WINDOW_SECONDS * SAMPLE_RATE,
truncation=True,
do_normalize=True,
)
input_features = torch.from_numpy(
inputs.input_features.astype(np.float32)
).to(device=torch_device, dtype=dtype)
with torch.no_grad():
probability = model(input_features=input_features)["logits"].view(-1).item()
prediction = 1 if probability > THRESHOLD else 0
return prediction, probability
if __name__ == "__main__":
if len(sys.argv) != 2:
raise SystemExit("Usage: python demo_smart_turn.py /path/to/audio.wav")
pred, prob = predict(sys.argv[1], device="cuda" if torch.cuda.is_available() else "cpu")
print(f"prediction={pred} probability={prob:.8f}")
Input And Output
- Input audio is converted to mono float32.
- Audio is resampled to 16 kHz.
- Only the latest 8 seconds are used.
- Shorter audio is left-padded with zeros.
- The model returns a sigmoid probability in
[0, 1]. prediction = 1means the segment is considered complete whenprobability > 0.5.
Notes
- This is an FP32 safetensors export.
- The model architecture is
WhisperEncoder + attention pooling + binary head. - The repository does not include training code or remote modeling code.
- Downloads last month
- 23