Uhm β on-device filler-word detection
A frame-precise classifier that finds "uh", "um", "hmm", and other fillers in audio with 20 ms timestamps. Trained on English; produces high-confidence detections on Spanish, French, German, and Dutch without retraining.
Try it
- iOS / macOS:
uhm-swiftβ the Swift SDK with a built-in demo app.
Variants
Two tiers, both free under the license up to 100,000 MAU each.
| Tier | Backbone | Character | When to use |
|---|---|---|---|
uhm-base |
HuBERT-base, 8-bit Core ML, 90 MB | Higher recall; broadest device support | Default. Catches more fillers, accepts a few more false fires. |
uhm-pro |
DistilHuBERT, fp16 Core ML, 45 MB | Smaller, faster (~2.2Γ on-device), more precise | When a flagged filler gets auto-cut without review. |
Both variants preserve 100% argmax agreement with the fp32 PyTorch reference on test inputs.
Files
| Tier | File | Format | Size | Use |
|---|---|---|---|---|
uhm-base |
uhm-base.mlpackage.zip |
Core ML 8-bit | ~88 MB | iOS / macOS on-device |
uhm-base |
uhm-base-web-fp16.onnx |
ONNX fp16 | ~189 MB | Browser, server, Python (onnxruntime) |
uhm-base |
uhm-base.onnx |
ONNX fp32 | ~378 MB | Quantization-free reference |
uhm-pro |
uhm-pro.mlpackage.zip |
Core ML fp16 | ~45 MB | iOS / macOS on-device |
uhm-pro |
uhm-pro-web-fp16.onnx |
ONNX fp16 | ~51 MB | Browser, server, Python (onnxruntime) |
uhm-pro |
uhm-pro.onnx |
ONNX fp32 | ~98 MB | Quantization-free reference |
Source weights for fine-tuning live in safetensors-checkpoint/ (HuBERT-base fp32, alongside config.json, preprocessor_config.json, labels.json).
Use
Swift (iOS / macOS)
import Uhm
let uhm = try await Uhm()
let result = try await uhm.analyze(audioURL: url)
for f in result.fillers { print(f.start, f.end, f.type ?? .other) }
See uhm-swift for the full API.
Python (ONNX)
from huggingface_hub import hf_hub_download
import onnxruntime as ort
path = hf_hub_download("desert-ant-labs/uhm", "uhm-base-web-fp16.onnx")
session = ort.InferenceSession(path, providers=["CPUExecutionProvider"])
Python (PyTorch β fine-tuning starting point)
from transformers import AutoModelForAudioFrameClassification, AutoFeatureExtractor
extractor = AutoFeatureExtractor.from_pretrained("desert-ant-labs/uhm")
model = AutoModelForAudioFrameClassification.from_pretrained("desert-ant-labs/uhm")
Inputs and outputs
- Input: 16 kHz mono audio, up to 30-second windows.
- Output: per-frame softmax over 6 classes, one prediction every 20 ms.
- Class indices:
0 = not_filler, 1 = uh, 2 = um, 3 = hmm, 4 = and, 5 = other.
Core ML input shape (1, 480000) float32; output (1, 1499, 6). Requires iOS 17 / macOS 14 or newer.
Limitations
- Trained on English; non-English performance is by acoustic transfer and has not been measured against per-language ground truth.
- Best on podcast / meeting / talking-head audio. Heavy background music, laughter, or multi-speaker overlap degrades quality.
- Type labels (uh / um / hmm / and / other) are secondary β trust filler vs. not_filler more than the specific subtype.
Built on
- Base architectures and pretrained weights:
facebook/hubert-base-ls960(Apache 2.0) and its distilled variantntu-spml/distilhubert(Apache 2.0). - Public fine-tuning audio: AMI Meeting Corpus (
edinburghcstr/ami, IHM split) β CC BY 4.0, Edinburgh CSTR. - Internal video content created by the Desert Ant Labs team.
License
Released under the Desert Ant Labs Source-Available License v1.0 (see LICENSE.md).
- Free for commercial use up to 100,000 Monthly Active Users (MAU).
- Above 100,000 MAU a commercial license is required. Contact licensing@desertant.ai.
Citation
@software{uhm_2026,
title = {Uhm: on-device filler-word detection},
author = {Desert Ant Labs},
year = {2026},
url = {https://huggingface.co/desert-ant-labs/uhm},
}
Β© 2026 Desert Ant Labs Β· https://desertant.ai