Uhm β€” on-device filler-word detection

A frame-precise classifier that finds "uh", "um", "hmm", and other fillers in audio with 20 ms timestamps. Trained on English; produces high-confidence detections on Spanish, French, German, and Dutch without retraining.

Try it

  • iOS / macOS: uhm-swift β€” the Swift SDK with a built-in demo app.

Variants

Two tiers, both free under the license up to 100,000 MAU each.

Tier Backbone Character When to use
uhm-base HuBERT-base, 8-bit Core ML, 90 MB Higher recall; broadest device support Default. Catches more fillers, accepts a few more false fires.
uhm-pro DistilHuBERT, fp16 Core ML, 45 MB Smaller, faster (~2.2Γ— on-device), more precise When a flagged filler gets auto-cut without review.

Both variants preserve 100% argmax agreement with the fp32 PyTorch reference on test inputs.

Files

Tier File Format Size Use
uhm-base uhm-base.mlpackage.zip Core ML 8-bit ~88 MB iOS / macOS on-device
uhm-base uhm-base-web-fp16.onnx ONNX fp16 ~189 MB Browser, server, Python (onnxruntime)
uhm-base uhm-base.onnx ONNX fp32 ~378 MB Quantization-free reference
uhm-pro uhm-pro.mlpackage.zip Core ML fp16 ~45 MB iOS / macOS on-device
uhm-pro uhm-pro-web-fp16.onnx ONNX fp16 ~51 MB Browser, server, Python (onnxruntime)
uhm-pro uhm-pro.onnx ONNX fp32 ~98 MB Quantization-free reference

Source weights for fine-tuning live in safetensors-checkpoint/ (HuBERT-base fp32, alongside config.json, preprocessor_config.json, labels.json).

Use

Swift (iOS / macOS)

import Uhm

let uhm = try await Uhm()
let result = try await uhm.analyze(audioURL: url)
for f in result.fillers { print(f.start, f.end, f.type ?? .other) }

See uhm-swift for the full API.

Python (ONNX)

from huggingface_hub import hf_hub_download
import onnxruntime as ort

path    = hf_hub_download("desert-ant-labs/uhm", "uhm-base-web-fp16.onnx")
session = ort.InferenceSession(path, providers=["CPUExecutionProvider"])

Python (PyTorch β€” fine-tuning starting point)

from transformers import AutoModelForAudioFrameClassification, AutoFeatureExtractor

extractor = AutoFeatureExtractor.from_pretrained("desert-ant-labs/uhm")
model     = AutoModelForAudioFrameClassification.from_pretrained("desert-ant-labs/uhm")

Inputs and outputs

  • Input: 16 kHz mono audio, up to 30-second windows.
  • Output: per-frame softmax over 6 classes, one prediction every 20 ms.
  • Class indices: 0 = not_filler, 1 = uh, 2 = um, 3 = hmm, 4 = and, 5 = other.

Core ML input shape (1, 480000) float32; output (1, 1499, 6). Requires iOS 17 / macOS 14 or newer.

Limitations

  • Trained on English; non-English performance is by acoustic transfer and has not been measured against per-language ground truth.
  • Best on podcast / meeting / talking-head audio. Heavy background music, laughter, or multi-speaker overlap degrades quality.
  • Type labels (uh / um / hmm / and / other) are secondary β€” trust filler vs. not_filler more than the specific subtype.

Built on

License

Released under the Desert Ant Labs Source-Available License v1.0 (see LICENSE.md).

  • Free for commercial use up to 100,000 Monthly Active Users (MAU).
  • Above 100,000 MAU a commercial license is required. Contact licensing@desertant.ai.

Citation

@software{uhm_2026,
  title  = {Uhm: on-device filler-word detection},
  author = {Desert Ant Labs},
  year   = {2026},
  url    = {https://huggingface.co/desert-ant-labs/uhm},
}

Β© 2026 Desert Ant Labs Β· https://desertant.ai

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using desert-ant-labs/uhm 1