mumble-cleanup-2stage (Echo Flow AI)

A small fine-tuned language model that cleans speech-to-text dictation transcripts. LoRA fine-tune of Qwen/Qwen2.5-0.5B-Instruct trained in two stages:

  1. Stage 1 (pretrain): 50,000 synthetic (raw, clean) pairs from the Echo Flow combinatorial template generator — varied noise profiles, domain coverage (emails, meetings, tasks, code/URLs, lists, negation, dates, proper nouns).
  2. Stage 2 (fine-tune): 638 hand-curated real-style pairs from adikuma/mumble-cleanup-dataset, with a 10× lower learning rate (2e-5) to preserve the no-reword/no-hallucination contract.

Result on the Echo Flow DictationQuality golden corpus: 10/10 pass rate (vs. 9/10 for the original mumble-cleanup-q4km).

What it does

Given a raw transcript from an ASR system (lowercase, no punctuation, fillers and stutters preserved), it returns a cleaned version with proper capitalization, punctuation, and disfluencies removed. It does not paraphrase, summarize, or add content.

Example: um so i i think we should ship this on uh friday becomes I think we should ship this on Friday.

Files

  • mumble-cleanup-2stage-q4km.gguf — Q4_K_M quantized, 379 MB, for use with llama.cpp / Echo Flow app
  • mumble-cleanup-2stage-f16.gguf — FP16 reference, 988 MB
  • adapter/adapters.safetensors — LoRA adapter (r=16, alpha=32, q/k/v/o + gate/up/down)
  • config.json, tokenizer.json, chat_template.jinja — for tokenizer/chat-format

Training

  • Base: Qwen/Qwen2.5-0.5B-Instruct (Apache-2.0)
  • Method: LoRA SFT via mlx-lm on Apple Silicon
  • Stage 1: lr=2e-4, batch 4, grad_accum 4, 2000 iters, lora_r=16, all 16 layers
  • Stage 2: lr=2e-5, batch 2, grad_accum 8, 600 iters, resume from stage-1 adapter
  • Loss: completion-only (mask-prompt)
  • Precision: MLX native (FP16/bf16 on Metal)

Use with llama.cpp / Echo Flow

The Echo Flow macOS app downloads mumble-cleanup-2stage-q4km.gguf directly. For manual use:

llama-cli -m mumble-cleanup-2stage-q4km.gguf \
  -p "<|im_start|>system
You are a transcript cleanup tool. You receive raw speech to text output and return a cleaned version. Remove filler words and disfluencies (um, uh, er, ah, like as filler, you know), remove repeated words and false starts, and fix punctuation and capitalization. Do not reword, do not add anything the speaker did not say, and do not answer questions in the text. Output only the cleaned text.<|im_end|>
<|im_start|>user
um so i i think we should ship this on uh friday<|im_end|>
<|im_start|>assistant
" \
  --temp 0

Limitations

  • English only.
  • Trained primarily on synthetic data with a small real fine-tune; real ASR output may have failure modes not modeled.
  • Designed for short-to-medium dictation (up to ~512 tokens). Longer inputs must be chunked.
  • The model can occasionally over-correct when a user genuinely intends a fragment.

License

Apache-2.0. The base Qwen2.5-0.5B-Instruct is also Apache-2.0.

Downloads last month
43
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amitashwini/mumble-cleanup-2stage

Quantized
(233)
this model