training-lab / README.md
arach's picture
πŸ§ͺ initial commit β€” voice-to-syntax training lab
04558eb

training-lab

Experiments in voice dictation to programming syntax. Teaching small models to understand spoken code.

Domain

Converting spoken dictation like "git space push space dash u space origin space main" into actual syntax: git push -u origin main.

The challenge: users don't always speak in perfect protocol format. They use synonyms ("minus" for "dash"), skip separator words, add conversational filler ("okay so the command is..."), and make mid-sentence corrections ("no wait, actually...").

Architecture

Raw speech transcript
  β†’ Protocol detector (is it already clean?)
  β†’ IF clean: bypass LLM β†’ procedural processor
  β†’ IF messy: LLM normalizer β†’ procedural processor
  β†’ Final syntax output

Procedural processor β€” deterministic token scanner. Symbol vocabulary, number words, casing directives. 93% on clean input, zero hallucination, instant.

LLM normalizer β€” rewrites messy dictation into clean protocol format. Strips filler, resolves corrections, inserts spacing keywords. The LLM never outputs actual symbols β€” it only outputs protocol words.

Structure

processor/          Deterministic symbol/number/casing processor
pipeline/           LLM + processor pipeline (zero-training normalizer)
eval/               Evaluation datasets (fuzzy + independent)
training/
  data/             Training data (syntax-reconstruction, dictation-to-bash)
  converters/       Scripts to generate training data from NL2Bash
  adapters/         Fine-tuned model adapters (LoRA/DoRA)
scripts/            Evaluation and benchmarking scripts
blog/               Writeup drafts and notes

Quick start

# Run the procedural processor on clean protocol input
python3 processor/procedural.py eval/independent.json

# Run the normalizer pipeline (requires mlx-lm)
pip install mlx mlx-lm
python3 pipeline/normalizer.py eval/fuzzy.json --model mlx-community/Qwen2.5-1.5B-Instruct-4bit

Results (zero-training, prompted only)

Model Clean Fuzzy Natural Chaotic Overall
Processor only 92% 0% 0% 2% 23.5%
Qwen 2.5 1.5B 90% 20% 54% 24% 47%
Qwen 2.5 0.5B 90% 12% 44% 20% 41.5%
Llama 3.2 1B 92% 14% 34% 10% 37.5%

Protocol format

The "space-as-a-word" protocol eliminates spacing ambiguity:

  • "space" β†’ literal space between tokens
  • Symbol words: dash dot slash pipe colon quote etc.
  • Casing: camel case, snake case, pascal case, kebab case
  • Numbers: zero through nineteen, twenty...ninety, hundred, thousand
  • Capitalization: capital X, all caps WORD