training-lab

Experiments in voice dictation to programming syntax. Teaching small models to understand spoken code.

Domain

Converting spoken dictation like "git space push space dash u space origin space main" into actual syntax: git push -u origin main.

The challenge: users don't always speak in perfect protocol format. They use synonyms ("minus" for "dash"), skip separator words, add conversational filler ("okay so the command is..."), and make mid-sentence corrections ("no wait, actually...").

Architecture

Raw speech transcript
  → Protocol detector (is it already clean?)
  → IF clean: bypass LLM → procedural processor
  → IF messy: LLM normalizer → procedural processor
  → Final syntax output

Procedural processor — deterministic token scanner. Symbol vocabulary, number words, casing directives. 93% on clean input, zero hallucination, instant.

LLM normalizer — rewrites messy dictation into clean protocol format. Strips filler, resolves corrections, inserts spacing keywords. The LLM never outputs actual symbols — it only outputs protocol words.

Structure

processor/          Deterministic symbol/number/casing processor
pipeline/           LLM + processor pipeline (zero-training normalizer)
eval/               Evaluation datasets (fuzzy + independent)
training/
  data/             Training data (syntax-reconstruction, dictation-to-bash)
  converters/       Scripts to generate training data from NL2Bash
  adapters/         Fine-tuned model adapters (LoRA/DoRA)
scripts/            Evaluation and benchmarking scripts
blog/               Writeup drafts and notes

Quick start

# Run the procedural processor on clean protocol input
python3 processor/procedural.py eval/independent.json

# Run the normalizer pipeline (requires mlx-lm)
pip install mlx mlx-lm
python3 pipeline/normalizer.py eval/fuzzy.json --model mlx-community/Qwen2.5-1.5B-Instruct-4bit

Results (zero-training, prompted only)

Model	Clean	Fuzzy	Natural	Chaotic	Overall
Processor only	92%	0%	0%	2%	23.5%
Qwen 2.5 1.5B	90%	20%	54%	24%	47%
Qwen 2.5 0.5B	90%	12%	44%	20%	41.5%
Llama 3.2 1B	92%	14%	34%	10%	37.5%

Protocol format

The "space-as-a-word" protocol eliminates spacing ambiguity:

"space" → literal space between tokens
Symbol words: dash dot slash pipe colon quote etc.
Casing: camel case, snake case, pascal case, kebab case
Numbers: zero through nineteen, twenty...ninety, hundred, thousand
Capitalization: capital X, all caps WORD