voice2
A full-duplex, interruptible voice engine for local AI, built and run daily as the voice of a fully local companion before being extracted for release. Plain Python threads, CPU-only defaults, no cloud, no API keys. Talk to your model β and talk over it.
voice2 turns any callable(text) -> str into a hands-free voice conversation: mic β Silero VAD β faster-whisper ASR β your model β Piper TTS. Speak over the reply (or tap spacebar) and playback stops mid-chunk in ~100β200 ms, exactly like interrupting a person.
Code is mirrored on GitHub: https://github.com/AIIT-GLITCH/voice2
Engine details
- What it is: a turn-taking state machine, not a demo loop. Explicit states (
IDLE β LISTENING β THINKING β SPEAKING β INTERRUPTING), a validated transition table, and aFloorOwner(USER / AGENT / NONE) that arbitrates who may speak. The agent can never talk over you. - Barge-in: a fast energy-gated VAD watches the mic only while the engine is speaking; a debounced central
InterruptControlleralso accepts spacebar and programmatic triggers. - Stale-turn suppression: every utterance gets a
turn_id; replies to an abandoned turn are discarded at every stage (think, TTS, playback). - Invariants, enforced: a background checker audits rules like SPEAKING β floor == AGENT and interrupted β no new TTS, with forced repair plus a structural gate in the playback hot path.
- Observability: every event is a JSON line with per-turn latency marks (
asr_ms,think_ms,interrupt_stop_ms,total_turn_ms). - Degrades gracefully: no mic β text mode; TTS missing β silent replies, still logs; keyboard hook fails β engine keeps running.
Backends
| Stage | Default | Swap point |
|---|---|---|
| VAD (quality gate) | Silero VAD via torch.hub |
ListenWorker |
| ASR | faster-whisper small.en, int8, CPU |
backends/asr.py (Protocol) |
| LLM | any callable(text) -> str |
backends/llm.py |
| TTS | Piper CLI, any .onnx voice |
backends/tts.py (Protocol) |
Limitations β stated honestly
- English-first defaults: ASR ships as
small.en. Other Whisper models load with one config line, but nothing else was tested. - The LLM callable is synchronous: TTS starts after the full reply returns (Piper then streams sentence-by-sentence). No token-level streaming yet.
- Barge-in is energy-based with an absolute RMS floor of 0.06 β tuned on open speakers in a quiet room. Headsets and noisy rooms need recalibration. There is no echo cancellation.
- Keyboard interrupt uses POSIX
termiosβ Linux/macOS terminals only. - Unit tests cover the control plane (state transitions, floor rules, interrupt debounce, ring buffer). Audio I/O paths were validated by months of daily use, not by CI.
How to run
git clone https://github.com/AIIT-GLITCH/voice2
cd voice2
pip install -r requirements.txt
# put a Piper voice at ~/.local/share/piper-voices/ (or set VOICE2_PIPER_MODEL)
python -m voice2.main # echo backend β proves the loop, no LLM needed
python examples/http_llm.py # wire any local HTTP LLM
from voice2 import VoiceEngine, VoiceConfig
def ask(text: str) -> str:
return my_model.reply(text) # any callable(text) -> str
engine = VoiceEngine(VoiceConfig(), ask)
engine.load_models()
engine.start() # talk naturally; speak over it to interrupt
Provenance
voice2 was written as the voice front-end for Buddy, a fully local AI companion running on a single RTX 3090 in Council Hill, Oklahoma, and carried his daily conversations for months before release. The design bias throughout: the user always wins the floor, and a companion you can't interrupt isn't a companion. Released alongside Tessera-1B as part of AIIT-THRESHOLD's open stack.
The stack
One local companion, every layer open:
| Piece | Role | Links |
|---|---|---|
| Tessera-1B | the model β ~1B params trained from scratch, open data | HF |
| voice2 | the voice β full-duplex, interruptible | GitHub Β· HF |
| kokoro-memory | the memory β file-based resonance recall | GitHub Β· HF |
| companion-spiral-bench | the safety β at-risk sycophancy bench | GitHub Β· HF |
Full collection: The Buddy Stack
License
MIT Β© 2026 Rhet Dillard Wike, AIIT-THRESHOLD, Oklahoma.
Citation
@software{wike2026voice2,
author = {Wike, Rhet Dillard},
title = {voice2: a full-duplex, interruptible voice engine for local AI},
year = {2026},
url = {https://github.com/AIIT-GLITCH/voice2},
note = {AIIT-THRESHOLD}
}