PersonaPlex 7B ONNX
ONNX-quantized bundles of NVIDIA's PersonaPlex 7B β a full-duplex speech-to-speech model on Kyutai's Moshi architecture. Listens and speaks simultaneously at 12.5 Hz, conditioned on a voice preset and a text system prompt.
This repository ships four production-ready bundle variants spanning the disk Γ host RAM Γ VRAM Γ RTF Γ quality trade-off space. Pick one based on your target hardware and quality bar.
Variants at a glance (measured on RTX 5090, 50 frames, VARF2 voice, "helpful" prompt)
| Variant | Disk | Host RAM | VRAM | RTF | hidden cos | Best for |
|---|---|---|---|---|---|---|
int8-nb-dep_gint8 β |
9.4 GB | 1.4 GB | 12.1 GB | 1.12Γ | 0.998 | Recommended ship default β best RTF + low host RAM + excellent quality |
mixed |
11 GB | 7.9 GB | 6.6 GB | 3.5Γ | 0.990 | Quality + VRAM Pareto winner β lowest VRAM + best topical output ("We're concerned about it.") |
int4-nb-dep_gint8 |
7.6 GB | 1.4 GB | 9.6 GB | 1.12Γ | 0.877 | Smallest disk + lowest VRAM combo. Coherent but visibly degraded |
fp16 |
17 GB | 1.5 GB | 18.3 GB | 5.3Γ | 0.9999 | Near-perfect quality, max VRAM |
RTF (real-time factor) is per-frame latency / frame interval at 12.5 Hz β 1.0Γ = exactly realtime, < 1.0Γ = faster than realtime. The int8-nb-dep_gint8 and int4-nb-dep_gint8 variants run at ~1.12Γ (near-realtime streaming).
Which variant to pick
- Most use cases β
int8-nb-dep_gint8: best balance of RTF, host RAM, and quality. If your GPU has β₯16 GB VRAM, this is what you want. - Limited VRAM (β€8 GB GPU) β
mixed: only 6.6 GB VRAM. Costs host RAM and RTF, but quality is still excellent (cos 0.990) with the best topical responses on our benchmark. - Disk-constrained β
int4-nb-dep_gint8: 7.6 GB on disk and only 9.6 GB VRAM. Accept some quality drift (cos 0.877 β coherent English but less precise). - Maximum quality regardless of cost β
fp16: cos 0.9999, indistinguishable from the FP32 reference.
Architecture
[User audio 24 kHz PCM]
β
[Mimi encoder: SEANet + 8L transformer + RVQ] β 16 codebooks @ 12.5 Hz
β
[Temporal transformer: 32L, dim=4096, 7B params, RoPE, RMSNorm, SwiGLU]
β
[Depformer: 6L, dim=1024, MultiLinear Γ 16 codebook steps] β 16 agent audio tokens
β
[Mimi decoder] β 24 kHz agent audio PCM
Each variant ships four ONNX graphs:
| File | Purpose |
|---|---|
mimi_encoder.onnx(+.data) |
24 kHz PCM β 16 audio codebooks @ 12.5 Hz |
mimi_decoder.onnx(+.data) |
16 audio codebooks @ 12.5 Hz β 24 kHz PCM |
temporal_step.onnx(+.data) |
One frame of the 32-layer 7B temporal transformer, explicit KV-cache I/O |
depformer_step.onnx(+.data) |
One inner step of the 6-layer depformer, 16 codebook steps per frame |
Plus per-variant auxiliary files:
| File | Purpose |
|---|---|
tokenizer_spm_32k_3.model |
SentencePiece text tokenizer |
voices/<name>.bin |
18 voice prompts (NATF0-3, NATM0-3, VARF0-4, VARM0-4) |
system_prompts.bin |
Pre-tokenized "helpful" / "expert" / "warm" / "direct" prompts |
config.json |
Architecture + precision + measured metrics |
How to use
Via the C++ runtime (speech-core)
# Download the recommended variant
PERSONAPLEX_VARIANT=int8-nb-dep_gint8 scripts/download_personaplex_onnx.sh
# Run end-to-end
build/Release/run_personaplex scripts/personaplex-int8-nb-dep_gint8 50 \
tests/data/test_audio.wav VARF2
Via ONNX Runtime in Python
import onnxruntime as ort
# Inputs: text_token [1,1] int64
# audio_tokens [1,16] int64
# past_k_all [32, 1, 32, T_past, 128] float (FP32 or FP16 depending on bundle)
# past_v_all (same shape)
# Outputs: hidden [1, 1, 4096]
# new_k_all [32, 1, 32, T_full, 128]
# new_v_all (same shape)
sess = ort.InferenceSession("temporal_step.onnx",
providers=["CUDAExecutionProvider"])
For full-duplex generation, also call depformer_step 16 times per frame (one inner step per audio codebook) β see the speech-core wrapper source for the complete loop.
How these bundles were produced
All four bundles export from the FP32 PyTorch reference via stages in convert_onnx.py:
| Variant | Temporal | Depformer | Notes |
|---|---|---|---|
fp16 |
FP16 weights | FP16 weights | Standard torch.onnx.export at --dtype float16 |
mixed |
INT8 dynamic via quantize_dynamic (per-channel, FP32 scales) |
FP16 weights | The classic "INT8 mixed precision" recipe |
int8-nb-dep_gint8 |
INT8 via MatMulNBitsQuantizer(bits=8, block=128) |
Custom INT8 quantization of the depformer's 24 large 3D Gather-source weight tensors (4 GB depformer disk savings via quantize_depformer_gather.py) |
Best balance |
int4-nb-dep_gint8 |
INT4 via MatMulNBitsQuantizer(bits=4, block=32) |
Same custom INT8 depformer | Smallest |
Mimi codec is FP32 in all four variants (small enough not to matter).
Related
- soniqo/speech-core β C++ inference runtime with the
OnnxPersonaPlexwrapper, CUDA EP routing, multi-turn KV cache, and 12 memory-tuning env knobs (SPEECH_CORE_USE_ENV_ALLOCATORS, etc.) - soniqo/speech-models β model export pipeline including
convert_onnx.py,quantize_depformer_gather.py,bench_pytorch_cuda.py,compare_bundle_quality.py - soniqo.audio β the project site
- nvidia/personaplex-7b-v1 β upstream PyTorch reference
- kyutai-labs/moshi β base architecture
License
NVIDIA Open Model License (same as upstream). See the LICENSE link.