PersonaPlex 7B ONNX

ONNX-quantized bundles of NVIDIA's PersonaPlex 7B β€” a full-duplex speech-to-speech model on Kyutai's Moshi architecture. Listens and speaks simultaneously at 12.5 Hz, conditioned on a voice preset and a text system prompt.

This repository ships four production-ready bundle variants spanning the disk Γ— host RAM Γ— VRAM Γ— RTF Γ— quality trade-off space. Pick one based on your target hardware and quality bar.

Variants at a glance (measured on RTX 5090, 50 frames, VARF2 voice, "helpful" prompt)

Variant Disk Host RAM VRAM RTF hidden cos Best for
int8-nb-dep_gint8 ⭐ 9.4 GB 1.4 GB 12.1 GB 1.12Γ— 0.998 Recommended ship default β€” best RTF + low host RAM + excellent quality
mixed 11 GB 7.9 GB 6.6 GB 3.5Γ— 0.990 Quality + VRAM Pareto winner β€” lowest VRAM + best topical output ("We're concerned about it.")
int4-nb-dep_gint8 7.6 GB 1.4 GB 9.6 GB 1.12Γ— 0.877 Smallest disk + lowest VRAM combo. Coherent but visibly degraded
fp16 17 GB 1.5 GB 18.3 GB 5.3Γ— 0.9999 Near-perfect quality, max VRAM

RTF (real-time factor) is per-frame latency / frame interval at 12.5 Hz β€” 1.0Γ— = exactly realtime, < 1.0Γ— = faster than realtime. The int8-nb-dep_gint8 and int4-nb-dep_gint8 variants run at ~1.12Γ— (near-realtime streaming).

Which variant to pick

  • Most use cases β†’ int8-nb-dep_gint8: best balance of RTF, host RAM, and quality. If your GPU has β‰₯16 GB VRAM, this is what you want.
  • Limited VRAM (≀8 GB GPU) β†’ mixed: only 6.6 GB VRAM. Costs host RAM and RTF, but quality is still excellent (cos 0.990) with the best topical responses on our benchmark.
  • Disk-constrained β†’ int4-nb-dep_gint8: 7.6 GB on disk and only 9.6 GB VRAM. Accept some quality drift (cos 0.877 β€” coherent English but less precise).
  • Maximum quality regardless of cost β†’ fp16: cos 0.9999, indistinguishable from the FP32 reference.

Architecture

[User audio 24 kHz PCM]
        ↓
[Mimi encoder: SEANet + 8L transformer + RVQ] β†’ 16 codebooks @ 12.5 Hz
        ↓
[Temporal transformer: 32L, dim=4096, 7B params, RoPE, RMSNorm, SwiGLU]
        ↓
[Depformer: 6L, dim=1024, MultiLinear Γ— 16 codebook steps] β†’ 16 agent audio tokens
        ↓
[Mimi decoder] β†’ 24 kHz agent audio PCM

Each variant ships four ONNX graphs:

File Purpose
mimi_encoder.onnx(+.data) 24 kHz PCM β†’ 16 audio codebooks @ 12.5 Hz
mimi_decoder.onnx(+.data) 16 audio codebooks @ 12.5 Hz β†’ 24 kHz PCM
temporal_step.onnx(+.data) One frame of the 32-layer 7B temporal transformer, explicit KV-cache I/O
depformer_step.onnx(+.data) One inner step of the 6-layer depformer, 16 codebook steps per frame

Plus per-variant auxiliary files:

File Purpose
tokenizer_spm_32k_3.model SentencePiece text tokenizer
voices/<name>.bin 18 voice prompts (NATF0-3, NATM0-3, VARF0-4, VARM0-4)
system_prompts.bin Pre-tokenized "helpful" / "expert" / "warm" / "direct" prompts
config.json Architecture + precision + measured metrics

How to use

Via the C++ runtime (speech-core)

# Download the recommended variant
PERSONAPLEX_VARIANT=int8-nb-dep_gint8 scripts/download_personaplex_onnx.sh

# Run end-to-end
build/Release/run_personaplex scripts/personaplex-int8-nb-dep_gint8 50 \
    tests/data/test_audio.wav VARF2

Via ONNX Runtime in Python

import onnxruntime as ort

# Inputs:  text_token [1,1] int64
#          audio_tokens [1,16] int64
#          past_k_all [32, 1, 32, T_past, 128] float (FP32 or FP16 depending on bundle)
#          past_v_all (same shape)
# Outputs: hidden [1, 1, 4096]
#          new_k_all [32, 1, 32, T_full, 128]
#          new_v_all (same shape)
sess = ort.InferenceSession("temporal_step.onnx",
                             providers=["CUDAExecutionProvider"])

For full-duplex generation, also call depformer_step 16 times per frame (one inner step per audio codebook) β€” see the speech-core wrapper source for the complete loop.

How these bundles were produced

All four bundles export from the FP32 PyTorch reference via stages in convert_onnx.py:

Variant Temporal Depformer Notes
fp16 FP16 weights FP16 weights Standard torch.onnx.export at --dtype float16
mixed INT8 dynamic via quantize_dynamic (per-channel, FP32 scales) FP16 weights The classic "INT8 mixed precision" recipe
int8-nb-dep_gint8 INT8 via MatMulNBitsQuantizer(bits=8, block=128) Custom INT8 quantization of the depformer's 24 large 3D Gather-source weight tensors (4 GB depformer disk savings via quantize_depformer_gather.py) Best balance
int4-nb-dep_gint8 INT4 via MatMulNBitsQuantizer(bits=4, block=32) Same custom INT8 depformer Smallest

Mimi codec is FP32 in all four variants (small enough not to matter).

Related

  • soniqo/speech-core β€” C++ inference runtime with the OnnxPersonaPlex wrapper, CUDA EP routing, multi-turn KV cache, and 12 memory-tuning env knobs (SPEECH_CORE_USE_ENV_ALLOCATORS, etc.)
  • soniqo/speech-models β€” model export pipeline including convert_onnx.py, quantize_depformer_gather.py, bench_pytorch_cuda.py, compare_bundle_quality.py
  • soniqo.audio β€” the project site
  • nvidia/personaplex-7b-v1 β€” upstream PyTorch reference
  • kyutai-labs/moshi β€” base architecture

License

NVIDIA Open Model License (same as upstream). See the LICENSE link.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/PersonaPlex-7B-ONNX

Quantized
(8)
this model