PersonaPlex 7B ONNX

ONNX-quantized bundles of NVIDIA's PersonaPlex 7B — a full-duplex speech-to-speech model on Kyutai's Moshi architecture. Listens and speaks simultaneously at 12.5 Hz, conditioned on a voice preset and a text system prompt.

This repository ships four production-ready bundle variants spanning the disk × host RAM × VRAM × RTF × quality trade-off space. Pick one based on your target hardware and quality bar.

Variants at a glance (measured on RTX 5090, 50 frames, VARF2 voice, "helpful" prompt)

Variant	Disk	Host RAM	VRAM	RTF	hidden cos	Best for
`int8-nb-dep_gint8` ⭐	9.4 GB	1.4 GB	12.1 GB	1.12×	0.998	Recommended ship default — best RTF + low host RAM + excellent quality
`mixed`	11 GB	7.9 GB	6.6 GB	3.5×	0.990	Quality + VRAM Pareto winner — lowest VRAM + best topical output ("We're concerned about it.")
`int4-nb-dep_gint8`	7.6 GB	1.4 GB	9.6 GB	1.12×	0.877	Smallest disk + lowest VRAM combo. Coherent but visibly degraded
`fp16`	17 GB	1.5 GB	18.3 GB	5.3×	0.9999	Near-perfect quality, max VRAM

RTF (real-time factor) is per-frame latency / frame interval at 12.5 Hz — 1.0× = exactly realtime, < 1.0× = faster than realtime. The int8-nb-dep_gint8 and int4-nb-dep_gint8 variants run at ~1.12× (near-realtime streaming).

Which variant to pick

Most use cases → int8-nb-dep_gint8: best balance of RTF, host RAM, and quality. If your GPU has ≥16 GB VRAM, this is what you want.
Limited VRAM (≤8 GB GPU) → mixed: only 6.6 GB VRAM. Costs host RAM and RTF, but quality is still excellent (cos 0.990) with the best topical responses on our benchmark.
Disk-constrained → int4-nb-dep_gint8: 7.6 GB on disk and only 9.6 GB VRAM. Accept some quality drift (cos 0.877 — coherent English but less precise).
Maximum quality regardless of cost → fp16: cos 0.9999, indistinguishable from the FP32 reference.

Architecture

[User audio 24 kHz PCM]
        ↓
[Mimi encoder: SEANet + 8L transformer + RVQ] → 16 codebooks @ 12.5 Hz
        ↓
[Temporal transformer: 32L, dim=4096, 7B params, RoPE, RMSNorm, SwiGLU]
        ↓
[Depformer: 6L, dim=1024, MultiLinear × 16 codebook steps] → 16 agent audio tokens
        ↓
[Mimi decoder] → 24 kHz agent audio PCM

Each variant ships four ONNX graphs:

File	Purpose
`mimi_encoder.onnx`(+`.data`)	24 kHz PCM → 16 audio codebooks @ 12.5 Hz
`mimi_decoder.onnx`(+`.data`)	16 audio codebooks @ 12.5 Hz → 24 kHz PCM
`temporal_step.onnx`(+`.data`)	One frame of the 32-layer 7B temporal transformer, explicit KV-cache I/O
`depformer_step.onnx`(+`.data`)	One inner step of the 6-layer depformer, 16 codebook steps per frame

Plus per-variant auxiliary files:

File	Purpose
`tokenizer_spm_32k_3.model`	SentencePiece text tokenizer
`voices/<name>.bin`	18 voice prompts (NATF0-3, NATM0-3, VARF0-4, VARM0-4)
`system_prompts.bin`	Pre-tokenized "helpful" / "expert" / "warm" / "direct" prompts
`config.json`	Architecture + precision + measured metrics

How to use

Via the C++ runtime (speech-core)

# Download the recommended variant
PERSONAPLEX_VARIANT=int8-nb-dep_gint8 scripts/download_personaplex_onnx.sh

# Run end-to-end
build/Release/run_personaplex scripts/personaplex-int8-nb-dep_gint8 50 \
    tests/data/test_audio.wav VARF2

Via ONNX Runtime in Python

import onnxruntime as ort

# Inputs:  text_token [1,1] int64
#          audio_tokens [1,16] int64
#          past_k_all [32, 1, 32, T_past, 128] float (FP32 or FP16 depending on bundle)
#          past_v_all (same shape)
# Outputs: hidden [1, 1, 4096]
#          new_k_all [32, 1, 32, T_full, 128]
#          new_v_all (same shape)
sess = ort.InferenceSession("temporal_step.onnx",
                             providers=["CUDAExecutionProvider"])

For full-duplex generation, also call depformer_step 16 times per frame (one inner step per audio codebook) — see the speech-core wrapper source for the complete loop.

How these bundles were produced

All four bundles export from the FP32 PyTorch reference via stages in convert_onnx.py:

Variant	Temporal	Depformer	Notes
`fp16`	FP16 weights	FP16 weights	Standard `torch.onnx.export` at `--dtype float16`
`mixed`	INT8 dynamic via `quantize_dynamic` (per-channel, FP32 scales)	FP16 weights	The classic "INT8 mixed precision" recipe
`int8-nb-dep_gint8`	INT8 via `MatMulNBitsQuantizer(bits=8, block=128)`	Custom INT8 quantization of the depformer's 24 large 3D Gather-source weight tensors (4 GB depformer disk savings via `quantize_depformer_gather.py`)	Best balance
`int4-nb-dep_gint8`	INT4 via `MatMulNBitsQuantizer(bits=4, block=32)`	Same custom INT8 depformer	Smallest

Mimi codec is FP32 in all four variants (small enough not to matter).

soniqo/speech-core — C++ inference runtime with the OnnxPersonaPlex wrapper, CUDA EP routing, multi-turn KV cache, and 12 memory-tuning env knobs (SPEECH_CORE_USE_ENV_ALLOCATORS, etc.)
soniqo/speech-models — model export pipeline including convert_onnx.py, quantize_depformer_gather.py, bench_pytorch_cuda.py, compare_bundle_quality.py
soniqo.audio — the project site
nvidia/personaplex-7b-v1 — upstream PyTorch reference
kyutai-labs/moshi — base architecture

License

NVIDIA Open Model License (same as upstream). See the LICENSE link.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for soniqo/PersonaPlex-7B-ONNX

Base model

kyutai/moshiko-pytorch-bf16

Finetuned

nvidia/personaplex-7b-v1

Quantized

(8)

this model

soniqo
/

PersonaPlex-7B-ONNX