jepa-q2d2
a 1.6 kbps neural speech codec. a jepa (joint-embedding predictive architecture) encoder feeds q2d2, a geometry-aware 2d rhombic-lattice quantizer, and a hifi-gan decoder. the codec objective is trained without any adversarial loss (a frozen wavlm perceptual loss is used while training the decoder).
- code: https://github.com/anant-004/jepa-q2d2
- paper: apsipa asc 2026
models
each folder has pytorch_model.pt, model.safetensors, and config.json.
| folder | operating point | quantizer | bitrate | quality |
|---|---|---|---|---|
jepa-q2d2-cd64-12.5hz |
12.5 hz, code dim 64 (main) | q2d2 | 1.6 kbps (100 tok/s) | pesq 2.53, estoi 0.80 |
jepa-q2d2-sigreg-cd32-25hz |
25 hz, code dim 32, sigreg | q2d2 | 1.6 kbps (100 tok/s) | estoi 0.79 |
teacher-cd128-fsq-12.5hz |
12.5 hz, code dim 128 | fsq | ~2.85 kbps (237.5 tok/s) | pesq ~2.91 |
quality is on a fixed 50-utterance librilight set. estoi is extended stoi, comparable only within this paper's pipeline. the main cd64 model is ahead of encodec at a comparable rate and ahead of mimi on pesq. the sigreg model is the co-design point: at the aggressive 25 hz / cd32 setting the codec collapses unless the encoder distribution is gaussianized with sigreg. the teacher is the higher-rate fsq codec used to distill the cd64 student.
emergent properties
the codec is trained on english read speech only and never sees language or emotion labels. the encoder features still pick up structure it was not supervised on.
cross-lingual: a 5-nn probe on frozen pre-quantization features separates 6 fleurs languages at 0.85 (chance 0.167), ahead of mimi 0.63, encodec 0.61, dac 0.45, and close to multilingual ssl models. cluster nmi against language is higher too (0.26 vs encodec 0.10, dac 0.07). the utterance-disjoint number is the same (0.85).
style and emotion: utterance embeddings separate by speaking style on a hindi 5-class set (angry, excited, neutral, sad, whisper). whisper splits off cleanly and the rest form distinct regions for the jepa encoders, while encodec and dac mix them.
usage
import torch, soundfile as sf
from koe.fast.hf_codec import load_codec_from_hf # from the github repo
model, info = load_codec_from_hf("main", device="cpu") # downloads weights from here
print(info)
wav, sr = sf.read("in.wav") # mono, resampled to 24 khz internally
x = torch.from_numpy(wav).float().view(1, 1, -1)
z_q = model.encode(x)[0] # discrete q2d2 tokens
recon = model.decode(z_q) # 24 khz waveform
model code lives in the github repo. the released checkpoints already include the fine-tuned encoder, so inference needs only these files plus the repo code.
training data
librilight, english read speech, 24 khz. the cross-lingual structure above is evaluated zero-shot on fleurs.
citation
@inproceedings{shukla2026jepaq2d2,
title = {JEPA-Q2D2: A Low-Bitrate Speech Codec with Emergent Cross-Lingual Structure},
author = {Shukla, Anant and Anand, Aman and Shakya, Suryansh and Bharti, Vatsal},
booktitle = {Proc. APSIPA ASC},
year = {2026},
}
license
cc-by-4.0, released for research use.

