jepa-q2d2

a 1.6 kbps neural speech codec. a jepa (joint-embedding predictive architecture) encoder feeds q2d2, a geometry-aware 2d rhombic-lattice quantizer, and a hifi-gan decoder. the codec objective is trained without any adversarial loss (a frozen wavlm perceptual loss is used while training the decoder).

code: https://github.com/anant-004/jepa-q2d2
paper: apsipa asc 2026

models

each folder has pytorch_model.pt, model.safetensors, and config.json.

folder	operating point	quantizer	bitrate	quality
`jepa-q2d2-cd64-12.5hz`	12.5 hz, code dim 64 (main)	q2d2	1.6 kbps (100 tok/s)	pesq 2.53, estoi 0.80
`jepa-q2d2-sigreg-cd32-25hz`	25 hz, code dim 32, sigreg	q2d2	1.6 kbps (100 tok/s)	estoi 0.79
`teacher-cd128-fsq-12.5hz`	12.5 hz, code dim 128	fsq	~2.85 kbps (237.5 tok/s)	pesq ~2.91

quality is on a fixed 50-utterance librilight set. estoi is extended stoi, comparable only within this paper's pipeline. the main cd64 model is ahead of encodec at a comparable rate and ahead of mimi on pesq. the sigreg model is the co-design point: at the aggressive 25 hz / cd32 setting the codec collapses unless the encoder distribution is gaussianized with sigreg. the teacher is the higher-rate fsq codec used to distill the cd64 student.

emergent properties

the codec is trained on english read speech only and never sees language or emotion labels. the encoder features still pick up structure it was not supervised on.

cross-lingual: a 5-nn probe on frozen pre-quantization features separates 6 fleurs languages at 0.85 (chance 0.167), ahead of mimi 0.63, encodec 0.61, dac 0.45, and close to multilingual ssl models. cluster nmi against language is higher too (0.26 vs encodec 0.10, dac 0.07). the utterance-disjoint number is the same (0.85).

style and emotion: utterance embeddings separate by speaking style on a hindi 5-class set (angry, excited, neutral, sad, whisper). whisper splits off cleanly and the rest form distinct regions for the jepa encoders, while encodec and dac mix them.

usage

import torch, soundfile as sf
from koe.fast.hf_codec import load_codec_from_hf   # from the github repo

model, info = load_codec_from_hf("main", device="cpu")   # downloads weights from here
print(info)

wav, sr = sf.read("in.wav")                              # mono, resampled to 24 khz internally
x = torch.from_numpy(wav).float().view(1, 1, -1)
z_q = model.encode(x)[0]                                 # discrete q2d2 tokens
recon = model.decode(z_q)                                # 24 khz waveform

model code lives in the github repo. the released checkpoints already include the fine-tuned encoder, so inference needs only these files plus the repo code.

training data

librilight, english read speech, 24 khz. the cross-lingual structure above is evaluated zero-shot on fleurs.

citation

@inproceedings{shukla2026jepaq2d2,
  title     = {JEPA-Q2D2: A Low-Bitrate Speech Codec with Emergent Cross-Lingual Structure},
  author    = {Shukla, Anant and Anand, Aman and Shakya, Suryansh and Bharti, Vatsal},
  booktitle = {Proc. APSIPA ASC},
  year      = {2026},
}

license

cc-by-4.0, released for research use.

Downloads last month: -; Downloads are not tracked for this model. How to track