jepa-q2d2

a 1.6 kbps neural speech codec. a jepa (joint-embedding predictive architecture) encoder feeds q2d2, a geometry-aware 2d rhombic-lattice quantizer, and a hifi-gan decoder. the codec objective is trained without any adversarial loss (a frozen wavlm perceptual loss is used while training the decoder).

models

each folder has pytorch_model.pt, model.safetensors, and config.json.

folder operating point quantizer bitrate quality
jepa-q2d2-cd64-12.5hz 12.5 hz, code dim 64 (main) q2d2 1.6 kbps (100 tok/s) pesq 2.53, estoi 0.80
jepa-q2d2-sigreg-cd32-25hz 25 hz, code dim 32, sigreg q2d2 1.6 kbps (100 tok/s) estoi 0.79
teacher-cd128-fsq-12.5hz 12.5 hz, code dim 128 fsq ~2.85 kbps (237.5 tok/s) pesq ~2.91

quality is on a fixed 50-utterance librilight set. estoi is extended stoi, comparable only within this paper's pipeline. the main cd64 model is ahead of encodec at a comparable rate and ahead of mimi on pesq. the sigreg model is the co-design point: at the aggressive 25 hz / cd32 setting the codec collapses unless the encoder distribution is gaussianized with sigreg. the teacher is the higher-rate fsq codec used to distill the cd64 student.

emergent properties

the codec is trained on english read speech only and never sees language or emotion labels. the encoder features still pick up structure it was not supervised on.

cross-lingual: a 5-nn probe on frozen pre-quantization features separates 6 fleurs languages at 0.85 (chance 0.167), ahead of mimi 0.63, encodec 0.61, dac 0.45, and close to multilingual ssl models. cluster nmi against language is higher too (0.26 vs encodec 0.10, dac 0.07). the utterance-disjoint number is the same (0.85).

cross-lingual separability

style and emotion: utterance embeddings separate by speaking style on a hindi 5-class set (angry, excited, neutral, sad, whisper). whisper splits off cleanly and the rest form distinct regions for the jepa encoders, while encodec and dac mix them.

emotion separation

usage

import torch, soundfile as sf
from koe.fast.hf_codec import load_codec_from_hf   # from the github repo

model, info = load_codec_from_hf("main", device="cpu")   # downloads weights from here
print(info)

wav, sr = sf.read("in.wav")                              # mono, resampled to 24 khz internally
x = torch.from_numpy(wav).float().view(1, 1, -1)
z_q = model.encode(x)[0]                                 # discrete q2d2 tokens
recon = model.decode(z_q)                                # 24 khz waveform

model code lives in the github repo. the released checkpoints already include the fine-tuned encoder, so inference needs only these files plus the repo code.

training data

librilight, english read speech, 24 khz. the cross-lingual structure above is evaluated zero-shot on fleurs.

citation

@inproceedings{shukla2026jepaq2d2,
  title     = {JEPA-Q2D2: A Low-Bitrate Speech Codec with Emergent Cross-Lingual Structure},
  author    = {Shukla, Anant and Anand, Aman and Shakya, Suryansh and Bharti, Vatsal},
  booktitle = {Proc. APSIPA ASC},
  year      = {2026},
}

license

cc-by-4.0, released for research use.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support