You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By clicking below, you agree to the BabyHuBERT License. Please read the one-pager (https://osf.io/n75rs/files/anx9v) and the full license (https://osf.io/n75rs/files/b79v4) before proceeding. The license prohibits commercial use and surveillance, requires reporting of misuse, and ensures any model built on BabyHuBERT inherits the same conditions.

BabyHuBERT

BabyHuBERT is a self-supervised speech representation model trained on 13,000+ hours of multilingual child-centered long-form audio recordings spanning 40+ languages — from widely-studied languages such as English and French to underrepresented languages including Yeli Dnye, Tsimane, and Quechua. It was created by the ExELang team in 2025, built on data shared by research teams around the world. We created BabyHuBERT because existing speech models trained on clean adult speech fail on child-centered recordings due to their challenging acoustic conditions: ~80% non-speech content, overlapping speakers, short vocalizations, and children's higher-pitched and more variable speech.

For a plain-language overview of what BabyHuBERT is and what you commit to by using it, see the one-pager.

License

BabyHuBERT is released under a custom license informed by an independent ethics assessment covering participants' consent, indigenous data sovereignty, privacy, and possible misuse. The license:

Prohibits commercial use and surveillance of participants
Requires reporting of misuse
Ensures that any model released building on BabyHuBERT inherits the same conditions

See the full license for the full terms. All documents related to the release of BabyHuBERT can be found in this OSF repository.

Downstream models

As of April 2026, three open-source task-specific models have been built on top of BabyHuBERT:

Model	Task
BabyHuBERT-VTC (Charlot et al., 2026)	Voice type classification (who speaks when?)
BabAR (Lavechin et al., 2026)	Phoneme recognition
Addressee classification (Charlot et al., 2026)	Child-directed speech vs adult-directed speech detection

Downloading the checkpoint

Fill in the access form on this page to get instant access, then authenticate and download:

from huggingface_hub import login, hf_hub_download
login()  # enter your HF token when prompted

ckpt_path = hf_hub_download(repo_id="MarvinLvn/BabyHuBERT", filename="BabyHuBERT.ckpt")

Extracting representations

import torch
from torchaudio.models import hubert_pretrain_base

model = hubert_pretrain_base(num_classes=500)
state_dict = torch.load(ckpt_path, map_location="cpu")
state_dict = {k.replace("model.", ""): v for k, v in state_dict["state_dict"].items()}
model.load_state_dict(state_dict)
encoder = model.wav2vec2
encoder.eval()

Citation

@misc{charlot2025babyhubertmultilingualselfsupervisedlearning,
    title={BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings}, 
    author={Théo Charlot and Tarek Kunze and Maxime Poli and Alejandrina Cristia and Emmanuel Dupoux and Marvin Lavechin},
    year={2025},
    eprint={2509.15001},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2509.15001}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for MarvinLvn/BabyHuBERT

BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

Paper • 2509.15001 • Published Sep 18, 2025