babycry — infant/child valence classifier

Binary valence classifier for infant / young-child vocalizations: positive (content / happy — cooing, laughing, delighted) vs distress (crying, frustrated, dysregulated).

It is an ensemble of three frozen audio encoders whose pooled embeddings feed a simple, deterministic CLAP + logistic-regression head (the default).

encoder	hub id	resample
AST	`MIT/ast-finetuned-audioset-10-10-0.4593`	16 kHz
wav2vec2	`facebook/wav2vec2-base`	16 kHz
CLAP	`laion/clap-htsat-unfused`	48 kHz

Preprocessing is self-contained: you pass raw audio (a waveform array + its sample rate, or a file path). The model resamples per encoder, slices into 2 s windows with a 1 s hop, embeds each window with the frozen encoders, mean-pools windows to one vector per clip, and applies the head.

⚠️ RESEARCH BASELINE ONLY — NOT A MEDICAL / SAFETY / CLINICAL DEVICE. This model is a weak research baseline trained on a small, non-infant-skewed corpus. Do not use it for real infant-monitoring, medical, safety, or caregiving decisions. Its outputs are unreliable for any individual clip and must never gate care for a child.

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained("owlgebra-ai/babycry", trust_remote_code=True)
model.eval()

# from an audio file path (resampling handled internally)
print(model.predict("clip.wav"))
# {'label': 'positive', 'prob_positive': 0.64, 'prob_distress': 0.36, 'head': 'single_logreg'}

# from a raw waveform array (mono, float in [-1, 1]) + its sample rate
import librosa
y, sr = librosa.load("clip.wav", sr=None, mono=True)   # sr=None keeps native rate
print(model.predict(y, sampling_rate=sr))

# a batch (list of paths and/or arrays)
print(model.predict(["a.wav", "b.wav"]))

prob_positive is P(positive); label thresholds it at 0.5.

Input requirements & gotchas

Sample rate matters — pass the file path, or audio at its original sample rate. The model resamples internally per encoder (AST & wav2vec2 at 16 kHz, CLAP at 48 kHz). If you pre-downsample and then feed the result, the model upsamples for CLAP but cannot recover the lost high-frequency content, and predictions shift materially. Measured on one clip (prob_positive): file path / native-SR array = 0.64 (identical), but the same clip downsampled to 16 kHz / 22.05 kHz shifted by ~0.25. Feed the original file (best) or the highest sample rate you have (ideally ≥ 48 kHz).
sampling_rate is required for raw-array input (omitting it raises ValueError); not needed for a file path (read natively).
Audio must be float, mono, in [-1, 1]. Don't pass int16 PCM — the wrong amplitude scale shifts predictions. Multi-channel input is auto-averaged to mono.
Call model.to(device) before the first predict(). The three frozen encoders are lazily loaded onto the model's current device on the first call.
First call is slow / needs network: it downloads the three frozen encoders (~hundreds of MB total) from their own hubs.
Per-clip outputs are noisy — treat single-clip predictions with caution; only aggregate behavior is meaningful.

Performance (honest)

The numbers to trust are leave-one-speaker-out (LOBO) within ReCANVo — the only clean, leakage-free, same-corpus / same-sample-rate evaluation (pooled out-of-fold predictions, 0.5 threshold, n=3903). For the default CLAP + logistic-regression head:

balanced-acc	ROC-AUC	positive-recall	positive-precision	macro-F1
0.648	0.706	0.641	0.556	0.643

(positive is the ROC-AUC positive class; class_weight='balanced'. Source: within-ReCANVo LOBO leaderboard, results/valence_advanced_metrics.csv.)

This is a hard, low-resource task and the ceiling is set by the data, not the head: an advanced-head/fusion sweep found that late fusion and neural heads gave no robust gain over a plain CLAP probe (neural ROC-AUC varied ±0.03 across seeds; fusion sat inside LOBO fold-to-fold noise), so the shipped default is the simplest, fully deterministic probe. The honest data ceiling is **0.74 ROC-AUC / ~0.67 balanced-acc**.

ReCANVo speakers are children/young people aged ~6–23, not infants, and positive clips are concentrated in a few speakers — so LOBO estimates are noisy and applying the model to actual infants is a transfer (out-of-distribution) setting that will degrade further. Cross-dataset and pooled numbers are confounded by recording-domain / sample-rate cues and are not clean generalization.

Shipped weights are trained on ALL valence-labelled clips (train-on-all, positive=1571 / distress=2829 across ReCANVo + DonateACry + ESC-50) to maximize usefulness. The numbers above are the LOBO eval figures (a separate, leakage-free protocol) — not the in-sample fit of the shipped weights.

Data & licensing

Trained on embeddings from the owlgebra-ai/babycry dataset, which aggregates several open corpora (ReCANVo, DonateACry, ESC-50). Licensing is per-source — each component dataset carries its own license and terms. See the dataset card for the per-source license details and obligations before any redistribution or downstream use.

Limitations

Weak baseline: unreliable on individual clips; only meaningful in aggregate.
Trained on non-infant (ages ~6–23) and mixed-domain data → real-infant use is OOD transfer.
Two-class only (positive vs distress); no "ambiguous"/neutral handling, no detection of non-vocal audio.
Frozen encoders — no fine-tuning on this task.
Not for medical, safety, clinical, or real-time infant-monitoring use.

Downloads last month: 53

Safetensors

Model size

554k params

Tensor type

F32

owlgebra-ai
/

babycry