babycry β€” infant/child valence classifier

Binary valence classifier for infant / young-child vocalizations: positive (content / happy β€” cooing, laughing, delighted) vs distress (crying, frustrated, dysregulated).

It is an ensemble of three frozen audio encoders whose pooled embeddings feed a simple, deterministic CLAP + logistic-regression head (the default).

encoder hub id resample
AST MIT/ast-finetuned-audioset-10-10-0.4593 16 kHz
wav2vec2 facebook/wav2vec2-base 16 kHz
CLAP laion/clap-htsat-unfused 48 kHz

Preprocessing is self-contained: you pass raw audio (a waveform array + its sample rate, or a file path). The model resamples per encoder, slices into 2 s windows with a 1 s hop, embeds each window with the frozen encoders, mean-pools windows to one vector per clip, and applies the head.

⚠️ RESEARCH BASELINE ONLY β€” NOT A MEDICAL / SAFETY / CLINICAL DEVICE. This model is a weak research baseline trained on a small, non-infant-skewed corpus. Do not use it for real infant-monitoring, medical, safety, or caregiving decisions. Its outputs are unreliable for any individual clip and must never gate care for a child.

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained("owlgebra-ai/babycry", trust_remote_code=True)
model.eval()

# from an audio file path (resampling handled internally)
print(model.predict("clip.wav"))
# {'label': 'positive', 'prob_positive': 0.64, 'prob_distress': 0.36, 'head': 'single_logreg'}

# from a raw waveform array (mono, float in [-1, 1]) + its sample rate
import librosa
y, sr = librosa.load("clip.wav", sr=None, mono=True)   # sr=None keeps native rate
print(model.predict(y, sampling_rate=sr))

# a batch (list of paths and/or arrays)
print(model.predict(["a.wav", "b.wav"]))

prob_positive is P(positive); label thresholds it at 0.5.

Input requirements & gotchas

  • Sample rate matters β€” pass the file path, or audio at its original sample rate. The model resamples internally per encoder (AST & wav2vec2 at 16 kHz, CLAP at 48 kHz). If you pre-downsample and then feed the result, the model upsamples for CLAP but cannot recover the lost high-frequency content, and predictions shift materially. Measured on one clip (prob_positive): file path / native-SR array = 0.64 (identical), but the same clip downsampled to 16 kHz / 22.05 kHz shifted by ~0.25. Feed the original file (best) or the highest sample rate you have (ideally β‰₯ 48 kHz).
  • sampling_rate is required for raw-array input (omitting it raises ValueError); not needed for a file path (read natively).
  • Audio must be float, mono, in [-1, 1]. Don't pass int16 PCM β€” the wrong amplitude scale shifts predictions. Multi-channel input is auto-averaged to mono.
  • Call model.to(device) before the first predict(). The three frozen encoders are lazily loaded onto the model's current device on the first call.
  • First call is slow / needs network: it downloads the three frozen encoders (~hundreds of MB total) from their own hubs.
  • Per-clip outputs are noisy β€” treat single-clip predictions with caution; only aggregate behavior is meaningful.

Performance (honest)

The numbers to trust are leave-one-speaker-out (LOBO) within ReCANVo β€” the only clean, leakage-free, same-corpus / same-sample-rate evaluation (pooled out-of-fold predictions, 0.5 threshold, n=3903). For the default CLAP + logistic-regression head:

balanced-acc ROC-AUC positive-recall positive-precision macro-F1
0.648 0.706 0.641 0.556 0.643

(positive is the ROC-AUC positive class; class_weight='balanced'. Source: within-ReCANVo LOBO leaderboard, results/valence_advanced_metrics.csv.)

This is a hard, low-resource task and the ceiling is set by the data, not the head: an advanced-head/fusion sweep found that late fusion and neural heads gave no robust gain over a plain CLAP probe (neural ROC-AUC varied Β±0.03 across seeds; fusion sat inside LOBO fold-to-fold noise), so the shipped default is the simplest, fully deterministic probe. The honest data ceiling is **0.74 ROC-AUC / ~0.67 balanced-acc**.

ReCANVo speakers are children/young people aged ~6–23, not infants, and positive clips are concentrated in a few speakers β€” so LOBO estimates are noisy and applying the model to actual infants is a transfer (out-of-distribution) setting that will degrade further. Cross-dataset and pooled numbers are confounded by recording-domain / sample-rate cues and are not clean generalization.

Shipped weights are trained on ALL valence-labelled clips (train-on-all, positive=1571 / distress=2829 across ReCANVo + DonateACry + ESC-50) to maximize usefulness. The numbers above are the LOBO eval figures (a separate, leakage-free protocol) β€” not the in-sample fit of the shipped weights.

Data & licensing

Trained on embeddings from the owlgebra-ai/babycry dataset, which aggregates several open corpora (ReCANVo, DonateACry, ESC-50). Licensing is per-source β€” each component dataset carries its own license and terms. See the dataset card for the per-source license details and obligations before any redistribution or downstream use.

Limitations

  • Weak baseline: unreliable on individual clips; only meaningful in aggregate.
  • Trained on non-infant (ages ~6–23) and mixed-domain data β†’ real-infant use is OOD transfer.
  • Two-class only (positive vs distress); no "ambiguous"/neutral handling, no detection of non-vocal audio.
  • Frozen encoders β€” no fine-tuning on this task.
  • Not for medical, safety, clinical, or real-time infant-monitoring use.
Downloads last month
53
Safetensors
Model size
554k params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train owlgebra-ai/babycry