Instructions to use owlgebra-ai/babycry with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use owlgebra-ai/babycry with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="owlgebra-ai/babycry", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("owlgebra-ai/babycry", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
babycry β infant/child valence classifier
Binary valence classifier for infant / young-child vocalizations:
positive (content / happy β cooing, laughing, delighted) vs distress
(crying, frustrated, dysregulated).
It is an ensemble of three frozen audio encoders whose pooled embeddings feed a simple, deterministic CLAP + logistic-regression head (the default).
| encoder | hub id | resample |
|---|---|---|
| AST | MIT/ast-finetuned-audioset-10-10-0.4593 |
16 kHz |
| wav2vec2 | facebook/wav2vec2-base |
16 kHz |
| CLAP | laion/clap-htsat-unfused |
48 kHz |
Preprocessing is self-contained: you pass raw audio (a waveform array + its sample rate, or a file path). The model resamples per encoder, slices into 2 s windows with a 1 s hop, embeds each window with the frozen encoders, mean-pools windows to one vector per clip, and applies the head.
β οΈ RESEARCH BASELINE ONLY β NOT A MEDICAL / SAFETY / CLINICAL DEVICE. This model is a weak research baseline trained on a small, non-infant-skewed corpus. Do not use it for real infant-monitoring, medical, safety, or caregiving decisions. Its outputs are unreliable for any individual clip and must never gate care for a child.
Usage
from transformers import AutoModel
model = AutoModel.from_pretrained("owlgebra-ai/babycry", trust_remote_code=True)
model.eval()
# from an audio file path (resampling handled internally)
print(model.predict("clip.wav"))
# {'label': 'positive', 'prob_positive': 0.64, 'prob_distress': 0.36, 'head': 'single_logreg'}
# from a raw waveform array (mono, float in [-1, 1]) + its sample rate
import librosa
y, sr = librosa.load("clip.wav", sr=None, mono=True) # sr=None keeps native rate
print(model.predict(y, sampling_rate=sr))
# a batch (list of paths and/or arrays)
print(model.predict(["a.wav", "b.wav"]))
prob_positive is P(positive); label thresholds it at 0.5.
Input requirements & gotchas
- Sample rate matters β pass the file path, or audio at its original sample
rate. The model resamples internally per encoder (AST & wav2vec2 at 16 kHz,
CLAP at 48 kHz). If you pre-downsample and then feed the result, the model
upsamples for CLAP but cannot recover the lost high-frequency content, and
predictions shift materially. Measured on one clip (
prob_positive): file path / native-SR array = 0.64 (identical), but the same clip downsampled to 16 kHz / 22.05 kHz shifted by ~0.25. Feed the original file (best) or the highest sample rate you have (ideally β₯ 48 kHz). sampling_rateis required for raw-array input (omitting it raisesValueError); not needed for a file path (read natively).- Audio must be float, mono, in
[-1, 1]. Don't pass int16 PCM β the wrong amplitude scale shifts predictions. Multi-channel input is auto-averaged to mono. - Call
model.to(device)before the firstpredict(). The three frozen encoders are lazily loaded onto the model's current device on the first call. - First call is slow / needs network: it downloads the three frozen encoders (~hundreds of MB total) from their own hubs.
- Per-clip outputs are noisy β treat single-clip predictions with caution; only aggregate behavior is meaningful.
Performance (honest)
The numbers to trust are leave-one-speaker-out (LOBO) within ReCANVo β the only clean, leakage-free, same-corpus / same-sample-rate evaluation (pooled out-of-fold predictions, 0.5 threshold, n=3903). For the default CLAP + logistic-regression head:
| balanced-acc | ROC-AUC | positive-recall | positive-precision | macro-F1 |
|---|---|---|---|---|
| 0.648 | 0.706 | 0.641 | 0.556 | 0.643 |
(positive is the ROC-AUC positive class; class_weight='balanced'. Source:
within-ReCANVo LOBO leaderboard, results/valence_advanced_metrics.csv.)
This is a hard, low-resource task and the ceiling is set by the data, not the
head: an advanced-head/fusion sweep found that late fusion and neural heads gave
no robust gain over a plain CLAP probe (neural ROC-AUC varied Β±0.03 across
seeds; fusion sat inside LOBO fold-to-fold noise), so the shipped default is the
simplest, fully deterministic probe. The honest data ceiling is **0.74 ROC-AUC /
~0.67 balanced-acc**.
ReCANVo speakers are children/young people aged ~6β23, not infants, and positive clips are concentrated in a few speakers β so LOBO estimates are noisy and applying the model to actual infants is a transfer (out-of-distribution) setting that will degrade further. Cross-dataset and pooled numbers are confounded by recording-domain / sample-rate cues and are not clean generalization.
Shipped weights are trained on ALL valence-labelled clips (train-on-all, positive=1571 / distress=2829 across ReCANVo + DonateACry + ESC-50) to maximize usefulness. The numbers above are the LOBO eval figures (a separate, leakage-free protocol) β not the in-sample fit of the shipped weights.
Data & licensing
Trained on embeddings from the owlgebra-ai/babycry
dataset, which aggregates several open corpora (ReCANVo, DonateACry, ESC-50).
Licensing is per-source β each component dataset carries its own license and
terms. See the dataset card for the per-source license details and obligations
before any redistribution or downstream use.
Limitations
- Weak baseline: unreliable on individual clips; only meaningful in aggregate.
- Trained on non-infant (ages ~6β23) and mixed-domain data β real-infant use is OOD transfer.
- Two-class only (
positivevsdistress); no "ambiguous"/neutral handling, no detection of non-vocal audio. - Frozen encoders β no fine-tuning on this task.
- Not for medical, safety, clinical, or real-time infant-monitoring use.
- Downloads last month
- 53