Real-time Emergency Audio Detection

A low-latency, multi-component AI pipeline that continuously monitors audio streams from microphones in public facilities (schools, welfare centers) and detects emergency situations in real time. Each 4-second segment is analyzed on a sliding window, enabling near-instant alert triggering and broadcast-system integration.

Key Highlights

  • Real-time streaming β€” Continuous 4-second segment analysis from live microphone input
  • Multi-signal fusion β€” Combines a binary emergency classifier (main decision), PANNs sound-event detection, a PANNs acoustic event-type prior, Korean keyword spotting, and emotion cues
  • Text-based keyword spotting β€” Whisper ASR + Korean emergency keyword matching with false-positive filtering
  • 5-class event taxonomy β€” FIRE / DISASTER / MEDICAL_RESCUE / CRIME / NORMAL, with keyword-driven sub-types
  • Broadcast-ready β€” Emits a standardized JSON protocol (logs + PostgreSQL) for direct integration with public-address systems
  • 90.0% accuracy / 0.935 F1 β€” Binary emergency classifier, 5-Fold CV on AIHub emergency audio

Architecture

Live Microphone (16kHz, 4s sliding-window segments)
|
+-- SSLAM Encoder (ViT-Base, 768-dim, frozen)
|   +-- BinaryClassifier (768->256->64->1, sigmoid) -- Emergency score 0~1  [MAIN decision]
|
+-- PANNs CNN14 (AudioSet-pretrained, via panns-inference)
|   +-- SED ------------------- 15 sound-event classes (scream / glass_break /
|   |                           explosion / alarm ... -> evidence & escalation)
|   +-- event_type head ------- panns_emg4_head.npz -> 4-class acoustic prior
|                               (FIRE / DISASTER / MEDICAL_RESCUE / CRIME)
|
+-- Silero VAD + RMS --------- silence gate: skip emotion & KWS when
|                              speech_prob < 0.5 AND rms < 0.01
|
+-- Emotion: 2-Track DeepFusion (emotion_fusion_kesd.pth)
|   +-- Track1 CURA(MobileNetV3-Small)+Mamba / Track2 librosa 216-d + 1D-CNN
|   +-- cross-attention -> Ekman 7 + Valence/Arousal -> 48-emotion mapping
|   +-- rule-based fallback when the checkpoint is absent
|
+-- KWS (2-stage, Korean)
|   +-- 1st Text KWS --------- Whisper ASR -> keyword-dict match + FP filter
|   +-- 2nd CNN KWS (fallback) mel-spectrogram CNN, used only when text KWS
|   |                          finds nothing
|   +-- backends: callvoice (default) | cnn | gemini ;  on detection: score +0.3
|
+-- Event-type fusion -> 5-class { FIRE, DISASTER, MEDICAL_RESCUE, CRIME, NORMAL }
|   priority: KWS subtype->parent  >  scream heuristic->MEDICAL_RESCUE  >  PANNs argmax
|   (sub-type is decided by KWS; the low-accuracy acoustic type head was removed)
|
+-- Score fusion
|   +-- escalate on: SSLAM >= 0.8 (solo) | SSLAM >= 0.7 (corroborated) |
|   |                high-emergency SED >= 0.5 | scream heuristic
|   +-- keyword detected -> +0.3 boost, final floored at 0.85
|   +-- no escalation     -> final capped at 0.49 (normal)
|
+-- Alert level: DANGER >= 0.85 | WARNING >= 0.70 | CAUTION >= 0.50 | NORMAL
|
+-- OutputFormatter -> JSON (logs/alert_YYYY-MM-DD.jsonl + PostgreSQL)
    A. metadata   B. location/zone   C. situation (event_type / subtype / risk)
    D. emotion (valence / arousal / 48-emotion)   F. broadcast (scenario / ment / action)

Components

Component Architecture Purpose
SSLAM Encoder ViT-Base (frozen, 768-dim) Shared audio embedding
BinaryClassifier MLP 768->256->64->1 Main emergency/normal decision (score 0~1)
PANNs SED CNN14, AudioSet-pretrained 15 sound-event classes (scream / glass_break / explosion evidence)
PANNs event_type head CNN14 emb + linear (panns_emg4_head.npz) 4-class acoustic prior (FIRE / DISASTER / MEDICAL_RESCUE / CRIME)
Event-type fusion rule priority (KWS > scream > acoustic) Final 5-class event_type + sub-type
Text KWS Whisper ASR + keyword dict Korean emergency keyword spotting (primary)
CNN KWS mel-spectrogram CNN (~50K params) Keyword spotting fallback
Emotion 2-Track DeepFusion (CURA+Mamba / librosa+CNN, cross-attention) Ekman 7 emotions + Valence/Arousal + 48-emotion mapping
Silero VAD Pre-trained Voice activity / silence gating

Performance

All scores are 5-Fold CV unless noted. Trained on Korean emergency audio (AIHub); PANNs backbones are pretrained on AudioSet.

Component Metric Score Notes
BinaryClassifier (main) Accuracy / F1 90.0% / 0.935 Precision 0.935, Recall 0.935 (AIHub)
PANNs event_type head Accuracy / macro-F1 94.6% / 0.946 4-class acoustic prior, n=7,958
Emotion DeepFusion Accuracy (7 emotions) 69.3% KESDy18; VA MAE 0.25 / 0.16 (real annotator ratings)

The final emergency decision relies primarily on the BinaryClassifier (90.0%). PANNs SED / event_type prior and KWS act as auxiliary evidence and score modifiers, so their standalone accuracies do not directly bound end-to-end performance. Keyword spotting is primarily text-based (Whisper ASR + Korean keyword dictionary); the CNN KWS is only a fallback. Sound-event detection uses AudioSet-pretrained PANNs (no in-house training); event sub-types are keyword-driven.

Latency

Audio-model inference (SSLAM + binary head + PANNs) runs in roughly tens of milliseconds per 4s segment on GPU. When speech is detected, text KWS adds Whisper ASR time on top. Minimum detection delay is bounded by the 4s segment length.

Alert Levels

Level Score Range Action
DANGER >= 0.85 immediate_alert (TRIGGER_BROADCAST)
WARNING 0.70 - 0.85 consecutive_alert
CAUTION 0.50 - 0.70 log_only
NORMAL < 0.50 none

Model Files

File Size Description
checkpoints/sslam_binary_head.pth 0.8MB Main binary emergency classifier (90.0% acc)
checkpoints/panns_emg4_head.npz 33KB PANNs event_type head (4-class acoustic prior, 94.6% acc)
checkpoints/kws_model_cnn.pth 0.2MB KWS CNN, 6-class (fallback)
checkpoints/emotion_fusion_kesd.pth 26.7MB 2-Track DeepFusion emotion model (KESDy18, 69.3% acc)

SED needs no checkpoint here β€” sound-event detection uses PANNs CNN14 (~/panns_data/Cnn14_mAP=0.431.pth, auto-downloaded by panns-inference).

Text KWS has no checkpoint here β€” it uses a Whisper ASR model (callvoice backend: INo0121/whisper-base-ko-callvoice, auto-downloaded on first run) plus the Korean keyword dictionary data/emergency_keywords.json in the GitHub repository.

Training Data

Component Dataset Samples
BinaryClassifier AIHub Emergency Audio (13 emergency + 2 normal categories) 5-Fold CV
PANNs event_type head AIHub Emergency Audio (merged into 4 emergency types) 7,958 (cap 2,000/class)
CNN KWS 119 Emergency Dispatch Data 4,649
Emotion DeepFusion KESDy18 (ETRI acted speech, real V/A annotator ratings) 2,879
PANNs SED AudioSet (PANNs CNN14 pretrained) β€” (no in-house training)

Usage

Download Checkpoints

from huggingface_hub import hf_hub_download

models = [
    "sslam_binary_head.pth",
    "panns_emg4_head.npz",
    "kws_model_cnn.pth",
    "emotion_fusion_kesd.pth",
]

for model in models:
    hf_hub_download(
        repo_id="Nakyung1007/emergency-audio-detection",
        filename=f"checkpoints/{model}",
        local_dir=".",
    )

Real-time Detection (Primary Use Case)

# Start real-time monitoring from microphone
python src/realtime_detection.py --device cuda

# Choose KWS backend / event-type mode explicitly (skip the menu)
python src/realtime_detection.py --kws callvoice --event-mode fusion

# List available microphones
python src/realtime_detection.py --list-devices

# Select a specific microphone and segment length
python src/realtime_detection.py --mic 2 --segment 3

Single File Inference

import torch
from src.detect_and_analyze import EmergencyDetector

detector = EmergencyDetector(device="cuda")

audio = torch.randn(1, 64000)  # (1, 16000 * 4)
result = detector.analyze(audio)

print(result["emergency_score"])     # 0.0 ~ 1.0 (after KWS boost / damping)
print(result["alert_level"])         # NORMAL / CAUTION / WARNING / DANGER
print(result["emergency_type"])      # sub-type ID (e.g. "scream") or None
print(result["emergency_type_ko"])   # Korean label, or None
print(result["matched_keywords"])    # e.g. ["μ‚΄λ €μ£Όμ„Έμš”"] when text KWS fires
print(result["emotion"])             # fear, anger, neutral, ...
print(result["valence"], result["arousal"])  # -1 ~ 1 each

Requirements

  • Python 3.8+
  • PyTorch >= 2.1.0
  • torchaudio >= 2.1.0
  • panns-inference (PANNs CNN14 for SED + event-type prior)
  • transformers (Whisper for text KWS)
  • soundfile, librosa, numpy, tqdm, python-dotenv
  • sounddevice (for real-time microphone input)

Limitations

  • Keyword spotting only supports Korean emergency keywords
  • Optimized for indoor environments
  • Minimum detection latency is ~4 seconds (segment length)
  • SSLAM and CURA backbones are frozen
  • Event sub-types are determined by keyword spotting; an acoustic type classifier was evaluated but removed (~49.6% accuracy increased misclassification)
  • Emotion model is trained on acted speech (KESDy18); minority classes (fear / surprise / disgust) are rarely predicted β€” in practice it discriminates neutral / anger / sadness / happiness more reliably
  • Falls back to rule-based emotion when emotion_fusion_kesd.pth is absent

License

MIT

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support