Real-time Emergency Audio Detection

A low-latency, multi-component AI pipeline that continuously monitors audio streams from microphones in public facilities (schools, welfare centers) and detects emergency situations in real time. Each 4-second segment is analyzed on a sliding window, enabling near-instant alert triggering and broadcast-system integration.

Key Highlights

Real-time streaming — Continuous 4-second segment analysis from live microphone input
Multi-signal fusion — Combines a binary emergency classifier (main decision), PANNs sound-event detection, a PANNs acoustic event-type prior, Korean keyword spotting, and emotion cues
Text-based keyword spotting — Whisper ASR + Korean emergency keyword matching with false-positive filtering
5-class event taxonomy — FIRE / DISASTER / MEDICAL_RESCUE / CRIME / NORMAL, with keyword-driven sub-types
Broadcast-ready — Emits a standardized JSON protocol (logs + PostgreSQL) for direct integration with public-address systems
90.0% accuracy / 0.935 F1 — Binary emergency classifier, 5-Fold CV on AIHub emergency audio

Architecture

Live Microphone (16kHz, 4s sliding-window segments)
|
+-- SSLAM Encoder (ViT-Base, 768-dim, frozen)
|   +-- BinaryClassifier (768->256->64->1, sigmoid) -- Emergency score 0~1  [MAIN decision]
|
+-- PANNs CNN14 (AudioSet-pretrained, via panns-inference)
|   +-- SED ------------------- 15 sound-event classes (scream / glass_break /
|   |                           explosion / alarm ... -> evidence & escalation)
|   +-- event_type head ------- panns_emg4_head.npz -> 4-class acoustic prior
|                               (FIRE / DISASTER / MEDICAL_RESCUE / CRIME)
|
+-- Silero VAD + RMS --------- silence gate: skip emotion & KWS when
|                              speech_prob < 0.5 AND rms < 0.01
|
+-- Emotion: 2-Track DeepFusion (emotion_fusion_kesd.pth)
|   +-- Track1 CURA(MobileNetV3-Small)+Mamba / Track2 librosa 216-d + 1D-CNN
|   +-- cross-attention -> Ekman 7 + Valence/Arousal -> 48-emotion mapping
|   +-- rule-based fallback when the checkpoint is absent
|
+-- KWS (2-stage, Korean)
|   +-- 1st Text KWS --------- Whisper ASR -> keyword-dict match + FP filter
|   +-- 2nd CNN KWS (fallback) mel-spectrogram CNN, used only when text KWS
|   |                          finds nothing
|   +-- backends: callvoice (default) | cnn | gemini ;  on detection: score +0.3
|
+-- Event-type fusion -> 5-class { FIRE, DISASTER, MEDICAL_RESCUE, CRIME, NORMAL }
|   priority: KWS subtype->parent  >  scream heuristic->MEDICAL_RESCUE  >  PANNs argmax
|   (sub-type is decided by KWS; the low-accuracy acoustic type head was removed)
|
+-- Score fusion
|   +-- escalate on: SSLAM >= 0.8 (solo) | SSLAM >= 0.7 (corroborated) |
|   |                high-emergency SED >= 0.5 | scream heuristic
|   +-- keyword detected -> +0.3 boost, final floored at 0.85
|   +-- no escalation     -> final capped at 0.49 (normal)
|
+-- Alert level: DANGER >= 0.85 | WARNING >= 0.70 | CAUTION >= 0.50 | NORMAL
|
+-- OutputFormatter -> JSON (logs/alert_YYYY-MM-DD.jsonl + PostgreSQL)
    A. metadata   B. location/zone   C. situation (event_type / subtype / risk)
    D. emotion (valence / arousal / 48-emotion)   F. broadcast (scenario / ment / action)

Components

Component	Architecture	Purpose
SSLAM Encoder	ViT-Base (frozen, 768-dim)	Shared audio embedding
BinaryClassifier	MLP 768->256->64->1	Main emergency/normal decision (score 0~1)
PANNs SED	CNN14, AudioSet-pretrained	15 sound-event classes (scream / glass_break / explosion evidence)
PANNs event_type head	CNN14 emb + linear (`panns_emg4_head.npz`)	4-class acoustic prior (FIRE / DISASTER / MEDICAL_RESCUE / CRIME)
Event-type fusion	rule priority (KWS > scream > acoustic)	Final 5-class event_type + sub-type
Text KWS	Whisper ASR + keyword dict	Korean emergency keyword spotting (primary)
CNN KWS	mel-spectrogram CNN (~50K params)	Keyword spotting fallback
Emotion	2-Track DeepFusion (CURA+Mamba / librosa+CNN, cross-attention)	Ekman 7 emotions + Valence/Arousal + 48-emotion mapping
Silero VAD	Pre-trained	Voice activity / silence gating

Performance

All scores are 5-Fold CV unless noted. Trained on Korean emergency audio (AIHub); PANNs backbones are pretrained on AudioSet.

Component	Metric	Score	Notes
BinaryClassifier (main)	Accuracy / F1	90.0% / 0.935	Precision 0.935, Recall 0.935 (AIHub)
PANNs event_type head	Accuracy / macro-F1	94.6% / 0.946	4-class acoustic prior, n=7,958
Emotion DeepFusion	Accuracy (7 emotions)	69.3%	KESDy18; VA MAE 0.25 / 0.16 (real annotator ratings)

The final emergency decision relies primarily on the BinaryClassifier (90.0%). PANNs SED / event_type prior and KWS act as auxiliary evidence and score modifiers, so their standalone accuracies do not directly bound end-to-end performance. Keyword spotting is primarily text-based (Whisper ASR + Korean keyword dictionary); the CNN KWS is only a fallback. Sound-event detection uses AudioSet-pretrained PANNs (no in-house training); event sub-types are keyword-driven.

Latency

Audio-model inference (SSLAM + binary head + PANNs) runs in roughly tens of milliseconds per 4s segment on GPU. When speech is detected, text KWS adds Whisper ASR time on top. Minimum detection delay is bounded by the 4s segment length.

Alert Levels

Level	Score Range	Action
DANGER	>= 0.85	immediate_alert (TRIGGER_BROADCAST)
WARNING	0.70 - 0.85	consecutive_alert
CAUTION	0.50 - 0.70	log_only
NORMAL	< 0.50	none

Model Files

File	Size	Description
`checkpoints/sslam_binary_head.pth`	0.8MB	Main binary emergency classifier (90.0% acc)
`checkpoints/panns_emg4_head.npz`	33KB	PANNs event_type head (4-class acoustic prior, 94.6% acc)
`checkpoints/kws_model_cnn.pth`	0.2MB	KWS CNN, 6-class (fallback)
`checkpoints/emotion_fusion_kesd.pth`	26.7MB	2-Track DeepFusion emotion model (KESDy18, 69.3% acc)

SED needs no checkpoint here — sound-event detection uses PANNs CNN14 (~/panns_data/Cnn14_mAP=0.431.pth, auto-downloaded by panns-inference).

Text KWS has no checkpoint here — it uses a Whisper ASR model (callvoice backend: INo0121/whisper-base-ko-callvoice, auto-downloaded on first run) plus the Korean keyword dictionary data/emergency_keywords.json in the GitHub repository.

Training Data

Component	Dataset	Samples
BinaryClassifier	AIHub Emergency Audio (13 emergency + 2 normal categories)	5-Fold CV
PANNs event_type head	AIHub Emergency Audio (merged into 4 emergency types)	7,958 (cap 2,000/class)
CNN KWS	119 Emergency Dispatch Data	4,649
Emotion DeepFusion	KESDy18 (ETRI acted speech, real V/A annotator ratings)	2,879
PANNs SED	AudioSet (PANNs CNN14 pretrained)	— (no in-house training)

Usage

Download Checkpoints

from huggingface_hub import hf_hub_download

models = [
    "sslam_binary_head.pth",
    "panns_emg4_head.npz",
    "kws_model_cnn.pth",
    "emotion_fusion_kesd.pth",
]

for model in models:
    hf_hub_download(
        repo_id="Nakyung1007/emergency-audio-detection",
        filename=f"checkpoints/{model}",
        local_dir=".",
    )

Real-time Detection (Primary Use Case)

# Start real-time monitoring from microphone
python src/realtime_detection.py --device cuda

# Choose KWS backend / event-type mode explicitly (skip the menu)
python src/realtime_detection.py --kws callvoice --event-mode fusion

# List available microphones
python src/realtime_detection.py --list-devices

# Select a specific microphone and segment length
python src/realtime_detection.py --mic 2 --segment 3

Single File Inference

import torch
from src.detect_and_analyze import EmergencyDetector

detector = EmergencyDetector(device="cuda")

audio = torch.randn(1, 64000)  # (1, 16000 * 4)
result = detector.analyze(audio)

print(result["emergency_score"])     # 0.0 ~ 1.0 (after KWS boost / damping)
print(result["alert_level"])         # NORMAL / CAUTION / WARNING / DANGER
print(result["emergency_type"])      # sub-type ID (e.g. "scream") or None
print(result["emergency_type_ko"])   # Korean label, or None
print(result["matched_keywords"])    # e.g. ["살려주세요"] when text KWS fires
print(result["emotion"])             # fear, anger, neutral, ...
print(result["valence"], result["arousal"])  # -1 ~ 1 each

Requirements

Python 3.8+
PyTorch >= 2.1.0
torchaudio >= 2.1.0
panns-inference (PANNs CNN14 for SED + event-type prior)
transformers (Whisper for text KWS)
soundfile, librosa, numpy, tqdm, python-dotenv
sounddevice (for real-time microphone input)

Limitations

Keyword spotting only supports Korean emergency keywords
Optimized for indoor environments
Minimum detection latency is ~4 seconds (segment length)
SSLAM and CURA backbones are frozen
Event sub-types are determined by keyword spotting; an acoustic type classifier was evaluated but removed (~49.6% accuracy increased misclassification)
Emotion model is trained on acted speech (KESDy18); minority classes (fear / surprise / disgust) are rarely predicted — in practice it discriminates neutral / anger / sadness / happiness more reliably
Falls back to rule-based emotion when emotion_fusion_kesd.pth is absent

License

MIT

Nakyung1007
/

emergency-audio-detection