Real-time Emergency Audio Detection
A low-latency, multi-component AI pipeline that continuously monitors audio streams from microphones in public facilities (schools, welfare centers) and detects emergency situations in real time. Each 4-second segment is analyzed on a sliding window, enabling near-instant alert triggering and broadcast-system integration.
Key Highlights
- Real-time streaming β Continuous 4-second segment analysis from live microphone input
- Multi-signal fusion β Combines a binary emergency classifier (main decision), PANNs sound-event detection, a PANNs acoustic event-type prior, Korean keyword spotting, and emotion cues
- Text-based keyword spotting β Whisper ASR + Korean emergency keyword matching with false-positive filtering
- 5-class event taxonomy β
FIRE / DISASTER / MEDICAL_RESCUE / CRIME / NORMAL, with keyword-driven sub-types - Broadcast-ready β Emits a standardized JSON protocol (logs + PostgreSQL) for direct integration with public-address systems
- 90.0% accuracy / 0.935 F1 β Binary emergency classifier, 5-Fold CV on AIHub emergency audio
Architecture
Live Microphone (16kHz, 4s sliding-window segments)
|
+-- SSLAM Encoder (ViT-Base, 768-dim, frozen)
| +-- BinaryClassifier (768->256->64->1, sigmoid) -- Emergency score 0~1 [MAIN decision]
|
+-- PANNs CNN14 (AudioSet-pretrained, via panns-inference)
| +-- SED ------------------- 15 sound-event classes (scream / glass_break /
| | explosion / alarm ... -> evidence & escalation)
| +-- event_type head ------- panns_emg4_head.npz -> 4-class acoustic prior
| (FIRE / DISASTER / MEDICAL_RESCUE / CRIME)
|
+-- Silero VAD + RMS --------- silence gate: skip emotion & KWS when
| speech_prob < 0.5 AND rms < 0.01
|
+-- Emotion: 2-Track DeepFusion (emotion_fusion_kesd.pth)
| +-- Track1 CURA(MobileNetV3-Small)+Mamba / Track2 librosa 216-d + 1D-CNN
| +-- cross-attention -> Ekman 7 + Valence/Arousal -> 48-emotion mapping
| +-- rule-based fallback when the checkpoint is absent
|
+-- KWS (2-stage, Korean)
| +-- 1st Text KWS --------- Whisper ASR -> keyword-dict match + FP filter
| +-- 2nd CNN KWS (fallback) mel-spectrogram CNN, used only when text KWS
| | finds nothing
| +-- backends: callvoice (default) | cnn | gemini ; on detection: score +0.3
|
+-- Event-type fusion -> 5-class { FIRE, DISASTER, MEDICAL_RESCUE, CRIME, NORMAL }
| priority: KWS subtype->parent > scream heuristic->MEDICAL_RESCUE > PANNs argmax
| (sub-type is decided by KWS; the low-accuracy acoustic type head was removed)
|
+-- Score fusion
| +-- escalate on: SSLAM >= 0.8 (solo) | SSLAM >= 0.7 (corroborated) |
| | high-emergency SED >= 0.5 | scream heuristic
| +-- keyword detected -> +0.3 boost, final floored at 0.85
| +-- no escalation -> final capped at 0.49 (normal)
|
+-- Alert level: DANGER >= 0.85 | WARNING >= 0.70 | CAUTION >= 0.50 | NORMAL
|
+-- OutputFormatter -> JSON (logs/alert_YYYY-MM-DD.jsonl + PostgreSQL)
A. metadata B. location/zone C. situation (event_type / subtype / risk)
D. emotion (valence / arousal / 48-emotion) F. broadcast (scenario / ment / action)
Components
| Component | Architecture | Purpose |
|---|---|---|
| SSLAM Encoder | ViT-Base (frozen, 768-dim) | Shared audio embedding |
| BinaryClassifier | MLP 768->256->64->1 | Main emergency/normal decision (score 0~1) |
| PANNs SED | CNN14, AudioSet-pretrained | 15 sound-event classes (scream / glass_break / explosion evidence) |
| PANNs event_type head | CNN14 emb + linear (panns_emg4_head.npz) |
4-class acoustic prior (FIRE / DISASTER / MEDICAL_RESCUE / CRIME) |
| Event-type fusion | rule priority (KWS > scream > acoustic) | Final 5-class event_type + sub-type |
| Text KWS | Whisper ASR + keyword dict | Korean emergency keyword spotting (primary) |
| CNN KWS | mel-spectrogram CNN (~50K params) | Keyword spotting fallback |
| Emotion | 2-Track DeepFusion (CURA+Mamba / librosa+CNN, cross-attention) | Ekman 7 emotions + Valence/Arousal + 48-emotion mapping |
| Silero VAD | Pre-trained | Voice activity / silence gating |
Performance
All scores are 5-Fold CV unless noted. Trained on Korean emergency audio (AIHub); PANNs backbones are pretrained on AudioSet.
| Component | Metric | Score | Notes |
|---|---|---|---|
| BinaryClassifier (main) | Accuracy / F1 | 90.0% / 0.935 | Precision 0.935, Recall 0.935 (AIHub) |
| PANNs event_type head | Accuracy / macro-F1 | 94.6% / 0.946 | 4-class acoustic prior, n=7,958 |
| Emotion DeepFusion | Accuracy (7 emotions) | 69.3% | KESDy18; VA MAE 0.25 / 0.16 (real annotator ratings) |
The final emergency decision relies primarily on the BinaryClassifier (90.0%). PANNs SED / event_type prior and KWS act as auxiliary evidence and score modifiers, so their standalone accuracies do not directly bound end-to-end performance. Keyword spotting is primarily text-based (Whisper ASR + Korean keyword dictionary); the CNN KWS is only a fallback. Sound-event detection uses AudioSet-pretrained PANNs (no in-house training); event sub-types are keyword-driven.
Latency
Audio-model inference (SSLAM + binary head + PANNs) runs in roughly tens of milliseconds per 4s segment on GPU. When speech is detected, text KWS adds Whisper ASR time on top. Minimum detection delay is bounded by the 4s segment length.
Alert Levels
| Level | Score Range | Action |
|---|---|---|
| DANGER | >= 0.85 | immediate_alert (TRIGGER_BROADCAST) |
| WARNING | 0.70 - 0.85 | consecutive_alert |
| CAUTION | 0.50 - 0.70 | log_only |
| NORMAL | < 0.50 | none |
Model Files
| File | Size | Description |
|---|---|---|
checkpoints/sslam_binary_head.pth |
0.8MB | Main binary emergency classifier (90.0% acc) |
checkpoints/panns_emg4_head.npz |
33KB | PANNs event_type head (4-class acoustic prior, 94.6% acc) |
checkpoints/kws_model_cnn.pth |
0.2MB | KWS CNN, 6-class (fallback) |
checkpoints/emotion_fusion_kesd.pth |
26.7MB | 2-Track DeepFusion emotion model (KESDy18, 69.3% acc) |
SED needs no checkpoint here β sound-event detection uses PANNs CNN14 (
~/panns_data/Cnn14_mAP=0.431.pth, auto-downloaded bypanns-inference).Text KWS has no checkpoint here β it uses a Whisper ASR model (
callvoicebackend:INo0121/whisper-base-ko-callvoice, auto-downloaded on first run) plus the Korean keyword dictionarydata/emergency_keywords.jsonin the GitHub repository.
Training Data
| Component | Dataset | Samples |
|---|---|---|
| BinaryClassifier | AIHub Emergency Audio (13 emergency + 2 normal categories) | 5-Fold CV |
| PANNs event_type head | AIHub Emergency Audio (merged into 4 emergency types) | 7,958 (cap 2,000/class) |
| CNN KWS | 119 Emergency Dispatch Data | 4,649 |
| Emotion DeepFusion | KESDy18 (ETRI acted speech, real V/A annotator ratings) | 2,879 |
| PANNs SED | AudioSet (PANNs CNN14 pretrained) | β (no in-house training) |
Usage
Download Checkpoints
from huggingface_hub import hf_hub_download
models = [
"sslam_binary_head.pth",
"panns_emg4_head.npz",
"kws_model_cnn.pth",
"emotion_fusion_kesd.pth",
]
for model in models:
hf_hub_download(
repo_id="Nakyung1007/emergency-audio-detection",
filename=f"checkpoints/{model}",
local_dir=".",
)
Real-time Detection (Primary Use Case)
# Start real-time monitoring from microphone
python src/realtime_detection.py --device cuda
# Choose KWS backend / event-type mode explicitly (skip the menu)
python src/realtime_detection.py --kws callvoice --event-mode fusion
# List available microphones
python src/realtime_detection.py --list-devices
# Select a specific microphone and segment length
python src/realtime_detection.py --mic 2 --segment 3
Single File Inference
import torch
from src.detect_and_analyze import EmergencyDetector
detector = EmergencyDetector(device="cuda")
audio = torch.randn(1, 64000) # (1, 16000 * 4)
result = detector.analyze(audio)
print(result["emergency_score"]) # 0.0 ~ 1.0 (after KWS boost / damping)
print(result["alert_level"]) # NORMAL / CAUTION / WARNING / DANGER
print(result["emergency_type"]) # sub-type ID (e.g. "scream") or None
print(result["emergency_type_ko"]) # Korean label, or None
print(result["matched_keywords"]) # e.g. ["μ΄λ €μ£ΌμΈμ"] when text KWS fires
print(result["emotion"]) # fear, anger, neutral, ...
print(result["valence"], result["arousal"]) # -1 ~ 1 each
Requirements
- Python 3.8+
- PyTorch >= 2.1.0
- torchaudio >= 2.1.0
- panns-inference (PANNs CNN14 for SED + event-type prior)
- transformers (Whisper for text KWS)
- soundfile, librosa, numpy, tqdm, python-dotenv
- sounddevice (for real-time microphone input)
Limitations
- Keyword spotting only supports Korean emergency keywords
- Optimized for indoor environments
- Minimum detection latency is ~4 seconds (segment length)
- SSLAM and CURA backbones are frozen
- Event sub-types are determined by keyword spotting; an acoustic type classifier was evaluated but removed (~49.6% accuracy increased misclassification)
- Emotion model is trained on acted speech (KESDy18); minority classes (fear / surprise / disgust) are rarely predicted β in practice it discriminates neutral / anger / sadness / happiness more reliably
- Falls back to rule-based emotion when
emotion_fusion_kesd.pthis absent
License
MIT