Spaces:
Running
Running
metadata
title: Who Spoke When
emoji: 🎙️
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app/main.py
pinned: false
Who Spoke When
Speaker diarization service and web app: upload audio and get who spoke when segments.
The project now runs with a hybrid pipeline:
- Preferred:
pyannote/speaker-diarization-3.1(best quality) - Fallback: VAD + ECAPA-TDNN embeddings + agglomerative clustering
What You Get
- FastAPI backend (
/diarize,/diarize/url,/health) - Web UI (
/) for file upload and timeline view - CLI demo (
demo.py) - Automatic fallback if pyannote models are unavailable
Project Structure
app/
main.py FastAPI app and endpoints
pipeline.py Hybrid diarization pipeline
models/
embedder.py ECAPA-TDNN embedding extractor
clusterer.py Speaker clustering logic
utils/
audio.py Audio and export helpers
static/
index.html Web UI
Dockerfile
requirements.txt
README.md
Quick Start (Local)
1. Create and activate a virtual environment
Windows PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
Linux/macOS:
python -m venv .venv
source .venv/bin/activate
2. Install dependencies
pip install -r requirements.txt
3. (Recommended) Set Hugging Face token
pyannote models are gated. Create a token at https://huggingface.co/settings/tokens.
Windows PowerShell:
$env:HF_TOKEN="your_token_here"
Linux/macOS:
export HF_TOKEN="your_token_here"
4. Run API server
uvicorn app.main:app --host 0.0.0.0 --port 8000
Open:
- UI:
http://localhost:8000 - API docs:
http://localhost:8000/docs
Web UI Notes
- The UI now defaults to same-origin API (
/diarize), so it works on Hugging Face Spaces. - If you manually set a custom endpoint, ensure it allows CORS and is reachable from browser.
Hugging Face Spaces Deployment
Requirements
- Space created (Docker SDK)
- Space secret
HF_TOKENconfigured - Terms accepted for:
Push code
Push main branch to your Space repo remote:
git push huggingface main
If push fails with unauthorized:
- Use a token with Write role (not Read)
- Confirm token owner has access to the target namespace
API
GET /health
Returns service health and device.
POST /diarize
Upload an audio file.
Form fields:
file: audio filenum_speakers(optional): force known number of speakers
Example:
curl -X POST http://localhost:8000/diarize \
-F "file=@meeting.mp3" \
-F "num_speakers=2"
POST /diarize/url
Diarize audio from a remote URL.
Example:
curl -X POST "http://localhost:8000/diarize/url?audio_url=https://example.com/sample.wav"
CLI Usage
python demo.py --audio meeting.wav
python demo.py --audio meeting.wav --speakers 2
python demo.py --audio meeting.wav --output result.json --rttm result.rttm --srt result.srt
Configuration (Environment Variables)
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
unset | Hugging Face token for gated pyannote models |
CACHE_DIR |
temp model cache path | Model download/cache directory |
USE_PYANNOTE_DIARIZATION |
true |
Enable full pyannote diarization first |
PYANNOTE_DIARIZATION_MODEL |
pyannote/speaker-diarization-3.1 |
pyannote diarization model id |
How the Pipeline Works
- Load and normalize audio
- Try full pyannote diarization (best quality)
- If unavailable/fails, fallback to:
- VAD (pyannote VAD or energy VAD)
- Sliding windows
- ECAPA embeddings
- Agglomerative clustering
- Merge adjacent same-speaker segments
Troubleshooting
1) UI shows Error: Failed to fetch
Likely wrong API endpoint. Use same-origin /diarize in deployed UI.
2) Logs show pyannote download/auth warnings
You need:
- valid
HF_TOKEN - accepted model terms on both pyannote model pages
3) Poor speaker separation
- Provide
num_speakerswhen known - Ensure clean audio (minimal background noise)
- Prefer pyannote path (set token + accept terms)
4) 500 during embedding load
This is usually model download/cache/auth mismatch. Confirm HF_TOKEN, cache path write access, and internet connectivity.
Limitations
- Overlapped speech may still be imperfect in fallback mode
- Quality depends on audio clarity, language mix, and noise
- Very short utterances are harder to classify reliably
License
Add your preferred license file (LICENSE) if this project is public.