Who-Spoke-When / README.md
ConvxO2's picture
Restore required Spaces YAML configuration in README
411e5d6
metadata
title: Who Spoke When
emoji: 🎙️
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app/main.py
pinned: false

Who Spoke When

Speaker diarization service and web app: upload audio and get who spoke when segments.

The project now runs with a hybrid pipeline:

  • Preferred: pyannote/speaker-diarization-3.1 (best quality)
  • Fallback: VAD + ECAPA-TDNN embeddings + agglomerative clustering

What You Get

  • FastAPI backend (/diarize, /diarize/url, /health)
  • Web UI (/) for file upload and timeline view
  • CLI demo (demo.py)
  • Automatic fallback if pyannote models are unavailable

Project Structure

app/
  main.py         FastAPI app and endpoints
  pipeline.py     Hybrid diarization pipeline
models/
  embedder.py     ECAPA-TDNN embedding extractor
  clusterer.py    Speaker clustering logic
utils/
  audio.py        Audio and export helpers
static/
  index.html      Web UI
Dockerfile
requirements.txt
README.md

Quick Start (Local)

1. Create and activate a virtual environment

Windows PowerShell:

python -m venv .venv
.\.venv\Scripts\Activate.ps1

Linux/macOS:

python -m venv .venv
source .venv/bin/activate

2. Install dependencies

pip install -r requirements.txt

3. (Recommended) Set Hugging Face token

pyannote models are gated. Create a token at https://huggingface.co/settings/tokens.

Windows PowerShell:

$env:HF_TOKEN="your_token_here"

Linux/macOS:

export HF_TOKEN="your_token_here"

4. Run API server

uvicorn app.main:app --host 0.0.0.0 --port 8000

Open:

  • UI: http://localhost:8000
  • API docs: http://localhost:8000/docs

Web UI Notes

  • The UI now defaults to same-origin API (/diarize), so it works on Hugging Face Spaces.
  • If you manually set a custom endpoint, ensure it allows CORS and is reachable from browser.

Hugging Face Spaces Deployment

Requirements

  1. Space created (Docker SDK)
  2. Space secret HF_TOKEN configured
  3. Terms accepted for:

Push code

Push main branch to your Space repo remote:

git push huggingface main

If push fails with unauthorized:

  • Use a token with Write role (not Read)
  • Confirm token owner has access to the target namespace

API

GET /health

Returns service health and device.

POST /diarize

Upload an audio file.

Form fields:

  • file: audio file
  • num_speakers (optional): force known number of speakers

Example:

curl -X POST http://localhost:8000/diarize \
  -F "file=@meeting.mp3" \
  -F "num_speakers=2"

POST /diarize/url

Diarize audio from a remote URL.

Example:

curl -X POST "http://localhost:8000/diarize/url?audio_url=https://example.com/sample.wav"

CLI Usage

python demo.py --audio meeting.wav
python demo.py --audio meeting.wav --speakers 2
python demo.py --audio meeting.wav --output result.json --rttm result.rttm --srt result.srt

Configuration (Environment Variables)

Variable Default Description
HF_TOKEN unset Hugging Face token for gated pyannote models
CACHE_DIR temp model cache path Model download/cache directory
USE_PYANNOTE_DIARIZATION true Enable full pyannote diarization first
PYANNOTE_DIARIZATION_MODEL pyannote/speaker-diarization-3.1 pyannote diarization model id

How the Pipeline Works

  1. Load and normalize audio
  2. Try full pyannote diarization (best quality)
  3. If unavailable/fails, fallback to:
    • VAD (pyannote VAD or energy VAD)
    • Sliding windows
    • ECAPA embeddings
    • Agglomerative clustering
  4. Merge adjacent same-speaker segments

Troubleshooting

1) UI shows Error: Failed to fetch

Likely wrong API endpoint. Use same-origin /diarize in deployed UI.

2) Logs show pyannote download/auth warnings

You need:

  • valid HF_TOKEN
  • accepted model terms on both pyannote model pages

3) Poor speaker separation

  • Provide num_speakers when known
  • Ensure clean audio (minimal background noise)
  • Prefer pyannote path (set token + accept terms)

4) 500 during embedding load

This is usually model download/cache/auth mismatch. Confirm HF_TOKEN, cache path write access, and internet connectivity.


Limitations

  • Overlapped speech may still be imperfect in fallback mode
  • Quality depends on audio clarity, language mix, and noise
  • Very short utterances are harder to classify reliably

License

Add your preferred license file (LICENSE) if this project is public.