Spaces:

ConvxO2
/

Who-Spoke-When

Running

App Files Files Community

Who-Spoke-When / README.md

ConvxO2

Restore required Spaces YAML configuration in README

411e5d6 7 days ago

preview code

raw

history blame contribute delete

4.89 kB

metadata

title: Who Spoke When
emoji: 🎙️
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app/main.py
pinned: false

Who Spoke When

Speaker diarization service and web app: upload audio and get who spoke when segments.

The project now runs with a hybrid pipeline:

Preferred: pyannote/speaker-diarization-3.1 (best quality)
Fallback: VAD + ECAPA-TDNN embeddings + agglomerative clustering

What You Get

FastAPI backend (/diarize, /diarize/url, /health)
Web UI (/) for file upload and timeline view
CLI demo (demo.py)
Automatic fallback if pyannote models are unavailable

Project Structure

app/
  main.py         FastAPI app and endpoints
  pipeline.py     Hybrid diarization pipeline
models/
  embedder.py     ECAPA-TDNN embedding extractor
  clusterer.py    Speaker clustering logic
utils/
  audio.py        Audio and export helpers
static/
  index.html      Web UI
Dockerfile
requirements.txt
README.md

Quick Start (Local)

1. Create and activate a virtual environment

Windows PowerShell:

python -m venv .venv
.\.venv\Scripts\Activate.ps1

Linux/macOS:

python -m venv .venv
source .venv/bin/activate

2. Install dependencies

pip install -r requirements.txt

3. (Recommended) Set Hugging Face token

pyannote models are gated. Create a token at https://huggingface.co/settings/tokens.

Windows PowerShell:

$env:HF_TOKEN="your_token_here"

Linux/macOS:

export HF_TOKEN="your_token_here"

4. Run API server

uvicorn app.main:app --host 0.0.0.0 --port 8000

Open:

UI: http://localhost:8000
API docs: http://localhost:8000/docs

Web UI Notes

The UI now defaults to same-origin API (/diarize), so it works on Hugging Face Spaces.
If you manually set a custom endpoint, ensure it allows CORS and is reachable from browser.

Hugging Face Spaces Deployment

Requirements

Space created (Docker SDK)
Space secret HF_TOKEN configured
Terms accepted for:
- https://huggingface.co/pyannote/voice-activity-detection
- https://huggingface.co/pyannote/speaker-diarization-3.1

Push code

Push main branch to your Space repo remote:

git push huggingface main

If push fails with unauthorized:

Use a token with Write role (not Read)
Confirm token owner has access to the target namespace

API

`GET /health`

Returns service health and device.

`POST /diarize`

Upload an audio file.

Form fields:

file: audio file
num_speakers (optional): force known number of speakers

Example:

curl -X POST http://localhost:8000/diarize \
  -F "file=@meeting.mp3" \
  -F "num_speakers=2"

`POST /diarize/url`

Diarize audio from a remote URL.

Example:

curl -X POST "http://localhost:8000/diarize/url?audio_url=https://example.com/sample.wav"

CLI Usage

python demo.py --audio meeting.wav
python demo.py --audio meeting.wav --speakers 2
python demo.py --audio meeting.wav --output result.json --rttm result.rttm --srt result.srt

Configuration (Environment Variables)

Variable	Default	Description
`HF_TOKEN`	unset	Hugging Face token for gated pyannote models
`CACHE_DIR`	temp model cache path	Model download/cache directory
`USE_PYANNOTE_DIARIZATION`	`true`	Enable full pyannote diarization first
`PYANNOTE_DIARIZATION_MODEL`	`pyannote/speaker-diarization-3.1`	pyannote diarization model id

How the Pipeline Works

Load and normalize audio
Try full pyannote diarization (best quality)
If unavailable/fails, fallback to:
- VAD (pyannote VAD or energy VAD)
- Sliding windows
- ECAPA embeddings
- Agglomerative clustering
Merge adjacent same-speaker segments

Troubleshooting

1) UI shows `Error: Failed to fetch`

Likely wrong API endpoint. Use same-origin /diarize in deployed UI.

2) Logs show pyannote download/auth warnings

You need:

valid HF_TOKEN
accepted model terms on both pyannote model pages

3) Poor speaker separation

Provide num_speakers when known
Ensure clean audio (minimal background noise)
Prefer pyannote path (set token + accept terms)

4) `500` during embedding load

This is usually model download/cache/auth mismatch. Confirm HF_TOKEN, cache path write access, and internet connectivity.

Limitations

Overlapped speech may still be imperfect in fallback mode
Quality depends on audio clarity, language mix, and noise
Very short utterances are harder to classify reliably

License

Add your preferred license file (LICENSE) if this project is public.