Spaces:
Sleeping
Sleeping
vijesh418 commited on
Commit ·
1f07aba
0
Parent(s):
Initial commit: MoodSyncAI multi-modal sentiment analyser
Browse files- .gitignore +36 -0
- README.md +64 -0
- app.py +1048 -0
- requirements.txt +33 -0
.gitignore
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Virtual environments
|
| 2 |
+
.venv/
|
| 3 |
+
venv/
|
| 4 |
+
env/
|
| 5 |
+
|
| 6 |
+
# Python
|
| 7 |
+
__pycache__/
|
| 8 |
+
*.py[cod]
|
| 9 |
+
*$py.class
|
| 10 |
+
*.so
|
| 11 |
+
*.egg-info/
|
| 12 |
+
.pytest_cache/
|
| 13 |
+
|
| 14 |
+
# Hugging Face / model caches
|
| 15 |
+
hf_cache/
|
| 16 |
+
.cache/
|
| 17 |
+
|
| 18 |
+
# Logs
|
| 19 |
+
*.log
|
| 20 |
+
install.log
|
| 21 |
+
|
| 22 |
+
# Dev / scratch scripts (uncomment to keep them local-only)
|
| 23 |
+
_warmup.py
|
| 24 |
+
_smoke_features.py
|
| 25 |
+
_verify_requirements.py
|
| 26 |
+
|
| 27 |
+
# IDE
|
| 28 |
+
.vscode/
|
| 29 |
+
.idea/
|
| 30 |
+
*.swp
|
| 31 |
+
.DS_Store
|
| 32 |
+
|
| 33 |
+
# Build artifacts
|
| 34 |
+
build/
|
| 35 |
+
dist/
|
| 36 |
+
*.spec
|
README.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🎭 MoodSyncAI
|
| 2 |
+
|
| 3 |
+
**Multi-Modal Sentiment & Emotion Analyser** — combines facial emotion (Vision Transformer), text sentiment (Transformer), a fusion layer (with mismatch detection), and a generative model that summarises the emotional state in plain language. Includes a **webcam / short-video timeline** view.
|
| 4 |
+
|
| 5 |
+
All models are **100% free & open-source** (Hugging Face Hub).
|
| 6 |
+
|
| 7 |
+
## Components
|
| 8 |
+
|
| 9 |
+
| Stage | Model | Type | Requirement satisfied |
|
| 10 |
+
|---|---|---|---|
|
| 11 |
+
| Visual emotion | `trpakov/vit-face-expression` | **ViT** | CNN/ViT for facial emotion ✅ |
|
| 12 |
+
| Text sentiment | `j-hartmann/emotion-english-distilroberta-base` | **Transformer** | RNN/LSTM/Transformer ✅ |
|
| 13 |
+
| Speech-to-text | `openai/whisper-tiny` | **Whisper encoder-decoder** | Audio → text channel ✅ |
|
| 14 |
+
| Fusion | Valence-aligned multimodal fusion | rule-based + weighted | Fusion + mismatch ✅ |
|
| 15 |
+
| Generative | `google/flan-t5-base` | seq2seq Transformer | Generative summary ✅ |
|
| 16 |
+
| Webcam / video | OpenCV frame sampling + Plotly timeline | — | Real-time / video input ✅ |
|
| 17 |
+
| Attention viz | ViT attention rollout + last-layer text attention | interpretability | Attention visualisation ✅ |
|
| 18 |
+
|
| 19 |
+
## Run
|
| 20 |
+
|
| 21 |
+
**Prerequisite:** Python **3.10 – 3.13** (CPU is enough — no GPU required, no system ffmpeg required).
|
| 22 |
+
|
| 23 |
+
```powershell
|
| 24 |
+
# 1. Clone / copy this folder onto the new machine, then:
|
| 25 |
+
cd "<path-to-folder>"
|
| 26 |
+
|
| 27 |
+
# 2. Create a virtual env
|
| 28 |
+
python -m venv .venv
|
| 29 |
+
.\.venv\Scripts\Activate.ps1 # Windows
|
| 30 |
+
# source .venv/bin/activate # macOS / Linux
|
| 31 |
+
|
| 32 |
+
# 3. Install (use --only-binary to skip Rust/MSVC compilation on Py3.13)
|
| 33 |
+
python -m pip install --upgrade pip
|
| 34 |
+
pip install -r requirements.txt --only-binary=:all:
|
| 35 |
+
|
| 36 |
+
# 4. Launch
|
| 37 |
+
python app.py
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
Browser opens at `http://127.0.0.1:7860`.
|
| 41 |
+
|
| 42 |
+
**To stop the app:** press `Ctrl+C` in the terminal running `python app.py`.
|
| 43 |
+
|
| 44 |
+
**First launch only:** downloads ~1.2 GB of models from Hugging Face into `~/.cache/huggingface/` (cached for all future runs, fully offline afterwards).
|
| 45 |
+
|
| 46 |
+
That's it — no system packages, no ffmpeg, no GPU, no model files to download manually.
|
| 47 |
+
|
| 48 |
+
## Tabs
|
| 49 |
+
|
| 50 |
+
1. **🖼️ Image + Text** — upload a face photo + type the spoken sentence → visual emotion bars, text emotion bars, fusion badge, generative summary. *Optional* attention-rollout heatmap on the face + per-token attention HTML when the toggle is on.
|
| 51 |
+
2. **📹 Webcam / Video + Text** — record a 3–10 s clip in the browser → per-frame emotion **timeline chart**, aggregated bars, fusion, summary.
|
| 52 |
+
3. **🎙️ Audio + Image** — record/upload audio + face photo. Whisper transcribes the audio; the transcript drives the text channel; full fusion + summary.
|
| 53 |
+
4. **🎬 Video with Audio** — record/upload a video *with sound*. Audio is extracted (imageio-ffmpeg), transcribed by Whisper, fed to the text classifier; frames produce the visual timeline; fused result + summary — no typing needed.
|
| 54 |
+
5. **ℹ️ About** — architecture & fusion logic.
|
| 55 |
+
|
| 56 |
+
## Fusion / mismatch rule
|
| 57 |
+
|
| 58 |
+
Each modality's emotion distribution is mapped to a **valence** in `[-1, +1]`.
|
| 59 |
+
|
| 60 |
+
- Opposite-sign valences → **MISMATCH DETECTED** (amber 🟠)
|
| 61 |
+
- Small delta → **ALIGNED** (green 🟢)
|
| 62 |
+
- Otherwise → **PARTIALLY ALIGNED** (yellow 🟡)
|
| 63 |
+
|
| 64 |
+
The generative model is prompted with the structured signals and writes a 2–3 sentence empathetic summary.
|
app.py
ADDED
|
@@ -0,0 +1,1048 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
MoodSyncAI: Multi-Modal Sentiment & Emotion Analyser
|
| 3 |
+
====================================================
|
| 4 |
+
Components:
|
| 5 |
+
- Visual emotion: ViT (Vision Transformer) - trpakov/vit-face-expression
|
| 6 |
+
- Text emotion: DistilRoBERTa transformer - j-hartmann/emotion-english-distilroberta-base
|
| 7 |
+
- Fusion: Valence-aligned multimodal fusion + mismatch detection
|
| 8 |
+
- Generative: FLAN-T5 (with safe template fallback) for plain-language summary
|
| 9 |
+
- Webcam: Short video upload/recording, per-frame emotion timeline
|
| 10 |
+
|
| 11 |
+
All models are free/open-source from Hugging Face. Runs on CPU.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
import os
|
| 15 |
+
import io
|
| 16 |
+
import time
|
| 17 |
+
import warnings
|
| 18 |
+
from typing import List, Tuple, Dict
|
| 19 |
+
|
| 20 |
+
warnings.filterwarnings("ignore")
|
| 21 |
+
os.environ.setdefault("TRANSFORMERS_VERBOSITY", "error")
|
| 22 |
+
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
|
| 23 |
+
|
| 24 |
+
import numpy as np
|
| 25 |
+
import pandas as pd
|
| 26 |
+
from PIL import Image
|
| 27 |
+
import cv2
|
| 28 |
+
import plotly.graph_objects as go
|
| 29 |
+
import plotly.express as px
|
| 30 |
+
import gradio as gr
|
| 31 |
+
|
| 32 |
+
import torch
|
| 33 |
+
from transformers import (
|
| 34 |
+
pipeline,
|
| 35 |
+
AutoTokenizer,
|
| 36 |
+
AutoModelForSeq2SeqLM,
|
| 37 |
+
AutoModelForImageClassification,
|
| 38 |
+
AutoModelForSequenceClassification,
|
| 39 |
+
AutoImageProcessor,
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
# -------------------------------------------------------------
|
| 43 |
+
# Model identifiers (all free / public on Hugging Face Hub)
|
| 44 |
+
# -------------------------------------------------------------
|
| 45 |
+
VISION_MODEL = "trpakov/vit-face-expression" # ViT for facial emotion
|
| 46 |
+
TEXT_MODEL = "j-hartmann/emotion-english-distilroberta-base" # 7 emotions
|
| 47 |
+
GEN_MODEL = "google/flan-t5-base" # generative summariser
|
| 48 |
+
ASR_MODEL = "openai/whisper-tiny" # speech-to-text (Whisper)
|
| 49 |
+
|
| 50 |
+
DEVICE = 0 if torch.cuda.is_available() else -1
|
| 51 |
+
print(f"[MoodSyncAI] Torch device: {'cuda' if DEVICE == 0 else 'cpu'}")
|
| 52 |
+
|
| 53 |
+
# -------------------------------------------------------------
|
| 54 |
+
# Lazy-loaded model singletons
|
| 55 |
+
# -------------------------------------------------------------
|
| 56 |
+
_vision_pipe = None
|
| 57 |
+
_text_pipe = None
|
| 58 |
+
_gen_tokenizer = None
|
| 59 |
+
_gen_model = None
|
| 60 |
+
_face_cascade = None
|
| 61 |
+
_asr_pipe = None
|
| 62 |
+
_vit_attn_model = None
|
| 63 |
+
_vit_attn_processor = None
|
| 64 |
+
_text_attn_model = None
|
| 65 |
+
_text_attn_tokenizer = None
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def get_vision_pipe():
|
| 69 |
+
global _vision_pipe
|
| 70 |
+
if _vision_pipe is None:
|
| 71 |
+
print("[MoodSyncAI] Loading vision model:", VISION_MODEL)
|
| 72 |
+
_vision_pipe = pipeline(
|
| 73 |
+
"image-classification",
|
| 74 |
+
model=VISION_MODEL,
|
| 75 |
+
device=DEVICE,
|
| 76 |
+
top_k=None,
|
| 77 |
+
)
|
| 78 |
+
return _vision_pipe
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
def get_text_pipe():
|
| 82 |
+
global _text_pipe
|
| 83 |
+
if _text_pipe is None:
|
| 84 |
+
print("[MoodSyncAI] Loading text model:", TEXT_MODEL)
|
| 85 |
+
_text_pipe = pipeline(
|
| 86 |
+
"text-classification",
|
| 87 |
+
model=TEXT_MODEL,
|
| 88 |
+
device=DEVICE,
|
| 89 |
+
top_k=None,
|
| 90 |
+
truncation=True,
|
| 91 |
+
)
|
| 92 |
+
return _text_pipe
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def get_generator():
|
| 96 |
+
global _gen_tokenizer, _gen_model
|
| 97 |
+
if _gen_model is None:
|
| 98 |
+
try:
|
| 99 |
+
print("[MoodSyncAI] Loading generator:", GEN_MODEL)
|
| 100 |
+
_gen_tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL)
|
| 101 |
+
_gen_model = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL)
|
| 102 |
+
if DEVICE == 0:
|
| 103 |
+
_gen_model = _gen_model.to("cuda")
|
| 104 |
+
except Exception as e:
|
| 105 |
+
print("[MoodSyncAI] Generator load failed, will use template fallback:", e)
|
| 106 |
+
_gen_tokenizer, _gen_model = None, None
|
| 107 |
+
return _gen_tokenizer, _gen_model
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
def get_face_cascade():
|
| 111 |
+
global _face_cascade
|
| 112 |
+
if _face_cascade is None:
|
| 113 |
+
path = os.path.join(cv2.data.haarcascades, "haarcascade_frontalface_default.xml")
|
| 114 |
+
_face_cascade = cv2.CascadeClassifier(path)
|
| 115 |
+
return _face_cascade
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
# -------------------------------------------------------------
|
| 119 |
+
# Valence map: used to align textual and visual signals
|
| 120 |
+
# -------------------------------------------------------------
|
| 121 |
+
VALENCE = {
|
| 122 |
+
# text emotions (from distilroberta)
|
| 123 |
+
"joy": 1.0,
|
| 124 |
+
"love": 1.0,
|
| 125 |
+
"surprise": 0.3,
|
| 126 |
+
"neutral": 0.0,
|
| 127 |
+
"sadness": -1.0,
|
| 128 |
+
"fear": -0.8,
|
| 129 |
+
"anger": -0.9,
|
| 130 |
+
"disgust": -0.8,
|
| 131 |
+
# vision labels (ViT face expression labels)
|
| 132 |
+
"happy": 1.0,
|
| 133 |
+
"happiness": 1.0,
|
| 134 |
+
"sad": -1.0,
|
| 135 |
+
"angry": -0.9,
|
| 136 |
+
"fearful": -0.8,
|
| 137 |
+
"fear": -0.8,
|
| 138 |
+
"disgusted": -0.8,
|
| 139 |
+
"surprised": 0.3,
|
| 140 |
+
"contempt": -0.6,
|
| 141 |
+
}
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
def valence_of(label: str) -> float:
|
| 145 |
+
return VALENCE.get(label.lower().strip(), 0.0)
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
# -------------------------------------------------------------
|
| 149 |
+
# Face detection (crops to face for better accuracy; falls back to full image)
|
| 150 |
+
# -------------------------------------------------------------
|
| 151 |
+
def detect_and_crop_face(pil_img: Image.Image) -> Image.Image:
|
| 152 |
+
try:
|
| 153 |
+
cascade = get_face_cascade()
|
| 154 |
+
rgb = np.array(pil_img.convert("RGB"))
|
| 155 |
+
gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
|
| 156 |
+
faces = cascade.detectMultiScale(gray, scaleFactor=1.2, minNeighbors=5, minSize=(60, 60))
|
| 157 |
+
if len(faces) == 0:
|
| 158 |
+
return pil_img
|
| 159 |
+
# Pick the largest face
|
| 160 |
+
x, y, w, h = max(faces, key=lambda b: b[2] * b[3])
|
| 161 |
+
pad = int(0.15 * max(w, h))
|
| 162 |
+
x0 = max(0, x - pad); y0 = max(0, y - pad)
|
| 163 |
+
x1 = min(rgb.shape[1], x + w + pad); y1 = min(rgb.shape[0], y + h + pad)
|
| 164 |
+
return Image.fromarray(rgb[y0:y1, x0:x1])
|
| 165 |
+
except Exception:
|
| 166 |
+
return pil_img
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
# -------------------------------------------------------------
|
| 170 |
+
# Core analysis helpers
|
| 171 |
+
# -------------------------------------------------------------
|
| 172 |
+
def predict_visual(pil_img: Image.Image) -> List[Dict]:
|
| 173 |
+
pipe = get_vision_pipe()
|
| 174 |
+
face = detect_and_crop_face(pil_img)
|
| 175 |
+
preds = pipe(face)
|
| 176 |
+
# normalise into list of {label,score}
|
| 177 |
+
return [{"label": p["label"], "score": float(p["score"])} for p in preds]
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
def predict_text(text: str) -> List[Dict]:
|
| 181 |
+
if not text or not text.strip():
|
| 182 |
+
return [{"label": "neutral", "score": 1.0}]
|
| 183 |
+
pipe = get_text_pipe()
|
| 184 |
+
preds = pipe(text)[0] # top_k=None -> list of all
|
| 185 |
+
return [{"label": p["label"], "score": float(p["score"])} for p in preds]
|
| 186 |
+
|
| 187 |
+
|
| 188 |
+
def top1(preds: List[Dict]) -> Tuple[str, float]:
|
| 189 |
+
p = max(preds, key=lambda d: d["score"])
|
| 190 |
+
return p["label"], p["score"]
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
def weighted_valence(preds: List[Dict]) -> float:
|
| 194 |
+
return sum(p["score"] * valence_of(p["label"]) for p in preds)
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
def fuse(visual_preds: List[Dict], text_preds: List[Dict]) -> Dict:
|
| 198 |
+
v_label, v_conf = top1(visual_preds)
|
| 199 |
+
t_label, t_conf = top1(text_preds)
|
| 200 |
+
v_val = weighted_valence(visual_preds)
|
| 201 |
+
t_val = weighted_valence(text_preds)
|
| 202 |
+
|
| 203 |
+
delta = v_val - t_val
|
| 204 |
+
# mismatch: opposite sign with meaningful magnitude
|
| 205 |
+
mismatch = (v_val * t_val < -0.05) or (abs(delta) > 0.9)
|
| 206 |
+
|
| 207 |
+
if mismatch:
|
| 208 |
+
status = "MISMATCH DETECTED"
|
| 209 |
+
badge = "🟠"
|
| 210 |
+
elif abs(delta) < 0.35:
|
| 211 |
+
status = "ALIGNED"
|
| 212 |
+
badge = "🟢"
|
| 213 |
+
else:
|
| 214 |
+
status = "PARTIALLY ALIGNED"
|
| 215 |
+
badge = "🟡"
|
| 216 |
+
|
| 217 |
+
# overall valence (weighted average favoring visual when mismatch)
|
| 218 |
+
if mismatch:
|
| 219 |
+
overall_val = 0.6 * v_val + 0.4 * t_val
|
| 220 |
+
else:
|
| 221 |
+
overall_val = 0.5 * (v_val + t_val)
|
| 222 |
+
|
| 223 |
+
return {
|
| 224 |
+
"visual_label": v_label,
|
| 225 |
+
"visual_conf": v_conf,
|
| 226 |
+
"text_label": t_label,
|
| 227 |
+
"text_conf": t_conf,
|
| 228 |
+
"visual_valence": v_val,
|
| 229 |
+
"text_valence": t_val,
|
| 230 |
+
"delta": delta,
|
| 231 |
+
"status": status,
|
| 232 |
+
"badge": badge,
|
| 233 |
+
"overall_valence": overall_val,
|
| 234 |
+
}
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
# -------------------------------------------------------------
|
| 238 |
+
# Generative summary
|
| 239 |
+
# -------------------------------------------------------------
|
| 240 |
+
def template_summary(fusion: Dict) -> str:
|
| 241 |
+
v = fusion["visual_label"]; vc = fusion["visual_conf"]
|
| 242 |
+
t = fusion["text_label"]; tc = fusion["text_conf"]
|
| 243 |
+
if fusion["status"].startswith("MISMATCH"):
|
| 244 |
+
return (
|
| 245 |
+
f"Despite expressing **{t}** sentiment verbally ({tc*100:.0f}% confidence), "
|
| 246 |
+
f"the speaker's facial cues indicate **{v}** ({vc*100:.0f}% confidence). "
|
| 247 |
+
f"This incongruence between words and expression is worth noting in the "
|
| 248 |
+
f"context of the conversation - the spoken message may not fully reflect "
|
| 249 |
+
f"how the person actually feels."
|
| 250 |
+
)
|
| 251 |
+
if fusion["status"] == "ALIGNED":
|
| 252 |
+
return (
|
| 253 |
+
f"The speaker's words ({t}, {tc*100:.0f}%) and facial expression "
|
| 254 |
+
f"({v}, {vc*100:.0f}%) are consistent. The overall emotional state "
|
| 255 |
+
f"appears genuine and uncomplicated."
|
| 256 |
+
)
|
| 257 |
+
return (
|
| 258 |
+
f"The speaker shows mild divergence between facial expression ({v}, "
|
| 259 |
+
f"{vc*100:.0f}%) and spoken sentiment ({t}, {tc*100:.0f}%). The signals "
|
| 260 |
+
f"are not contradictory but suggest some nuance in the emotional state."
|
| 261 |
+
)
|
| 262 |
+
|
| 263 |
+
|
| 264 |
+
def generative_summary(fusion: Dict, text_input: str) -> str:
|
| 265 |
+
tok, model = get_generator()
|
| 266 |
+
fallback = template_summary(fusion)
|
| 267 |
+
if model is None or tok is None:
|
| 268 |
+
return fallback
|
| 269 |
+
try:
|
| 270 |
+
mismatch = fusion["status"].startswith("MISMATCH")
|
| 271 |
+
instr = (
|
| 272 |
+
"rewrite as one empathetic paragraph (2-3 sentences) that explicitly "
|
| 273 |
+
"highlights the mismatch between facial expression and spoken words"
|
| 274 |
+
if mismatch else
|
| 275 |
+
"rewrite as one empathetic paragraph (2-3 sentences) noting the emotional state"
|
| 276 |
+
)
|
| 277 |
+
prompt = (
|
| 278 |
+
f"You are an empathetic psychologist. Given the analysis below, {instr}. "
|
| 279 |
+
f"Begin with the word 'The'.\n\n"
|
| 280 |
+
f"Analysis:\n"
|
| 281 |
+
f"- Spoken sentence: \"{text_input or '(none provided)'}\"\n"
|
| 282 |
+
f"- Facial emotion detected: {fusion['visual_label']} "
|
| 283 |
+
f"({fusion['visual_conf']*100:.0f}% confidence)\n"
|
| 284 |
+
f"- Sentiment of the words: {fusion['text_label']} "
|
| 285 |
+
f"({fusion['text_conf']*100:.0f}% confidence)\n"
|
| 286 |
+
f"- Alignment: {fusion['status']}\n\n"
|
| 287 |
+
f"Paragraph:"
|
| 288 |
+
)
|
| 289 |
+
inputs = tok(prompt, return_tensors="pt", truncation=True, max_length=512)
|
| 290 |
+
if DEVICE == 0:
|
| 291 |
+
inputs = {k: v.to("cuda") for k, v in inputs.items()}
|
| 292 |
+
out = model.generate(
|
| 293 |
+
**inputs,
|
| 294 |
+
max_new_tokens=140,
|
| 295 |
+
min_new_tokens=30,
|
| 296 |
+
num_beams=4,
|
| 297 |
+
do_sample=False,
|
| 298 |
+
no_repeat_ngram_size=3,
|
| 299 |
+
early_stopping=True,
|
| 300 |
+
)
|
| 301 |
+
text = tok.decode(out[0], skip_special_tokens=True).strip()
|
| 302 |
+
# Reject obvious echoes / too-short / off-topic outputs
|
| 303 |
+
bad = (len(text) < 50
|
| 304 |
+
or text.lower().startswith(("tell ", "write ", "give "))
|
| 305 |
+
or "story" in text.lower()[:40]
|
| 306 |
+
or fusion["visual_label"].lower() not in text.lower()
|
| 307 |
+
and fusion["text_label"].lower() not in text.lower())
|
| 308 |
+
if bad:
|
| 309 |
+
return fallback
|
| 310 |
+
return text
|
| 311 |
+
except Exception as e:
|
| 312 |
+
print("[MoodSyncAI] Generation error:", e)
|
| 313 |
+
return fallback
|
| 314 |
+
|
| 315 |
+
|
| 316 |
+
# -------------------------------------------------------------
|
| 317 |
+
# Plotly charts
|
| 318 |
+
# -------------------------------------------------------------
|
| 319 |
+
def bar_chart(preds: List[Dict], title: str, color: str) -> go.Figure:
|
| 320 |
+
df = pd.DataFrame(preds).sort_values("score", ascending=True)
|
| 321 |
+
df["pct"] = (df["score"] * 100).round(1)
|
| 322 |
+
fig = go.Figure(go.Bar(
|
| 323 |
+
x=df["pct"], y=df["label"], orientation="h",
|
| 324 |
+
marker=dict(color=color),
|
| 325 |
+
text=df["pct"].astype(str) + "%",
|
| 326 |
+
textposition="outside",
|
| 327 |
+
))
|
| 328 |
+
fig.update_layout(
|
| 329 |
+
title=title,
|
| 330 |
+
xaxis_title="Confidence (%)",
|
| 331 |
+
yaxis_title=None,
|
| 332 |
+
xaxis=dict(range=[0, 110]),
|
| 333 |
+
height=320, margin=dict(l=10, r=10, t=40, b=10),
|
| 334 |
+
template="plotly_white",
|
| 335 |
+
)
|
| 336 |
+
return fig
|
| 337 |
+
|
| 338 |
+
|
| 339 |
+
def empty_fig(msg="No data") -> go.Figure:
|
| 340 |
+
fig = go.Figure()
|
| 341 |
+
fig.add_annotation(text=msg, xref="paper", yref="paper",
|
| 342 |
+
x=0.5, y=0.5, showarrow=False, font=dict(size=14))
|
| 343 |
+
fig.update_layout(height=320, template="plotly_white",
|
| 344 |
+
margin=dict(l=10, r=10, t=20, b=10))
|
| 345 |
+
return fig
|
| 346 |
+
|
| 347 |
+
|
| 348 |
+
# -------------------------------------------------------------
|
| 349 |
+
# Tab 1: Image + Text analysis
|
| 350 |
+
# -------------------------------------------------------------
|
| 351 |
+
def analyse_image_text(image: Image.Image, text: str):
|
| 352 |
+
if image is None:
|
| 353 |
+
return (empty_fig("Please upload an image"),
|
| 354 |
+
empty_fig("Awaiting input"),
|
| 355 |
+
"### ⚠️ Please upload an image of a face.", "")
|
| 356 |
+
|
| 357 |
+
visual_preds = predict_visual(image)
|
| 358 |
+
text_preds = predict_text(text or "")
|
| 359 |
+
|
| 360 |
+
fusion = fuse(visual_preds, text_preds)
|
| 361 |
+
summary = generative_summary(fusion, text)
|
| 362 |
+
|
| 363 |
+
vfig = bar_chart(visual_preds, "👁️ Visual Emotion (ViT)", "#4C78A8")
|
| 364 |
+
tfig = bar_chart(text_preds, "💬 Text Sentiment (Transformer)", "#54A24B")
|
| 365 |
+
|
| 366 |
+
fusion_md = f"""
|
| 367 |
+
### {fusion['badge']} Fusion Result: **{fusion['status']}**
|
| 368 |
+
|
| 369 |
+
| Modality | Top Prediction | Confidence | Valence |
|
| 370 |
+
|---|---|---|---|
|
| 371 |
+
| 👁️ Visual | **{fusion['visual_label']}** | {fusion['visual_conf']*100:.1f}% | {fusion['visual_valence']:+.2f} |
|
| 372 |
+
| 💬 Text | **{fusion['text_label']}** | {fusion['text_conf']*100:.1f}% | {fusion['text_valence']:+.2f} |
|
| 373 |
+
| 🔗 Overall valence | — | — | **{fusion['overall_valence']:+.2f}** |
|
| 374 |
+
"""
|
| 375 |
+
summary_md = f"### 🧠 Generative Summary\n\n> {summary}"
|
| 376 |
+
return vfig, tfig, fusion_md, summary_md
|
| 377 |
+
|
| 378 |
+
|
| 379 |
+
# -------------------------------------------------------------
|
| 380 |
+
# Tab 2: Webcam / short video → emotion timeline
|
| 381 |
+
# -------------------------------------------------------------
|
| 382 |
+
def sample_frames(video_path: str, max_frames: int = 12) -> List[Tuple[float, Image.Image]]:
|
| 383 |
+
cap = cv2.VideoCapture(video_path)
|
| 384 |
+
if not cap.isOpened():
|
| 385 |
+
return []
|
| 386 |
+
fps = cap.get(cv2.CAP_PROP_FPS) or 25.0
|
| 387 |
+
total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
|
| 388 |
+
|
| 389 |
+
# If total frames is unknown, read sequentially to count.
|
| 390 |
+
if total <= 0:
|
| 391 |
+
total = 0
|
| 392 |
+
while True:
|
| 393 |
+
ok, _ = cap.read()
|
| 394 |
+
if not ok:
|
| 395 |
+
break
|
| 396 |
+
total += 1
|
| 397 |
+
cap.release()
|
| 398 |
+
cap = cv2.VideoCapture(video_path)
|
| 399 |
+
if total <= 0:
|
| 400 |
+
return []
|
| 401 |
+
|
| 402 |
+
duration = total / fps if fps > 0 else 1.0
|
| 403 |
+
n = min(max_frames, max(3, int(duration * 2))) # ~2 fps target
|
| 404 |
+
target_idxs = set(np.linspace(0, total - 1, n).astype(int).tolist())
|
| 405 |
+
|
| 406 |
+
out: List[Tuple[float, Image.Image]] = []
|
| 407 |
+
idx = 0
|
| 408 |
+
while True:
|
| 409 |
+
ok, frame = cap.read()
|
| 410 |
+
if not ok:
|
| 411 |
+
break
|
| 412 |
+
if idx in target_idxs:
|
| 413 |
+
ts = idx / fps if fps > 0 else float(idx)
|
| 414 |
+
pil = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
|
| 415 |
+
out.append((float(ts), pil))
|
| 416 |
+
if len(out) >= n:
|
| 417 |
+
break
|
| 418 |
+
idx += 1
|
| 419 |
+
cap.release()
|
| 420 |
+
return out
|
| 421 |
+
|
| 422 |
+
|
| 423 |
+
def analyse_video_text(video_path, text: str):
|
| 424 |
+
if not video_path:
|
| 425 |
+
return (empty_fig("Record or upload a short video"),
|
| 426 |
+
empty_fig("Awaiting input"),
|
| 427 |
+
empty_fig("Awaiting input"),
|
| 428 |
+
"### ⚠️ Please provide a webcam video.", "")
|
| 429 |
+
|
| 430 |
+
frames = sample_frames(video_path, max_frames=12)
|
| 431 |
+
if not frames:
|
| 432 |
+
return (empty_fig("Could not read video"),
|
| 433 |
+
empty_fig(""), empty_fig(""),
|
| 434 |
+
"### ⚠️ Could not decode the video file.", "")
|
| 435 |
+
|
| 436 |
+
timeline = [] # list of dict: ts, label->score
|
| 437 |
+
aggregated: Dict[str, float] = {}
|
| 438 |
+
for ts, pil in frames:
|
| 439 |
+
preds = predict_visual(pil)
|
| 440 |
+
row = {"timestamp": ts}
|
| 441 |
+
for p in preds:
|
| 442 |
+
row[p["label"]] = p["score"]
|
| 443 |
+
aggregated[p["label"]] = aggregated.get(p["label"], 0.0) + p["score"]
|
| 444 |
+
timeline.append(row)
|
| 445 |
+
|
| 446 |
+
# Average the aggregated visual prediction across frames
|
| 447 |
+
n = len(frames)
|
| 448 |
+
avg_visual = [{"label": k, "score": v / n} for k, v in aggregated.items()]
|
| 449 |
+
|
| 450 |
+
text_preds = predict_text(text or "")
|
| 451 |
+
fusion = fuse(avg_visual, text_preds)
|
| 452 |
+
summary = generative_summary(fusion, text)
|
| 453 |
+
|
| 454 |
+
# Timeline figure (line per emotion)
|
| 455 |
+
df = pd.DataFrame(timeline).fillna(0.0)
|
| 456 |
+
label_cols = [c for c in df.columns if c != "timestamp"]
|
| 457 |
+
tl_fig = go.Figure()
|
| 458 |
+
palette = px.colors.qualitative.Set2
|
| 459 |
+
for i, lbl in enumerate(label_cols):
|
| 460 |
+
tl_fig.add_trace(go.Scatter(
|
| 461 |
+
x=df["timestamp"], y=df[lbl] * 100,
|
| 462 |
+
mode="lines+markers", name=lbl,
|
| 463 |
+
line=dict(color=palette[i % len(palette)], width=2),
|
| 464 |
+
))
|
| 465 |
+
tl_fig.update_layout(
|
| 466 |
+
title="📈 Emotion Timeline (per frame)",
|
| 467 |
+
xaxis_title="Time (s)", yaxis_title="Confidence (%)",
|
| 468 |
+
height=360, template="plotly_white",
|
| 469 |
+
margin=dict(l=10, r=10, t=40, b=10),
|
| 470 |
+
yaxis=dict(range=[0, 100]),
|
| 471 |
+
)
|
| 472 |
+
|
| 473 |
+
vfig = bar_chart(avg_visual, "👁️ Average Visual Emotion", "#4C78A8")
|
| 474 |
+
tfig = bar_chart(text_preds, "💬 Text Sentiment", "#54A24B")
|
| 475 |
+
|
| 476 |
+
fusion_md = f"""
|
| 477 |
+
### {fusion['badge']} Fusion Result: **{fusion['status']}**
|
| 478 |
+
|
| 479 |
+
| Modality | Top Prediction | Confidence | Valence |
|
| 480 |
+
|---|---|---|---|
|
| 481 |
+
| 👁️ Visual (avg) | **{fusion['visual_label']}** | {fusion['visual_conf']*100:.1f}% | {fusion['visual_valence']:+.2f} |
|
| 482 |
+
| 💬 Text | **{fusion['text_label']}** | {fusion['text_conf']*100:.1f}% | {fusion['text_valence']:+.2f} |
|
| 483 |
+
| 🔗 Overall valence | — | — | **{fusion['overall_valence']:+.2f}** |
|
| 484 |
+
|
| 485 |
+
*Analysed {n} frames from the video.*
|
| 486 |
+
"""
|
| 487 |
+
summary_md = f"### 🧠 Generative Summary\n\n> {summary}"
|
| 488 |
+
return tl_fig, vfig, tfig, fusion_md, summary_md
|
| 489 |
+
|
| 490 |
+
|
| 491 |
+
# =============================================================
|
| 492 |
+
# NEW FEATURE BLOCK (additive — does not touch Tab 1 / Tab 2)
|
| 493 |
+
# =============================================================
|
| 494 |
+
# 1) Whisper ASR (audio → text channel)
|
| 495 |
+
# 2) Video with audio (transcribe + frame timeline + fusion)
|
| 496 |
+
# 3) Attention visualisation (ViT rollout heatmap + text token attention)
|
| 497 |
+
# =============================================================
|
| 498 |
+
|
| 499 |
+
import tempfile
|
| 500 |
+
import subprocess
|
| 501 |
+
import html as _html
|
| 502 |
+
|
| 503 |
+
|
| 504 |
+
def get_asr_pipe():
|
| 505 |
+
global _asr_pipe
|
| 506 |
+
if _asr_pipe is None:
|
| 507 |
+
print("[MoodSyncAI] Loading ASR model:", ASR_MODEL)
|
| 508 |
+
_asr_pipe = pipeline(
|
| 509 |
+
"automatic-speech-recognition",
|
| 510 |
+
model=ASR_MODEL,
|
| 511 |
+
device=DEVICE,
|
| 512 |
+
chunk_length_s=30,
|
| 513 |
+
return_timestamps=False,
|
| 514 |
+
)
|
| 515 |
+
return _asr_pipe
|
| 516 |
+
|
| 517 |
+
|
| 518 |
+
def transcribe_audio(audio_path: str) -> str:
|
| 519 |
+
if not audio_path:
|
| 520 |
+
return ""
|
| 521 |
+
try:
|
| 522 |
+
# Load audio ourselves (soundfile/librosa) so we don't depend on
|
| 523 |
+
# whisper's internal ffmpeg-via-PATH lookup.
|
| 524 |
+
import soundfile as sf
|
| 525 |
+
try:
|
| 526 |
+
audio, sr = sf.read(audio_path, dtype="float32", always_2d=False)
|
| 527 |
+
except Exception:
|
| 528 |
+
import librosa
|
| 529 |
+
audio, sr = librosa.load(audio_path, sr=16000, mono=True)
|
| 530 |
+
if audio.ndim > 1:
|
| 531 |
+
audio = audio.mean(axis=1)
|
| 532 |
+
if sr != 16000:
|
| 533 |
+
import librosa
|
| 534 |
+
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
|
| 535 |
+
sr = 16000
|
| 536 |
+
if audio.size == 0:
|
| 537 |
+
return ""
|
| 538 |
+
pipe = get_asr_pipe()
|
| 539 |
+
out = pipe(
|
| 540 |
+
{"array": audio, "sampling_rate": sr},
|
| 541 |
+
generate_kwargs={"language": "en", "task": "transcribe"},
|
| 542 |
+
)
|
| 543 |
+
text = out.get("text", "") if isinstance(out, dict) else str(out)
|
| 544 |
+
return (text or "").strip()
|
| 545 |
+
except Exception as e:
|
| 546 |
+
print("[MoodSyncAI] Transcription error:", e)
|
| 547 |
+
return ""
|
| 548 |
+
|
| 549 |
+
|
| 550 |
+
def _ffmpeg_exe() -> str:
|
| 551 |
+
try:
|
| 552 |
+
import imageio_ffmpeg
|
| 553 |
+
return imageio_ffmpeg.get_ffmpeg_exe()
|
| 554 |
+
except Exception:
|
| 555 |
+
return "ffmpeg"
|
| 556 |
+
|
| 557 |
+
|
| 558 |
+
def extract_audio_from_video(video_path: str) -> str:
|
| 559 |
+
"""Extract mono 16 kHz wav from video. Returns wav path or '' on failure."""
|
| 560 |
+
if not video_path:
|
| 561 |
+
return ""
|
| 562 |
+
try:
|
| 563 |
+
out_path = tempfile.NamedTemporaryFile(
|
| 564 |
+
suffix=".wav", delete=False
|
| 565 |
+
).name
|
| 566 |
+
cmd = [
|
| 567 |
+
_ffmpeg_exe(), "-y", "-i", video_path,
|
| 568 |
+
"-vn", "-ac", "1", "-ar", "16000",
|
| 569 |
+
"-f", "wav", out_path,
|
| 570 |
+
]
|
| 571 |
+
proc = subprocess.run(cmd, capture_output=True, timeout=120)
|
| 572 |
+
if proc.returncode != 0 or not os.path.exists(out_path) or os.path.getsize(out_path) < 1024:
|
| 573 |
+
return ""
|
| 574 |
+
return out_path
|
| 575 |
+
except Exception as e:
|
| 576 |
+
print("[MoodSyncAI] Audio-extract error:", e)
|
| 577 |
+
return ""
|
| 578 |
+
|
| 579 |
+
|
| 580 |
+
# -------------------------------------------------------------
|
| 581 |
+
# Attention visualisation
|
| 582 |
+
# -------------------------------------------------------------
|
| 583 |
+
def _get_vit_attn():
|
| 584 |
+
global _vit_attn_model, _vit_attn_processor
|
| 585 |
+
if _vit_attn_model is None:
|
| 586 |
+
print("[MoodSyncAI] Loading ViT (eager attn) for attention rollout")
|
| 587 |
+
_vit_attn_processor = AutoImageProcessor.from_pretrained(VISION_MODEL)
|
| 588 |
+
_vit_attn_model = AutoModelForImageClassification.from_pretrained(
|
| 589 |
+
VISION_MODEL, attn_implementation="eager"
|
| 590 |
+
)
|
| 591 |
+
_vit_attn_model.eval()
|
| 592 |
+
if DEVICE == 0:
|
| 593 |
+
_vit_attn_model = _vit_attn_model.to("cuda")
|
| 594 |
+
return _vit_attn_model, _vit_attn_processor
|
| 595 |
+
|
| 596 |
+
|
| 597 |
+
def _get_text_attn():
|
| 598 |
+
global _text_attn_model, _text_attn_tokenizer
|
| 599 |
+
if _text_attn_model is None:
|
| 600 |
+
print("[MoodSyncAI] Loading text classifier (eager attn) for token attention")
|
| 601 |
+
_text_attn_tokenizer = AutoTokenizer.from_pretrained(TEXT_MODEL)
|
| 602 |
+
_text_attn_model = AutoModelForSequenceClassification.from_pretrained(
|
| 603 |
+
TEXT_MODEL, attn_implementation="eager"
|
| 604 |
+
)
|
| 605 |
+
_text_attn_model.eval()
|
| 606 |
+
if DEVICE == 0:
|
| 607 |
+
_text_attn_model = _text_attn_model.to("cuda")
|
| 608 |
+
return _text_attn_model, _text_attn_tokenizer
|
| 609 |
+
|
| 610 |
+
|
| 611 |
+
def vit_attention_heatmap(pil_img: Image.Image) -> Image.Image:
|
| 612 |
+
"""Attention-rollout heatmap overlaid on the (face-cropped) image."""
|
| 613 |
+
try:
|
| 614 |
+
face = detect_and_crop_face(pil_img).convert("RGB")
|
| 615 |
+
model, processor = _get_vit_attn()
|
| 616 |
+
inputs = processor(images=face, return_tensors="pt")
|
| 617 |
+
if DEVICE == 0:
|
| 618 |
+
inputs = {k: v.to("cuda") for k, v in inputs.items()}
|
| 619 |
+
with torch.no_grad():
|
| 620 |
+
out = model(**inputs, output_attentions=True)
|
| 621 |
+
attns = out.attentions # tuple(L) of (1, H, S, S)
|
| 622 |
+
if not attns:
|
| 623 |
+
return face
|
| 624 |
+
|
| 625 |
+
# Attention rollout: avg heads, add identity, normalise, multiply layers
|
| 626 |
+
result = None
|
| 627 |
+
for a in attns:
|
| 628 |
+
a = a.mean(dim=1).squeeze(0) # (S, S)
|
| 629 |
+
a = a + torch.eye(a.size(0), device=a.device)
|
| 630 |
+
a = a / a.sum(dim=-1, keepdim=True)
|
| 631 |
+
result = a if result is None else a @ result
|
| 632 |
+
|
| 633 |
+
# CLS-token row, drop CLS index → patch importances
|
| 634 |
+
cls_attn = result[0, 1:].detach().cpu().numpy()
|
| 635 |
+
side = int(np.sqrt(cls_attn.shape[0]))
|
| 636 |
+
if side * side != cls_attn.shape[0]:
|
| 637 |
+
return face
|
| 638 |
+
grid = cls_attn.reshape(side, side)
|
| 639 |
+
grid = (grid - grid.min()) / (grid.max() - grid.min() + 1e-8)
|
| 640 |
+
|
| 641 |
+
# Resize heatmap to face image
|
| 642 |
+
w, h = face.size
|
| 643 |
+
heat = cv2.resize(grid, (w, h), interpolation=cv2.INTER_CUBIC)
|
| 644 |
+
heat_u8 = (heat * 255).astype(np.uint8)
|
| 645 |
+
color = cv2.applyColorMap(heat_u8, cv2.COLORMAP_JET)
|
| 646 |
+
color = cv2.cvtColor(color, cv2.COLOR_BGR2RGB)
|
| 647 |
+
base = np.array(face)
|
| 648 |
+
overlay = (0.55 * base + 0.45 * color).clip(0, 255).astype(np.uint8)
|
| 649 |
+
return Image.fromarray(overlay)
|
| 650 |
+
except Exception as e:
|
| 651 |
+
print("[MoodSyncAI] ViT attention error:", e)
|
| 652 |
+
return pil_img
|
| 653 |
+
|
| 654 |
+
|
| 655 |
+
def text_token_attention_html(text: str) -> str:
|
| 656 |
+
"""Render input text with per-token attention intensity (last layer, [CLS] row)."""
|
| 657 |
+
if not text or not text.strip():
|
| 658 |
+
return "<em>(no text)</em>"
|
| 659 |
+
try:
|
| 660 |
+
model, tok = _get_text_attn()
|
| 661 |
+
enc = tok(text, return_tensors="pt", truncation=True, max_length=256)
|
| 662 |
+
if DEVICE == 0:
|
| 663 |
+
enc = {k: v.to("cuda") for k, v in enc.items()}
|
| 664 |
+
with torch.no_grad():
|
| 665 |
+
out = model(**enc, output_attentions=True)
|
| 666 |
+
attns = out.attentions # tuple(L) of (1, H, S, S)
|
| 667 |
+
if not attns:
|
| 668 |
+
return _html.escape(text)
|
| 669 |
+
last = attns[-1].mean(dim=1).squeeze(0) # (S, S)
|
| 670 |
+
cls_row = last[0].detach().cpu().numpy() # importance of each token to CLS
|
| 671 |
+
|
| 672 |
+
ids = enc["input_ids"][0].detach().cpu().tolist()
|
| 673 |
+
tokens = tok.convert_ids_to_tokens(ids)
|
| 674 |
+
# Skip special tokens for normalisation range
|
| 675 |
+
specials = set(tok.all_special_tokens)
|
| 676 |
+
keep_mask = np.array([t not in specials for t in tokens])
|
| 677 |
+
if keep_mask.sum() == 0:
|
| 678 |
+
return _html.escape(text)
|
| 679 |
+
scores = cls_row.copy()
|
| 680 |
+
scores_disp = scores[keep_mask]
|
| 681 |
+
lo, hi = scores_disp.min(), scores_disp.max()
|
| 682 |
+
norm = (scores - lo) / (hi - lo + 1e-8)
|
| 683 |
+
norm = np.clip(norm, 0.0, 1.0)
|
| 684 |
+
|
| 685 |
+
# Build HTML: merge subword tokens (RoBERTa uses 'Ġ' prefix for word start)
|
| 686 |
+
spans = []
|
| 687 |
+
for i, t in enumerate(tokens):
|
| 688 |
+
if t in specials:
|
| 689 |
+
continue
|
| 690 |
+
display = t
|
| 691 |
+
prefix_space = ""
|
| 692 |
+
if display.startswith("Ġ"):
|
| 693 |
+
display = display[1:]
|
| 694 |
+
prefix_space = " "
|
| 695 |
+
elif display.startswith("▁"):
|
| 696 |
+
display = display[1:]
|
| 697 |
+
prefix_space = " "
|
| 698 |
+
intensity = float(norm[i])
|
| 699 |
+
# red highlight, alpha from intensity
|
| 700 |
+
bg = f"rgba(220,38,38,{intensity:.2f})"
|
| 701 |
+
color = "#fff" if intensity > 0.55 else "#111"
|
| 702 |
+
safe = _html.escape(display)
|
| 703 |
+
spans.append(
|
| 704 |
+
f"{prefix_space}<span style=\"background:{bg};color:{color};"
|
| 705 |
+
f"padding:2px 4px;border-radius:4px;margin:1px;"
|
| 706 |
+
f"font-family:monospace\" title=\"{intensity:.2f}\">{safe}</span>"
|
| 707 |
+
)
|
| 708 |
+
body = "".join(spans).strip()
|
| 709 |
+
legend = (
|
| 710 |
+
"<div style='margin-top:8px;font-size:12px;color:#555'>"
|
| 711 |
+
"Darker red = higher attention weight from [CLS] to that token "
|
| 712 |
+
"(last transformer layer, averaged over heads)."
|
| 713 |
+
"</div>"
|
| 714 |
+
)
|
| 715 |
+
return f"<div style='line-height:2;font-size:15px'>{body}</div>{legend}"
|
| 716 |
+
except Exception as e:
|
| 717 |
+
print("[MoodSyncAI] Text attention error:", e)
|
| 718 |
+
return _html.escape(text)
|
| 719 |
+
|
| 720 |
+
|
| 721 |
+
# -------------------------------------------------------------
|
| 722 |
+
# Tab 1 wrapper: existing outputs + (optional) attention viz
|
| 723 |
+
# -------------------------------------------------------------
|
| 724 |
+
def analyse_image_text_with_attention(image: Image.Image, text: str, show_attn: bool):
|
| 725 |
+
vfig, tfig, fusion_md, summary_md = analyse_image_text(image, text)
|
| 726 |
+
if not show_attn or image is None:
|
| 727 |
+
return (vfig, tfig, fusion_md, summary_md,
|
| 728 |
+
None, "<em>Toggle 'Show attention visualisation' to view.</em>")
|
| 729 |
+
heat = vit_attention_heatmap(image)
|
| 730 |
+
token_html = text_token_attention_html(text or "")
|
| 731 |
+
return vfig, tfig, fusion_md, summary_md, heat, token_html
|
| 732 |
+
|
| 733 |
+
|
| 734 |
+
# -------------------------------------------------------------
|
| 735 |
+
# Tab 3: Audio + Image
|
| 736 |
+
# -------------------------------------------------------------
|
| 737 |
+
def analyse_audio_image(audio_path, image: Image.Image):
|
| 738 |
+
if image is None and not audio_path:
|
| 739 |
+
return ("",
|
| 740 |
+
empty_fig("Provide an image"),
|
| 741 |
+
empty_fig("Provide audio"),
|
| 742 |
+
"### ⚠️ Please provide both an image and audio.", "")
|
| 743 |
+
transcript = transcribe_audio(audio_path) if audio_path else ""
|
| 744 |
+
if not transcript:
|
| 745 |
+
transcript = "(no speech detected)"
|
| 746 |
+
if image is None:
|
| 747 |
+
return (transcript,
|
| 748 |
+
empty_fig("No image provided"),
|
| 749 |
+
empty_fig("(transcript only)"),
|
| 750 |
+
"### ⚠️ Please also provide a face image.", "")
|
| 751 |
+
|
| 752 |
+
visual_preds = predict_visual(image)
|
| 753 |
+
spoken = "" if transcript.startswith("(") else transcript
|
| 754 |
+
text_preds = predict_text(spoken)
|
| 755 |
+
fusion = fuse(visual_preds, text_preds)
|
| 756 |
+
summary = generative_summary(fusion, spoken)
|
| 757 |
+
|
| 758 |
+
vfig = bar_chart(visual_preds, "👁️ Visual Emotion (ViT)", "#4C78A8")
|
| 759 |
+
tfig = bar_chart(text_preds, "💬 Sentiment of Transcribed Speech", "#54A24B")
|
| 760 |
+
fusion_md = f"""
|
| 761 |
+
### {fusion['badge']} Fusion Result: **{fusion['status']}**
|
| 762 |
+
|
| 763 |
+
| Modality | Top Prediction | Confidence | Valence |
|
| 764 |
+
|---|---|---|---|
|
| 765 |
+
| 👁️ Visual (image) | **{fusion['visual_label']}** | {fusion['visual_conf']*100:.1f}% | {fusion['visual_valence']:+.2f} |
|
| 766 |
+
| 🎙️ Audio → Text | **{fusion['text_label']}** | {fusion['text_conf']*100:.1f}% | {fusion['text_valence']:+.2f} |
|
| 767 |
+
| 🔗 Overall valence | — | — | **{fusion['overall_valence']:+.2f}** |
|
| 768 |
+
"""
|
| 769 |
+
summary_md = f"### 🧠 Generative Summary\n\n> {summary}"
|
| 770 |
+
return transcript, vfig, tfig, fusion_md, summary_md
|
| 771 |
+
|
| 772 |
+
|
| 773 |
+
# -------------------------------------------------------------
|
| 774 |
+
# Tab 4: Video WITH audio (frames timeline + audio transcript → text channel)
|
| 775 |
+
# -------------------------------------------------------------
|
| 776 |
+
def analyse_video_with_audio(video_path):
|
| 777 |
+
if not video_path:
|
| 778 |
+
return ("",
|
| 779 |
+
empty_fig("Record or upload a video"),
|
| 780 |
+
empty_fig(""), empty_fig(""),
|
| 781 |
+
"### ⚠️ Please provide a video.", "")
|
| 782 |
+
|
| 783 |
+
frames = sample_frames(video_path, max_frames=12)
|
| 784 |
+
if not frames:
|
| 785 |
+
return ("",
|
| 786 |
+
empty_fig("Could not read video"),
|
| 787 |
+
empty_fig(""), empty_fig(""),
|
| 788 |
+
"### ⚠️ Could not decode the video file.", "")
|
| 789 |
+
|
| 790 |
+
# 1) Audio → transcript
|
| 791 |
+
wav = extract_audio_from_video(video_path)
|
| 792 |
+
transcript = transcribe_audio(wav) if wav else ""
|
| 793 |
+
if wav and os.path.exists(wav):
|
| 794 |
+
try: os.remove(wav)
|
| 795 |
+
except Exception: pass
|
| 796 |
+
if not transcript:
|
| 797 |
+
transcript = "(no speech detected in the audio track)"
|
| 798 |
+
spoken = "" if transcript.startswith("(") else transcript
|
| 799 |
+
|
| 800 |
+
# 2) Per-frame visual + aggregate
|
| 801 |
+
timeline = []
|
| 802 |
+
aggregated: Dict[str, float] = {}
|
| 803 |
+
for ts, pil in frames:
|
| 804 |
+
preds = predict_visual(pil)
|
| 805 |
+
row = {"timestamp": ts}
|
| 806 |
+
for p in preds:
|
| 807 |
+
row[p["label"]] = p["score"]
|
| 808 |
+
aggregated[p["label"]] = aggregated.get(p["label"], 0.0) + p["score"]
|
| 809 |
+
timeline.append(row)
|
| 810 |
+
n = len(frames)
|
| 811 |
+
avg_visual = [{"label": k, "score": v / n} for k, v in aggregated.items()]
|
| 812 |
+
|
| 813 |
+
# 3) Text channel from transcript
|
| 814 |
+
text_preds = predict_text(spoken)
|
| 815 |
+
fusion = fuse(avg_visual, text_preds)
|
| 816 |
+
summary = generative_summary(fusion, spoken)
|
| 817 |
+
|
| 818 |
+
# Timeline figure
|
| 819 |
+
df = pd.DataFrame(timeline).fillna(0.0)
|
| 820 |
+
label_cols = [c for c in df.columns if c != "timestamp"]
|
| 821 |
+
tl_fig = go.Figure()
|
| 822 |
+
palette = px.colors.qualitative.Set2
|
| 823 |
+
for i, lbl in enumerate(label_cols):
|
| 824 |
+
tl_fig.add_trace(go.Scatter(
|
| 825 |
+
x=df["timestamp"], y=df[lbl] * 100,
|
| 826 |
+
mode="lines+markers", name=lbl,
|
| 827 |
+
line=dict(color=palette[i % len(palette)], width=2),
|
| 828 |
+
))
|
| 829 |
+
tl_fig.update_layout(
|
| 830 |
+
title="📈 Emotion Timeline (per frame) — audio transcript drives text channel",
|
| 831 |
+
xaxis_title="Time (s)", yaxis_title="Confidence (%)",
|
| 832 |
+
height=360, template="plotly_white",
|
| 833 |
+
margin=dict(l=10, r=10, t=40, b=10),
|
| 834 |
+
yaxis=dict(range=[0, 100]),
|
| 835 |
+
)
|
| 836 |
+
|
| 837 |
+
vfig = bar_chart(avg_visual, "👁️ Avg Visual Emotion (frames)", "#4C78A8")
|
| 838 |
+
tfig = bar_chart(text_preds, "💬 Sentiment of Spoken Audio", "#54A24B")
|
| 839 |
+
|
| 840 |
+
fusion_md = f"""
|
| 841 |
+
### {fusion['badge']} Fusion Result: **{fusion['status']}**
|
| 842 |
+
|
| 843 |
+
| Modality | Top Prediction | Confidence | Valence |
|
| 844 |
+
|---|---|---|---|
|
| 845 |
+
| 👁️ Visual (avg of {n} frames) | **{fusion['visual_label']}** | {fusion['visual_conf']*100:.1f}% | {fusion['visual_valence']:+.2f} |
|
| 846 |
+
| 🎙️ Audio transcript | **{fusion['text_label']}** | {fusion['text_conf']*100:.1f}% | {fusion['text_valence']:+.2f} |
|
| 847 |
+
| 🔗 Overall valence | — | — | **{fusion['overall_valence']:+.2f}** |
|
| 848 |
+
|
| 849 |
+
*Spoken words (auto-transcribed):* "{spoken or '—'}"
|
| 850 |
+
"""
|
| 851 |
+
summary_md = f"### 🧠 Generative Summary\n\n> {summary}"
|
| 852 |
+
return transcript, tl_fig, vfig, tfig, fusion_md, summary_md
|
| 853 |
+
|
| 854 |
+
|
| 855 |
+
# -------------------------------------------------------------
|
| 856 |
+
# Gradio UI
|
| 857 |
+
# -------------------------------------------------------------
|
| 858 |
+
CSS = """
|
| 859 |
+
.gradio-container {max-width: 1200px !important;}
|
| 860 |
+
#title {text-align:center;}
|
| 861 |
+
footer {display: none !important;}
|
| 862 |
+
.show-api, .built-with, .settings {display: none !important;}
|
| 863 |
+
"""
|
| 864 |
+
|
| 865 |
+
with gr.Blocks(title="MoodSyncAI", theme=gr.themes.Soft(), css=CSS) as demo:
|
| 866 |
+
gr.Markdown("# 🎭 MoodSyncAI", elem_id="title")
|
| 867 |
+
gr.Markdown(
|
| 868 |
+
"**Multi-Modal Sentiment & Emotion Analyser** — combines a Vision "
|
| 869 |
+
"Transformer (face), a Transformer text classifier (words), a fusion "
|
| 870 |
+
"layer (mismatch detection), and a generative model (plain-language "
|
| 871 |
+
"summary). 100% open-source."
|
| 872 |
+
)
|
| 873 |
+
|
| 874 |
+
with gr.Tabs():
|
| 875 |
+
# ---------------- Tab 1 ----------------
|
| 876 |
+
with gr.Tab("🖼️ Image + Text"):
|
| 877 |
+
with gr.Row():
|
| 878 |
+
with gr.Column(scale=1):
|
| 879 |
+
img_in = gr.Image(type="pil", label="Face photo", height=320)
|
| 880 |
+
txt_in = gr.Textbox(
|
| 881 |
+
label="What the person said",
|
| 882 |
+
placeholder="e.g., No, I think the project is going really well.",
|
| 883 |
+
lines=2,
|
| 884 |
+
)
|
| 885 |
+
btn1 = gr.Button("🔍 Analyse", variant="primary")
|
| 886 |
+
attn_toggle1 = gr.Checkbox(
|
| 887 |
+
label="🔬 Show attention visualisation (ViT rollout + text tokens)",
|
| 888 |
+
value=False,
|
| 889 |
+
)
|
| 890 |
+
gr.Examples(
|
| 891 |
+
examples=[
|
| 892 |
+
[None, "No, I think the project is going really well."],
|
| 893 |
+
[None, "I'm absolutely thrilled about the results!"],
|
| 894 |
+
[None, "I'm fine, really, don't worry about me."],
|
| 895 |
+
],
|
| 896 |
+
inputs=[img_in, txt_in],
|
| 897 |
+
)
|
| 898 |
+
with gr.Column(scale=2):
|
| 899 |
+
fusion_md1 = gr.Markdown()
|
| 900 |
+
summary_md1 = gr.Markdown()
|
| 901 |
+
with gr.Row():
|
| 902 |
+
vbar1 = gr.Plot(label="Visual emotion")
|
| 903 |
+
tbar1 = gr.Plot(label="Text sentiment")
|
| 904 |
+
with gr.Accordion("🔬 Attention visualisation", open=False):
|
| 905 |
+
attn_img1 = gr.Image(
|
| 906 |
+
label="ViT attention rollout (face)",
|
| 907 |
+
height=320, interactive=False,
|
| 908 |
+
)
|
| 909 |
+
attn_html1 = gr.HTML(label="Text token attention")
|
| 910 |
+
btn1.click(analyse_image_text_with_attention,
|
| 911 |
+
inputs=[img_in, txt_in, attn_toggle1],
|
| 912 |
+
outputs=[vbar1, tbar1, fusion_md1, summary_md1,
|
| 913 |
+
attn_img1, attn_html1])
|
| 914 |
+
|
| 915 |
+
# ---------------- Tab 2 ----------------
|
| 916 |
+
with gr.Tab("📹 Webcam / Video + Text"):
|
| 917 |
+
gr.Markdown(
|
| 918 |
+
"Record a short clip from your webcam (3–10 s recommended) **or** "
|
| 919 |
+
"upload a short video. The system samples frames and builds an "
|
| 920 |
+
"emotion timeline."
|
| 921 |
+
)
|
| 922 |
+
with gr.Row():
|
| 923 |
+
with gr.Column(scale=1):
|
| 924 |
+
vid_in = gr.Video(
|
| 925 |
+
label="Webcam / video",
|
| 926 |
+
sources=["webcam", "upload"],
|
| 927 |
+
height=300,
|
| 928 |
+
)
|
| 929 |
+
txt_in2 = gr.Textbox(
|
| 930 |
+
label="What the person said",
|
| 931 |
+
placeholder="Type the spoken sentence here…",
|
| 932 |
+
lines=2,
|
| 933 |
+
)
|
| 934 |
+
btn2 = gr.Button("🔍 Analyse video", variant="primary")
|
| 935 |
+
with gr.Column(scale=2):
|
| 936 |
+
timeline_plot = gr.Plot(label="Emotion timeline")
|
| 937 |
+
fusion_md2 = gr.Markdown()
|
| 938 |
+
summary_md2 = gr.Markdown()
|
| 939 |
+
with gr.Row():
|
| 940 |
+
vbar2 = gr.Plot(label="Avg visual emotion")
|
| 941 |
+
tbar2 = gr.Plot(label="Text sentiment")
|
| 942 |
+
btn2.click(analyse_video_text,
|
| 943 |
+
inputs=[vid_in, txt_in2],
|
| 944 |
+
outputs=[timeline_plot, vbar2, tbar2, fusion_md2, summary_md2])
|
| 945 |
+
|
| 946 |
+
# ---------------- Tab 3 : Audio + Image ----------------
|
| 947 |
+
with gr.Tab("🎙️ Audio + Image"):
|
| 948 |
+
gr.Markdown(
|
| 949 |
+
"Speak (or upload audio) **and** provide a face image. Whisper "
|
| 950 |
+
"transcribes the audio; the words become the *text channel* fed "
|
| 951 |
+
"into the multimodal fusion."
|
| 952 |
+
)
|
| 953 |
+
with gr.Row():
|
| 954 |
+
with gr.Column(scale=1):
|
| 955 |
+
audio_in3 = gr.Audio(
|
| 956 |
+
label="🎙️ Audio (microphone or upload)",
|
| 957 |
+
sources=["microphone", "upload"],
|
| 958 |
+
type="filepath",
|
| 959 |
+
)
|
| 960 |
+
img_in3 = gr.Image(type="pil", label="Face photo", height=300)
|
| 961 |
+
btn3 = gr.Button("🔍 Transcribe & analyse", variant="primary")
|
| 962 |
+
with gr.Column(scale=2):
|
| 963 |
+
transcript3 = gr.Textbox(
|
| 964 |
+
label="Auto-transcript (Whisper)",
|
| 965 |
+
interactive=False, lines=2,
|
| 966 |
+
)
|
| 967 |
+
fusion_md3 = gr.Markdown()
|
| 968 |
+
summary_md3 = gr.Markdown()
|
| 969 |
+
with gr.Row():
|
| 970 |
+
vbar3 = gr.Plot(label="Visual emotion")
|
| 971 |
+
tbar3 = gr.Plot(label="Audio→text sentiment")
|
| 972 |
+
btn3.click(analyse_audio_image,
|
| 973 |
+
inputs=[audio_in3, img_in3],
|
| 974 |
+
outputs=[transcript3, vbar3, tbar3, fusion_md3, summary_md3])
|
| 975 |
+
|
| 976 |
+
# ---------------- Tab 4 : Video WITH audio ----------------
|
| 977 |
+
with gr.Tab("🎬 Video with Audio"):
|
| 978 |
+
gr.Markdown(
|
| 979 |
+
"Record or upload a short video **with sound**. The system extracts "
|
| 980 |
+
"the audio track, transcribes it (Whisper), samples frames for an "
|
| 981 |
+
"emotion timeline, then fuses the visual signal with the spoken-word "
|
| 982 |
+
"sentiment — no manual typing needed."
|
| 983 |
+
)
|
| 984 |
+
with gr.Row():
|
| 985 |
+
with gr.Column(scale=1):
|
| 986 |
+
vid_in4 = gr.Video(
|
| 987 |
+
label="Webcam / video (with audio)",
|
| 988 |
+
sources=["webcam", "upload"],
|
| 989 |
+
height=300,
|
| 990 |
+
)
|
| 991 |
+
btn4 = gr.Button("🔍 Transcribe & analyse video", variant="primary")
|
| 992 |
+
with gr.Column(scale=2):
|
| 993 |
+
transcript4 = gr.Textbox(
|
| 994 |
+
label="Auto-transcript (Whisper)",
|
| 995 |
+
interactive=False, lines=2,
|
| 996 |
+
)
|
| 997 |
+
timeline_plot4 = gr.Plot(label="Emotion timeline")
|
| 998 |
+
fusion_md4 = gr.Markdown()
|
| 999 |
+
summary_md4 = gr.Markdown()
|
| 1000 |
+
with gr.Row():
|
| 1001 |
+
vbar4 = gr.Plot(label="Avg visual emotion")
|
| 1002 |
+
tbar4 = gr.Plot(label="Audio→text sentiment")
|
| 1003 |
+
btn4.click(analyse_video_with_audio,
|
| 1004 |
+
inputs=[vid_in4],
|
| 1005 |
+
outputs=[transcript4, timeline_plot4, vbar4, tbar4,
|
| 1006 |
+
fusion_md4, summary_md4])
|
| 1007 |
+
|
| 1008 |
+
# ---------------- Tab 3 (about) ----------------
|
| 1009 |
+
with gr.Tab("ℹ️ About"):
|
| 1010 |
+
gr.Markdown(f"""
|
| 1011 |
+
### Architecture
|
| 1012 |
+
|
| 1013 |
+
| Stage | Model | Type |
|
| 1014 |
+
|---|---|---|
|
| 1015 |
+
| Visual emotion | `{VISION_MODEL}` | **Vision Transformer (ViT)** |
|
| 1016 |
+
| Text sentiment | `{TEXT_MODEL}` | **Transformer (DistilRoBERTa)** |
|
| 1017 |
+
| Speech-to-text | `{ASR_MODEL}` | **Encoder-Decoder Transformer (Whisper)** |
|
| 1018 |
+
| Fusion | Valence-aligned multimodal fusion (custom) | rule + weighted |
|
| 1019 |
+
| Generative summary | `{GEN_MODEL}` | **Encoder-Decoder Transformer (FLAN-T5)** |
|
| 1020 |
+
| Attention viz | ViT attention rollout + last-layer text attention | interpretability |
|
| 1021 |
+
|
| 1022 |
+
### Fusion logic
|
| 1023 |
+
|
| 1024 |
+
1. Each modality produces a probability distribution over emotion labels.
|
| 1025 |
+
2. Labels are mapped to a *valence* score in `[-1, +1]`.
|
| 1026 |
+
3. We compute weighted valence per modality, then a delta.
|
| 1027 |
+
4. Opposite signs → **MISMATCH** (amber). Small delta → **ALIGNED** (green).
|
| 1028 |
+
5. Generative model receives the structured signals and writes plain-language output.
|
| 1029 |
+
|
| 1030 |
+
### Privacy
|
| 1031 |
+
|
| 1032 |
+
All processing runs locally on your machine; no data is sent to external services
|
| 1033 |
+
after the first model download from the Hugging Face Hub.
|
| 1034 |
+
""")
|
| 1035 |
+
|
| 1036 |
+
if __name__ == "__main__":
|
| 1037 |
+
# Warm up small models so first request is snappy
|
| 1038 |
+
try:
|
| 1039 |
+
get_text_pipe()
|
| 1040 |
+
except Exception as e:
|
| 1041 |
+
print("[MoodSyncAI] Warmup text failed:", e)
|
| 1042 |
+
demo.queue().launch(
|
| 1043 |
+
server_name="127.0.0.1",
|
| 1044 |
+
server_port=7860,
|
| 1045 |
+
inbrowser=True,
|
| 1046 |
+
show_error=True,
|
| 1047 |
+
show_api=False,
|
| 1048 |
+
)
|
requirements.txt
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MoodSyncAI runtime dependencies
|
| 2 |
+
# Tested on Python 3.10–3.13 (Windows / Linux / macOS, CPU)
|
| 3 |
+
#
|
| 4 |
+
# Install:
|
| 5 |
+
# pip install --upgrade pip
|
| 6 |
+
# pip install -r requirements.txt --only-binary=:all:
|
| 7 |
+
#
|
| 8 |
+
# The --only-binary flag forces wheels, which avoids needing Rust/MSVC to
|
| 9 |
+
# compile tokenizers on Python 3.13.
|
| 10 |
+
|
| 11 |
+
# --- UI ---
|
| 12 |
+
gradio>=4.44,<6
|
| 13 |
+
|
| 14 |
+
# --- Deep-learning stack ---
|
| 15 |
+
torch>=2.2
|
| 16 |
+
torchvision>=0.17
|
| 17 |
+
transformers>=4.46
|
| 18 |
+
tokenizers>=0.20
|
| 19 |
+
sentencepiece>=0.2
|
| 20 |
+
accelerate>=0.30
|
| 21 |
+
safetensors>=0.4
|
| 22 |
+
|
| 23 |
+
# --- Vision / data ---
|
| 24 |
+
pillow>=10
|
| 25 |
+
numpy>=1.26
|
| 26 |
+
opencv-python>=4.9
|
| 27 |
+
plotly>=5.20
|
| 28 |
+
pandas>=2.0
|
| 29 |
+
|
| 30 |
+
# --- Audio (Whisper + video audio extraction) ---
|
| 31 |
+
imageio-ffmpeg>=0.5 # bundles ffmpeg binary; no system install required
|
| 32 |
+
soundfile>=0.12
|
| 33 |
+
librosa>=0.10
|