vijesh418 commited on
Commit
1f07aba
·
0 Parent(s):

Initial commit: MoodSyncAI multi-modal sentiment analyser

Browse files
Files changed (4) hide show
  1. .gitignore +36 -0
  2. README.md +64 -0
  3. app.py +1048 -0
  4. requirements.txt +33 -0
.gitignore ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Virtual environments
2
+ .venv/
3
+ venv/
4
+ env/
5
+
6
+ # Python
7
+ __pycache__/
8
+ *.py[cod]
9
+ *$py.class
10
+ *.so
11
+ *.egg-info/
12
+ .pytest_cache/
13
+
14
+ # Hugging Face / model caches
15
+ hf_cache/
16
+ .cache/
17
+
18
+ # Logs
19
+ *.log
20
+ install.log
21
+
22
+ # Dev / scratch scripts (uncomment to keep them local-only)
23
+ _warmup.py
24
+ _smoke_features.py
25
+ _verify_requirements.py
26
+
27
+ # IDE
28
+ .vscode/
29
+ .idea/
30
+ *.swp
31
+ .DS_Store
32
+
33
+ # Build artifacts
34
+ build/
35
+ dist/
36
+ *.spec
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🎭 MoodSyncAI
2
+
3
+ **Multi-Modal Sentiment & Emotion Analyser** — combines facial emotion (Vision Transformer), text sentiment (Transformer), a fusion layer (with mismatch detection), and a generative model that summarises the emotional state in plain language. Includes a **webcam / short-video timeline** view.
4
+
5
+ All models are **100% free & open-source** (Hugging Face Hub).
6
+
7
+ ## Components
8
+
9
+ | Stage | Model | Type | Requirement satisfied |
10
+ |---|---|---|---|
11
+ | Visual emotion | `trpakov/vit-face-expression` | **ViT** | CNN/ViT for facial emotion ✅ |
12
+ | Text sentiment | `j-hartmann/emotion-english-distilroberta-base` | **Transformer** | RNN/LSTM/Transformer ✅ |
13
+ | Speech-to-text | `openai/whisper-tiny` | **Whisper encoder-decoder** | Audio → text channel ✅ |
14
+ | Fusion | Valence-aligned multimodal fusion | rule-based + weighted | Fusion + mismatch ✅ |
15
+ | Generative | `google/flan-t5-base` | seq2seq Transformer | Generative summary ✅ |
16
+ | Webcam / video | OpenCV frame sampling + Plotly timeline | — | Real-time / video input ✅ |
17
+ | Attention viz | ViT attention rollout + last-layer text attention | interpretability | Attention visualisation ✅ |
18
+
19
+ ## Run
20
+
21
+ **Prerequisite:** Python **3.10 – 3.13** (CPU is enough — no GPU required, no system ffmpeg required).
22
+
23
+ ```powershell
24
+ # 1. Clone / copy this folder onto the new machine, then:
25
+ cd "<path-to-folder>"
26
+
27
+ # 2. Create a virtual env
28
+ python -m venv .venv
29
+ .\.venv\Scripts\Activate.ps1 # Windows
30
+ # source .venv/bin/activate # macOS / Linux
31
+
32
+ # 3. Install (use --only-binary to skip Rust/MSVC compilation on Py3.13)
33
+ python -m pip install --upgrade pip
34
+ pip install -r requirements.txt --only-binary=:all:
35
+
36
+ # 4. Launch
37
+ python app.py
38
+ ```
39
+
40
+ Browser opens at `http://127.0.0.1:7860`.
41
+
42
+ **To stop the app:** press `Ctrl+C` in the terminal running `python app.py`.
43
+
44
+ **First launch only:** downloads ~1.2 GB of models from Hugging Face into `~/.cache/huggingface/` (cached for all future runs, fully offline afterwards).
45
+
46
+ That's it — no system packages, no ffmpeg, no GPU, no model files to download manually.
47
+
48
+ ## Tabs
49
+
50
+ 1. **🖼️ Image + Text** — upload a face photo + type the spoken sentence → visual emotion bars, text emotion bars, fusion badge, generative summary. *Optional* attention-rollout heatmap on the face + per-token attention HTML when the toggle is on.
51
+ 2. **📹 Webcam / Video + Text** — record a 3–10 s clip in the browser → per-frame emotion **timeline chart**, aggregated bars, fusion, summary.
52
+ 3. **🎙️ Audio + Image** — record/upload audio + face photo. Whisper transcribes the audio; the transcript drives the text channel; full fusion + summary.
53
+ 4. **🎬 Video with Audio** — record/upload a video *with sound*. Audio is extracted (imageio-ffmpeg), transcribed by Whisper, fed to the text classifier; frames produce the visual timeline; fused result + summary — no typing needed.
54
+ 5. **ℹ️ About** — architecture & fusion logic.
55
+
56
+ ## Fusion / mismatch rule
57
+
58
+ Each modality's emotion distribution is mapped to a **valence** in `[-1, +1]`.
59
+
60
+ - Opposite-sign valences → **MISMATCH DETECTED** (amber 🟠)
61
+ - Small delta → **ALIGNED** (green 🟢)
62
+ - Otherwise → **PARTIALLY ALIGNED** (yellow 🟡)
63
+
64
+ The generative model is prompted with the structured signals and writes a 2–3 sentence empathetic summary.
app.py ADDED
@@ -0,0 +1,1048 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ MoodSyncAI: Multi-Modal Sentiment & Emotion Analyser
3
+ ====================================================
4
+ Components:
5
+ - Visual emotion: ViT (Vision Transformer) - trpakov/vit-face-expression
6
+ - Text emotion: DistilRoBERTa transformer - j-hartmann/emotion-english-distilroberta-base
7
+ - Fusion: Valence-aligned multimodal fusion + mismatch detection
8
+ - Generative: FLAN-T5 (with safe template fallback) for plain-language summary
9
+ - Webcam: Short video upload/recording, per-frame emotion timeline
10
+
11
+ All models are free/open-source from Hugging Face. Runs on CPU.
12
+ """
13
+
14
+ import os
15
+ import io
16
+ import time
17
+ import warnings
18
+ from typing import List, Tuple, Dict
19
+
20
+ warnings.filterwarnings("ignore")
21
+ os.environ.setdefault("TRANSFORMERS_VERBOSITY", "error")
22
+ os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
23
+
24
+ import numpy as np
25
+ import pandas as pd
26
+ from PIL import Image
27
+ import cv2
28
+ import plotly.graph_objects as go
29
+ import plotly.express as px
30
+ import gradio as gr
31
+
32
+ import torch
33
+ from transformers import (
34
+ pipeline,
35
+ AutoTokenizer,
36
+ AutoModelForSeq2SeqLM,
37
+ AutoModelForImageClassification,
38
+ AutoModelForSequenceClassification,
39
+ AutoImageProcessor,
40
+ )
41
+
42
+ # -------------------------------------------------------------
43
+ # Model identifiers (all free / public on Hugging Face Hub)
44
+ # -------------------------------------------------------------
45
+ VISION_MODEL = "trpakov/vit-face-expression" # ViT for facial emotion
46
+ TEXT_MODEL = "j-hartmann/emotion-english-distilroberta-base" # 7 emotions
47
+ GEN_MODEL = "google/flan-t5-base" # generative summariser
48
+ ASR_MODEL = "openai/whisper-tiny" # speech-to-text (Whisper)
49
+
50
+ DEVICE = 0 if torch.cuda.is_available() else -1
51
+ print(f"[MoodSyncAI] Torch device: {'cuda' if DEVICE == 0 else 'cpu'}")
52
+
53
+ # -------------------------------------------------------------
54
+ # Lazy-loaded model singletons
55
+ # -------------------------------------------------------------
56
+ _vision_pipe = None
57
+ _text_pipe = None
58
+ _gen_tokenizer = None
59
+ _gen_model = None
60
+ _face_cascade = None
61
+ _asr_pipe = None
62
+ _vit_attn_model = None
63
+ _vit_attn_processor = None
64
+ _text_attn_model = None
65
+ _text_attn_tokenizer = None
66
+
67
+
68
+ def get_vision_pipe():
69
+ global _vision_pipe
70
+ if _vision_pipe is None:
71
+ print("[MoodSyncAI] Loading vision model:", VISION_MODEL)
72
+ _vision_pipe = pipeline(
73
+ "image-classification",
74
+ model=VISION_MODEL,
75
+ device=DEVICE,
76
+ top_k=None,
77
+ )
78
+ return _vision_pipe
79
+
80
+
81
+ def get_text_pipe():
82
+ global _text_pipe
83
+ if _text_pipe is None:
84
+ print("[MoodSyncAI] Loading text model:", TEXT_MODEL)
85
+ _text_pipe = pipeline(
86
+ "text-classification",
87
+ model=TEXT_MODEL,
88
+ device=DEVICE,
89
+ top_k=None,
90
+ truncation=True,
91
+ )
92
+ return _text_pipe
93
+
94
+
95
+ def get_generator():
96
+ global _gen_tokenizer, _gen_model
97
+ if _gen_model is None:
98
+ try:
99
+ print("[MoodSyncAI] Loading generator:", GEN_MODEL)
100
+ _gen_tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL)
101
+ _gen_model = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL)
102
+ if DEVICE == 0:
103
+ _gen_model = _gen_model.to("cuda")
104
+ except Exception as e:
105
+ print("[MoodSyncAI] Generator load failed, will use template fallback:", e)
106
+ _gen_tokenizer, _gen_model = None, None
107
+ return _gen_tokenizer, _gen_model
108
+
109
+
110
+ def get_face_cascade():
111
+ global _face_cascade
112
+ if _face_cascade is None:
113
+ path = os.path.join(cv2.data.haarcascades, "haarcascade_frontalface_default.xml")
114
+ _face_cascade = cv2.CascadeClassifier(path)
115
+ return _face_cascade
116
+
117
+
118
+ # -------------------------------------------------------------
119
+ # Valence map: used to align textual and visual signals
120
+ # -------------------------------------------------------------
121
+ VALENCE = {
122
+ # text emotions (from distilroberta)
123
+ "joy": 1.0,
124
+ "love": 1.0,
125
+ "surprise": 0.3,
126
+ "neutral": 0.0,
127
+ "sadness": -1.0,
128
+ "fear": -0.8,
129
+ "anger": -0.9,
130
+ "disgust": -0.8,
131
+ # vision labels (ViT face expression labels)
132
+ "happy": 1.0,
133
+ "happiness": 1.0,
134
+ "sad": -1.0,
135
+ "angry": -0.9,
136
+ "fearful": -0.8,
137
+ "fear": -0.8,
138
+ "disgusted": -0.8,
139
+ "surprised": 0.3,
140
+ "contempt": -0.6,
141
+ }
142
+
143
+
144
+ def valence_of(label: str) -> float:
145
+ return VALENCE.get(label.lower().strip(), 0.0)
146
+
147
+
148
+ # -------------------------------------------------------------
149
+ # Face detection (crops to face for better accuracy; falls back to full image)
150
+ # -------------------------------------------------------------
151
+ def detect_and_crop_face(pil_img: Image.Image) -> Image.Image:
152
+ try:
153
+ cascade = get_face_cascade()
154
+ rgb = np.array(pil_img.convert("RGB"))
155
+ gray = cv2.cvtColor(rgb, cv2.COLOR_RGB2GRAY)
156
+ faces = cascade.detectMultiScale(gray, scaleFactor=1.2, minNeighbors=5, minSize=(60, 60))
157
+ if len(faces) == 0:
158
+ return pil_img
159
+ # Pick the largest face
160
+ x, y, w, h = max(faces, key=lambda b: b[2] * b[3])
161
+ pad = int(0.15 * max(w, h))
162
+ x0 = max(0, x - pad); y0 = max(0, y - pad)
163
+ x1 = min(rgb.shape[1], x + w + pad); y1 = min(rgb.shape[0], y + h + pad)
164
+ return Image.fromarray(rgb[y0:y1, x0:x1])
165
+ except Exception:
166
+ return pil_img
167
+
168
+
169
+ # -------------------------------------------------------------
170
+ # Core analysis helpers
171
+ # -------------------------------------------------------------
172
+ def predict_visual(pil_img: Image.Image) -> List[Dict]:
173
+ pipe = get_vision_pipe()
174
+ face = detect_and_crop_face(pil_img)
175
+ preds = pipe(face)
176
+ # normalise into list of {label,score}
177
+ return [{"label": p["label"], "score": float(p["score"])} for p in preds]
178
+
179
+
180
+ def predict_text(text: str) -> List[Dict]:
181
+ if not text or not text.strip():
182
+ return [{"label": "neutral", "score": 1.0}]
183
+ pipe = get_text_pipe()
184
+ preds = pipe(text)[0] # top_k=None -> list of all
185
+ return [{"label": p["label"], "score": float(p["score"])} for p in preds]
186
+
187
+
188
+ def top1(preds: List[Dict]) -> Tuple[str, float]:
189
+ p = max(preds, key=lambda d: d["score"])
190
+ return p["label"], p["score"]
191
+
192
+
193
+ def weighted_valence(preds: List[Dict]) -> float:
194
+ return sum(p["score"] * valence_of(p["label"]) for p in preds)
195
+
196
+
197
+ def fuse(visual_preds: List[Dict], text_preds: List[Dict]) -> Dict:
198
+ v_label, v_conf = top1(visual_preds)
199
+ t_label, t_conf = top1(text_preds)
200
+ v_val = weighted_valence(visual_preds)
201
+ t_val = weighted_valence(text_preds)
202
+
203
+ delta = v_val - t_val
204
+ # mismatch: opposite sign with meaningful magnitude
205
+ mismatch = (v_val * t_val < -0.05) or (abs(delta) > 0.9)
206
+
207
+ if mismatch:
208
+ status = "MISMATCH DETECTED"
209
+ badge = "🟠"
210
+ elif abs(delta) < 0.35:
211
+ status = "ALIGNED"
212
+ badge = "🟢"
213
+ else:
214
+ status = "PARTIALLY ALIGNED"
215
+ badge = "🟡"
216
+
217
+ # overall valence (weighted average favoring visual when mismatch)
218
+ if mismatch:
219
+ overall_val = 0.6 * v_val + 0.4 * t_val
220
+ else:
221
+ overall_val = 0.5 * (v_val + t_val)
222
+
223
+ return {
224
+ "visual_label": v_label,
225
+ "visual_conf": v_conf,
226
+ "text_label": t_label,
227
+ "text_conf": t_conf,
228
+ "visual_valence": v_val,
229
+ "text_valence": t_val,
230
+ "delta": delta,
231
+ "status": status,
232
+ "badge": badge,
233
+ "overall_valence": overall_val,
234
+ }
235
+
236
+
237
+ # -------------------------------------------------------------
238
+ # Generative summary
239
+ # -------------------------------------------------------------
240
+ def template_summary(fusion: Dict) -> str:
241
+ v = fusion["visual_label"]; vc = fusion["visual_conf"]
242
+ t = fusion["text_label"]; tc = fusion["text_conf"]
243
+ if fusion["status"].startswith("MISMATCH"):
244
+ return (
245
+ f"Despite expressing **{t}** sentiment verbally ({tc*100:.0f}% confidence), "
246
+ f"the speaker's facial cues indicate **{v}** ({vc*100:.0f}% confidence). "
247
+ f"This incongruence between words and expression is worth noting in the "
248
+ f"context of the conversation - the spoken message may not fully reflect "
249
+ f"how the person actually feels."
250
+ )
251
+ if fusion["status"] == "ALIGNED":
252
+ return (
253
+ f"The speaker's words ({t}, {tc*100:.0f}%) and facial expression "
254
+ f"({v}, {vc*100:.0f}%) are consistent. The overall emotional state "
255
+ f"appears genuine and uncomplicated."
256
+ )
257
+ return (
258
+ f"The speaker shows mild divergence between facial expression ({v}, "
259
+ f"{vc*100:.0f}%) and spoken sentiment ({t}, {tc*100:.0f}%). The signals "
260
+ f"are not contradictory but suggest some nuance in the emotional state."
261
+ )
262
+
263
+
264
+ def generative_summary(fusion: Dict, text_input: str) -> str:
265
+ tok, model = get_generator()
266
+ fallback = template_summary(fusion)
267
+ if model is None or tok is None:
268
+ return fallback
269
+ try:
270
+ mismatch = fusion["status"].startswith("MISMATCH")
271
+ instr = (
272
+ "rewrite as one empathetic paragraph (2-3 sentences) that explicitly "
273
+ "highlights the mismatch between facial expression and spoken words"
274
+ if mismatch else
275
+ "rewrite as one empathetic paragraph (2-3 sentences) noting the emotional state"
276
+ )
277
+ prompt = (
278
+ f"You are an empathetic psychologist. Given the analysis below, {instr}. "
279
+ f"Begin with the word 'The'.\n\n"
280
+ f"Analysis:\n"
281
+ f"- Spoken sentence: \"{text_input or '(none provided)'}\"\n"
282
+ f"- Facial emotion detected: {fusion['visual_label']} "
283
+ f"({fusion['visual_conf']*100:.0f}% confidence)\n"
284
+ f"- Sentiment of the words: {fusion['text_label']} "
285
+ f"({fusion['text_conf']*100:.0f}% confidence)\n"
286
+ f"- Alignment: {fusion['status']}\n\n"
287
+ f"Paragraph:"
288
+ )
289
+ inputs = tok(prompt, return_tensors="pt", truncation=True, max_length=512)
290
+ if DEVICE == 0:
291
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
292
+ out = model.generate(
293
+ **inputs,
294
+ max_new_tokens=140,
295
+ min_new_tokens=30,
296
+ num_beams=4,
297
+ do_sample=False,
298
+ no_repeat_ngram_size=3,
299
+ early_stopping=True,
300
+ )
301
+ text = tok.decode(out[0], skip_special_tokens=True).strip()
302
+ # Reject obvious echoes / too-short / off-topic outputs
303
+ bad = (len(text) < 50
304
+ or text.lower().startswith(("tell ", "write ", "give "))
305
+ or "story" in text.lower()[:40]
306
+ or fusion["visual_label"].lower() not in text.lower()
307
+ and fusion["text_label"].lower() not in text.lower())
308
+ if bad:
309
+ return fallback
310
+ return text
311
+ except Exception as e:
312
+ print("[MoodSyncAI] Generation error:", e)
313
+ return fallback
314
+
315
+
316
+ # -------------------------------------------------------------
317
+ # Plotly charts
318
+ # -------------------------------------------------------------
319
+ def bar_chart(preds: List[Dict], title: str, color: str) -> go.Figure:
320
+ df = pd.DataFrame(preds).sort_values("score", ascending=True)
321
+ df["pct"] = (df["score"] * 100).round(1)
322
+ fig = go.Figure(go.Bar(
323
+ x=df["pct"], y=df["label"], orientation="h",
324
+ marker=dict(color=color),
325
+ text=df["pct"].astype(str) + "%",
326
+ textposition="outside",
327
+ ))
328
+ fig.update_layout(
329
+ title=title,
330
+ xaxis_title="Confidence (%)",
331
+ yaxis_title=None,
332
+ xaxis=dict(range=[0, 110]),
333
+ height=320, margin=dict(l=10, r=10, t=40, b=10),
334
+ template="plotly_white",
335
+ )
336
+ return fig
337
+
338
+
339
+ def empty_fig(msg="No data") -> go.Figure:
340
+ fig = go.Figure()
341
+ fig.add_annotation(text=msg, xref="paper", yref="paper",
342
+ x=0.5, y=0.5, showarrow=False, font=dict(size=14))
343
+ fig.update_layout(height=320, template="plotly_white",
344
+ margin=dict(l=10, r=10, t=20, b=10))
345
+ return fig
346
+
347
+
348
+ # -------------------------------------------------------------
349
+ # Tab 1: Image + Text analysis
350
+ # -------------------------------------------------------------
351
+ def analyse_image_text(image: Image.Image, text: str):
352
+ if image is None:
353
+ return (empty_fig("Please upload an image"),
354
+ empty_fig("Awaiting input"),
355
+ "### ⚠️ Please upload an image of a face.", "")
356
+
357
+ visual_preds = predict_visual(image)
358
+ text_preds = predict_text(text or "")
359
+
360
+ fusion = fuse(visual_preds, text_preds)
361
+ summary = generative_summary(fusion, text)
362
+
363
+ vfig = bar_chart(visual_preds, "👁️ Visual Emotion (ViT)", "#4C78A8")
364
+ tfig = bar_chart(text_preds, "💬 Text Sentiment (Transformer)", "#54A24B")
365
+
366
+ fusion_md = f"""
367
+ ### {fusion['badge']} Fusion Result: **{fusion['status']}**
368
+
369
+ | Modality | Top Prediction | Confidence | Valence |
370
+ |---|---|---|---|
371
+ | 👁️ Visual | **{fusion['visual_label']}** | {fusion['visual_conf']*100:.1f}% | {fusion['visual_valence']:+.2f} |
372
+ | 💬 Text | **{fusion['text_label']}** | {fusion['text_conf']*100:.1f}% | {fusion['text_valence']:+.2f} |
373
+ | 🔗 Overall valence | — | — | **{fusion['overall_valence']:+.2f}** |
374
+ """
375
+ summary_md = f"### 🧠 Generative Summary\n\n> {summary}"
376
+ return vfig, tfig, fusion_md, summary_md
377
+
378
+
379
+ # -------------------------------------------------------------
380
+ # Tab 2: Webcam / short video → emotion timeline
381
+ # -------------------------------------------------------------
382
+ def sample_frames(video_path: str, max_frames: int = 12) -> List[Tuple[float, Image.Image]]:
383
+ cap = cv2.VideoCapture(video_path)
384
+ if not cap.isOpened():
385
+ return []
386
+ fps = cap.get(cv2.CAP_PROP_FPS) or 25.0
387
+ total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
388
+
389
+ # If total frames is unknown, read sequentially to count.
390
+ if total <= 0:
391
+ total = 0
392
+ while True:
393
+ ok, _ = cap.read()
394
+ if not ok:
395
+ break
396
+ total += 1
397
+ cap.release()
398
+ cap = cv2.VideoCapture(video_path)
399
+ if total <= 0:
400
+ return []
401
+
402
+ duration = total / fps if fps > 0 else 1.0
403
+ n = min(max_frames, max(3, int(duration * 2))) # ~2 fps target
404
+ target_idxs = set(np.linspace(0, total - 1, n).astype(int).tolist())
405
+
406
+ out: List[Tuple[float, Image.Image]] = []
407
+ idx = 0
408
+ while True:
409
+ ok, frame = cap.read()
410
+ if not ok:
411
+ break
412
+ if idx in target_idxs:
413
+ ts = idx / fps if fps > 0 else float(idx)
414
+ pil = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
415
+ out.append((float(ts), pil))
416
+ if len(out) >= n:
417
+ break
418
+ idx += 1
419
+ cap.release()
420
+ return out
421
+
422
+
423
+ def analyse_video_text(video_path, text: str):
424
+ if not video_path:
425
+ return (empty_fig("Record or upload a short video"),
426
+ empty_fig("Awaiting input"),
427
+ empty_fig("Awaiting input"),
428
+ "### ⚠️ Please provide a webcam video.", "")
429
+
430
+ frames = sample_frames(video_path, max_frames=12)
431
+ if not frames:
432
+ return (empty_fig("Could not read video"),
433
+ empty_fig(""), empty_fig(""),
434
+ "### ⚠️ Could not decode the video file.", "")
435
+
436
+ timeline = [] # list of dict: ts, label->score
437
+ aggregated: Dict[str, float] = {}
438
+ for ts, pil in frames:
439
+ preds = predict_visual(pil)
440
+ row = {"timestamp": ts}
441
+ for p in preds:
442
+ row[p["label"]] = p["score"]
443
+ aggregated[p["label"]] = aggregated.get(p["label"], 0.0) + p["score"]
444
+ timeline.append(row)
445
+
446
+ # Average the aggregated visual prediction across frames
447
+ n = len(frames)
448
+ avg_visual = [{"label": k, "score": v / n} for k, v in aggregated.items()]
449
+
450
+ text_preds = predict_text(text or "")
451
+ fusion = fuse(avg_visual, text_preds)
452
+ summary = generative_summary(fusion, text)
453
+
454
+ # Timeline figure (line per emotion)
455
+ df = pd.DataFrame(timeline).fillna(0.0)
456
+ label_cols = [c for c in df.columns if c != "timestamp"]
457
+ tl_fig = go.Figure()
458
+ palette = px.colors.qualitative.Set2
459
+ for i, lbl in enumerate(label_cols):
460
+ tl_fig.add_trace(go.Scatter(
461
+ x=df["timestamp"], y=df[lbl] * 100,
462
+ mode="lines+markers", name=lbl,
463
+ line=dict(color=palette[i % len(palette)], width=2),
464
+ ))
465
+ tl_fig.update_layout(
466
+ title="📈 Emotion Timeline (per frame)",
467
+ xaxis_title="Time (s)", yaxis_title="Confidence (%)",
468
+ height=360, template="plotly_white",
469
+ margin=dict(l=10, r=10, t=40, b=10),
470
+ yaxis=dict(range=[0, 100]),
471
+ )
472
+
473
+ vfig = bar_chart(avg_visual, "👁️ Average Visual Emotion", "#4C78A8")
474
+ tfig = bar_chart(text_preds, "💬 Text Sentiment", "#54A24B")
475
+
476
+ fusion_md = f"""
477
+ ### {fusion['badge']} Fusion Result: **{fusion['status']}**
478
+
479
+ | Modality | Top Prediction | Confidence | Valence |
480
+ |---|---|---|---|
481
+ | 👁️ Visual (avg) | **{fusion['visual_label']}** | {fusion['visual_conf']*100:.1f}% | {fusion['visual_valence']:+.2f} |
482
+ | 💬 Text | **{fusion['text_label']}** | {fusion['text_conf']*100:.1f}% | {fusion['text_valence']:+.2f} |
483
+ | 🔗 Overall valence | — | — | **{fusion['overall_valence']:+.2f}** |
484
+
485
+ *Analysed {n} frames from the video.*
486
+ """
487
+ summary_md = f"### 🧠 Generative Summary\n\n> {summary}"
488
+ return tl_fig, vfig, tfig, fusion_md, summary_md
489
+
490
+
491
+ # =============================================================
492
+ # NEW FEATURE BLOCK (additive — does not touch Tab 1 / Tab 2)
493
+ # =============================================================
494
+ # 1) Whisper ASR (audio → text channel)
495
+ # 2) Video with audio (transcribe + frame timeline + fusion)
496
+ # 3) Attention visualisation (ViT rollout heatmap + text token attention)
497
+ # =============================================================
498
+
499
+ import tempfile
500
+ import subprocess
501
+ import html as _html
502
+
503
+
504
+ def get_asr_pipe():
505
+ global _asr_pipe
506
+ if _asr_pipe is None:
507
+ print("[MoodSyncAI] Loading ASR model:", ASR_MODEL)
508
+ _asr_pipe = pipeline(
509
+ "automatic-speech-recognition",
510
+ model=ASR_MODEL,
511
+ device=DEVICE,
512
+ chunk_length_s=30,
513
+ return_timestamps=False,
514
+ )
515
+ return _asr_pipe
516
+
517
+
518
+ def transcribe_audio(audio_path: str) -> str:
519
+ if not audio_path:
520
+ return ""
521
+ try:
522
+ # Load audio ourselves (soundfile/librosa) so we don't depend on
523
+ # whisper's internal ffmpeg-via-PATH lookup.
524
+ import soundfile as sf
525
+ try:
526
+ audio, sr = sf.read(audio_path, dtype="float32", always_2d=False)
527
+ except Exception:
528
+ import librosa
529
+ audio, sr = librosa.load(audio_path, sr=16000, mono=True)
530
+ if audio.ndim > 1:
531
+ audio = audio.mean(axis=1)
532
+ if sr != 16000:
533
+ import librosa
534
+ audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
535
+ sr = 16000
536
+ if audio.size == 0:
537
+ return ""
538
+ pipe = get_asr_pipe()
539
+ out = pipe(
540
+ {"array": audio, "sampling_rate": sr},
541
+ generate_kwargs={"language": "en", "task": "transcribe"},
542
+ )
543
+ text = out.get("text", "") if isinstance(out, dict) else str(out)
544
+ return (text or "").strip()
545
+ except Exception as e:
546
+ print("[MoodSyncAI] Transcription error:", e)
547
+ return ""
548
+
549
+
550
+ def _ffmpeg_exe() -> str:
551
+ try:
552
+ import imageio_ffmpeg
553
+ return imageio_ffmpeg.get_ffmpeg_exe()
554
+ except Exception:
555
+ return "ffmpeg"
556
+
557
+
558
+ def extract_audio_from_video(video_path: str) -> str:
559
+ """Extract mono 16 kHz wav from video. Returns wav path or '' on failure."""
560
+ if not video_path:
561
+ return ""
562
+ try:
563
+ out_path = tempfile.NamedTemporaryFile(
564
+ suffix=".wav", delete=False
565
+ ).name
566
+ cmd = [
567
+ _ffmpeg_exe(), "-y", "-i", video_path,
568
+ "-vn", "-ac", "1", "-ar", "16000",
569
+ "-f", "wav", out_path,
570
+ ]
571
+ proc = subprocess.run(cmd, capture_output=True, timeout=120)
572
+ if proc.returncode != 0 or not os.path.exists(out_path) or os.path.getsize(out_path) < 1024:
573
+ return ""
574
+ return out_path
575
+ except Exception as e:
576
+ print("[MoodSyncAI] Audio-extract error:", e)
577
+ return ""
578
+
579
+
580
+ # -------------------------------------------------------------
581
+ # Attention visualisation
582
+ # -------------------------------------------------------------
583
+ def _get_vit_attn():
584
+ global _vit_attn_model, _vit_attn_processor
585
+ if _vit_attn_model is None:
586
+ print("[MoodSyncAI] Loading ViT (eager attn) for attention rollout")
587
+ _vit_attn_processor = AutoImageProcessor.from_pretrained(VISION_MODEL)
588
+ _vit_attn_model = AutoModelForImageClassification.from_pretrained(
589
+ VISION_MODEL, attn_implementation="eager"
590
+ )
591
+ _vit_attn_model.eval()
592
+ if DEVICE == 0:
593
+ _vit_attn_model = _vit_attn_model.to("cuda")
594
+ return _vit_attn_model, _vit_attn_processor
595
+
596
+
597
+ def _get_text_attn():
598
+ global _text_attn_model, _text_attn_tokenizer
599
+ if _text_attn_model is None:
600
+ print("[MoodSyncAI] Loading text classifier (eager attn) for token attention")
601
+ _text_attn_tokenizer = AutoTokenizer.from_pretrained(TEXT_MODEL)
602
+ _text_attn_model = AutoModelForSequenceClassification.from_pretrained(
603
+ TEXT_MODEL, attn_implementation="eager"
604
+ )
605
+ _text_attn_model.eval()
606
+ if DEVICE == 0:
607
+ _text_attn_model = _text_attn_model.to("cuda")
608
+ return _text_attn_model, _text_attn_tokenizer
609
+
610
+
611
+ def vit_attention_heatmap(pil_img: Image.Image) -> Image.Image:
612
+ """Attention-rollout heatmap overlaid on the (face-cropped) image."""
613
+ try:
614
+ face = detect_and_crop_face(pil_img).convert("RGB")
615
+ model, processor = _get_vit_attn()
616
+ inputs = processor(images=face, return_tensors="pt")
617
+ if DEVICE == 0:
618
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
619
+ with torch.no_grad():
620
+ out = model(**inputs, output_attentions=True)
621
+ attns = out.attentions # tuple(L) of (1, H, S, S)
622
+ if not attns:
623
+ return face
624
+
625
+ # Attention rollout: avg heads, add identity, normalise, multiply layers
626
+ result = None
627
+ for a in attns:
628
+ a = a.mean(dim=1).squeeze(0) # (S, S)
629
+ a = a + torch.eye(a.size(0), device=a.device)
630
+ a = a / a.sum(dim=-1, keepdim=True)
631
+ result = a if result is None else a @ result
632
+
633
+ # CLS-token row, drop CLS index → patch importances
634
+ cls_attn = result[0, 1:].detach().cpu().numpy()
635
+ side = int(np.sqrt(cls_attn.shape[0]))
636
+ if side * side != cls_attn.shape[0]:
637
+ return face
638
+ grid = cls_attn.reshape(side, side)
639
+ grid = (grid - grid.min()) / (grid.max() - grid.min() + 1e-8)
640
+
641
+ # Resize heatmap to face image
642
+ w, h = face.size
643
+ heat = cv2.resize(grid, (w, h), interpolation=cv2.INTER_CUBIC)
644
+ heat_u8 = (heat * 255).astype(np.uint8)
645
+ color = cv2.applyColorMap(heat_u8, cv2.COLORMAP_JET)
646
+ color = cv2.cvtColor(color, cv2.COLOR_BGR2RGB)
647
+ base = np.array(face)
648
+ overlay = (0.55 * base + 0.45 * color).clip(0, 255).astype(np.uint8)
649
+ return Image.fromarray(overlay)
650
+ except Exception as e:
651
+ print("[MoodSyncAI] ViT attention error:", e)
652
+ return pil_img
653
+
654
+
655
+ def text_token_attention_html(text: str) -> str:
656
+ """Render input text with per-token attention intensity (last layer, [CLS] row)."""
657
+ if not text or not text.strip():
658
+ return "<em>(no text)</em>"
659
+ try:
660
+ model, tok = _get_text_attn()
661
+ enc = tok(text, return_tensors="pt", truncation=True, max_length=256)
662
+ if DEVICE == 0:
663
+ enc = {k: v.to("cuda") for k, v in enc.items()}
664
+ with torch.no_grad():
665
+ out = model(**enc, output_attentions=True)
666
+ attns = out.attentions # tuple(L) of (1, H, S, S)
667
+ if not attns:
668
+ return _html.escape(text)
669
+ last = attns[-1].mean(dim=1).squeeze(0) # (S, S)
670
+ cls_row = last[0].detach().cpu().numpy() # importance of each token to CLS
671
+
672
+ ids = enc["input_ids"][0].detach().cpu().tolist()
673
+ tokens = tok.convert_ids_to_tokens(ids)
674
+ # Skip special tokens for normalisation range
675
+ specials = set(tok.all_special_tokens)
676
+ keep_mask = np.array([t not in specials for t in tokens])
677
+ if keep_mask.sum() == 0:
678
+ return _html.escape(text)
679
+ scores = cls_row.copy()
680
+ scores_disp = scores[keep_mask]
681
+ lo, hi = scores_disp.min(), scores_disp.max()
682
+ norm = (scores - lo) / (hi - lo + 1e-8)
683
+ norm = np.clip(norm, 0.0, 1.0)
684
+
685
+ # Build HTML: merge subword tokens (RoBERTa uses 'Ġ' prefix for word start)
686
+ spans = []
687
+ for i, t in enumerate(tokens):
688
+ if t in specials:
689
+ continue
690
+ display = t
691
+ prefix_space = ""
692
+ if display.startswith("Ġ"):
693
+ display = display[1:]
694
+ prefix_space = " "
695
+ elif display.startswith("▁"):
696
+ display = display[1:]
697
+ prefix_space = " "
698
+ intensity = float(norm[i])
699
+ # red highlight, alpha from intensity
700
+ bg = f"rgba(220,38,38,{intensity:.2f})"
701
+ color = "#fff" if intensity > 0.55 else "#111"
702
+ safe = _html.escape(display)
703
+ spans.append(
704
+ f"{prefix_space}<span style=\"background:{bg};color:{color};"
705
+ f"padding:2px 4px;border-radius:4px;margin:1px;"
706
+ f"font-family:monospace\" title=\"{intensity:.2f}\">{safe}</span>"
707
+ )
708
+ body = "".join(spans).strip()
709
+ legend = (
710
+ "<div style='margin-top:8px;font-size:12px;color:#555'>"
711
+ "Darker red = higher attention weight from [CLS] to that token "
712
+ "(last transformer layer, averaged over heads)."
713
+ "</div>"
714
+ )
715
+ return f"<div style='line-height:2;font-size:15px'>{body}</div>{legend}"
716
+ except Exception as e:
717
+ print("[MoodSyncAI] Text attention error:", e)
718
+ return _html.escape(text)
719
+
720
+
721
+ # -------------------------------------------------------------
722
+ # Tab 1 wrapper: existing outputs + (optional) attention viz
723
+ # -------------------------------------------------------------
724
+ def analyse_image_text_with_attention(image: Image.Image, text: str, show_attn: bool):
725
+ vfig, tfig, fusion_md, summary_md = analyse_image_text(image, text)
726
+ if not show_attn or image is None:
727
+ return (vfig, tfig, fusion_md, summary_md,
728
+ None, "<em>Toggle 'Show attention visualisation' to view.</em>")
729
+ heat = vit_attention_heatmap(image)
730
+ token_html = text_token_attention_html(text or "")
731
+ return vfig, tfig, fusion_md, summary_md, heat, token_html
732
+
733
+
734
+ # -------------------------------------------------------------
735
+ # Tab 3: Audio + Image
736
+ # -------------------------------------------------------------
737
+ def analyse_audio_image(audio_path, image: Image.Image):
738
+ if image is None and not audio_path:
739
+ return ("",
740
+ empty_fig("Provide an image"),
741
+ empty_fig("Provide audio"),
742
+ "### ⚠️ Please provide both an image and audio.", "")
743
+ transcript = transcribe_audio(audio_path) if audio_path else ""
744
+ if not transcript:
745
+ transcript = "(no speech detected)"
746
+ if image is None:
747
+ return (transcript,
748
+ empty_fig("No image provided"),
749
+ empty_fig("(transcript only)"),
750
+ "### ⚠️ Please also provide a face image.", "")
751
+
752
+ visual_preds = predict_visual(image)
753
+ spoken = "" if transcript.startswith("(") else transcript
754
+ text_preds = predict_text(spoken)
755
+ fusion = fuse(visual_preds, text_preds)
756
+ summary = generative_summary(fusion, spoken)
757
+
758
+ vfig = bar_chart(visual_preds, "👁️ Visual Emotion (ViT)", "#4C78A8")
759
+ tfig = bar_chart(text_preds, "💬 Sentiment of Transcribed Speech", "#54A24B")
760
+ fusion_md = f"""
761
+ ### {fusion['badge']} Fusion Result: **{fusion['status']}**
762
+
763
+ | Modality | Top Prediction | Confidence | Valence |
764
+ |---|---|---|---|
765
+ | 👁️ Visual (image) | **{fusion['visual_label']}** | {fusion['visual_conf']*100:.1f}% | {fusion['visual_valence']:+.2f} |
766
+ | 🎙️ Audio → Text | **{fusion['text_label']}** | {fusion['text_conf']*100:.1f}% | {fusion['text_valence']:+.2f} |
767
+ | 🔗 Overall valence | — | — | **{fusion['overall_valence']:+.2f}** |
768
+ """
769
+ summary_md = f"### 🧠 Generative Summary\n\n> {summary}"
770
+ return transcript, vfig, tfig, fusion_md, summary_md
771
+
772
+
773
+ # -------------------------------------------------------------
774
+ # Tab 4: Video WITH audio (frames timeline + audio transcript → text channel)
775
+ # -------------------------------------------------------------
776
+ def analyse_video_with_audio(video_path):
777
+ if not video_path:
778
+ return ("",
779
+ empty_fig("Record or upload a video"),
780
+ empty_fig(""), empty_fig(""),
781
+ "### ⚠️ Please provide a video.", "")
782
+
783
+ frames = sample_frames(video_path, max_frames=12)
784
+ if not frames:
785
+ return ("",
786
+ empty_fig("Could not read video"),
787
+ empty_fig(""), empty_fig(""),
788
+ "### ⚠️ Could not decode the video file.", "")
789
+
790
+ # 1) Audio → transcript
791
+ wav = extract_audio_from_video(video_path)
792
+ transcript = transcribe_audio(wav) if wav else ""
793
+ if wav and os.path.exists(wav):
794
+ try: os.remove(wav)
795
+ except Exception: pass
796
+ if not transcript:
797
+ transcript = "(no speech detected in the audio track)"
798
+ spoken = "" if transcript.startswith("(") else transcript
799
+
800
+ # 2) Per-frame visual + aggregate
801
+ timeline = []
802
+ aggregated: Dict[str, float] = {}
803
+ for ts, pil in frames:
804
+ preds = predict_visual(pil)
805
+ row = {"timestamp": ts}
806
+ for p in preds:
807
+ row[p["label"]] = p["score"]
808
+ aggregated[p["label"]] = aggregated.get(p["label"], 0.0) + p["score"]
809
+ timeline.append(row)
810
+ n = len(frames)
811
+ avg_visual = [{"label": k, "score": v / n} for k, v in aggregated.items()]
812
+
813
+ # 3) Text channel from transcript
814
+ text_preds = predict_text(spoken)
815
+ fusion = fuse(avg_visual, text_preds)
816
+ summary = generative_summary(fusion, spoken)
817
+
818
+ # Timeline figure
819
+ df = pd.DataFrame(timeline).fillna(0.0)
820
+ label_cols = [c for c in df.columns if c != "timestamp"]
821
+ tl_fig = go.Figure()
822
+ palette = px.colors.qualitative.Set2
823
+ for i, lbl in enumerate(label_cols):
824
+ tl_fig.add_trace(go.Scatter(
825
+ x=df["timestamp"], y=df[lbl] * 100,
826
+ mode="lines+markers", name=lbl,
827
+ line=dict(color=palette[i % len(palette)], width=2),
828
+ ))
829
+ tl_fig.update_layout(
830
+ title="📈 Emotion Timeline (per frame) — audio transcript drives text channel",
831
+ xaxis_title="Time (s)", yaxis_title="Confidence (%)",
832
+ height=360, template="plotly_white",
833
+ margin=dict(l=10, r=10, t=40, b=10),
834
+ yaxis=dict(range=[0, 100]),
835
+ )
836
+
837
+ vfig = bar_chart(avg_visual, "👁️ Avg Visual Emotion (frames)", "#4C78A8")
838
+ tfig = bar_chart(text_preds, "💬 Sentiment of Spoken Audio", "#54A24B")
839
+
840
+ fusion_md = f"""
841
+ ### {fusion['badge']} Fusion Result: **{fusion['status']}**
842
+
843
+ | Modality | Top Prediction | Confidence | Valence |
844
+ |---|---|---|---|
845
+ | 👁️ Visual (avg of {n} frames) | **{fusion['visual_label']}** | {fusion['visual_conf']*100:.1f}% | {fusion['visual_valence']:+.2f} |
846
+ | 🎙️ Audio transcript | **{fusion['text_label']}** | {fusion['text_conf']*100:.1f}% | {fusion['text_valence']:+.2f} |
847
+ | 🔗 Overall valence | — | — | **{fusion['overall_valence']:+.2f}** |
848
+
849
+ *Spoken words (auto-transcribed):* "{spoken or '—'}"
850
+ """
851
+ summary_md = f"### 🧠 Generative Summary\n\n> {summary}"
852
+ return transcript, tl_fig, vfig, tfig, fusion_md, summary_md
853
+
854
+
855
+ # -------------------------------------------------------------
856
+ # Gradio UI
857
+ # -------------------------------------------------------------
858
+ CSS = """
859
+ .gradio-container {max-width: 1200px !important;}
860
+ #title {text-align:center;}
861
+ footer {display: none !important;}
862
+ .show-api, .built-with, .settings {display: none !important;}
863
+ """
864
+
865
+ with gr.Blocks(title="MoodSyncAI", theme=gr.themes.Soft(), css=CSS) as demo:
866
+ gr.Markdown("# 🎭 MoodSyncAI", elem_id="title")
867
+ gr.Markdown(
868
+ "**Multi-Modal Sentiment & Emotion Analyser** — combines a Vision "
869
+ "Transformer (face), a Transformer text classifier (words), a fusion "
870
+ "layer (mismatch detection), and a generative model (plain-language "
871
+ "summary). 100% open-source."
872
+ )
873
+
874
+ with gr.Tabs():
875
+ # ---------------- Tab 1 ----------------
876
+ with gr.Tab("🖼️ Image + Text"):
877
+ with gr.Row():
878
+ with gr.Column(scale=1):
879
+ img_in = gr.Image(type="pil", label="Face photo", height=320)
880
+ txt_in = gr.Textbox(
881
+ label="What the person said",
882
+ placeholder="e.g., No, I think the project is going really well.",
883
+ lines=2,
884
+ )
885
+ btn1 = gr.Button("🔍 Analyse", variant="primary")
886
+ attn_toggle1 = gr.Checkbox(
887
+ label="🔬 Show attention visualisation (ViT rollout + text tokens)",
888
+ value=False,
889
+ )
890
+ gr.Examples(
891
+ examples=[
892
+ [None, "No, I think the project is going really well."],
893
+ [None, "I'm absolutely thrilled about the results!"],
894
+ [None, "I'm fine, really, don't worry about me."],
895
+ ],
896
+ inputs=[img_in, txt_in],
897
+ )
898
+ with gr.Column(scale=2):
899
+ fusion_md1 = gr.Markdown()
900
+ summary_md1 = gr.Markdown()
901
+ with gr.Row():
902
+ vbar1 = gr.Plot(label="Visual emotion")
903
+ tbar1 = gr.Plot(label="Text sentiment")
904
+ with gr.Accordion("🔬 Attention visualisation", open=False):
905
+ attn_img1 = gr.Image(
906
+ label="ViT attention rollout (face)",
907
+ height=320, interactive=False,
908
+ )
909
+ attn_html1 = gr.HTML(label="Text token attention")
910
+ btn1.click(analyse_image_text_with_attention,
911
+ inputs=[img_in, txt_in, attn_toggle1],
912
+ outputs=[vbar1, tbar1, fusion_md1, summary_md1,
913
+ attn_img1, attn_html1])
914
+
915
+ # ---------------- Tab 2 ----------------
916
+ with gr.Tab("📹 Webcam / Video + Text"):
917
+ gr.Markdown(
918
+ "Record a short clip from your webcam (3–10 s recommended) **or** "
919
+ "upload a short video. The system samples frames and builds an "
920
+ "emotion timeline."
921
+ )
922
+ with gr.Row():
923
+ with gr.Column(scale=1):
924
+ vid_in = gr.Video(
925
+ label="Webcam / video",
926
+ sources=["webcam", "upload"],
927
+ height=300,
928
+ )
929
+ txt_in2 = gr.Textbox(
930
+ label="What the person said",
931
+ placeholder="Type the spoken sentence here…",
932
+ lines=2,
933
+ )
934
+ btn2 = gr.Button("🔍 Analyse video", variant="primary")
935
+ with gr.Column(scale=2):
936
+ timeline_plot = gr.Plot(label="Emotion timeline")
937
+ fusion_md2 = gr.Markdown()
938
+ summary_md2 = gr.Markdown()
939
+ with gr.Row():
940
+ vbar2 = gr.Plot(label="Avg visual emotion")
941
+ tbar2 = gr.Plot(label="Text sentiment")
942
+ btn2.click(analyse_video_text,
943
+ inputs=[vid_in, txt_in2],
944
+ outputs=[timeline_plot, vbar2, tbar2, fusion_md2, summary_md2])
945
+
946
+ # ---------------- Tab 3 : Audio + Image ----------------
947
+ with gr.Tab("🎙️ Audio + Image"):
948
+ gr.Markdown(
949
+ "Speak (or upload audio) **and** provide a face image. Whisper "
950
+ "transcribes the audio; the words become the *text channel* fed "
951
+ "into the multimodal fusion."
952
+ )
953
+ with gr.Row():
954
+ with gr.Column(scale=1):
955
+ audio_in3 = gr.Audio(
956
+ label="🎙️ Audio (microphone or upload)",
957
+ sources=["microphone", "upload"],
958
+ type="filepath",
959
+ )
960
+ img_in3 = gr.Image(type="pil", label="Face photo", height=300)
961
+ btn3 = gr.Button("🔍 Transcribe & analyse", variant="primary")
962
+ with gr.Column(scale=2):
963
+ transcript3 = gr.Textbox(
964
+ label="Auto-transcript (Whisper)",
965
+ interactive=False, lines=2,
966
+ )
967
+ fusion_md3 = gr.Markdown()
968
+ summary_md3 = gr.Markdown()
969
+ with gr.Row():
970
+ vbar3 = gr.Plot(label="Visual emotion")
971
+ tbar3 = gr.Plot(label="Audio→text sentiment")
972
+ btn3.click(analyse_audio_image,
973
+ inputs=[audio_in3, img_in3],
974
+ outputs=[transcript3, vbar3, tbar3, fusion_md3, summary_md3])
975
+
976
+ # ---------------- Tab 4 : Video WITH audio ----------------
977
+ with gr.Tab("🎬 Video with Audio"):
978
+ gr.Markdown(
979
+ "Record or upload a short video **with sound**. The system extracts "
980
+ "the audio track, transcribes it (Whisper), samples frames for an "
981
+ "emotion timeline, then fuses the visual signal with the spoken-word "
982
+ "sentiment — no manual typing needed."
983
+ )
984
+ with gr.Row():
985
+ with gr.Column(scale=1):
986
+ vid_in4 = gr.Video(
987
+ label="Webcam / video (with audio)",
988
+ sources=["webcam", "upload"],
989
+ height=300,
990
+ )
991
+ btn4 = gr.Button("🔍 Transcribe & analyse video", variant="primary")
992
+ with gr.Column(scale=2):
993
+ transcript4 = gr.Textbox(
994
+ label="Auto-transcript (Whisper)",
995
+ interactive=False, lines=2,
996
+ )
997
+ timeline_plot4 = gr.Plot(label="Emotion timeline")
998
+ fusion_md4 = gr.Markdown()
999
+ summary_md4 = gr.Markdown()
1000
+ with gr.Row():
1001
+ vbar4 = gr.Plot(label="Avg visual emotion")
1002
+ tbar4 = gr.Plot(label="Audio→text sentiment")
1003
+ btn4.click(analyse_video_with_audio,
1004
+ inputs=[vid_in4],
1005
+ outputs=[transcript4, timeline_plot4, vbar4, tbar4,
1006
+ fusion_md4, summary_md4])
1007
+
1008
+ # ---------------- Tab 3 (about) ----------------
1009
+ with gr.Tab("ℹ️ About"):
1010
+ gr.Markdown(f"""
1011
+ ### Architecture
1012
+
1013
+ | Stage | Model | Type |
1014
+ |---|---|---|
1015
+ | Visual emotion | `{VISION_MODEL}` | **Vision Transformer (ViT)** |
1016
+ | Text sentiment | `{TEXT_MODEL}` | **Transformer (DistilRoBERTa)** |
1017
+ | Speech-to-text | `{ASR_MODEL}` | **Encoder-Decoder Transformer (Whisper)** |
1018
+ | Fusion | Valence-aligned multimodal fusion (custom) | rule + weighted |
1019
+ | Generative summary | `{GEN_MODEL}` | **Encoder-Decoder Transformer (FLAN-T5)** |
1020
+ | Attention viz | ViT attention rollout + last-layer text attention | interpretability |
1021
+
1022
+ ### Fusion logic
1023
+
1024
+ 1. Each modality produces a probability distribution over emotion labels.
1025
+ 2. Labels are mapped to a *valence* score in `[-1, +1]`.
1026
+ 3. We compute weighted valence per modality, then a delta.
1027
+ 4. Opposite signs → **MISMATCH** (amber). Small delta → **ALIGNED** (green).
1028
+ 5. Generative model receives the structured signals and writes plain-language output.
1029
+
1030
+ ### Privacy
1031
+
1032
+ All processing runs locally on your machine; no data is sent to external services
1033
+ after the first model download from the Hugging Face Hub.
1034
+ """)
1035
+
1036
+ if __name__ == "__main__":
1037
+ # Warm up small models so first request is snappy
1038
+ try:
1039
+ get_text_pipe()
1040
+ except Exception as e:
1041
+ print("[MoodSyncAI] Warmup text failed:", e)
1042
+ demo.queue().launch(
1043
+ server_name="127.0.0.1",
1044
+ server_port=7860,
1045
+ inbrowser=True,
1046
+ show_error=True,
1047
+ show_api=False,
1048
+ )
requirements.txt ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MoodSyncAI runtime dependencies
2
+ # Tested on Python 3.10–3.13 (Windows / Linux / macOS, CPU)
3
+ #
4
+ # Install:
5
+ # pip install --upgrade pip
6
+ # pip install -r requirements.txt --only-binary=:all:
7
+ #
8
+ # The --only-binary flag forces wheels, which avoids needing Rust/MSVC to
9
+ # compile tokenizers on Python 3.13.
10
+
11
+ # --- UI ---
12
+ gradio>=4.44,<6
13
+
14
+ # --- Deep-learning stack ---
15
+ torch>=2.2
16
+ torchvision>=0.17
17
+ transformers>=4.46
18
+ tokenizers>=0.20
19
+ sentencepiece>=0.2
20
+ accelerate>=0.30
21
+ safetensors>=0.4
22
+
23
+ # --- Vision / data ---
24
+ pillow>=10
25
+ numpy>=1.26
26
+ opencv-python>=4.9
27
+ plotly>=5.20
28
+ pandas>=2.0
29
+
30
+ # --- Audio (Whisper + video audio extraction) ---
31
+ imageio-ffmpeg>=0.5 # bundles ffmpeg binary; no system install required
32
+ soundfile>=0.12
33
+ librosa>=0.10