jefffffff9 Claude Sonnet 4.6 commited on
Commit
ad902c6
·
1 Parent(s): bfe5b59

Add real conversational memory + live learning to Conversation Mode

Browse files

Problem: every turn was stateless — no history, no vocabulary context.
The LLM received only the current utterance with a generic system prompt,
so it had no idea what was said before and could not use taught words.

Changes to app.py:

Vocabulary context cache (_vocab_context_cache):
- _refresh_vocab_context(): loads vocabulary.jsonl from Hub at startup
and after every vocabulary write; formats top 200 entries as compact
"word = meaning [lang]" lines injected into every LLM system prompt
- Called on startup (background thread) and after phrase imports,
Wikipedia harvests, and mid-conversation LEARNED saves

Conversation history (gr.State):
- conv_history = gr.State(value=[]) — per-session, not shared
- Passed as input + output on every ask_btn / stop_recording event
- Capped at 20 turns to stay within LLM token budget
- Displayed in gr.Chatbot (visible only when Conversation Mode is ON)
- "Clear" button resets history and chatbot to empty

Smarter system prompt (_CONVO_SYSTEM_TEMPLATE):
- Injects full vocabulary context so LLM knows every word taught so far
- Full multi-turn message list passed to LLM (history + new turn)
- Instructs LLM to ask clarifying questions when uncertain
- Instructs LLM to refer back to earlier messages naturally
- Teaches LEARNED tag format: [LEARNED: word="X" meaning="Y"]

Auto-learning from conversation (_parse_and_strip_learned):
- Regex parses [LEARNED: ...] tags out of LLM response
- Strips them from spoken text before TTS (user never hears the tag)
- Saves each learned pair to vocabulary.jsonl on Hub async
- Immediately refreshes vocab cache so next turn knows the new word

_convo_pipeline():
- Now accepts history: list and returns new_history as 5th value
- _build_messages() constructs full system+history+user message list
- Graceful LLM fallback: speaks "I could not reach the model" in Bambara

handle_ask() always returns 5-tuple (transcript, eng, response, audio, history)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (1) hide show
  1. app.py +241 -57
app.py CHANGED
@@ -248,22 +248,148 @@ def _run_pipeline(audio_path: str, language_code: str):
248
 
249
  # ── Conversation-mode helpers ─────────────────────────────────────────────────
250
 
251
- # Bambara conversation system prompt instructs LLM to respond phonetically
252
- _BAM_CONVO_SYSTEM = """\
253
- You are a friendly Bambara voice assistant. Rules you must follow:
254
- 1. Always reply in Bambara, matching the user's informal spoken style.
255
- 2. Use phonetic spelling: write 'u' instead of 'ou', 'j' instead of 'dj', \
256
- 'c' instead of 'ch' — spell words as they sound when spoken aloud.
257
- 3. Keep responses short: 1–3 sentences max. This is a voice conversation.
258
- 4. Never add translations or explanations unless explicitly asked.
259
- 5. If the user speaks French or English, switch to that language naturally."""
260
 
261
 
262
- def _get_llm() -> GemmaClient:
263
- global _llm_client
264
- if _llm_client is None:
265
- _llm_client = GemmaClient(model_id=LLM_MODEL_ID, hf_token=HF_TOKEN)
266
- return _llm_client
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
267
 
268
 
269
  def set_voice_reference(audio_file) -> str:
@@ -316,22 +442,25 @@ def set_voice_reference(audio_file) -> str:
316
 
317
 
318
  @_gpu
319
- def _convo_pipeline(audio_path: str, language_code: str):
320
  """
321
- Full S2S conversation pipeline:
322
- 1. ASR — fine-tuned Whisper → transcript
323
- 2. Norm — bam_normalize() on Bambara input
324
- 3. Brain — LLM (Qwen) with Bambara phonetic system prompt response text
325
- 4. MouthF5-TTS with voice reference (or MMS-TTS fallback) → audio
326
-
327
- Returns same 4-tuple as _run_pipeline.
 
328
  """
329
  import torch
 
 
330
 
331
  device = "cuda" if torch.cuda.is_available() else "cpu"
332
 
333
  if _whisper_model is None:
334
- return "⏳ Model still loading…", "", "", None
335
 
336
  import librosa
337
  audio_np, _ = librosa.load(audio_path, sr=16000, mono=True)
@@ -366,29 +495,42 @@ def _convo_pipeline(audio_path: str, language_code: str):
366
  if device == "cuda":
367
  torch.cuda.empty_cache()
368
 
369
- # Phonetic normalisation for Bambara (unifies ou→u etc.)
370
  normalised = bam_normalize(transcript) if language_code == "bam" else transcript
371
 
372
- # ── LLM brain ─────────────────────────────────────────────────────────────
 
373
  try:
374
  from huggingface_hub import InferenceClient
375
- client = InferenceClient(token=HF_TOKEN)
 
376
  completion = client.chat_completion(
377
  model=LLM_MODEL_ID,
378
- messages=[
379
- {"role": "system", "content": _BAM_CONVO_SYSTEM},
380
- {"role": "user", "content": normalised},
381
- ],
382
- max_tokens=256,
383
- temperature=0.6,
384
  )
385
  response_text = completion.choices[0].message.content.strip()
386
  except Exception as llm_err:
387
- response_text = normalised # echo transcript if LLM fails
388
- import logging
389
- logging.getLogger(__name__).warning("LLM failed: %s", llm_err)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
390
 
391
- # ── TTS mouth — F5-TTS preferred, MMS-TTS fallback ────────────────────────
392
  audio_out = None
393
  if _voice_ref_path and Path(_voice_ref_path).exists():
394
  try:
@@ -403,15 +545,13 @@ def _convo_pipeline(audio_path: str, language_code: str):
403
  wav_np, sr = result
404
  audio_out = (sr, wav_np)
405
  except Exception as tts_err:
406
- import logging
407
- logging.getLogger(__name__).warning("F5-TTS failed, falling back: %s", tts_err)
408
 
409
  if audio_out is None:
410
- # MMS-TTS fallback
411
  wav_np, sr = _tts.synthesize(response_text, language_code, device=device)
412
  audio_out = (sr, wav_np)
413
 
414
- return transcript, "", response_text, audio_out
415
 
416
 
417
  # ── HF Hub feedback persistence ───────────────────────────────────────────────
@@ -643,6 +783,7 @@ def _append_phrases_to_vocabulary_jsonl(lang: str, pairs_text: str) -> None:
643
  repo_id=FEEDBACK_REPO_ID,
644
  repo_type="dataset",
645
  )
 
646
  break
647
  except Exception:
648
  if attempt == 1:
@@ -689,8 +830,9 @@ def _load_phrase_additions_from_hub() -> None:
689
  except Exception:
690
  pass # No additions saved yet — fine
691
 
692
- # Load user phrase additions in background at module import time
693
  threading.Thread(target=_load_phrase_additions_from_hub, daemon=True).start()
 
694
 
695
 
696
  def _save_audio_for_training(lang_label: str, audio_path: str | None, transcript: str, source_note: str) -> str:
@@ -1033,6 +1175,7 @@ def _harvest_wikipedia(lang_label: str, max_articles: int = 100) -> str:
1033
  total, err = _upload_jsonl("vocabulary.jsonl", entries)
1034
  if err:
1035
  return f"❌ Upload failed: {err}"
 
1036
  return (
1037
  f"✅ Wikipedia harvest complete!\n"
1038
  f" Language : {lang_label}\n"
@@ -1114,24 +1257,35 @@ def _harvest_hf_dataset(lang_label: str, max_samples: int = 500) -> str:
1114
 
1115
  # ── Main ask handler ──────────────────────────────────────────────────────────
1116
 
1117
- def handle_ask(audio_path, language_label, convo_mode: bool = False):
 
 
 
 
 
 
 
 
1118
  if audio_path is None:
1119
- return "⚠️ No audio — press Record or upload a file.", "", "", None
1120
 
1121
  language_code = SUPPORTED_LANGUAGES.get(language_label, "bam")
1122
  status = _ensure_whisper_loaded()
1123
 
1124
  if _whisper_model is None:
1125
- return f"⏳ Model loading ({status}). Wait a moment and try again.", "", "", None
1126
 
1127
  try:
1128
  if convo_mode:
1129
- transcript, eng, response_text, audio_out = _convo_pipeline(audio_path, language_code)
 
 
1130
  else:
1131
  transcript, eng, response_text, audio_out = _run_pipeline(audio_path, language_code)
1132
- return transcript, eng, response_text, audio_out
 
1133
  except Exception as e:
1134
- return f"❌ {e}", "", "", None
1135
 
1136
 
1137
  # ── Gradio UI ─────────────────────────────────────────────────────────────────
@@ -1190,6 +1344,9 @@ def build_ui() -> gr.Blocks:
1190
 
1191
  gr.Markdown("---")
1192
 
 
 
 
1193
  with gr.Row():
1194
  with gr.Column(scale=1):
1195
  language_dd = gr.Dropdown(
@@ -1202,19 +1359,21 @@ def build_ui() -> gr.Blocks:
1202
  type="filepath",
1203
  label="Record or upload audio",
1204
  )
1205
- ask_btn = gr.Button("▶ Ask / Ɲinɛ", variant="primary")
 
 
1206
 
1207
  with gr.Column(scale=1):
1208
  transcript_box = gr.Textbox(
1209
- label="Whisper heard (transcription)",
1210
  lines=2,
1211
  placeholder="Your words will appear here…",
1212
  interactive=False,
1213
  )
1214
  translation_box = gr.Textbox(
1215
- label="English translation (hidden in Conversation Mode)",
1216
  lines=2,
1217
- placeholder="English meaning will appear here…",
1218
  interactive=False,
1219
  )
1220
  response_box = gr.Textbox(
@@ -1234,21 +1393,46 @@ def build_ui() -> gr.Blocks:
1234
  size="sm",
1235
  )
1236
 
1237
- _ask_inputs = [audio_input, language_dd, convo_mode_toggle]
1238
- _ask_outputs = [transcript_box, translation_box, response_box, audio_output]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1239
 
1240
- # Manual button click
1241
  ask_btn.click(
1242
- fn=handle_ask,
1243
  inputs=_ask_inputs,
1244
  outputs=_ask_outputs,
1245
  )
1246
- # Auto-submit when mic recording stops (Conversation Mode only)
1247
  audio_input.stop_recording(
1248
- fn=lambda ap, ll, cm: handle_ask(ap, ll, cm) if cm else (None, None, None, None),
 
1249
  inputs=_ask_inputs,
1250
  outputs=_ask_outputs,
1251
  )
 
 
 
 
 
1252
 
1253
  # ── Tab 2: Feedback & Correction ─────────────────────────────────
1254
  with gr.TabItem("📝 Feedback & Correction", id="tab_feedback"):
 
248
 
249
  # ── Conversation-mode helpers ─────────────────────────────────────────────────
250
 
251
+ # Vocabulary context cacheloaded from Hub, refreshed after each LEARNED save
252
+ _vocab_context_cache: str = ""
253
+ _vocab_lock = threading.Lock()
 
 
 
 
 
 
254
 
255
 
256
+ def _refresh_vocab_context() -> None:
257
+ """Load vocabulary.jsonl from Hub and rebuild the LLM context string."""
258
+ global _vocab_context_cache
259
+ if not HF_TOKEN or not FEEDBACK_REPO_ID:
260
+ return
261
+ try:
262
+ from huggingface_hub import hf_hub_download
263
+ local = hf_hub_download(
264
+ repo_id=FEEDBACK_REPO_ID, filename="vocabulary.jsonl",
265
+ repo_type="dataset", token=HF_TOKEN,
266
+ )
267
+ entries: list[dict] = []
268
+ with open(local, encoding="utf-8") as f:
269
+ for line in f:
270
+ line = line.strip()
271
+ if line:
272
+ try:
273
+ entries.append(json.loads(line))
274
+ except Exception:
275
+ pass
276
+ # Most recent first, cap at 200 entries to stay within token budget
277
+ entries = entries[-200:][::-1]
278
+ lines = []
279
+ for e in entries:
280
+ word = e.get("word", "").strip()
281
+ tr = e.get("translation", "").strip()
282
+ lang = e.get("language", "")
283
+ if word:
284
+ lines.append(f"{word} = {tr} [{lang}]" if tr else f"{word} [{lang}]")
285
+ with _vocab_lock:
286
+ _vocab_context_cache = "\n".join(lines)
287
+ except Exception:
288
+ pass # Non-critical — LLM continues without vocab context
289
+
290
+
291
+ def _get_vocab_context() -> str:
292
+ with _vocab_lock:
293
+ return _vocab_context_cache
294
+
295
+
296
+ def _save_learned_async(word: str, meaning: str, lang: str) -> None:
297
+ """Persist a word/phrase learned mid-conversation to vocabulary.jsonl on Hub."""
298
+ def _run():
299
+ if not word.strip():
300
+ return
301
+ entry = {"word": word.strip(), "translation": meaning.strip(), "language": lang,
302
+ "source": "conversation", "timestamp": datetime.now(timezone.utc).isoformat()}
303
+ _upload_jsonl_later("vocabulary.jsonl", [entry])
304
+ _refresh_vocab_context() # update cache so next turn knows this word
305
+ threading.Thread(target=_run, daemon=True).start()
306
+
307
+
308
+ def _upload_jsonl_later(repo_path: str, entries: list[dict]) -> None:
309
+ """Append entries to a Hub JSONL file — called from background threads."""
310
+ if not HF_TOKEN or not FEEDBACK_REPO_ID or _hf_api is None:
311
+ return
312
+ from huggingface_hub import hf_hub_download
313
+ for attempt in range(2):
314
+ try:
315
+ local = hf_hub_download(
316
+ repo_id=FEEDBACK_REPO_ID, filename=repo_path,
317
+ repo_type="dataset", token=HF_TOKEN,
318
+ )
319
+ with open(local, encoding="utf-8") as f:
320
+ existing = f.read()
321
+ except Exception:
322
+ existing = ""
323
+ updated = existing + "".join(json.dumps(e, ensure_ascii=False) + "\n" for e in entries)
324
+ try:
325
+ _hf_api.upload_file(
326
+ path_or_fileobj=io.BytesIO(updated.encode("utf-8")),
327
+ path_in_repo=repo_path,
328
+ repo_id=FEEDBACK_REPO_ID,
329
+ repo_type="dataset",
330
+ )
331
+ return
332
+ except Exception:
333
+ if attempt == 1:
334
+ pass
335
+
336
+
337
+ import re as _re
338
+ _LEARNED_RE = _re.compile(
339
+ r'\[LEARNED:\s*word=["\'](.+?)["\']\s+meaning=["\'](.+?)["\']\s*\]',
340
+ _re.IGNORECASE,
341
+ )
342
+
343
+
344
+ def _parse_and_strip_learned(text: str, lang: str) -> tuple[str, list[tuple[str, str]]]:
345
+ """
346
+ Extract [LEARNED: word="X" meaning="Y"] tags from LLM output.
347
+ Returns (cleaned_text, list_of_(word, meaning) pairs).
348
+ Saves each pair to Hub asynchronously.
349
+ """
350
+ learned = []
351
+ for m in _LEARNED_RE.finditer(text):
352
+ word, meaning = m.group(1).strip(), m.group(2).strip()
353
+ learned.append((word, meaning))
354
+ _save_learned_async(word, meaning, lang)
355
+ cleaned = _LEARNED_RE.sub("", text).strip()
356
+ return cleaned, learned
357
+
358
+
359
+ # System prompt — includes vocabulary context + conversation rules
360
+ _CONVO_SYSTEM_TEMPLATE = """\
361
+ You are a helpful voice assistant for Bambara and Fula speakers. \
362
+ You are talking, not writing — keep every response to 1–3 short sentences.
363
+
364
+ YOUR KNOWLEDGE BASE (words and phrases you have learned from users):
365
+ {vocab}
366
+
367
+ RULES you must always follow:
368
+ 1. Reply in whatever language the user speaks (Bambara, Fula, French, or English).
369
+ 2. When speaking Bambara, use phonetic spelling: 'u' not 'ou', 'j' not 'dj', 'c' not 'ch'.
370
+ 3. Keep responses SHORT — this is voice, not text.
371
+ 4. If you do not understand something, ask ONE specific follow-up question \
372
+ (e.g. "Mun ye o fileli ye?" = "What does that mean?").
373
+ 5. If the user teaches you a word or phrase (says "X means Y" or "X se dit Y in Bambara"), \
374
+ confirm warmly then add exactly: [LEARNED: word="X" meaning="Y"]
375
+ 6. Remember the full conversation — refer to earlier messages naturally \
376
+ (e.g. "As you said earlier…", "I ka kuma fɔlen don…").
377
+ 7. Never invent words you do not know. Honest uncertainty is always better than wrong answers."""
378
+
379
+
380
+ def _build_messages(user_text: str, history: list, language_code: str) -> list[dict]:
381
+ """Build the full message list: system (with vocab) + history + new user turn."""
382
+ vocab = _get_vocab_context()
383
+ system_content = _CONVO_SYSTEM_TEMPLATE.format(
384
+ vocab=vocab if vocab else "(no vocabulary recorded yet — you can teach me words!)"
385
+ )
386
+ messages: list[dict] = [{"role": "system", "content": system_content}]
387
+ # Inject conversation history (last 20 turns max)
388
+ for u, a in history[-20:]:
389
+ messages.append({"role": "user", "content": u})
390
+ messages.append({"role": "assistant", "content": a})
391
+ messages.append({"role": "user", "content": user_text})
392
+ return messages
393
 
394
 
395
  def set_voice_reference(audio_file) -> str:
 
442
 
443
 
444
  @_gpu
445
+ def _convo_pipeline(audio_path: str, language_code: str, history: list):
446
  """
447
+ Full S2S conversation pipeline with memory:
448
+ 1. ASR — fine-tuned Whisper (or base) → transcript
449
+ 2. Norm — bam_normalize() on Bambara text
450
+ 3. Brain — LLM with full conversation history + vocabulary context
451
+ 4. Learnparse [LEARNED:] tags, persist to Hub async
452
+ 5. Mouth — F5-TTS (voice ref) or MMS-TTS fallback → audio
453
+
454
+ Returns: (transcript, eng, response_text, audio_out, new_history)
455
  """
456
  import torch
457
+ import logging
458
+ log = logging.getLogger(__name__)
459
 
460
  device = "cuda" if torch.cuda.is_available() else "cpu"
461
 
462
  if _whisper_model is None:
463
+ return "⏳ Model still loading…", "", "", None, history
464
 
465
  import librosa
466
  audio_np, _ = librosa.load(audio_path, sr=16000, mono=True)
 
495
  if device == "cuda":
496
  torch.cuda.empty_cache()
497
 
498
+ # Phonetic normalisation for Bambara
499
  normalised = bam_normalize(transcript) if language_code == "bam" else transcript
500
 
501
+ # ── LLM brain — full context: vocab + history + new turn ─────────────────
502
+ response_text = ""
503
  try:
504
  from huggingface_hub import InferenceClient
505
+ client = InferenceClient(token=HF_TOKEN)
506
+ messages = _build_messages(normalised, history, language_code)
507
  completion = client.chat_completion(
508
  model=LLM_MODEL_ID,
509
+ messages=messages,
510
+ max_tokens=300,
511
+ temperature=0.65,
 
 
 
512
  )
513
  response_text = completion.choices[0].message.content.strip()
514
  except Exception as llm_err:
515
+ log.warning("LLM failed: %s", llm_err)
516
+ # Graceful degradation: tell user LLM is unavailable, ask them to try again
517
+ response_text = (
518
+ "Hakɛ to, n bɛ sɔrɔ cogo dɔ la."
519
+ if language_code == "bam"
520
+ else "Sorry, I could not reach the language model. Please try again."
521
+ )
522
+
523
+ # ── Parse and strip [LEARNED:] tags — save async to Hub ──────────────────
524
+ response_text, learned_pairs = _parse_and_strip_learned(response_text, language_code)
525
+ if learned_pairs:
526
+ log.info("Learned %d new item(s): %s", len(learned_pairs), learned_pairs)
527
+
528
+ # ── Update conversation history ───────────────────────────────────────────
529
+ new_history = list(history) + [(normalised, response_text)]
530
+ if len(new_history) > 20:
531
+ new_history = new_history[-20:]
532
 
533
+ # ── TTS mouth — F5-TTS (voice ref) or MMS-TTS fallback ───────────────────
534
  audio_out = None
535
  if _voice_ref_path and Path(_voice_ref_path).exists():
536
  try:
 
545
  wav_np, sr = result
546
  audio_out = (sr, wav_np)
547
  except Exception as tts_err:
548
+ log.warning("F5-TTS failed, using MMS-TTS: %s", tts_err)
 
549
 
550
  if audio_out is None:
 
551
  wav_np, sr = _tts.synthesize(response_text, language_code, device=device)
552
  audio_out = (sr, wav_np)
553
 
554
+ return transcript, "", response_text, audio_out, new_history
555
 
556
 
557
  # ── HF Hub feedback persistence ───────────────────────────────────────────────
 
783
  repo_id=FEEDBACK_REPO_ID,
784
  repo_type="dataset",
785
  )
786
+ threading.Thread(target=_refresh_vocab_context, daemon=True).start()
787
  break
788
  except Exception:
789
  if attempt == 1:
 
830
  except Exception:
831
  pass # No additions saved yet — fine
832
 
833
+ # Load phrase additions + vocabulary context in background at startup
834
  threading.Thread(target=_load_phrase_additions_from_hub, daemon=True).start()
835
+ threading.Thread(target=_refresh_vocab_context, daemon=True).start()
836
 
837
 
838
  def _save_audio_for_training(lang_label: str, audio_path: str | None, transcript: str, source_note: str) -> str:
 
1175
  total, err = _upload_jsonl("vocabulary.jsonl", entries)
1176
  if err:
1177
  return f"❌ Upload failed: {err}"
1178
+ threading.Thread(target=_refresh_vocab_context, daemon=True).start()
1179
  return (
1180
  f"✅ Wikipedia harvest complete!\n"
1181
  f" Language : {lang_label}\n"
 
1257
 
1258
  # ── Main ask handler ──────────────────────────────────────────────────────────
1259
 
1260
+ def handle_ask(audio_path, language_label, convo_mode: bool = False, history: list | None = None):
1261
+ """
1262
+ Main dispatcher. Always returns 5 values:
1263
+ (transcript, eng_translation, response_text, audio_out, new_history)
1264
+ new_history is the updated gr.State list of (user, asst) tuples.
1265
+ In normal (sensor) mode, history is passed through unchanged.
1266
+ """
1267
+ history = history or []
1268
+
1269
  if audio_path is None:
1270
+ return "⚠️ No audio — press Record or upload a file.", "", "", None, history
1271
 
1272
  language_code = SUPPORTED_LANGUAGES.get(language_label, "bam")
1273
  status = _ensure_whisper_loaded()
1274
 
1275
  if _whisper_model is None:
1276
+ return f"⏳ Model loading ({status}). Wait a moment and try again.", "", "", None, history
1277
 
1278
  try:
1279
  if convo_mode:
1280
+ transcript, eng, response_text, audio_out, new_history = _convo_pipeline(
1281
+ audio_path, language_code, history
1282
+ )
1283
  else:
1284
  transcript, eng, response_text, audio_out = _run_pipeline(audio_path, language_code)
1285
+ new_history = history # sensor mode doesn't modify history
1286
+ return transcript, eng, response_text, audio_out, new_history
1287
  except Exception as e:
1288
+ return f"❌ {e}", "", "", None, history
1289
 
1290
 
1291
  # ── Gradio UI ─────────────────────────────────────────────────────────────────
 
1344
 
1345
  gr.Markdown("---")
1346
 
1347
+ # Per-session conversation history (not shared between users)
1348
+ conv_history = gr.State(value=[])
1349
+
1350
  with gr.Row():
1351
  with gr.Column(scale=1):
1352
  language_dd = gr.Dropdown(
 
1359
  type="filepath",
1360
  label="Record or upload audio",
1361
  )
1362
+ with gr.Row():
1363
+ ask_btn = gr.Button("▶ Ask / Ɲinɛ", variant="primary")
1364
+ clear_btn = gr.Button("🗑 Clear", variant="secondary", size="sm")
1365
 
1366
  with gr.Column(scale=1):
1367
  transcript_box = gr.Textbox(
1368
+ label="Whisper heard",
1369
  lines=2,
1370
  placeholder="Your words will appear here…",
1371
  interactive=False,
1372
  )
1373
  translation_box = gr.Textbox(
1374
+ label="English translation",
1375
  lines=2,
1376
+ placeholder="(shown in sensor mode only)",
1377
  interactive=False,
1378
  )
1379
  response_box = gr.Textbox(
 
1393
  size="sm",
1394
  )
1395
 
1396
+ # Conversation history display (Conversation Mode only)
1397
+ chatbot = gr.Chatbot(
1398
+ label="Conversation history",
1399
+ height=300,
1400
+ visible=False,
1401
+ type="tuples",
1402
+ )
1403
+ convo_mode_toggle.change(
1404
+ fn=lambda on: gr.update(visible=on),
1405
+ inputs=[convo_mode_toggle],
1406
+ outputs=[chatbot],
1407
+ )
1408
+
1409
+ _ask_inputs = [audio_input, language_dd, convo_mode_toggle, conv_history]
1410
+ _ask_outputs = [transcript_box, translation_box, response_box,
1411
+ audio_output, conv_history, chatbot]
1412
+
1413
+ def _ask_and_update(ap, ll, cm, hist):
1414
+ t, e, r, a, new_hist = handle_ask(ap, ll, cm, hist)
1415
+ # Convert history tuples to list-of-lists for gr.Chatbot
1416
+ chat_msgs = [[u, v] for u, v in new_hist]
1417
+ return t, e, r, a, new_hist, chat_msgs
1418
 
 
1419
  ask_btn.click(
1420
+ fn=_ask_and_update,
1421
  inputs=_ask_inputs,
1422
  outputs=_ask_outputs,
1423
  )
1424
+ # Auto-submit when mic stops (Conversation Mode)
1425
  audio_input.stop_recording(
1426
+ fn=lambda ap, ll, cm, h: _ask_and_update(ap, ll, cm, h) if cm
1427
+ else (None, None, None, None, h, [[u, v] for u, v in h]),
1428
  inputs=_ask_inputs,
1429
  outputs=_ask_outputs,
1430
  )
1431
+ # Clear conversation
1432
+ clear_btn.click(
1433
+ fn=lambda: ([], []),
1434
+ outputs=[conv_history, chatbot],
1435
+ )
1436
 
1437
  # ── Tab 2: Feedback & Correction ─────────────────────────────────
1438
  with gr.TabItem("📝 Feedback & Correction", id="tab_feedback"):