Spaces:
Sleeping
Add real conversational memory + live learning to Conversation Mode
Browse filesProblem: every turn was stateless — no history, no vocabulary context.
The LLM received only the current utterance with a generic system prompt,
so it had no idea what was said before and could not use taught words.
Changes to app.py:
Vocabulary context cache (_vocab_context_cache):
- _refresh_vocab_context(): loads vocabulary.jsonl from Hub at startup
and after every vocabulary write; formats top 200 entries as compact
"word = meaning [lang]" lines injected into every LLM system prompt
- Called on startup (background thread) and after phrase imports,
Wikipedia harvests, and mid-conversation LEARNED saves
Conversation history (gr.State):
- conv_history = gr.State(value=[]) — per-session, not shared
- Passed as input + output on every ask_btn / stop_recording event
- Capped at 20 turns to stay within LLM token budget
- Displayed in gr.Chatbot (visible only when Conversation Mode is ON)
- "Clear" button resets history and chatbot to empty
Smarter system prompt (_CONVO_SYSTEM_TEMPLATE):
- Injects full vocabulary context so LLM knows every word taught so far
- Full multi-turn message list passed to LLM (history + new turn)
- Instructs LLM to ask clarifying questions when uncertain
- Instructs LLM to refer back to earlier messages naturally
- Teaches LEARNED tag format: [LEARNED: word="X" meaning="Y"]
Auto-learning from conversation (_parse_and_strip_learned):
- Regex parses [LEARNED: ...] tags out of LLM response
- Strips them from spoken text before TTS (user never hears the tag)
- Saves each learned pair to vocabulary.jsonl on Hub async
- Immediately refreshes vocab cache so next turn knows the new word
_convo_pipeline():
- Now accepts history: list and returns new_history as 5th value
- _build_messages() constructs full system+history+user message list
- Graceful LLM fallback: speaks "I could not reach the model" in Bambara
handle_ask() always returns 5-tuple (transcript, eng, response, audio, history)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@@ -248,22 +248,148 @@ def _run_pipeline(audio_path: str, language_code: str):
|
|
| 248 |
|
| 249 |
# ── Conversation-mode helpers ─────────────────────────────────────────────────
|
| 250 |
|
| 251 |
-
#
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
1. Always reply in Bambara, matching the user's informal spoken style.
|
| 255 |
-
2. Use phonetic spelling: write 'u' instead of 'ou', 'j' instead of 'dj', \
|
| 256 |
-
'c' instead of 'ch' — spell words as they sound when spoken aloud.
|
| 257 |
-
3. Keep responses short: 1–3 sentences max. This is a voice conversation.
|
| 258 |
-
4. Never add translations or explanations unless explicitly asked.
|
| 259 |
-
5. If the user speaks French or English, switch to that language naturally."""
|
| 260 |
|
| 261 |
|
| 262 |
-
def
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 267 |
|
| 268 |
|
| 269 |
def set_voice_reference(audio_file) -> str:
|
|
@@ -316,22 +442,25 @@ def set_voice_reference(audio_file) -> str:
|
|
| 316 |
|
| 317 |
|
| 318 |
@_gpu
|
| 319 |
-
def _convo_pipeline(audio_path: str, language_code: str):
|
| 320 |
"""
|
| 321 |
-
Full S2S conversation pipeline:
|
| 322 |
-
1. ASR — fine-tuned Whisper → transcript
|
| 323 |
-
2. Norm — bam_normalize() on Bambara
|
| 324 |
-
3. Brain — LLM
|
| 325 |
-
4.
|
| 326 |
-
|
| 327 |
-
|
|
|
|
| 328 |
"""
|
| 329 |
import torch
|
|
|
|
|
|
|
| 330 |
|
| 331 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 332 |
|
| 333 |
if _whisper_model is None:
|
| 334 |
-
return "⏳ Model still loading…", "", "", None
|
| 335 |
|
| 336 |
import librosa
|
| 337 |
audio_np, _ = librosa.load(audio_path, sr=16000, mono=True)
|
|
@@ -366,29 +495,42 @@ def _convo_pipeline(audio_path: str, language_code: str):
|
|
| 366 |
if device == "cuda":
|
| 367 |
torch.cuda.empty_cache()
|
| 368 |
|
| 369 |
-
# Phonetic normalisation for Bambara
|
| 370 |
normalised = bam_normalize(transcript) if language_code == "bam" else transcript
|
| 371 |
|
| 372 |
-
# ── LLM brain ─────────────────
|
|
|
|
| 373 |
try:
|
| 374 |
from huggingface_hub import InferenceClient
|
| 375 |
-
client
|
|
|
|
| 376 |
completion = client.chat_completion(
|
| 377 |
model=LLM_MODEL_ID,
|
| 378 |
-
messages=
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
],
|
| 382 |
-
max_tokens=256,
|
| 383 |
-
temperature=0.6,
|
| 384 |
)
|
| 385 |
response_text = completion.choices[0].message.content.strip()
|
| 386 |
except Exception as llm_err:
|
| 387 |
-
|
| 388 |
-
|
| 389 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 390 |
|
| 391 |
-
# ── TTS mouth — F5-TTS
|
| 392 |
audio_out = None
|
| 393 |
if _voice_ref_path and Path(_voice_ref_path).exists():
|
| 394 |
try:
|
|
@@ -403,15 +545,13 @@ def _convo_pipeline(audio_path: str, language_code: str):
|
|
| 403 |
wav_np, sr = result
|
| 404 |
audio_out = (sr, wav_np)
|
| 405 |
except Exception as tts_err:
|
| 406 |
-
|
| 407 |
-
logging.getLogger(__name__).warning("F5-TTS failed, falling back: %s", tts_err)
|
| 408 |
|
| 409 |
if audio_out is None:
|
| 410 |
-
# MMS-TTS fallback
|
| 411 |
wav_np, sr = _tts.synthesize(response_text, language_code, device=device)
|
| 412 |
audio_out = (sr, wav_np)
|
| 413 |
|
| 414 |
-
return transcript, "", response_text, audio_out
|
| 415 |
|
| 416 |
|
| 417 |
# ── HF Hub feedback persistence ───────────────────────────────────────────────
|
|
@@ -643,6 +783,7 @@ def _append_phrases_to_vocabulary_jsonl(lang: str, pairs_text: str) -> None:
|
|
| 643 |
repo_id=FEEDBACK_REPO_ID,
|
| 644 |
repo_type="dataset",
|
| 645 |
)
|
|
|
|
| 646 |
break
|
| 647 |
except Exception:
|
| 648 |
if attempt == 1:
|
|
@@ -689,8 +830,9 @@ def _load_phrase_additions_from_hub() -> None:
|
|
| 689 |
except Exception:
|
| 690 |
pass # No additions saved yet — fine
|
| 691 |
|
| 692 |
-
# Load
|
| 693 |
threading.Thread(target=_load_phrase_additions_from_hub, daemon=True).start()
|
|
|
|
| 694 |
|
| 695 |
|
| 696 |
def _save_audio_for_training(lang_label: str, audio_path: str | None, transcript: str, source_note: str) -> str:
|
|
@@ -1033,6 +1175,7 @@ def _harvest_wikipedia(lang_label: str, max_articles: int = 100) -> str:
|
|
| 1033 |
total, err = _upload_jsonl("vocabulary.jsonl", entries)
|
| 1034 |
if err:
|
| 1035 |
return f"❌ Upload failed: {err}"
|
|
|
|
| 1036 |
return (
|
| 1037 |
f"✅ Wikipedia harvest complete!\n"
|
| 1038 |
f" Language : {lang_label}\n"
|
|
@@ -1114,24 +1257,35 @@ def _harvest_hf_dataset(lang_label: str, max_samples: int = 500) -> str:
|
|
| 1114 |
|
| 1115 |
# ── Main ask handler ──────────────────────────────────────────────────────────
|
| 1116 |
|
| 1117 |
-
def handle_ask(audio_path, language_label, convo_mode: bool = False):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1118 |
if audio_path is None:
|
| 1119 |
-
return "⚠️ No audio — press Record or upload a file.", "", "", None
|
| 1120 |
|
| 1121 |
language_code = SUPPORTED_LANGUAGES.get(language_label, "bam")
|
| 1122 |
status = _ensure_whisper_loaded()
|
| 1123 |
|
| 1124 |
if _whisper_model is None:
|
| 1125 |
-
return f"⏳ Model loading ({status}). Wait a moment and try again.", "", "", None
|
| 1126 |
|
| 1127 |
try:
|
| 1128 |
if convo_mode:
|
| 1129 |
-
transcript, eng, response_text, audio_out = _convo_pipeline(
|
|
|
|
|
|
|
| 1130 |
else:
|
| 1131 |
transcript, eng, response_text, audio_out = _run_pipeline(audio_path, language_code)
|
| 1132 |
-
|
|
|
|
| 1133 |
except Exception as e:
|
| 1134 |
-
return f"❌ {e}", "", "", None
|
| 1135 |
|
| 1136 |
|
| 1137 |
# ── Gradio UI ─────────────────────────────────────────────────────────────────
|
|
@@ -1190,6 +1344,9 @@ def build_ui() -> gr.Blocks:
|
|
| 1190 |
|
| 1191 |
gr.Markdown("---")
|
| 1192 |
|
|
|
|
|
|
|
|
|
|
| 1193 |
with gr.Row():
|
| 1194 |
with gr.Column(scale=1):
|
| 1195 |
language_dd = gr.Dropdown(
|
|
@@ -1202,19 +1359,21 @@ def build_ui() -> gr.Blocks:
|
|
| 1202 |
type="filepath",
|
| 1203 |
label="Record or upload audio",
|
| 1204 |
)
|
| 1205 |
-
|
|
|
|
|
|
|
| 1206 |
|
| 1207 |
with gr.Column(scale=1):
|
| 1208 |
transcript_box = gr.Textbox(
|
| 1209 |
-
label="Whisper heard
|
| 1210 |
lines=2,
|
| 1211 |
placeholder="Your words will appear here…",
|
| 1212 |
interactive=False,
|
| 1213 |
)
|
| 1214 |
translation_box = gr.Textbox(
|
| 1215 |
-
label="English translation
|
| 1216 |
lines=2,
|
| 1217 |
-
placeholder="
|
| 1218 |
interactive=False,
|
| 1219 |
)
|
| 1220 |
response_box = gr.Textbox(
|
|
@@ -1234,21 +1393,46 @@ def build_ui() -> gr.Blocks:
|
|
| 1234 |
size="sm",
|
| 1235 |
)
|
| 1236 |
|
| 1237 |
-
|
| 1238 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1239 |
|
| 1240 |
-
# Manual button click
|
| 1241 |
ask_btn.click(
|
| 1242 |
-
fn=
|
| 1243 |
inputs=_ask_inputs,
|
| 1244 |
outputs=_ask_outputs,
|
| 1245 |
)
|
| 1246 |
-
# Auto-submit when mic
|
| 1247 |
audio_input.stop_recording(
|
| 1248 |
-
fn=lambda ap, ll, cm:
|
|
|
|
| 1249 |
inputs=_ask_inputs,
|
| 1250 |
outputs=_ask_outputs,
|
| 1251 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1252 |
|
| 1253 |
# ── Tab 2: Feedback & Correction ─────────────────────────────────
|
| 1254 |
with gr.TabItem("📝 Feedback & Correction", id="tab_feedback"):
|
|
|
|
| 248 |
|
| 249 |
# ── Conversation-mode helpers ─────────────────────────────────────────────────
|
| 250 |
|
| 251 |
+
# Vocabulary context cache — loaded from Hub, refreshed after each LEARNED save
|
| 252 |
+
_vocab_context_cache: str = ""
|
| 253 |
+
_vocab_lock = threading.Lock()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
|
| 255 |
|
| 256 |
+
def _refresh_vocab_context() -> None:
|
| 257 |
+
"""Load vocabulary.jsonl from Hub and rebuild the LLM context string."""
|
| 258 |
+
global _vocab_context_cache
|
| 259 |
+
if not HF_TOKEN or not FEEDBACK_REPO_ID:
|
| 260 |
+
return
|
| 261 |
+
try:
|
| 262 |
+
from huggingface_hub import hf_hub_download
|
| 263 |
+
local = hf_hub_download(
|
| 264 |
+
repo_id=FEEDBACK_REPO_ID, filename="vocabulary.jsonl",
|
| 265 |
+
repo_type="dataset", token=HF_TOKEN,
|
| 266 |
+
)
|
| 267 |
+
entries: list[dict] = []
|
| 268 |
+
with open(local, encoding="utf-8") as f:
|
| 269 |
+
for line in f:
|
| 270 |
+
line = line.strip()
|
| 271 |
+
if line:
|
| 272 |
+
try:
|
| 273 |
+
entries.append(json.loads(line))
|
| 274 |
+
except Exception:
|
| 275 |
+
pass
|
| 276 |
+
# Most recent first, cap at 200 entries to stay within token budget
|
| 277 |
+
entries = entries[-200:][::-1]
|
| 278 |
+
lines = []
|
| 279 |
+
for e in entries:
|
| 280 |
+
word = e.get("word", "").strip()
|
| 281 |
+
tr = e.get("translation", "").strip()
|
| 282 |
+
lang = e.get("language", "")
|
| 283 |
+
if word:
|
| 284 |
+
lines.append(f"{word} = {tr} [{lang}]" if tr else f"{word} [{lang}]")
|
| 285 |
+
with _vocab_lock:
|
| 286 |
+
_vocab_context_cache = "\n".join(lines)
|
| 287 |
+
except Exception:
|
| 288 |
+
pass # Non-critical — LLM continues without vocab context
|
| 289 |
+
|
| 290 |
+
|
| 291 |
+
def _get_vocab_context() -> str:
|
| 292 |
+
with _vocab_lock:
|
| 293 |
+
return _vocab_context_cache
|
| 294 |
+
|
| 295 |
+
|
| 296 |
+
def _save_learned_async(word: str, meaning: str, lang: str) -> None:
|
| 297 |
+
"""Persist a word/phrase learned mid-conversation to vocabulary.jsonl on Hub."""
|
| 298 |
+
def _run():
|
| 299 |
+
if not word.strip():
|
| 300 |
+
return
|
| 301 |
+
entry = {"word": word.strip(), "translation": meaning.strip(), "language": lang,
|
| 302 |
+
"source": "conversation", "timestamp": datetime.now(timezone.utc).isoformat()}
|
| 303 |
+
_upload_jsonl_later("vocabulary.jsonl", [entry])
|
| 304 |
+
_refresh_vocab_context() # update cache so next turn knows this word
|
| 305 |
+
threading.Thread(target=_run, daemon=True).start()
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
def _upload_jsonl_later(repo_path: str, entries: list[dict]) -> None:
|
| 309 |
+
"""Append entries to a Hub JSONL file — called from background threads."""
|
| 310 |
+
if not HF_TOKEN or not FEEDBACK_REPO_ID or _hf_api is None:
|
| 311 |
+
return
|
| 312 |
+
from huggingface_hub import hf_hub_download
|
| 313 |
+
for attempt in range(2):
|
| 314 |
+
try:
|
| 315 |
+
local = hf_hub_download(
|
| 316 |
+
repo_id=FEEDBACK_REPO_ID, filename=repo_path,
|
| 317 |
+
repo_type="dataset", token=HF_TOKEN,
|
| 318 |
+
)
|
| 319 |
+
with open(local, encoding="utf-8") as f:
|
| 320 |
+
existing = f.read()
|
| 321 |
+
except Exception:
|
| 322 |
+
existing = ""
|
| 323 |
+
updated = existing + "".join(json.dumps(e, ensure_ascii=False) + "\n" for e in entries)
|
| 324 |
+
try:
|
| 325 |
+
_hf_api.upload_file(
|
| 326 |
+
path_or_fileobj=io.BytesIO(updated.encode("utf-8")),
|
| 327 |
+
path_in_repo=repo_path,
|
| 328 |
+
repo_id=FEEDBACK_REPO_ID,
|
| 329 |
+
repo_type="dataset",
|
| 330 |
+
)
|
| 331 |
+
return
|
| 332 |
+
except Exception:
|
| 333 |
+
if attempt == 1:
|
| 334 |
+
pass
|
| 335 |
+
|
| 336 |
+
|
| 337 |
+
import re as _re
|
| 338 |
+
_LEARNED_RE = _re.compile(
|
| 339 |
+
r'\[LEARNED:\s*word=["\'](.+?)["\']\s+meaning=["\'](.+?)["\']\s*\]',
|
| 340 |
+
_re.IGNORECASE,
|
| 341 |
+
)
|
| 342 |
+
|
| 343 |
+
|
| 344 |
+
def _parse_and_strip_learned(text: str, lang: str) -> tuple[str, list[tuple[str, str]]]:
|
| 345 |
+
"""
|
| 346 |
+
Extract [LEARNED: word="X" meaning="Y"] tags from LLM output.
|
| 347 |
+
Returns (cleaned_text, list_of_(word, meaning) pairs).
|
| 348 |
+
Saves each pair to Hub asynchronously.
|
| 349 |
+
"""
|
| 350 |
+
learned = []
|
| 351 |
+
for m in _LEARNED_RE.finditer(text):
|
| 352 |
+
word, meaning = m.group(1).strip(), m.group(2).strip()
|
| 353 |
+
learned.append((word, meaning))
|
| 354 |
+
_save_learned_async(word, meaning, lang)
|
| 355 |
+
cleaned = _LEARNED_RE.sub("", text).strip()
|
| 356 |
+
return cleaned, learned
|
| 357 |
+
|
| 358 |
+
|
| 359 |
+
# System prompt — includes vocabulary context + conversation rules
|
| 360 |
+
_CONVO_SYSTEM_TEMPLATE = """\
|
| 361 |
+
You are a helpful voice assistant for Bambara and Fula speakers. \
|
| 362 |
+
You are talking, not writing — keep every response to 1–3 short sentences.
|
| 363 |
+
|
| 364 |
+
YOUR KNOWLEDGE BASE (words and phrases you have learned from users):
|
| 365 |
+
{vocab}
|
| 366 |
+
|
| 367 |
+
RULES you must always follow:
|
| 368 |
+
1. Reply in whatever language the user speaks (Bambara, Fula, French, or English).
|
| 369 |
+
2. When speaking Bambara, use phonetic spelling: 'u' not 'ou', 'j' not 'dj', 'c' not 'ch'.
|
| 370 |
+
3. Keep responses SHORT — this is voice, not text.
|
| 371 |
+
4. If you do not understand something, ask ONE specific follow-up question \
|
| 372 |
+
(e.g. "Mun ye o fileli ye?" = "What does that mean?").
|
| 373 |
+
5. If the user teaches you a word or phrase (says "X means Y" or "X se dit Y in Bambara"), \
|
| 374 |
+
confirm warmly then add exactly: [LEARNED: word="X" meaning="Y"]
|
| 375 |
+
6. Remember the full conversation — refer to earlier messages naturally \
|
| 376 |
+
(e.g. "As you said earlier…", "I ka kuma fɔlen don…").
|
| 377 |
+
7. Never invent words you do not know. Honest uncertainty is always better than wrong answers."""
|
| 378 |
+
|
| 379 |
+
|
| 380 |
+
def _build_messages(user_text: str, history: list, language_code: str) -> list[dict]:
|
| 381 |
+
"""Build the full message list: system (with vocab) + history + new user turn."""
|
| 382 |
+
vocab = _get_vocab_context()
|
| 383 |
+
system_content = _CONVO_SYSTEM_TEMPLATE.format(
|
| 384 |
+
vocab=vocab if vocab else "(no vocabulary recorded yet — you can teach me words!)"
|
| 385 |
+
)
|
| 386 |
+
messages: list[dict] = [{"role": "system", "content": system_content}]
|
| 387 |
+
# Inject conversation history (last 20 turns max)
|
| 388 |
+
for u, a in history[-20:]:
|
| 389 |
+
messages.append({"role": "user", "content": u})
|
| 390 |
+
messages.append({"role": "assistant", "content": a})
|
| 391 |
+
messages.append({"role": "user", "content": user_text})
|
| 392 |
+
return messages
|
| 393 |
|
| 394 |
|
| 395 |
def set_voice_reference(audio_file) -> str:
|
|
|
|
| 442 |
|
| 443 |
|
| 444 |
@_gpu
|
| 445 |
+
def _convo_pipeline(audio_path: str, language_code: str, history: list):
|
| 446 |
"""
|
| 447 |
+
Full S2S conversation pipeline with memory:
|
| 448 |
+
1. ASR — fine-tuned Whisper (or base) → transcript
|
| 449 |
+
2. Norm — bam_normalize() on Bambara text
|
| 450 |
+
3. Brain — LLM with full conversation history + vocabulary context
|
| 451 |
+
4. Learn — parse [LEARNED:] tags, persist to Hub async
|
| 452 |
+
5. Mouth — F5-TTS (voice ref) or MMS-TTS fallback → audio
|
| 453 |
+
|
| 454 |
+
Returns: (transcript, eng, response_text, audio_out, new_history)
|
| 455 |
"""
|
| 456 |
import torch
|
| 457 |
+
import logging
|
| 458 |
+
log = logging.getLogger(__name__)
|
| 459 |
|
| 460 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 461 |
|
| 462 |
if _whisper_model is None:
|
| 463 |
+
return "⏳ Model still loading…", "", "", None, history
|
| 464 |
|
| 465 |
import librosa
|
| 466 |
audio_np, _ = librosa.load(audio_path, sr=16000, mono=True)
|
|
|
|
| 495 |
if device == "cuda":
|
| 496 |
torch.cuda.empty_cache()
|
| 497 |
|
| 498 |
+
# Phonetic normalisation for Bambara
|
| 499 |
normalised = bam_normalize(transcript) if language_code == "bam" else transcript
|
| 500 |
|
| 501 |
+
# ── LLM brain — full context: vocab + history + new turn ─────────────────
|
| 502 |
+
response_text = ""
|
| 503 |
try:
|
| 504 |
from huggingface_hub import InferenceClient
|
| 505 |
+
client = InferenceClient(token=HF_TOKEN)
|
| 506 |
+
messages = _build_messages(normalised, history, language_code)
|
| 507 |
completion = client.chat_completion(
|
| 508 |
model=LLM_MODEL_ID,
|
| 509 |
+
messages=messages,
|
| 510 |
+
max_tokens=300,
|
| 511 |
+
temperature=0.65,
|
|
|
|
|
|
|
|
|
|
| 512 |
)
|
| 513 |
response_text = completion.choices[0].message.content.strip()
|
| 514 |
except Exception as llm_err:
|
| 515 |
+
log.warning("LLM failed: %s", llm_err)
|
| 516 |
+
# Graceful degradation: tell user LLM is unavailable, ask them to try again
|
| 517 |
+
response_text = (
|
| 518 |
+
"Hakɛ to, n bɛ sɔrɔ cogo dɔ la."
|
| 519 |
+
if language_code == "bam"
|
| 520 |
+
else "Sorry, I could not reach the language model. Please try again."
|
| 521 |
+
)
|
| 522 |
+
|
| 523 |
+
# ── Parse and strip [LEARNED:] tags — save async to Hub ──────────────────
|
| 524 |
+
response_text, learned_pairs = _parse_and_strip_learned(response_text, language_code)
|
| 525 |
+
if learned_pairs:
|
| 526 |
+
log.info("Learned %d new item(s): %s", len(learned_pairs), learned_pairs)
|
| 527 |
+
|
| 528 |
+
# ── Update conversation history ───────────────────────────────────────────
|
| 529 |
+
new_history = list(history) + [(normalised, response_text)]
|
| 530 |
+
if len(new_history) > 20:
|
| 531 |
+
new_history = new_history[-20:]
|
| 532 |
|
| 533 |
+
# ── TTS mouth — F5-TTS (voice ref) or MMS-TTS fallback ───────────────────
|
| 534 |
audio_out = None
|
| 535 |
if _voice_ref_path and Path(_voice_ref_path).exists():
|
| 536 |
try:
|
|
|
|
| 545 |
wav_np, sr = result
|
| 546 |
audio_out = (sr, wav_np)
|
| 547 |
except Exception as tts_err:
|
| 548 |
+
log.warning("F5-TTS failed, using MMS-TTS: %s", tts_err)
|
|
|
|
| 549 |
|
| 550 |
if audio_out is None:
|
|
|
|
| 551 |
wav_np, sr = _tts.synthesize(response_text, language_code, device=device)
|
| 552 |
audio_out = (sr, wav_np)
|
| 553 |
|
| 554 |
+
return transcript, "", response_text, audio_out, new_history
|
| 555 |
|
| 556 |
|
| 557 |
# ── HF Hub feedback persistence ───────────────────────────────────────────────
|
|
|
|
| 783 |
repo_id=FEEDBACK_REPO_ID,
|
| 784 |
repo_type="dataset",
|
| 785 |
)
|
| 786 |
+
threading.Thread(target=_refresh_vocab_context, daemon=True).start()
|
| 787 |
break
|
| 788 |
except Exception:
|
| 789 |
if attempt == 1:
|
|
|
|
| 830 |
except Exception:
|
| 831 |
pass # No additions saved yet — fine
|
| 832 |
|
| 833 |
+
# Load phrase additions + vocabulary context in background at startup
|
| 834 |
threading.Thread(target=_load_phrase_additions_from_hub, daemon=True).start()
|
| 835 |
+
threading.Thread(target=_refresh_vocab_context, daemon=True).start()
|
| 836 |
|
| 837 |
|
| 838 |
def _save_audio_for_training(lang_label: str, audio_path: str | None, transcript: str, source_note: str) -> str:
|
|
|
|
| 1175 |
total, err = _upload_jsonl("vocabulary.jsonl", entries)
|
| 1176 |
if err:
|
| 1177 |
return f"❌ Upload failed: {err}"
|
| 1178 |
+
threading.Thread(target=_refresh_vocab_context, daemon=True).start()
|
| 1179 |
return (
|
| 1180 |
f"✅ Wikipedia harvest complete!\n"
|
| 1181 |
f" Language : {lang_label}\n"
|
|
|
|
| 1257 |
|
| 1258 |
# ── Main ask handler ──────────────────────────────────────────────────────────
|
| 1259 |
|
| 1260 |
+
def handle_ask(audio_path, language_label, convo_mode: bool = False, history: list | None = None):
|
| 1261 |
+
"""
|
| 1262 |
+
Main dispatcher. Always returns 5 values:
|
| 1263 |
+
(transcript, eng_translation, response_text, audio_out, new_history)
|
| 1264 |
+
new_history is the updated gr.State list of (user, asst) tuples.
|
| 1265 |
+
In normal (sensor) mode, history is passed through unchanged.
|
| 1266 |
+
"""
|
| 1267 |
+
history = history or []
|
| 1268 |
+
|
| 1269 |
if audio_path is None:
|
| 1270 |
+
return "⚠️ No audio — press Record or upload a file.", "", "", None, history
|
| 1271 |
|
| 1272 |
language_code = SUPPORTED_LANGUAGES.get(language_label, "bam")
|
| 1273 |
status = _ensure_whisper_loaded()
|
| 1274 |
|
| 1275 |
if _whisper_model is None:
|
| 1276 |
+
return f"⏳ Model loading ({status}). Wait a moment and try again.", "", "", None, history
|
| 1277 |
|
| 1278 |
try:
|
| 1279 |
if convo_mode:
|
| 1280 |
+
transcript, eng, response_text, audio_out, new_history = _convo_pipeline(
|
| 1281 |
+
audio_path, language_code, history
|
| 1282 |
+
)
|
| 1283 |
else:
|
| 1284 |
transcript, eng, response_text, audio_out = _run_pipeline(audio_path, language_code)
|
| 1285 |
+
new_history = history # sensor mode doesn't modify history
|
| 1286 |
+
return transcript, eng, response_text, audio_out, new_history
|
| 1287 |
except Exception as e:
|
| 1288 |
+
return f"❌ {e}", "", "", None, history
|
| 1289 |
|
| 1290 |
|
| 1291 |
# ── Gradio UI ─────────────────────────────────────────────────────────────────
|
|
|
|
| 1344 |
|
| 1345 |
gr.Markdown("---")
|
| 1346 |
|
| 1347 |
+
# Per-session conversation history (not shared between users)
|
| 1348 |
+
conv_history = gr.State(value=[])
|
| 1349 |
+
|
| 1350 |
with gr.Row():
|
| 1351 |
with gr.Column(scale=1):
|
| 1352 |
language_dd = gr.Dropdown(
|
|
|
|
| 1359 |
type="filepath",
|
| 1360 |
label="Record or upload audio",
|
| 1361 |
)
|
| 1362 |
+
with gr.Row():
|
| 1363 |
+
ask_btn = gr.Button("▶ Ask / Ɲinɛ", variant="primary")
|
| 1364 |
+
clear_btn = gr.Button("🗑 Clear", variant="secondary", size="sm")
|
| 1365 |
|
| 1366 |
with gr.Column(scale=1):
|
| 1367 |
transcript_box = gr.Textbox(
|
| 1368 |
+
label="Whisper heard",
|
| 1369 |
lines=2,
|
| 1370 |
placeholder="Your words will appear here…",
|
| 1371 |
interactive=False,
|
| 1372 |
)
|
| 1373 |
translation_box = gr.Textbox(
|
| 1374 |
+
label="English translation",
|
| 1375 |
lines=2,
|
| 1376 |
+
placeholder="(shown in sensor mode only)",
|
| 1377 |
interactive=False,
|
| 1378 |
)
|
| 1379 |
response_box = gr.Textbox(
|
|
|
|
| 1393 |
size="sm",
|
| 1394 |
)
|
| 1395 |
|
| 1396 |
+
# Conversation history display (Conversation Mode only)
|
| 1397 |
+
chatbot = gr.Chatbot(
|
| 1398 |
+
label="Conversation history",
|
| 1399 |
+
height=300,
|
| 1400 |
+
visible=False,
|
| 1401 |
+
type="tuples",
|
| 1402 |
+
)
|
| 1403 |
+
convo_mode_toggle.change(
|
| 1404 |
+
fn=lambda on: gr.update(visible=on),
|
| 1405 |
+
inputs=[convo_mode_toggle],
|
| 1406 |
+
outputs=[chatbot],
|
| 1407 |
+
)
|
| 1408 |
+
|
| 1409 |
+
_ask_inputs = [audio_input, language_dd, convo_mode_toggle, conv_history]
|
| 1410 |
+
_ask_outputs = [transcript_box, translation_box, response_box,
|
| 1411 |
+
audio_output, conv_history, chatbot]
|
| 1412 |
+
|
| 1413 |
+
def _ask_and_update(ap, ll, cm, hist):
|
| 1414 |
+
t, e, r, a, new_hist = handle_ask(ap, ll, cm, hist)
|
| 1415 |
+
# Convert history tuples to list-of-lists for gr.Chatbot
|
| 1416 |
+
chat_msgs = [[u, v] for u, v in new_hist]
|
| 1417 |
+
return t, e, r, a, new_hist, chat_msgs
|
| 1418 |
|
|
|
|
| 1419 |
ask_btn.click(
|
| 1420 |
+
fn=_ask_and_update,
|
| 1421 |
inputs=_ask_inputs,
|
| 1422 |
outputs=_ask_outputs,
|
| 1423 |
)
|
| 1424 |
+
# Auto-submit when mic stops (Conversation Mode)
|
| 1425 |
audio_input.stop_recording(
|
| 1426 |
+
fn=lambda ap, ll, cm, h: _ask_and_update(ap, ll, cm, h) if cm
|
| 1427 |
+
else (None, None, None, None, h, [[u, v] for u, v in h]),
|
| 1428 |
inputs=_ask_inputs,
|
| 1429 |
outputs=_ask_outputs,
|
| 1430 |
)
|
| 1431 |
+
# Clear conversation
|
| 1432 |
+
clear_btn.click(
|
| 1433 |
+
fn=lambda: ([], []),
|
| 1434 |
+
outputs=[conv_history, chatbot],
|
| 1435 |
+
)
|
| 1436 |
|
| 1437 |
# ── Tab 2: Feedback & Correction ─────────────────────────────────
|
| 1438 |
with gr.TabItem("📝 Feedback & Correction", id="tab_feedback"):
|