Spaces:

MataStrategy
/

ground-zero

Sleeping

jefffffff9 Claude Sonnet 4.6 commited on 30 days ago

Commit

ad902c6

1 Parent(s): bfe5b59

Add real conversational memory + live learning to Conversation Mode

Problem: every turn was stateless — no history, no vocabulary context.
The LLM received only the current utterance with a generic system prompt,
so it had no idea what was said before and could not use taught words.

Changes to app.py:

Vocabulary context cache (_vocab_context_cache):
- _refresh_vocab_context(): loads vocabulary.jsonl from Hub at startup
and after every vocabulary write; formats top 200 entries as compact
"word = meaning [lang]" lines injected into every LLM system prompt
- Called on startup (background thread) and after phrase imports,
Wikipedia harvests, and mid-conversation LEARNED saves

Conversation history (gr.State):
- conv_history = gr.State(value=[]) — per-session, not shared
- Passed as input + output on every ask_btn / stop_recording event
- Capped at 20 turns to stay within LLM token budget
- Displayed in gr.Chatbot (visible only when Conversation Mode is ON)
- "Clear" button resets history and chatbot to empty

Smarter system prompt (_CONVO_SYSTEM_TEMPLATE):
- Injects full vocabulary context so LLM knows every word taught so far
- Full multi-turn message list passed to LLM (history + new turn)
- Instructs LLM to ask clarifying questions when uncertain
- Instructs LLM to refer back to earlier messages naturally
- Teaches LEARNED tag format: [LEARNED: word="X" meaning="Y"]

Auto-learning from conversation (_parse_and_strip_learned):
- Regex parses [LEARNED: ...] tags out of LLM response
- Strips them from spoken text before TTS (user never hears the tag)
- Saves each learned pair to vocabulary.jsonl on Hub async
- Immediately refreshes vocab cache so next turn knows the new word

_convo_pipeline():
- Now accepts history: list and returns new_history as 5th value
- _build_messages() constructs full system+history+user message list
- Graceful LLM fallback: speaks "I could not reach the model" in Bambara

handle_ask() always returns 5-tuple (transcript, eng, response, audio, history)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (1) hide show

app.py +241 -57

app.py CHANGED Viewed

@@ -248,22 +248,148 @@ def _run_pipeline(audio_path: str, language_code: str):
 # ── Conversation-mode helpers ─────────────────────────────────────────────────
-# Bambara conversation system prompt — instructs LLM to respond phonetically
-_BAM_CONVO_SYSTEM = """\
-You are a friendly Bambara voice assistant. Rules you must follow:
-1. Always reply in Bambara, matching the user's informal spoken style.
-2. Use phonetic spelling: write 'u' instead of 'ou', 'j' instead of 'dj', \
-'c' instead of 'ch' — spell words as they sound when spoken aloud.
-3. Keep responses short: 1–3 sentences max. This is a voice conversation.
-4. Never add translations or explanations unless explicitly asked.
-5. If the user speaks French or English, switch to that language naturally."""
-def _get_llm() -> GemmaClient:
-    global _llm_client
-    if _llm_client is None:
-        _llm_client = GemmaClient(model_id=LLM_MODEL_ID, hf_token=HF_TOKEN)
-    return _llm_client
 def set_voice_reference(audio_file) -> str:
@@ -316,22 +442,25 @@ def set_voice_reference(audio_file) -> str:
 @_gpu
-def _convo_pipeline(audio_path: str, language_code: str):
     """
-    Full S2S conversation pipeline:
-      1. ASR   — fine-tuned Whisper → transcript
-      2. Norm  — bam_normalize() on Bambara input
-      3. Brain — LLM (Qwen) with Bambara phonetic system prompt → response text
-      4. Mouth — F5-TTS with voice reference (or MMS-TTS fallback) → audio
-    Returns same 4-tuple as _run_pipeline.
     """
     import torch
     device = "cuda" if torch.cuda.is_available() else "cpu"
     if _whisper_model is None:
-        return "⏳ Model still loading…", "", "", None
     import librosa
     audio_np, _ = librosa.load(audio_path, sr=16000, mono=True)
@@ -366,29 +495,42 @@ def _convo_pipeline(audio_path: str, language_code: str):
     if device == "cuda":
         torch.cuda.empty_cache()
-    # Phonetic normalisation for Bambara (unifies ou→u etc.)
     normalised = bam_normalize(transcript) if language_code == "bam" else transcript
-    # ── LLM brain ─────────────────────────────────────────────────────────────
     try:
         from huggingface_hub import InferenceClient
-        client = InferenceClient(token=HF_TOKEN)
         completion = client.chat_completion(
             model=LLM_MODEL_ID,
-            messages=[
-                {"role": "system", "content": _BAM_CONVO_SYSTEM},
-                {"role": "user",   "content": normalised},
-            ],
-            max_tokens=256,
-            temperature=0.6,
         )
         response_text = completion.choices[0].message.content.strip()
     except Exception as llm_err:
-        response_text = normalised  # echo transcript if LLM fails
-        import logging
-        logging.getLogger(__name__).warning("LLM failed: %s", llm_err)
-    # ── TTS mouth — F5-TTS preferred, MMS-TTS fallback ────────────────────────
     audio_out = None
     if _voice_ref_path and Path(_voice_ref_path).exists():
         try:
@@ -403,15 +545,13 @@ def _convo_pipeline(audio_path: str, language_code: str):
                 wav_np, sr = result
                 audio_out = (sr, wav_np)
         except Exception as tts_err:
-            import logging
-            logging.getLogger(__name__).warning("F5-TTS failed, falling back: %s", tts_err)
     if audio_out is None:
-        # MMS-TTS fallback
         wav_np, sr = _tts.synthesize(response_text, language_code, device=device)
         audio_out = (sr, wav_np)
-    return transcript, "", response_text, audio_out
 # ── HF Hub feedback persistence ───────────────────────────────────────────────
@@ -643,6 +783,7 @@ def _append_phrases_to_vocabulary_jsonl(lang: str, pairs_text: str) -> None:
                     repo_id=FEEDBACK_REPO_ID,
                     repo_type="dataset",
                 )
                 break
             except Exception:
                 if attempt == 1:
@@ -689,8 +830,9 @@ def _load_phrase_additions_from_hub() -> None:
         except Exception:
             pass  # No additions saved yet — fine
-# Load user phrase additions in background at module import time
 threading.Thread(target=_load_phrase_additions_from_hub, daemon=True).start()
 def _save_audio_for_training(lang_label: str, audio_path: str | None, transcript: str, source_note: str) -> str:
@@ -1033,6 +1175,7 @@ def _harvest_wikipedia(lang_label: str, max_articles: int = 100) -> str:
     total, err = _upload_jsonl("vocabulary.jsonl", entries)
     if err:
         return f"❌ Upload failed: {err}"
     return (
         f"✅ Wikipedia harvest complete!\n"
         f"  Language         : {lang_label}\n"
@@ -1114,24 +1257,35 @@ def _harvest_hf_dataset(lang_label: str, max_samples: int = 500) -> str:
 # ── Main ask handler ──────────────────────────────────────────────────────────
-def handle_ask(audio_path, language_label, convo_mode: bool = False):
     if audio_path is None:
-        return "⚠️ No audio — press Record or upload a file.", "", "", None
     language_code = SUPPORTED_LANGUAGES.get(language_label, "bam")
     status = _ensure_whisper_loaded()
     if _whisper_model is None:
-        return f"⏳ Model loading ({status}). Wait a moment and try again.", "", "", None
     try:
         if convo_mode:
-            transcript, eng, response_text, audio_out = _convo_pipeline(audio_path, language_code)
         else:
             transcript, eng, response_text, audio_out = _run_pipeline(audio_path, language_code)
-        return transcript, eng, response_text, audio_out
     except Exception as e:
-        return f"❌ {e}", "", "", None
 # ── Gradio UI ─────────────────────────────────────────────────────────────────
@@ -1190,6 +1344,9 @@ def build_ui() -> gr.Blocks:
                 gr.Markdown("---")
                 with gr.Row():
                     with gr.Column(scale=1):
                         language_dd = gr.Dropdown(
@@ -1202,19 +1359,21 @@ def build_ui() -> gr.Blocks:
                             type="filepath",
                             label="Record or upload audio",
                         )
-                        ask_btn = gr.Button("▶ Ask / Ɲinɛ", variant="primary")
                     with gr.Column(scale=1):
                         transcript_box = gr.Textbox(
-                            label="Whisper heard (transcription)",
                             lines=2,
                             placeholder="Your words will appear here…",
                             interactive=False,
                         )
                         translation_box = gr.Textbox(
-                            label="English translation (hidden in Conversation Mode)",
                             lines=2,
-                            placeholder="English meaning will appear here…",
                             interactive=False,
                         )
                         response_box = gr.Textbox(
@@ -1234,21 +1393,46 @@ def build_ui() -> gr.Blocks:
                             size="sm",
                         )
-                _ask_inputs  = [audio_input, language_dd, convo_mode_toggle]
-                _ask_outputs = [transcript_box, translation_box, response_box, audio_output]
-                # Manual button click
                 ask_btn.click(
-                    fn=handle_ask,
                     inputs=_ask_inputs,
                     outputs=_ask_outputs,
                 )
-                # Auto-submit when mic recording stops (Conversation Mode only)
                 audio_input.stop_recording(
-                    fn=lambda ap, ll, cm: handle_ask(ap, ll, cm) if cm else (None, None, None, None),
                     inputs=_ask_inputs,
                     outputs=_ask_outputs,
                 )
             # ── Tab 2: Feedback & Correction ─────────────────────────────────
             with gr.TabItem("📝 Feedback & Correction", id="tab_feedback"):

 # ── Conversation-mode helpers ─────────────────────────────────────────────────
+# Vocabulary context cache — loaded from Hub, refreshed after each LEARNED save
+_vocab_context_cache: str = ""
+_vocab_lock = threading.Lock()
+def _refresh_vocab_context() -> None:
+    """Load vocabulary.jsonl from Hub and rebuild the LLM context string."""
+    global _vocab_context_cache
+    if not HF_TOKEN or not FEEDBACK_REPO_ID:
+        return
+    try:
+        from huggingface_hub import hf_hub_download
+        local = hf_hub_download(
+            repo_id=FEEDBACK_REPO_ID, filename="vocabulary.jsonl",
+            repo_type="dataset", token=HF_TOKEN,
+        )
+        entries: list[dict] = []
+        with open(local, encoding="utf-8") as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    try:
+                        entries.append(json.loads(line))
+                    except Exception:
+                        pass
+        # Most recent first, cap at 200 entries to stay within token budget
+        entries = entries[-200:][::-1]
+        lines = []
+        for e in entries:
+            word = e.get("word", "").strip()
+            tr   = e.get("translation", "").strip()
+            lang = e.get("language", "")
+            if word:
+                lines.append(f"{word} = {tr}  [{lang}]" if tr else f"{word}  [{lang}]")
+        with _vocab_lock:
+            _vocab_context_cache = "\n".join(lines)
+    except Exception:
+        pass  # Non-critical — LLM continues without vocab context
+def _get_vocab_context() -> str:
+    with _vocab_lock:
+        return _vocab_context_cache
+def _save_learned_async(word: str, meaning: str, lang: str) -> None:
+    """Persist a word/phrase learned mid-conversation to vocabulary.jsonl on Hub."""
+    def _run():
+        if not word.strip():
+            return
+        entry = {"word": word.strip(), "translation": meaning.strip(), "language": lang,
+                 "source": "conversation", "timestamp": datetime.now(timezone.utc).isoformat()}
+        _upload_jsonl_later("vocabulary.jsonl", [entry])
+        _refresh_vocab_context()  # update cache so next turn knows this word
+    threading.Thread(target=_run, daemon=True).start()
+def _upload_jsonl_later(repo_path: str, entries: list[dict]) -> None:
+    """Append entries to a Hub JSONL file — called from background threads."""
+    if not HF_TOKEN or not FEEDBACK_REPO_ID or _hf_api is None:
+        return
+    from huggingface_hub import hf_hub_download
+    for attempt in range(2):
+        try:
+            local = hf_hub_download(
+                repo_id=FEEDBACK_REPO_ID, filename=repo_path,
+                repo_type="dataset", token=HF_TOKEN,
+            )
+            with open(local, encoding="utf-8") as f:
+                existing = f.read()
+        except Exception:
+            existing = ""
+        updated = existing + "".join(json.dumps(e, ensure_ascii=False) + "\n" for e in entries)
+        try:
+            _hf_api.upload_file(
+                path_or_fileobj=io.BytesIO(updated.encode("utf-8")),
+                path_in_repo=repo_path,
+                repo_id=FEEDBACK_REPO_ID,
+                repo_type="dataset",
+            )
+            return
+        except Exception:
+            if attempt == 1:
+                pass
+import re as _re
+_LEARNED_RE = _re.compile(
+    r'\[LEARNED:\s*word=["\'](.+?)["\']\s+meaning=["\'](.+?)["\']\s*\]',
+    _re.IGNORECASE,
+)
+def _parse_and_strip_learned(text: str, lang: str) -> tuple[str, list[tuple[str, str]]]:
+    """
+    Extract [LEARNED: word="X" meaning="Y"] tags from LLM output.
+    Returns (cleaned_text, list_of_(word, meaning) pairs).
+    Saves each pair to Hub asynchronously.
+    """
+    learned = []
+    for m in _LEARNED_RE.finditer(text):
+        word, meaning = m.group(1).strip(), m.group(2).strip()
+        learned.append((word, meaning))
+        _save_learned_async(word, meaning, lang)
+    cleaned = _LEARNED_RE.sub("", text).strip()
+    return cleaned, learned
+# System prompt — includes vocabulary context + conversation rules
+_CONVO_SYSTEM_TEMPLATE = """\
+You are a helpful voice assistant for Bambara and Fula speakers. \
+You are talking, not writing — keep every response to 1–3 short sentences.
+YOUR KNOWLEDGE BASE (words and phrases you have learned from users):
+{vocab}
+RULES you must always follow:
+1. Reply in whatever language the user speaks (Bambara, Fula, French, or English).
+2. When speaking Bambara, use phonetic spelling: 'u' not 'ou', 'j' not 'dj', 'c' not 'ch'.
+3. Keep responses SHORT — this is voice, not text.
+4. If you do not understand something, ask ONE specific follow-up question \
+   (e.g. "Mun ye o fileli ye?" = "What does that mean?").
+5. If the user teaches you a word or phrase (says "X means Y" or "X se dit Y in Bambara"), \
+   confirm warmly then add exactly: [LEARNED: word="X" meaning="Y"]
+6. Remember the full conversation — refer to earlier messages naturally \
+   (e.g. "As you said earlier…", "I ka kuma fɔlen don…").
+7. Never invent words you do not know. Honest uncertainty is always better than wrong answers."""
+def _build_messages(user_text: str, history: list, language_code: str) -> list[dict]:
+    """Build the full message list: system (with vocab) + history + new user turn."""
+    vocab = _get_vocab_context()
+    system_content = _CONVO_SYSTEM_TEMPLATE.format(
+        vocab=vocab if vocab else "(no vocabulary recorded yet — you can teach me words!)"
+    )
+    messages: list[dict] = [{"role": "system", "content": system_content}]
+    # Inject conversation history (last 20 turns max)
+    for u, a in history[-20:]:
+        messages.append({"role": "user",      "content": u})
+        messages.append({"role": "assistant", "content": a})
+    messages.append({"role": "user", "content": user_text})
+    return messages
 def set_voice_reference(audio_file) -> str:
 @_gpu
+def _convo_pipeline(audio_path: str, language_code: str, history: list):
     """
+    Full S2S conversation pipeline with memory:
+      1. ASR   — fine-tuned Whisper (or base) → transcript
+      2. Norm  — bam_normalize() on Bambara text
+      3. Brain — LLM with full conversation history + vocabulary context
+      4. Learn — parse [LEARNED:] tags, persist to Hub async
+      5. Mouth — F5-TTS (voice ref) or MMS-TTS fallback → audio
+    Returns: (transcript, eng, response_text, audio_out, new_history)
     """
     import torch
+    import logging
+    log = logging.getLogger(__name__)
     device = "cuda" if torch.cuda.is_available() else "cpu"
     if _whisper_model is None:
+        return "⏳ Model still loading…", "", "", None, history
     import librosa
     audio_np, _ = librosa.load(audio_path, sr=16000, mono=True)
     if device == "cuda":
         torch.cuda.empty_cache()
+    # Phonetic normalisation for Bambara
     normalised = bam_normalize(transcript) if language_code == "bam" else transcript
+    # ── LLM brain — full context: vocab + history + new turn ─────────────────
+    response_text = ""
     try:
         from huggingface_hub import InferenceClient
+        client    = InferenceClient(token=HF_TOKEN)
+        messages  = _build_messages(normalised, history, language_code)
         completion = client.chat_completion(
             model=LLM_MODEL_ID,
+            messages=messages,
+            max_tokens=300,
+            temperature=0.65,
         )
         response_text = completion.choices[0].message.content.strip()
     except Exception as llm_err:
+        log.warning("LLM failed: %s", llm_err)
+        # Graceful degradation: tell user LLM is unavailable, ask them to try again
+        response_text = (
+            "Hakɛ to, n bɛ sɔrɔ cogo dɔ la."
+            if language_code == "bam"
+            else "Sorry, I could not reach the language model. Please try again."
+        )
+    # ── Parse and strip [LEARNED:] tags — save async to Hub ──────────────────
+    response_text, learned_pairs = _parse_and_strip_learned(response_text, language_code)
+    if learned_pairs:
+        log.info("Learned %d new item(s): %s", len(learned_pairs), learned_pairs)
+    # ── Update conversation history ───────────────────────────────────────────
+    new_history = list(history) + [(normalised, response_text)]
+    if len(new_history) > 20:
+        new_history = new_history[-20:]
+    # ── TTS mouth — F5-TTS (voice ref) or MMS-TTS fallback ───────────────────
     audio_out = None
     if _voice_ref_path and Path(_voice_ref_path).exists():
         try:
                 wav_np, sr = result
                 audio_out = (sr, wav_np)
         except Exception as tts_err:
+            log.warning("F5-TTS failed, using MMS-TTS: %s", tts_err)
     if audio_out is None:
         wav_np, sr = _tts.synthesize(response_text, language_code, device=device)
         audio_out = (sr, wav_np)
+    return transcript, "", response_text, audio_out, new_history
 # ── HF Hub feedback persistence ───────────────────────────────────────────────
                     repo_id=FEEDBACK_REPO_ID,
                     repo_type="dataset",
                 )
+                threading.Thread(target=_refresh_vocab_context, daemon=True).start()
                 break
             except Exception:
                 if attempt == 1:
         except Exception:
             pass  # No additions saved yet — fine
+# Load phrase additions + vocabulary context in background at startup
 threading.Thread(target=_load_phrase_additions_from_hub, daemon=True).start()
+threading.Thread(target=_refresh_vocab_context, daemon=True).start()
 def _save_audio_for_training(lang_label: str, audio_path: str | None, transcript: str, source_note: str) -> str:
     total, err = _upload_jsonl("vocabulary.jsonl", entries)
     if err:
         return f"❌ Upload failed: {err}"
+    threading.Thread(target=_refresh_vocab_context, daemon=True).start()
     return (
         f"✅ Wikipedia harvest complete!\n"
         f"  Language         : {lang_label}\n"
 # ── Main ask handler ──────────────────────────────────────────────────────────
+def handle_ask(audio_path, language_label, convo_mode: bool = False, history: list | None = None):
+    """
+    Main dispatcher. Always returns 5 values:
+      (transcript, eng_translation, response_text, audio_out, new_history)
+    new_history is the updated gr.State list of (user, asst) tuples.
+    In normal (sensor) mode, history is passed through unchanged.
+    """
+    history = history or []
     if audio_path is None:
+        return "⚠️ No audio — press Record or upload a file.", "", "", None, history
     language_code = SUPPORTED_LANGUAGES.get(language_label, "bam")
     status = _ensure_whisper_loaded()
     if _whisper_model is None:
+        return f"⏳ Model loading ({status}). Wait a moment and try again.", "", "", None, history
     try:
         if convo_mode:
+            transcript, eng, response_text, audio_out, new_history = _convo_pipeline(
+                audio_path, language_code, history
+            )
         else:
             transcript, eng, response_text, audio_out = _run_pipeline(audio_path, language_code)
+            new_history = history  # sensor mode doesn't modify history
+        return transcript, eng, response_text, audio_out, new_history
     except Exception as e:
+        return f"❌ {e}", "", "", None, history
 # ── Gradio UI ─────────────────────────────────────────────────────────────────
                 gr.Markdown("---")
+                # Per-session conversation history (not shared between users)
+                conv_history = gr.State(value=[])
                 with gr.Row():
                     with gr.Column(scale=1):
                         language_dd = gr.Dropdown(
                             type="filepath",
                             label="Record or upload audio",
                         )
+                        with gr.Row():
+                            ask_btn   = gr.Button("▶ Ask / Ɲinɛ", variant="primary")
+                            clear_btn = gr.Button("🗑 Clear", variant="secondary", size="sm")
                     with gr.Column(scale=1):
                         transcript_box = gr.Textbox(
+                            label="Whisper heard",
                             lines=2,
                             placeholder="Your words will appear here…",
                             interactive=False,
                         )
                         translation_box = gr.Textbox(
+                            label="English translation",
                             lines=2,
+                            placeholder="(shown in sensor mode only)",
                             interactive=False,
                         )
                         response_box = gr.Textbox(
                             size="sm",
                         )
+                # Conversation history display (Conversation Mode only)
+                chatbot = gr.Chatbot(
+                    label="Conversation history",
+                    height=300,
+                    visible=False,
+                    type="tuples",
+                )
+                convo_mode_toggle.change(
+                    fn=lambda on: gr.update(visible=on),
+                    inputs=[convo_mode_toggle],
+                    outputs=[chatbot],
+                )
+                _ask_inputs  = [audio_input, language_dd, convo_mode_toggle, conv_history]
+                _ask_outputs = [transcript_box, translation_box, response_box,
+                                audio_output, conv_history, chatbot]
+                def _ask_and_update(ap, ll, cm, hist):
+                    t, e, r, a, new_hist = handle_ask(ap, ll, cm, hist)
+                    # Convert history tuples to list-of-lists for gr.Chatbot
+                    chat_msgs = [[u, v] for u, v in new_hist]
+                    return t, e, r, a, new_hist, chat_msgs
                 ask_btn.click(
+                    fn=_ask_and_update,
                     inputs=_ask_inputs,
                     outputs=_ask_outputs,
                 )
+                # Auto-submit when mic stops (Conversation Mode)
                 audio_input.stop_recording(
+                    fn=lambda ap, ll, cm, h: _ask_and_update(ap, ll, cm, h) if cm
+                        else (None, None, None, None, h, [[u, v] for u, v in h]),
                     inputs=_ask_inputs,
                     outputs=_ask_outputs,
                 )
+                # Clear conversation
+                clear_btn.click(
+                    fn=lambda: ([], []),
+                    outputs=[conv_history, chatbot],
+                )
             # ── Tab 2: Feedback & Correction ─────────────────────────────────
             with gr.TabItem("📝 Feedback & Correction", id="tab_feedback"):