Spaces:
Running
Ground-zero Stages 1–3: dialect anchors + phrasebook short-circuit + Aya-Expanse
Browse filesStage 1 — dialect-pinned LLM client (src/llm/minimal_client.py)
Plain-text replacement for GemmaClient's JSON/teacher flow. System prompt
pins Bambara-Mali and Pular-Fuuta-Jallon explicitly, names forbidden
neighbouring languages (Wolof, Hausa, Pulaar-Senegal, Fulfulde-Nigeria,
Jula-CI), and injects a 30-pair bilingual gold list as few-shot anchoring
from configs/dialect_anchors/{bambara_mali,pular_guinea}.json.
Stage 2 — curated phrasebook short-circuit (src/llm/phrasebook.py)
100 Bambara + 110 Pular English-keyed pairs across greetings, family,
food, farming, health, shopping, travel, clarity, time, parting. Fuzzy
matched (threshold 0.88) before every LLM call; on hit returns the gold
translation directly — zero drift risk, zero latency.
Stage 3 — default LLM swapped to CohereLabs/aya-expanse-32b
23-language multilingual base with stronger West African coverage than
Qwen 2.5-7B. Overridable via LLM_MODEL_ID.
Space wiring
- README frontmatter app_file: app.py → app_minimal.py (Space now serves
the minimal baseline; app.py untouched for the full production stack).
- .env auto-loaded via python-dotenv so HF_TOKEN is picked up on launch.
- README updated: minimal-baseline section, Stack + env-var tables,
Run-locally block.
- README.md +51 -6
- app_minimal.py +54 -35
- configs/dialect_anchors/bambara_mali.json +37 -0
- configs/dialect_anchors/bambara_phrasebook.json +107 -0
- configs/dialect_anchors/pular_guinea.json +37 -0
- configs/dialect_anchors/pular_phrasebook.json +117 -0
- src/llm/minimal_client.py +179 -0
- src/llm/phrasebook.py +123 -0
|
@@ -31,7 +31,46 @@ Two intertwined jobs:
|
|
| 31 |
1. **Memory loop** — users *teach* the assistant new words; it persists them to a HuggingFace dataset and uses them as the source of truth in future answers.
|
| 32 |
2. **Agricultural IoT voice interface** — Sahelian farmers query soil, weather, irrigation, and pest data in their own language, short answers, ≤ 6 words per sentence for clean TTS.
|
| 33 |
|
| 34 |
-
The core stack is explicitly **100% non-Meta** (Whisper /
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
---
|
| 37 |
|
|
@@ -54,7 +93,9 @@ See `docs/roadmap_2026-04.md` for the full plan and `docs/baseline_rebuild.md` f
|
|
| 54 |
| Layer | Tool |
|
| 55 |
|-------|------|
|
| 56 |
| STT | `openai/whisper-large-v3-turbo` + PEFT LoRA hot-swap (~50 MB adapter per language, ~50 ms switch) |
|
| 57 |
-
| LLM | `
|
|
|
|
|
|
|
| 58 |
| TTS (baseline) | `facebook/mms-tts-bam`, `facebook/mms-tts-ful` |
|
| 59 |
| TTS (Bambara) | `ynnov/ekodi-bambara-tts-female` (Waxal VITS) |
|
| 60 |
| TTS (Fula) | placeholder → `ous-sow/fula-tts` when published |
|
|
@@ -70,7 +111,8 @@ See `docs/roadmap_2026-04.md` for the full plan and `docs/baseline_rebuild.md` f
|
|
| 70 |
|
| 71 |
| File | Purpose | Lifecycle |
|
| 72 |
|------|---------|-----------|
|
| 73 |
-
| `
|
|
|
|
| 74 |
| `app_lab.py` | **Experimental Gradio UI** for prototyping (e.g. `CuriosityEngine`) before folding into `app.py`. | `python app_lab.py` |
|
| 75 |
| `src/api/app.py` | **FastAPI service** — loads Whisper once, registers `bam`/`ful` adapters via `AdapterManager`, preloads `bam`, attaches `Transcriber` + `SensorBridge` to `app.state`. | `python scripts/run_server.py` |
|
| 76 |
|
|
@@ -163,7 +205,7 @@ All variables have sensible defaults, so you can boot the Space without any of t
|
|
| 163 |
| `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` | Memory-loop target dataset. |
|
| 164 |
| `ADAPTER_REPO_ID` | `ous-sow/sahel-agri-adapters` | Published LoRA adapters. |
|
| 165 |
| `WHISPER_MODEL_ID` | `openai/whisper-large-v3-turbo` | STT base model. |
|
| 166 |
-
| `LLM_MODEL_ID` | `
|
| 167 |
| `LOG_LEVEL` | `INFO` | Standard Python logging level. |
|
| 168 |
| `DEVICE` | `cuda` (FastAPI) | Torch device for inference. |
|
| 169 |
|
|
@@ -193,8 +235,11 @@ All variables have sensible defaults, so you can boot the Space without any of t
|
|
| 193 |
## Run locally
|
| 194 |
|
| 195 |
```bash
|
| 196 |
-
#
|
| 197 |
pip install -r requirements.txt
|
|
|
|
|
|
|
|
|
|
| 198 |
python app.py
|
| 199 |
|
| 200 |
# FastAPI service
|
|
@@ -253,7 +298,7 @@ At minimum:
|
|
| 253 |
|-----|-------|
|
| 254 |
| `HF_TOKEN` | write-scope token |
|
| 255 |
| `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` |
|
| 256 |
-
| `LLM_MODEL_ID` | `
|
| 257 |
|
| 258 |
---
|
| 259 |
|
|
|
|
| 31 |
1. **Memory loop** — users *teach* the assistant new words; it persists them to a HuggingFace dataset and uses them as the source of truth in future answers.
|
| 32 |
2. **Agricultural IoT voice interface** — Sahelian farmers query soil, weather, irrigation, and pest data in their own language, short answers, ≤ 6 words per sentence for clean TTS.
|
| 33 |
|
| 34 |
+
The core stack is explicitly **100% non-Meta** (Whisper / Aya-Expanse / F5-TTS / VITS); MMS-TTS is only used as a baseline fallback.
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## What this Space currently runs — the `ground-zero` minimal baseline
|
| 39 |
+
|
| 40 |
+
The deployed Space (`app_file: app_minimal.py`) is the **Month 1–3 rebuild**
|
| 41 |
+
baseline — a stripped-down Whisper → LLM → MMS-TTS pipeline used for field
|
| 42 |
+
testing and to build a real-user eval set. No LoRA adapters, no memory loop,
|
| 43 |
+
no speaker ID, no voice cloning, no IoT, no phrase matcher. Everything in
|
| 44 |
+
`app.py` still exists for the full production stack; it is just not what the
|
| 45 |
+
Space serves today.
|
| 46 |
+
|
| 47 |
+
Three stacked changes land dialect fidelity without any training:
|
| 48 |
+
|
| 49 |
+
1. **Stage 1 — dialect-pinned system prompt** (`src/llm/minimal_client.py`).
|
| 50 |
+
Replaces the `GemmaClient` JSON/teacher flow with a plain-text client whose
|
| 51 |
+
system prompt pins the target dialect explicitly — *Bambara as spoken in
|
| 52 |
+
Bamako, Mali* and *Pular of Fuuta Jallon, as spoken in Guinea* — names the
|
| 53 |
+
languages the model must **not** drift into (Wolof, Hausa, Pulaar of
|
| 54 |
+
Senegal, Fulfulde of Nigeria, Jula of Côte d'Ivoire), and injects a 30-pair
|
| 55 |
+
bilingual gold list as few-shot anchoring
|
| 56 |
+
(`configs/dialect_anchors/{bambara_mali,pular_guinea}.json`).
|
| 57 |
+
|
| 58 |
+
2. **Stage 2 — curated phrasebook short-circuit** (`src/llm/phrasebook.py`).
|
| 59 |
+
Before calling the LLM, the user's input is normalised and fuzzy-matched
|
| 60 |
+
(threshold 0.88) against a curated English-keyed phrasebook
|
| 61 |
+
(`configs/dialect_anchors/{bambara,pular}_phrasebook.json` — 100 Bambara /
|
| 62 |
+
110 Pular entries across greetings, family, food, farming, health,
|
| 63 |
+
shopping, travel, clarity, time, parting). A hit returns the gold
|
| 64 |
+
translation directly — zero LLM risk, zero latency.
|
| 65 |
+
|
| 66 |
+
3. **Stage 3 — better multilingual base LLM.**
|
| 67 |
+
Default `LLM_MODEL_ID` is now **`CohereLabs/aya-expanse-32b`**, a 23-language
|
| 68 |
+
multilingual model with much stronger West African coverage than Qwen
|
| 69 |
+
2.5-7B. Can be overridden via the `LLM_MODEL_ID` env var (e.g. to
|
| 70 |
+
`Qwen/Qwen2.5-72B-Instruct`) if Cohere's inference provider is not
|
| 71 |
+
available on your HF account.
|
| 72 |
+
|
| 73 |
+
See `docs/baseline_rebuild.md` for the broader minimal-track plan.
|
| 74 |
|
| 75 |
---
|
| 76 |
|
|
|
|
| 93 |
| Layer | Tool |
|
| 94 |
|-------|------|
|
| 95 |
| STT | `openai/whisper-large-v3-turbo` + PEFT LoRA hot-swap (~50 MB adapter per language, ~50 ms switch) |
|
| 96 |
+
| LLM | `CohereLabs/aya-expanse-32b` (minimal-baseline default, strong African-language coverage) via HF Serverless InferenceClient — overridable to `Qwen/Qwen2.5-72B-Instruct`, `Qwen2.5-7B-Instruct`, Mistral, Zephyr |
|
| 97 |
+
| Dialect anchoring (minimal) | `src/llm/minimal_client.py` — pinned Bambara-Mali / Pular-Guinea system prompt with 30-pair bilingual few-shot + forbidden-drift guardrails |
|
| 98 |
+
| Phrasebook short-circuit (minimal) | `src/llm/phrasebook.py` — 100 Bambara + 110 Pular curated gold pairs, fuzzy-matched (0.88 threshold) before any LLM call |
|
| 99 |
| TTS (baseline) | `facebook/mms-tts-bam`, `facebook/mms-tts-ful` |
|
| 100 |
| TTS (Bambara) | `ynnov/ekodi-bambara-tts-female` (Waxal VITS) |
|
| 101 |
| TTS (Fula) | placeholder → `ous-sow/fula-tts` when published |
|
|
|
|
| 111 |
|
| 112 |
| File | Purpose | Lifecycle |
|
| 113 |
|------|---------|-----------|
|
| 114 |
+
| `app_minimal.py` | **Minimal baseline Gradio UI** — what the HF Space currently serves. Whisper → LLM → MMS-TTS with dialect-pinned prompts + curated phrasebook short-circuit. Tabs: Voice / Text. | `python app_minimal.py` |
|
| 115 |
+
| `app.py` | **Full production Gradio UI** (not currently served on the Space). Single-file (~99 KB) by design. Tabs: Conversation / Teaching / Knowledge Base / Self-Teaching. | `python app.py` |
|
| 116 |
| `app_lab.py` | **Experimental Gradio UI** for prototyping (e.g. `CuriosityEngine`) before folding into `app.py`. | `python app_lab.py` |
|
| 117 |
| `src/api/app.py` | **FastAPI service** — loads Whisper once, registers `bam`/`ful` adapters via `AdapterManager`, preloads `bam`, attaches `Transcriber` + `SensorBridge` to `app.state`. | `python scripts/run_server.py` |
|
| 118 |
|
|
|
|
| 205 |
| `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` | Memory-loop target dataset. |
|
| 206 |
| `ADAPTER_REPO_ID` | `ous-sow/sahel-agri-adapters` | Published LoRA adapters. |
|
| 207 |
| `WHISPER_MODEL_ID` | `openai/whisper-large-v3-turbo` | STT base model. |
|
| 208 |
+
| `LLM_MODEL_ID` | `CohereLabs/aya-expanse-32b` | LLM via HF Serverless. Override to any HF Serverless-supported model. |
|
| 209 |
| `LOG_LEVEL` | `INFO` | Standard Python logging level. |
|
| 210 |
| `DEVICE` | `cuda` (FastAPI) | Torch device for inference. |
|
| 211 |
|
|
|
|
| 235 |
## Run locally
|
| 236 |
|
| 237 |
```bash
|
| 238 |
+
# Minimal baseline (what the Space runs)
|
| 239 |
pip install -r requirements.txt
|
| 240 |
+
python app_minimal.py
|
| 241 |
+
|
| 242 |
+
# Full production UI (not currently on the Space)
|
| 243 |
python app.py
|
| 244 |
|
| 245 |
# FastAPI service
|
|
|
|
| 298 |
|-----|-------|
|
| 299 |
| `HF_TOKEN` | write-scope token |
|
| 300 |
| `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` |
|
| 301 |
+
| `LLM_MODEL_ID` | `CohereLabs/aya-expanse-32b` (or any HF Serverless-supported model) |
|
| 302 |
|
| 303 |
---
|
| 304 |
|
|
@@ -12,7 +12,8 @@ Run locally:
|
|
| 12 |
Environment variables (all optional except HF_TOKEN, which is needed for the
|
| 13 |
Qwen HF Serverless call):
|
| 14 |
HF_TOKEN — HuggingFace token with read access
|
| 15 |
-
LLM_MODEL_ID — default "
|
|
|
|
| 16 |
DEVICE — "cuda" or "cpu" (auto if unset)
|
| 17 |
LOG_LEVEL — default "INFO"
|
| 18 |
"""
|
|
@@ -24,11 +25,20 @@ from typing import Optional, Tuple
|
|
| 24 |
|
| 25 |
import numpy as np
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
# Local imports — the four modules the baseline-rebuild plan authorizes.
|
| 28 |
# Everything else in src/ is intentionally unused here.
|
| 29 |
from src.data.bam_normalize import normalize as bam_normalize
|
| 30 |
from src.engine.whisper_base import WhisperBackbone
|
| 31 |
-
from src.llm.
|
|
|
|
| 32 |
from src.tts.mms_tts import MMSTTSEngine
|
| 33 |
|
| 34 |
logging.basicConfig(
|
|
@@ -40,7 +50,7 @@ logger = logging.getLogger(__name__)
|
|
| 40 |
|
| 41 |
# ── Environment ──────────────────────────────────────────────────────────────
|
| 42 |
HF_TOKEN = os.environ.get("HF_TOKEN")
|
| 43 |
-
LLM_MODEL_ID = os.environ.get("LLM_MODEL_ID", "
|
| 44 |
_REQUESTED_DEVICE = os.environ.get("DEVICE") # optional override
|
| 45 |
|
| 46 |
LANG_CHOICES = [("Bambara", "bam"), ("Fula", "ful"), ("French", "fr"), ("English", "en")]
|
|
@@ -56,20 +66,13 @@ LANG_TO_WHISPER_HINT = {
|
|
| 56 |
}
|
| 57 |
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
The LLM's system prompt (in GemmaClient) does not know which language we
|
| 63 |
-
want the reply in — it picks based on vibes, which can drift (e.g. to
|
| 64 |
-
Wolof). We keep GemmaClient untouched and steer from the user turn.
|
| 65 |
-
"""
|
| 66 |
-
name = LANG_NAMES.get(output_lang, "English")
|
| 67 |
-
return f"{user_text}\n\n(Please reply in {name} only.)"
|
| 68 |
|
| 69 |
|
| 70 |
# ── Service singletons (lazy-loaded) ────────────────────────────────────────
|
| 71 |
_backbone: Optional[WhisperBackbone] = None
|
| 72 |
-
_llm: Optional[
|
| 73 |
_tts: Optional[MMSTTSEngine] = None
|
| 74 |
|
| 75 |
|
|
@@ -92,11 +95,11 @@ def get_backbone() -> WhisperBackbone:
|
|
| 92 |
return _backbone
|
| 93 |
|
| 94 |
|
| 95 |
-
def get_llm() ->
|
| 96 |
global _llm
|
| 97 |
if _llm is None:
|
| 98 |
-
_llm =
|
| 99 |
-
logger.info("LLM client configured: %s", LLM_MODEL_ID)
|
| 100 |
return _llm
|
| 101 |
|
| 102 |
|
|
@@ -193,17 +196,25 @@ def run_pipeline(
|
|
| 193 |
if not transcript:
|
| 194 |
return "", "(no speech detected)", None
|
| 195 |
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
| 201 |
)
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 205 |
|
| 206 |
-
reply_text
|
| 207 |
|
| 208 |
try:
|
| 209 |
wav, sr = get_tts().synthesize(
|
|
@@ -234,16 +245,22 @@ def run_text_pipeline(
|
|
| 234 |
if not text:
|
| 235 |
return "(no text entered)", None
|
| 236 |
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
|
|
|
|
|
|
| 241 |
)
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 245 |
|
| 246 |
-
reply_text
|
| 247 |
|
| 248 |
try:
|
| 249 |
wav, sr = get_tts().synthesize(
|
|
@@ -264,8 +281,10 @@ def build_ui():
|
|
| 264 |
with gr.Blocks(title="Sahel-Voice — Minimal Baseline") as demo:
|
| 265 |
gr.Markdown(
|
| 266 |
"# 🌾 Sahel-Voice — Minimal Baseline\n"
|
| 267 |
-
"Zero-shot Whisper →
|
| 268 |
-
"
|
|
|
|
|
|
|
| 269 |
)
|
| 270 |
|
| 271 |
# Shared across tabs. Split into two so input and output language
|
|
|
|
| 12 |
Environment variables (all optional except HF_TOKEN, which is needed for the
|
| 13 |
Qwen HF Serverless call):
|
| 14 |
HF_TOKEN — HuggingFace token with read access
|
| 15 |
+
LLM_MODEL_ID — default "CohereLabs/aya-expanse-32b"
|
| 16 |
+
(23-language multilingual, strong African-language coverage)
|
| 17 |
DEVICE — "cuda" or "cpu" (auto if unset)
|
| 18 |
LOG_LEVEL — default "INFO"
|
| 19 |
"""
|
|
|
|
| 25 |
|
| 26 |
import numpy as np
|
| 27 |
|
| 28 |
+
# Load .env (HF_TOKEN etc.) before reading os.environ below. Silent no-op if
|
| 29 |
+
# python-dotenv is not installed or no .env is present.
|
| 30 |
+
try:
|
| 31 |
+
from dotenv import load_dotenv
|
| 32 |
+
load_dotenv()
|
| 33 |
+
except ImportError:
|
| 34 |
+
pass
|
| 35 |
+
|
| 36 |
# Local imports — the four modules the baseline-rebuild plan authorizes.
|
| 37 |
# Everything else in src/ is intentionally unused here.
|
| 38 |
from src.data.bam_normalize import normalize as bam_normalize
|
| 39 |
from src.engine.whisper_base import WhisperBackbone
|
| 40 |
+
from src.llm.minimal_client import MinimalClient
|
| 41 |
+
from src.llm.phrasebook import lookup as phrasebook_lookup
|
| 42 |
from src.tts.mms_tts import MMSTTSEngine
|
| 43 |
|
| 44 |
logging.basicConfig(
|
|
|
|
| 50 |
|
| 51 |
# ── Environment ──────────────────────────────────────────────────────────────
|
| 52 |
HF_TOKEN = os.environ.get("HF_TOKEN")
|
| 53 |
+
LLM_MODEL_ID = os.environ.get("LLM_MODEL_ID", "CohereLabs/aya-expanse-32b")
|
| 54 |
_REQUESTED_DEVICE = os.environ.get("DEVICE") # optional override
|
| 55 |
|
| 56 |
LANG_CHOICES = [("Bambara", "bam"), ("Fula", "ful"), ("French", "fr"), ("English", "en")]
|
|
|
|
| 66 |
}
|
| 67 |
|
| 68 |
|
| 69 |
+
# Reply-language steering is handled inside MinimalClient via a dialect-anchored
|
| 70 |
+
# system prompt (see src/llm/minimal_client.py). No per-turn directive needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
|
| 73 |
# ── Service singletons (lazy-loaded) ────────────────────────────────────────
|
| 74 |
_backbone: Optional[WhisperBackbone] = None
|
| 75 |
+
_llm: Optional[MinimalClient] = None
|
| 76 |
_tts: Optional[MMSTTSEngine] = None
|
| 77 |
|
| 78 |
|
|
|
|
| 95 |
return _backbone
|
| 96 |
|
| 97 |
|
| 98 |
+
def get_llm() -> MinimalClient:
|
| 99 |
global _llm
|
| 100 |
if _llm is None:
|
| 101 |
+
_llm = MinimalClient(model_id=LLM_MODEL_ID, hf_token=HF_TOKEN)
|
| 102 |
+
logger.info("Minimal LLM client configured: %s", LLM_MODEL_ID)
|
| 103 |
return _llm
|
| 104 |
|
| 105 |
|
|
|
|
| 196 |
if not transcript:
|
| 197 |
return "", "(no speech detected)", None
|
| 198 |
|
| 199 |
+
# ── Phrasebook short-circuit ──────────────────────────────────────────
|
| 200 |
+
# Canonical greetings/courtesies hit the curated gold phrasebook directly,
|
| 201 |
+
# skipping the LLM entirely. Only fires for bam/ful targets.
|
| 202 |
+
hit = phrasebook_lookup(transcript, output_lang)
|
| 203 |
+
if hit:
|
| 204 |
+
logger.info(
|
| 205 |
+
"Phrasebook hit (%s, score=%.2f): %r → %r [cat=%s]",
|
| 206 |
+
hit["match"], hit["score"], transcript, hit["target"], hit["category"],
|
| 207 |
)
|
| 208 |
+
reply_text = hit["target"]
|
| 209 |
+
else:
|
| 210 |
+
try:
|
| 211 |
+
# Dialect-anchored plain-string reply (see MinimalClient).
|
| 212 |
+
reply_text = get_llm().chat(transcript, target_lang=output_lang)
|
| 213 |
+
except Exception as exc: # pragma: no cover
|
| 214 |
+
logger.exception("LLM call failed")
|
| 215 |
+
return transcript, f"(LLM error: {exc})", None
|
| 216 |
|
| 217 |
+
reply_text = reply_text or "(empty reply)"
|
| 218 |
|
| 219 |
try:
|
| 220 |
wav, sr = get_tts().synthesize(
|
|
|
|
| 245 |
if not text:
|
| 246 |
return "(no text entered)", None
|
| 247 |
|
| 248 |
+
# ── Phrasebook short-circuit (see voice path above) ──────────────────
|
| 249 |
+
hit = phrasebook_lookup(text, output_lang)
|
| 250 |
+
if hit:
|
| 251 |
+
logger.info(
|
| 252 |
+
"Phrasebook hit (%s, score=%.2f): %r → %r [cat=%s]",
|
| 253 |
+
hit["match"], hit["score"], text, hit["target"], hit["category"],
|
| 254 |
)
|
| 255 |
+
reply_text = hit["target"]
|
| 256 |
+
else:
|
| 257 |
+
try:
|
| 258 |
+
reply_text = get_llm().chat(text, target_lang=output_lang)
|
| 259 |
+
except Exception as exc: # pragma: no cover
|
| 260 |
+
logger.exception("LLM call failed")
|
| 261 |
+
return f"(LLM error: {exc})", None
|
| 262 |
|
| 263 |
+
reply_text = reply_text or "(empty reply)"
|
| 264 |
|
| 265 |
try:
|
| 266 |
wav, sr = get_tts().synthesize(
|
|
|
|
| 281 |
with gr.Blocks(title="Sahel-Voice — Minimal Baseline") as demo:
|
| 282 |
gr.Markdown(
|
| 283 |
"# 🌾 Sahel-Voice — Minimal Baseline\n"
|
| 284 |
+
f"Zero-shot Whisper → {LLM_MODEL_ID} → MMS-TTS, with a curated "
|
| 285 |
+
"Bambara/Pular phrasebook short-circuit in front of the LLM. "
|
| 286 |
+
"No adapters, no memory, no polish. This is the field-test "
|
| 287 |
+
"baseline — see `docs/baseline_rebuild.md`."
|
| 288 |
)
|
| 289 |
|
| 290 |
# Shared across tabs. Split into two so input and output language
|
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"dialect": "Bambara as spoken in Bamako, Mali",
|
| 3 |
+
"iso": "bam",
|
| 4 |
+
"notes": "Curated 30-phrase gold list. Orthography uses ɛ, ɔ, ɲ. Elisions (t', b', k') are preserved as in standard written Mali Bambara. Do NOT substitute with Jula/Dyula (Côte d'Ivoire) forms.",
|
| 5 |
+
"pairs": [
|
| 6 |
+
{"source": "Good morning / Bonjour", "target": "I ni sɔgɔma"},
|
| 7 |
+
{"source": "Good afternoon / Bon après-midi", "target": "I ni tile"},
|
| 8 |
+
{"source": "Good evening / Bonsoir", "target": "I ni wula"},
|
| 9 |
+
{"source": "Hello (general) / Salut", "target": "I ni ce"},
|
| 10 |
+
{"source": "Thank you / Merci", "target": "I ni ce"},
|
| 11 |
+
{"source": "How are you? / Comment vas-tu ?", "target": "I ka kɛnɛ wa?"},
|
| 12 |
+
{"source": "I am fine. / Je vais bien.", "target": "Kɛnɛ, tɔɔrɔ tɛ."},
|
| 13 |
+
{"source": "How is the family? / Comment va la famille ?", "target": "Sɔmɔgɔw bɛ di?"},
|
| 14 |
+
{"source": "They are fine. / Ils vont bien.", "target": "Tɔɔrɔ t'u la."},
|
| 15 |
+
{"source": "What is your name? / Comment t'appelles-tu ?", "target": "I tɔgɔ bi di?"},
|
| 16 |
+
{"source": "My name is... / Je m'appelle...", "target": "Ne tɔgɔ ye..."},
|
| 17 |
+
{"source": "Where are you going? / Où vas-tu ?", "target": "I bɛ taa min?"},
|
| 18 |
+
{"source": "I am going to the market. / Je vais au marché.", "target": "N bɛ taa sugu la."},
|
| 19 |
+
{"source": "How much is this? / C'est combien ?", "target": "Nin ye joli ye?"},
|
| 20 |
+
{"source": "It is too expensive. / C'est trop cher.", "target": "A da ka gɛlɛn."},
|
| 21 |
+
{"source": "Please / S'il vous plaît", "target": "Hakɛ to"},
|
| 22 |
+
{"source": "I am sorry / Je suis désolé", "target": "Yafa n ma"},
|
| 23 |
+
{"source": "I don't understand / Je ne comprends pas", "target": "N m'a faamu"},
|
| 24 |
+
{"source": "Speak slowly / Parle doucement", "target": "Kuma dɔɔni dɔɔni"},
|
| 25 |
+
{"source": "I am hungry / J'ai faim", "target": "Kɔngɔ bɛ n na"},
|
| 26 |
+
{"source": "I want to eat / Je veux manger", "target": "N b'a fɛ ka dumu"},
|
| 27 |
+
{"source": "Give me water / Donne-moi de l'eau", "target": "Ji di n ma"},
|
| 28 |
+
{"source": "How is the work/field? / Comment va le travail/champ ?", "target": "Baara bɛ di? / Sɛnɛ bɛ di?"},
|
| 29 |
+
{"source": "The work is good. / Le travail va bien.", "target": "Baara bɛ kɛnɛ."},
|
| 30 |
+
{"source": "Where is the doctor? / Où est le docteur ?", "target": "Dɔkɔtɔrɔ bɛ min?"},
|
| 31 |
+
{"source": "I am tired / Je suis fatigué", "target": "N sɛgɛnna"},
|
| 32 |
+
{"source": "See you tomorrow / À demain", "target": "K'an bɛn sini"},
|
| 33 |
+
{"source": "Goodbye / Au revoir", "target": "K'an bɛn"},
|
| 34 |
+
{"source": "God bless you / Que Dieu te bénisse", "target": "Ala ka duga i ye"},
|
| 35 |
+
{"source": "Peace only / La paix seulement", "target": "Hɛɛrɛ dɔrɔn"}
|
| 36 |
+
]
|
| 37 |
+
}
|
|
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"dialect": "Bambara as spoken in Bamako, Mali",
|
| 3 |
+
"iso": "bam",
|
| 4 |
+
"notes": "Curated 100-phrase field phrasebook, organized by conversational category. Used by the phrasebook short-circuit in src/llm/phrasebook.py — English-keyed, fuzzy-matched. Do NOT substitute with Jula/Dyula (Côte d'Ivoire) forms.",
|
| 5 |
+
"pairs": [
|
| 6 |
+
{"category": "Greetings", "source": "Hello / Thank you", "target": "I ni ce"},
|
| 7 |
+
{"category": "Greetings", "source": "Good morning", "target": "I ni sɔgɔma"},
|
| 8 |
+
{"category": "Greetings", "source": "Good afternoon", "target": "I ni tile"},
|
| 9 |
+
{"category": "Greetings", "source": "Good evening", "target": "I ni wula"},
|
| 10 |
+
{"category": "Greetings", "source": "Welcome", "target": "I ni dɔn"},
|
| 11 |
+
{"category": "Greetings", "source": "How are you?", "target": "I ka kɛnɛ wa?"},
|
| 12 |
+
{"category": "Greetings", "source": "Fine, no trouble", "target": "Kɛnɛ, tɔɔrɔ tɛ"},
|
| 13 |
+
{"category": "Greetings", "source": "How was the night?", "target": "Sini kɛnɛ?"},
|
| 14 |
+
{"category": "Greetings", "source": "How was the work?", "target": "Baara ni ce"},
|
| 15 |
+
{"category": "Greetings", "source": "Well done", "target": "I ni baara"},
|
| 16 |
+
{"category": "Identity", "source": "What is your name?", "target": "I tɔgɔ bi di?"},
|
| 17 |
+
{"category": "Identity", "source": "My name is...", "target": "Ne tɔgɔ ye..."},
|
| 18 |
+
{"category": "Identity", "source": "Where are you from?", "target": "I bɔra min?"},
|
| 19 |
+
{"category": "Identity", "source": "I am from...", "target": "N bɔra..."},
|
| 20 |
+
{"category": "Identity", "source": "What is your work?", "target": "I bɛ mun baara kɛ?"},
|
| 21 |
+
{"category": "Family", "source": "How is the family?", "target": "Sɔmɔgɔw bɛ di?"},
|
| 22 |
+
{"category": "Family", "source": "How is your wife?", "target": "I muso bɛ di?"},
|
| 23 |
+
{"category": "Family", "source": "How is your husband?", "target": "I tigi bɛ di?"},
|
| 24 |
+
{"category": "Family", "source": "How are the children?", "target": "Denmisɛnw bɛ di?"},
|
| 25 |
+
{"category": "Family", "source": "How is the baby?", "target": "Denu bɛ di?"},
|
| 26 |
+
{"category": "Family", "source": "They are fine", "target": "Tɔɔrɔ t'u la"},
|
| 27 |
+
{"category": "Family", "source": "My father is well", "target": "N fa bɛ kɛnɛ"},
|
| 28 |
+
{"category": "Family", "source": "My mother is well", "target": "N ba bɛ kɛnɛ"},
|
| 29 |
+
{"category": "Family", "source": "Are you married?", "target": "I furula wa?"},
|
| 30 |
+
{"category": "Food/Water", "source": "I am hungry", "target": "Kɔngɔ bɛ n na"},
|
| 31 |
+
{"category": "Food/Water", "source": "I am thirsty", "target": "Min nɔgɔ bɛ n na"},
|
| 32 |
+
{"category": "Food/Water", "source": "I want to eat", "target": "N b'a fɛ ka dumu"},
|
| 33 |
+
{"category": "Food/Water", "source": "Give me water", "target": "Ji di n ma"},
|
| 34 |
+
{"category": "Food/Water", "source": "The food is sweet", "target": "Dumuni ka di"},
|
| 35 |
+
{"category": "Food/Water", "source": "I am full", "target": "N fara"},
|
| 36 |
+
{"category": "Food/Water", "source": "Bread", "target": "Buruburu"},
|
| 37 |
+
{"category": "Food/Water", "source": "Rice", "target": "Malo"},
|
| 38 |
+
{"category": "Food/Water", "source": "Meat", "target": "Sogo"},
|
| 39 |
+
{"category": "Food/Water", "source": "Tea", "target": "Te"},
|
| 40 |
+
{"category": "Food/Water", "source": "Sugar", "target": "Sukaro"},
|
| 41 |
+
{"category": "Farming", "source": "How is the farming?", "target": "Sɛnɛ bɛ di?"},
|
| 42 |
+
{"category": "Farming", "source": "It rained today", "target": "Sanji nna bi"},
|
| 43 |
+
{"category": "Farming", "source": "The field", "target": "Sɛnɛfɛla"},
|
| 44 |
+
{"category": "Farming", "source": "Maize / Corn", "target": "Kaba"},
|
| 45 |
+
{"category": "Farming", "source": "Cow", "target": "Misi"},
|
| 46 |
+
{"category": "Farming", "source": "Sheep", "target": "Saga"},
|
| 47 |
+
{"category": "Farming", "source": "Goat", "target": "Ba"},
|
| 48 |
+
{"category": "Farming", "source": "Chicken", "target": "Shɛ"},
|
| 49 |
+
{"category": "Farming", "source": "Where is the hoe?", "target": "Daba bɛ min?"},
|
| 50 |
+
{"category": "Farming", "source": "We are working", "target": "An bɛ baara kɛ"},
|
| 51 |
+
{"category": "Health", "source": "I am sick", "target": "N bana"},
|
| 52 |
+
{"category": "Health", "source": "My head hurts", "target": "N kungolo bɛ n dimi"},
|
| 53 |
+
{"category": "Health", "source": "My stomach hurts", "target": "N kɔnɔ bɛ n dimi"},
|
| 54 |
+
{"category": "Health", "source": "I have fever", "target": "Sumaya bɛ n na"},
|
| 55 |
+
{"category": "Health", "source": "Where is the hospital?", "target": "Ɲɛnajɛso bɛ min?"},
|
| 56 |
+
{"category": "Health", "source": "Where is the doctor?", "target": "Dɔkɔtɔrɔ bɛ min?"},
|
| 57 |
+
{"category": "Health", "source": "Take the medicine", "target": "Fura min"},
|
| 58 |
+
{"category": "Health", "source": "Drink this", "target": "Nin min"},
|
| 59 |
+
{"category": "Health", "source": "Lie down", "target": "I la"},
|
| 60 |
+
{"category": "Health", "source": "Do you feel better?", "target": "A ka fisa wa?"},
|
| 61 |
+
{"category": "Shopping", "source": "How much?", "target": "Joli ye?"},
|
| 62 |
+
{"category": "Shopping", "source": "It is too much", "target": "A ka ca"},
|
| 63 |
+
{"category": "Shopping", "source": "Reduce it", "target": "Dɔɔni dɔɔni bɔ a la"},
|
| 64 |
+
{"category": "Shopping", "source": "I have no money", "target": "Wari tɛ n fɛ"},
|
| 65 |
+
{"category": "Shopping", "source": "Here is the money", "target": "Wari filɛ"},
|
| 66 |
+
{"category": "Shopping", "source": "Market", "target": "Sugu"},
|
| 67 |
+
{"category": "Shopping", "source": "Shop", "target": "Butiki"},
|
| 68 |
+
{"category": "Shopping", "source": "Soap", "target": "Safinɛ"},
|
| 69 |
+
{"category": "Shopping", "source": "Oil", "target": "Tulu"},
|
| 70 |
+
{"category": "Shopping", "source": "Salt", "target": "Kɔgɔ"},
|
| 71 |
+
{"category": "Travel", "source": "Where is the road?", "target": "Sira bɛ min?"},
|
| 72 |
+
{"category": "Travel", "source": "Is it far?", "target": "A ka jan wa?"},
|
| 73 |
+
{"category": "Travel", "source": "It is close", "target": "A surunya"},
|
| 74 |
+
{"category": "Travel", "source": "Turn right", "target": "Kini bolo fɛ"},
|
| 75 |
+
{"category": "Travel", "source": "Turn left", "target": "Numa bolo fɛ"},
|
| 76 |
+
{"category": "Travel", "source": "Stop here", "target": "I jɔ yan"},
|
| 77 |
+
{"category": "Travel", "source": "Let's go", "target": "An ka taa"},
|
| 78 |
+
{"category": "Travel", "source": "Car", "target": "Mobili"},
|
| 79 |
+
{"category": "Travel", "source": "Bus", "target": "Sɔta"},
|
| 80 |
+
{"category": "Travel", "source": "Motorbike", "target": "Nɛgɛso"},
|
| 81 |
+
{"category": "Clarity", "source": "I understand", "target": "N n'a faamu"},
|
| 82 |
+
{"category": "Clarity", "source": "I don't understand", "target": "N m'a faamu"},
|
| 83 |
+
{"category": "Clarity", "source": "Repeat it", "target": "Segi a kan"},
|
| 84 |
+
{"category": "Clarity", "source": "Speak slowly", "target": "Kuma dɔɔni dɔɔni"},
|
| 85 |
+
{"category": "Clarity", "source": "Do you speak Bambara?", "target": "I bɛ Bamanankan mɛn wa?"},
|
| 86 |
+
{"category": "Clarity", "source": "A little", "target": "Dɔɔni dɔɔni"},
|
| 87 |
+
{"category": "Clarity", "source": "I don't know", "target": "N m'a lɔn"},
|
| 88 |
+
{"category": "Clarity", "source": "Yes", "target": "Awɔ"},
|
| 89 |
+
{"category": "Clarity", "source": "No", "target": "Ayi"},
|
| 90 |
+
{"category": "Clarity", "source": "Wait", "target": "Kɔnɔ"},
|
| 91 |
+
{"category": "Time", "source": "Today", "target": "Bi"},
|
| 92 |
+
{"category": "Time", "source": "Tomorrow", "target": "Sini"},
|
| 93 |
+
{"category": "Time", "source": "Yesterday", "target": "Kunu"},
|
| 94 |
+
{"category": "Time", "source": "Now", "target": "Sisan"},
|
| 95 |
+
{"category": "Time", "source": "Later", "target": "Kɔfɛ"},
|
| 96 |
+
{"category": "Parting", "source": "Goodbye", "target": "K'an bɛn"},
|
| 97 |
+
{"category": "Parting", "source": "Until later", "target": "K'an bɛn kɔfɛ"},
|
| 98 |
+
{"category": "Parting", "source": "Until tomorrow", "target": "K'an bɛn sini"},
|
| 99 |
+
{"category": "Parting", "source": "Have a good day", "target": "Tile hɛɛrɛ"},
|
| 100 |
+
{"category": "Parting", "source": "Have a good night", "target": "Su hɛɛrɛ"},
|
| 101 |
+
{"category": "Parting", "source": "Go in peace", "target": "Taa hɛɛrɛ la"},
|
| 102 |
+
{"category": "Parting", "source": "God bless you", "target": "Ala ka duga i ye"},
|
| 103 |
+
{"category": "Parting", "source": "God willing", "target": "Ala sɔnna"},
|
| 104 |
+
{"category": "Parting", "source": "Thank God", "target": "Ala tando"},
|
| 105 |
+
{"category": "Parting", "source": "Peace only", "target": "Hɛɛrɛ dɔrɔn"}
|
| 106 |
+
]
|
| 107 |
+
}
|
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"dialect": "Pular of Fuuta Jallon, as spoken in Guinea",
|
| 3 |
+
"iso": "ful",
|
| 4 |
+
"notes": "Curated 30-phrase gold list, cross-checked against the Peace Corps Guinea 2015 Pular manual. Orthography uses ɓ, ɗ, ñ, ŋ. Signature Fuuta Jallon markers: 'Miɗo yaha' (1sg progressive), 'No ... wa'i' (how is), 'Jam tun' (peace only response), 'A jaraama' (thank you / hello). Do NOT substitute with Pulaar (Senegal) or Fulfulde (Nigeria, Cameroon) forms.",
|
| 5 |
+
"pairs": [
|
| 6 |
+
{"source": "Hello / Thank you (General)", "target": "A jaraama"},
|
| 7 |
+
{"source": "Good morning (Did you sleep in peace?)", "target": "On walli e jam?"},
|
| 8 |
+
{"source": "Good afternoon (Have you spent the day in peace?)", "target": "On ñalli e jam?"},
|
| 9 |
+
{"source": "Good evening (Have you spent the evening in peace?)", "target": "On hiiri e jam?"},
|
| 10 |
+
{"source": "Peace only (Standard response)", "target": "Jam tun"},
|
| 11 |
+
{"source": "How are you? / How is it?", "target": "No wa'i?"},
|
| 12 |
+
{"source": "Is there any trouble? / Is it okay?", "target": "Tana alaa?"},
|
| 13 |
+
{"source": "No trouble / Fine", "target": "Tana alaa"},
|
| 14 |
+
{"source": "Thank you (Respectful/Plural)", "target": "On jaraama"},
|
| 15 |
+
{"source": "How is the family?", "target": "No ɓeyngure nden wa'i?"},
|
| 16 |
+
{"source": "How are the children?", "target": "No fayɓe ɓen wa'i?"},
|
| 17 |
+
{"source": "What is your name?", "target": "Innde maa ko woni?"},
|
| 18 |
+
{"source": "My name is...", "target": "Innde am ko..."},
|
| 19 |
+
{"source": "Where are you going?", "target": "Hoto yahataa?"},
|
| 20 |
+
{"source": "I am going to the market", "target": "Miɗo yaha ka sugu"},
|
| 21 |
+
{"source": "Please (I ask you)", "target": "Mi yidiima"},
|
| 22 |
+
{"source": "Excuse me / Sorry", "target": "Accu hakke"},
|
| 23 |
+
{"source": "I understand", "target": "Mi faamii"},
|
| 24 |
+
{"source": "I don't understand", "target": "Mi faamaali"},
|
| 25 |
+
{"source": "Do you speak Pular?", "target": "Aɗa waawi Pular?"},
|
| 26 |
+
{"source": "Just a little bit", "target": "Seeɗa tun"},
|
| 27 |
+
{"source": "I want water", "target": "Miɗo yiɗi ndiyam"},
|
| 28 |
+
{"source": "Give me...", "target": "Okku am..."},
|
| 29 |
+
{"source": "How much is it?", "target": "Ko jelu?"},
|
| 30 |
+
{"source": "It is expensive", "target": "No tiiɗi"},
|
| 31 |
+
{"source": "God bless you", "target": "Alla duga maa"},
|
| 32 |
+
{"source": "If God wills (God willing)", "target": "Si Alla jaɓii"},
|
| 33 |
+
{"source": "Goodbye (Formal)", "target": "Oo-o"},
|
| 34 |
+
{"source": "Until tomorrow (See you tomorrow)", "target": "En jango"},
|
| 35 |
+
{"source": "Go in peace", "target": "Yahu e jam"}
|
| 36 |
+
]
|
| 37 |
+
}
|
|
@@ -0,0 +1,117 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"dialect": "Pular of Fuuta Jallon, as spoken in Guinea",
|
| 3 |
+
"iso": "ful",
|
| 4 |
+
"notes": "Curated 110-phrase field phrasebook, organized by conversational category. Used by the phrasebook short-circuit in src/llm/phrasebook.py — English-keyed, fuzzy-matched. Cross-checked against Peace Corps Guinea 2015 Pular manual. Do NOT substitute with Pulaar (Senegal) or Fulfulde (Nigeria/Cameroon) forms.",
|
| 5 |
+
"pairs": [
|
| 6 |
+
{"category": "Greetings", "source": "Hello / Thank you", "target": "A jaraama"},
|
| 7 |
+
{"category": "Greetings", "source": "Good morning", "target": "On walli e jam?"},
|
| 8 |
+
{"category": "Greetings", "source": "Good afternoon", "target": "On ñalli e jam?"},
|
| 9 |
+
{"category": "Greetings", "source": "Good evening", "target": "On hiiri e jam?"},
|
| 10 |
+
{"category": "Greetings", "source": "Peace only (Response)", "target": "Jam tun"},
|
| 11 |
+
{"category": "Greetings", "source": "How are you?", "target": "No wa'i?"},
|
| 12 |
+
{"category": "Greetings", "source": "Is there any trouble?", "target": "Tana alaa?"},
|
| 13 |
+
{"category": "Greetings", "source": "No trouble", "target": "Tana alaa"},
|
| 14 |
+
{"category": "Greetings", "source": "How is the heat/weather?", "target": "Ho no yasi ken waye?"},
|
| 15 |
+
{"category": "Greetings", "source": "Welcome", "target": "Tana alaa"},
|
| 16 |
+
{"category": "Identity", "source": "What is your name?", "target": "Ko ho no inne te dah?"},
|
| 17 |
+
{"category": "Identity", "source": "My name is...", "target": "Innde am ko..."},
|
| 18 |
+
{"category": "Identity", "source": "Where are you coming from?", "target": "Hoto iwruɗaa?"},
|
| 19 |
+
{"category": "Identity", "source": "Where are you from?", "target": "Eewdi maadiin koh hontoh?"},
|
| 20 |
+
{"category": "Identity", "source": "I am coming from...", "target": "Mi iwri ko..."},
|
| 21 |
+
{"category": "Identity", "source": "I am from", "target": "Eewdi an diin koh"},
|
| 22 |
+
{"category": "Identity", "source": "I am a farmer", "target": "Koh mee rehmohwoh"},
|
| 23 |
+
{"category": "Family", "source": "How is the family?", "target": "No ɓeyngure nden wa'i?"},
|
| 24 |
+
{"category": "Family", "source": "How is the woman?", "target": "No debbo on wa'i?"},
|
| 25 |
+
{"category": "Family", "source": "How is your wife?", "target": "No mbehgu ma wa'i?"},
|
| 26 |
+
{"category": "Family", "source": "How is your husband?", "target": "No mohdi ma wa'i?"},
|
| 27 |
+
{"category": "Family", "source": "How is the man?", "target": "No gorko on wa'i?"},
|
| 28 |
+
{"category": "Family", "source": "How are the children?", "target": "No fayɓe ɓen wa'i?"},
|
| 29 |
+
{"category": "Family", "source": "How is the baby?", "target": "No boobo on wa'i?"},
|
| 30 |
+
{"category": "Family", "source": "Everyone is fine", "target": "Hiɓe e jam"},
|
| 31 |
+
{"category": "Family", "source": "My father is well", "target": "Baba am no e jam"},
|
| 32 |
+
{"category": "Family", "source": "My mother is well", "target": "Neene am no e jam"},
|
| 33 |
+
{"category": "Family", "source": "How many children?", "target": "Fayɓe ben ko jelu?"},
|
| 34 |
+
{"category": "Food/Water", "source": "I am hungry", "target": "Mi weelaa maa"},
|
| 35 |
+
{"category": "Food/Water", "source": "I am thirsty", "target": "Miɗo ɗonɗa"},
|
| 36 |
+
{"category": "Food/Water", "source": "I want to eat", "target": "Miɗo faalaa ñaamude"},
|
| 37 |
+
{"category": "Food/Water", "source": "Give me water", "target": "Okku am ndiyam"},
|
| 38 |
+
{"category": "Food/Water", "source": "The food is good", "target": "Ñaameteeɗon no weli"},
|
| 39 |
+
{"category": "Food/Water", "source": "I am full", "target": "Mi haraama"},
|
| 40 |
+
{"category": "Food/Water", "source": "Bread", "target": "Biirehdi"},
|
| 41 |
+
{"category": "Food/Water", "source": "Rice", "target": "Maaro"},
|
| 42 |
+
{"category": "Food/Water", "source": "Milk", "target": "Mɓeerah"},
|
| 43 |
+
{"category": "Food/Water", "source": "Sour Cream", "target": "Kosam"},
|
| 44 |
+
{"category": "Food/Water", "source": "Hot water", "target": "Ndiyam wuuldham"},
|
| 45 |
+
{"category": "Food/Water", "source": "Cold water", "target": "Ndiyam ɓuuɓudham"},
|
| 46 |
+
{"category": "Food/Water", "source": "Coffee", "target": "Kafe"},
|
| 47 |
+
{"category": "Food/Water", "source": "Sugar", "target": "Sukkar"},
|
| 48 |
+
{"category": "Farming", "source": "How is the farming?", "target": "No ngsa kan wa'i?"},
|
| 49 |
+
{"category": "Farming", "source": "The rain is good", "target": "Ndiyam ndan no moƴƴi"},
|
| 50 |
+
{"category": "Farming", "source": "The field", "target": "Ngesa"},
|
| 51 |
+
{"category": "Farming", "source": "Garden", "target": "Suntuure"},
|
| 52 |
+
{"category": "Farming", "source": "Cattle / Cows", "target": "Nai"},
|
| 53 |
+
{"category": "Farming", "source": "Sheep", "target": "Baali"},
|
| 54 |
+
{"category": "Farming", "source": "Goat", "target": "Mbeewa"},
|
| 55 |
+
{"category": "Farming", "source": "Chicken", "target": "Gertogal"},
|
| 56 |
+
{"category": "Farming", "source": "Where is the thing?", "target": "Hoto huunde nden woni?"},
|
| 57 |
+
{"category": "Farming", "source": "To cultivate or to farm", "target": "Remugol"},
|
| 58 |
+
{"category": "Farming", "source": "To sow or plant seeds", "target": "Aawugol"},
|
| 59 |
+
{"category": "Farming", "source": "To harvest", "target": "Heptugol"},
|
| 60 |
+
{"category": "Farming", "source": "We are working (speaking to the person I'm working with)", "target": "Hiɗen e golle"},
|
| 61 |
+
{"category": "Farming", "source": "We are working (speaking to another person not working with us)", "target": "Meein gollu deh"},
|
| 62 |
+
{"category": "Health", "source": "I am sick", "target": "Miɗo nawni"},
|
| 63 |
+
{"category": "Health", "source": "My head hurts", "target": "Hoore am den no muusa"},
|
| 64 |
+
{"category": "Health", "source": "My stomach hurts", "target": "Reedu am doun no muusa"},
|
| 65 |
+
{"category": "Health", "source": "I have fever", "target": "Miɗo jogi yontere"},
|
| 66 |
+
{"category": "Health", "source": "Where is the clinic?", "target": "Hoto kilinik on woni?"},
|
| 67 |
+
{"category": "Health", "source": "Where is the doctor?", "target": "Hoto dɔkɔtɔrɔ on woni?"},
|
| 68 |
+
{"category": "Health", "source": "Take this medicine", "target": "Jehhtu leki kin"},
|
| 69 |
+
{"category": "Health", "source": "Drink this", "target": "Yaru ɗun"},
|
| 70 |
+
{"category": "Health", "source": "Rest now", "target": "Fow'w toh"},
|
| 71 |
+
{"category": "Health", "source": "Are you better?", "target": "Aɗa selli jooni?"},
|
| 72 |
+
{"category": "Shopping", "source": "How much is this?", "target": "Dounn ko jelu?"},
|
| 73 |
+
{"category": "Shopping", "source": "It is too expensive", "target": "No sahtee"},
|
| 74 |
+
{"category": "Shopping", "source": "Reduce the price", "target": "Dhuitah nam seeɗa"},
|
| 75 |
+
{"category": "Shopping", "source": "I have no money", "target": "Mi alaa buudi"},
|
| 76 |
+
{"category": "Shopping", "source": "Here is the money", "target": "Hinoh buudi dinn"},
|
| 77 |
+
{"category": "Shopping", "source": "Market", "target": "Luhmoh"},
|
| 78 |
+
{"category": "Shopping", "source": "Shop / Boutique", "target": "Bitiki"},
|
| 79 |
+
{"category": "Shopping", "source": "Soap", "target": "Sabunnde"},
|
| 80 |
+
{"category": "Shopping", "source": "Matches", "target": "Almet"},
|
| 81 |
+
{"category": "Shopping", "source": "Salt", "target": "Landan"},
|
| 82 |
+
{"category": "Travel", "source": "Where is the road to...?", "target": "Hoto ngol laawol yahata...?"},
|
| 83 |
+
{"category": "Travel", "source": "Is it far?", "target": "No woɗɗi?"},
|
| 84 |
+
{"category": "Travel", "source": "It is near", "target": "No ɓadii"},
|
| 85 |
+
{"category": "Travel", "source": "Turn right", "target": "Ýillu ka ñaamo"},
|
| 86 |
+
{"category": "Travel", "source": "Turn left", "target": "Ýillu ka nannoh"},
|
| 87 |
+
{"category": "Travel", "source": "Stop here", "target": "Daroh ɗoo"},
|
| 88 |
+
{"category": "Travel", "source": "Let's go", "target": "Mah een"},
|
| 89 |
+
{"category": "Travel", "source": "Car / Taxi", "target": "Oto"},
|
| 90 |
+
{"category": "Travel", "source": "Bicycle", "target": "Velo"},
|
| 91 |
+
{"category": "Travel", "source": "Motorcycle", "target": "Moto"},
|
| 92 |
+
{"category": "Clarity", "source": "I understand", "target": "Mi faamii"},
|
| 93 |
+
{"category": "Clarity", "source": "I don't understand", "target": "Mi faamaali"},
|
| 94 |
+
{"category": "Clarity", "source": "Please repeat", "target": "Fultu kadi"},
|
| 95 |
+
{"category": "Clarity", "source": "Speak slowly", "target": "Halu seeɗa seeɗa"},
|
| 96 |
+
{"category": "Clarity", "source": "Do you speak French?", "target": "Aɗa waawi Faransi?"},
|
| 97 |
+
{"category": "Clarity", "source": "I can just a little", "target": "Mi nan waawi seeɗa tun"},
|
| 98 |
+
{"category": "Clarity", "source": "I don't know", "target": "Mi andaa"},
|
| 99 |
+
{"category": "Clarity", "source": "Yes", "target": "Eyyo / Hii'hi"},
|
| 100 |
+
{"category": "Clarity", "source": "No", "target": "O'o"},
|
| 101 |
+
{"category": "Clarity", "source": "Wait", "target": "Sabboh"},
|
| 102 |
+
{"category": "Time", "source": "Today", "target": "Hannde"},
|
| 103 |
+
{"category": "Time", "source": "Tomorrow", "target": "Jango"},
|
| 104 |
+
{"category": "Time", "source": "Yesterday", "target": "Hanki"},
|
| 105 |
+
{"category": "Time", "source": "Now", "target": "Joni"},
|
| 106 |
+
{"category": "Time", "source": "Later", "target": "On tuma"},
|
| 107 |
+
{"category": "Parting", "source": "Goodbye", "target": "Oo-o"},
|
| 108 |
+
{"category": "Parting", "source": "See you later", "target": "En on tuma"},
|
| 109 |
+
{"category": "Parting", "source": "See you tomorrow", "target": "En jango"},
|
| 110 |
+
{"category": "Parting", "source": "Have a good day", "target": "Ñallu e jam"},
|
| 111 |
+
{"category": "Parting", "source": "Have a good night", "target": "Waalu e jam"},
|
| 112 |
+
{"category": "Parting", "source": "Go in peace", "target": "Yahu e jam"},
|
| 113 |
+
{"category": "Parting", "source": "God willing", "target": "Si Alla jaɓii"},
|
| 114 |
+
{"category": "Parting", "source": "Thank God", "target": "Ko ýettude Alla"},
|
| 115 |
+
{"category": "Parting", "source": "Peace only", "target": "Jam tun"}
|
| 116 |
+
]
|
| 117 |
+
}
|
|
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""MinimalClient — dialect-anchored plain-text LLM client for the Month 1–3 rebuild.
|
| 2 |
+
|
| 3 |
+
Why this exists (and not GemmaClient):
|
| 4 |
+
GemmaClient wraps every reply in a JSON object and runs a "teacher / child"
|
| 5 |
+
intent-classification flow. That's fine for the full app, but for the minimal
|
| 6 |
+
baseline it (a) spends model capacity on JSON compliance, (b) lets the model
|
| 7 |
+
drift into neighbouring languages (Wolof, Hausa, Pulaar of Senegal, Fulfulde
|
| 8 |
+
of Nigeria, Jula of Côte d'Ivoire), and (c) produces text that isn't clean
|
| 9 |
+
for TTS.
|
| 10 |
+
|
| 11 |
+
This client instead:
|
| 12 |
+
- pins the target dialect explicitly (Bambara / Bamako–Mali or Pular / Fuuta
|
| 13 |
+
Jallon–Guinea),
|
| 14 |
+
- injects the curated 30-phrase gold list for the target language as
|
| 15 |
+
few-shot anchoring in the system prompt,
|
| 16 |
+
- names forbidden neighbouring languages the model must not code-switch to,
|
| 17 |
+
- returns a plain string, ready for MMS-TTS.
|
| 18 |
+
|
| 19 |
+
GemmaClient and app.py are intentionally untouched.
|
| 20 |
+
"""
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import json
|
| 24 |
+
import logging
|
| 25 |
+
from functools import lru_cache
|
| 26 |
+
from pathlib import Path
|
| 27 |
+
from typing import Optional
|
| 28 |
+
|
| 29 |
+
logger = logging.getLogger(__name__)
|
| 30 |
+
|
| 31 |
+
# configs/dialect_anchors/*.json lives at <repo>/configs/dialect_anchors
|
| 32 |
+
_ANCHOR_DIR = (
|
| 33 |
+
Path(__file__).resolve().parent.parent.parent / "configs" / "dialect_anchors"
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
_ANCHOR_FILE = {
|
| 37 |
+
"bam": "bambara_mali.json",
|
| 38 |
+
"ful": "pular_guinea.json",
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
LANG_FULL_NAME = {
|
| 42 |
+
"bam": "Bambara as spoken in Bamako, Mali",
|
| 43 |
+
"ful": "Pular of Fuuta Jallon, as spoken in Guinea",
|
| 44 |
+
"fr": "French",
|
| 45 |
+
"en": "English",
|
| 46 |
+
}
|
| 47 |
+
|
| 48 |
+
# Neighbouring languages the model is most likely to drift into. Empty for
|
| 49 |
+
# fr/en — we don't need to fence those.
|
| 50 |
+
FORBIDDEN_DRIFT = {
|
| 51 |
+
"bam": (
|
| 52 |
+
"Jula / Dyula of Côte d'Ivoire, Wolof, Hausa, Swahili, Lingala, "
|
| 53 |
+
"or any other African language"
|
| 54 |
+
),
|
| 55 |
+
"ful": (
|
| 56 |
+
"Pulaar of Senegal, Fulfulde of Nigeria or Cameroon, Wolof, Hausa, "
|
| 57 |
+
"Swahili, or any other African language"
|
| 58 |
+
),
|
| 59 |
+
"fr": "",
|
| 60 |
+
"en": "",
|
| 61 |
+
}
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
@lru_cache(maxsize=4)
|
| 65 |
+
def _load_anchors(lang: str) -> list[dict]:
|
| 66 |
+
"""Load the curated gold-phrase list for `lang`. Cached per process."""
|
| 67 |
+
fname = _ANCHOR_FILE.get(lang)
|
| 68 |
+
if not fname:
|
| 69 |
+
return []
|
| 70 |
+
path = _ANCHOR_DIR / fname
|
| 71 |
+
if not path.exists():
|
| 72 |
+
logger.warning("Dialect anchor file missing: %s", path)
|
| 73 |
+
return []
|
| 74 |
+
with path.open("r", encoding="utf-8") as f:
|
| 75 |
+
data = json.load(f)
|
| 76 |
+
return data.get("pairs", [])
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def _build_system_prompt(target_lang: str) -> str:
|
| 80 |
+
"""Assemble the per-call system prompt for a target output language."""
|
| 81 |
+
full = LANG_FULL_NAME.get(target_lang, "English")
|
| 82 |
+
forbidden = FORBIDDEN_DRIFT.get(target_lang, "")
|
| 83 |
+
anchors = _load_anchors(target_lang)
|
| 84 |
+
|
| 85 |
+
lines: list[str] = [
|
| 86 |
+
f"You are a warm, concise conversational assistant that replies ONLY in {full}.",
|
| 87 |
+
"",
|
| 88 |
+
"Output format: plain natural text only. No JSON, no code fences, no "
|
| 89 |
+
"markdown, no translations, no romanisation, no explanations. Reply in "
|
| 90 |
+
"1–3 short sentences suitable to be read aloud by a text-to-speech voice.",
|
| 91 |
+
]
|
| 92 |
+
|
| 93 |
+
if forbidden:
|
| 94 |
+
lines += [
|
| 95 |
+
"",
|
| 96 |
+
(
|
| 97 |
+
f"CRITICAL — dialect fidelity: do NOT use, mix, or substitute words "
|
| 98 |
+
f"from {forbidden}. If you are not confident a word belongs to "
|
| 99 |
+
f"{full}, rephrase using simpler vocabulary you are certain of, or "
|
| 100 |
+
f"apologise briefly in {full} (for example that you did not "
|
| 101 |
+
f"understand)."
|
| 102 |
+
),
|
| 103 |
+
]
|
| 104 |
+
|
| 105 |
+
if anchors:
|
| 106 |
+
lines += [
|
| 107 |
+
"",
|
| 108 |
+
f"Reference phrases in {full} — use this exact orthography, spelling, "
|
| 109 |
+
"and dialectal style as your model for every reply:",
|
| 110 |
+
]
|
| 111 |
+
for item in anchors:
|
| 112 |
+
src = item.get("source", "").strip()
|
| 113 |
+
tgt = item.get("target", "").strip()
|
| 114 |
+
if src and tgt:
|
| 115 |
+
lines.append(f"- {src} → {tgt}")
|
| 116 |
+
|
| 117 |
+
lines += [
|
| 118 |
+
"",
|
| 119 |
+
f"Always reply in {full}, even if the user writes to you in English, "
|
| 120 |
+
"French, or another language. Never translate your own reply.",
|
| 121 |
+
]
|
| 122 |
+
return "\n".join(lines)
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
class MinimalClient:
|
| 126 |
+
"""Dialect-anchored plain-text LLM client over HF Serverless Inference.
|
| 127 |
+
|
| 128 |
+
Usage:
|
| 129 |
+
client = MinimalClient(model_id="Qwen/Qwen2.5-7B-Instruct", hf_token=TOK)
|
| 130 |
+
reply = client.chat("Good morning", target_lang="bam")
|
| 131 |
+
# → "I ni sɔgɔma. I ka kɛnɛ wa?"
|
| 132 |
+
"""
|
| 133 |
+
|
| 134 |
+
def __init__(
|
| 135 |
+
self,
|
| 136 |
+
model_id: str = "CohereLabs/aya-expanse-32b",
|
| 137 |
+
hf_token: Optional[str] = None,
|
| 138 |
+
) -> None:
|
| 139 |
+
self.model_id = model_id
|
| 140 |
+
self.hf_token = hf_token
|
| 141 |
+
self._client = None # lazy init
|
| 142 |
+
|
| 143 |
+
def _get_client(self):
|
| 144 |
+
if self._client is None:
|
| 145 |
+
from huggingface_hub import InferenceClient
|
| 146 |
+
self._client = InferenceClient(token=self.hf_token)
|
| 147 |
+
return self._client
|
| 148 |
+
|
| 149 |
+
def chat(self, user_text: str, target_lang: str = "bam") -> str:
|
| 150 |
+
"""Return a plain-text reply in `target_lang`.
|
| 151 |
+
|
| 152 |
+
On any error returns a short parenthetical error string so the caller
|
| 153 |
+
can still feed something into TTS / display.
|
| 154 |
+
"""
|
| 155 |
+
system_prompt = _build_system_prompt(target_lang)
|
| 156 |
+
try:
|
| 157 |
+
client = self._get_client()
|
| 158 |
+
completion = client.chat_completion(
|
| 159 |
+
model=self.model_id,
|
| 160 |
+
messages=[
|
| 161 |
+
{"role": "system", "content": system_prompt},
|
| 162 |
+
{"role": "user", "content": user_text},
|
| 163 |
+
],
|
| 164 |
+
max_tokens=256,
|
| 165 |
+
temperature=0.3,
|
| 166 |
+
)
|
| 167 |
+
raw = (completion.choices[0].message.content or "").strip()
|
| 168 |
+
# Defensive: strip any stray code fences the model may emit anyway.
|
| 169 |
+
if raw.startswith("```"):
|
| 170 |
+
raw = raw.strip("`").strip()
|
| 171 |
+
# If a language tag slipped in on the first line, drop it.
|
| 172 |
+
if "\n" in raw:
|
| 173 |
+
first, rest = raw.split("\n", 1)
|
| 174 |
+
if len(first) < 20 and " " not in first:
|
| 175 |
+
raw = rest.strip()
|
| 176 |
+
return raw
|
| 177 |
+
except Exception as exc: # pragma: no cover — surfaced to UI
|
| 178 |
+
logger.error("MinimalClient error: %s", exc)
|
| 179 |
+
return f"(LLM unavailable: {exc})"
|
|
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Phrasebook short-circuit — skip the LLM when the user hits a curated phrase.
|
| 2 |
+
|
| 3 |
+
Purpose
|
| 4 |
+
For the 80% of field-demo inputs that are canonical greetings, courtesies,
|
| 5 |
+
or basic questions, the LLM adds risk (dialect drift, hallucination,
|
| 6 |
+
latency) without adding value — we already have a gold translation. This
|
| 7 |
+
module does an English-keyed, fuzzy-normalised match against the curated
|
| 8 |
+
phrasebooks in configs/dialect_anchors/{bambara,pular}_phrasebook.json and
|
| 9 |
+
returns the target string directly when the match is strong.
|
| 10 |
+
|
| 11 |
+
Scope
|
| 12 |
+
- Only fires when target language is bam or ful. For en/fr output we let
|
| 13 |
+
the LLM (or a passthrough) handle it — nothing to short-circuit.
|
| 14 |
+
- Only English source keys (what the curated sheets contain). French or
|
| 15 |
+
in-language inputs will not match and will fall through to the LLM —
|
| 16 |
+
that's correct behaviour.
|
| 17 |
+
|
| 18 |
+
Matching
|
| 19 |
+
- Exact match on normalised string → score 1.0 ("exact").
|
| 20 |
+
- Otherwise SequenceMatcher ratio; threshold DEFAULT_THRESHOLD = 0.88.
|
| 21 |
+
- Normalisation: lowercase, strip punctuation (keeps internal apostrophes),
|
| 22 |
+
collapse whitespace.
|
| 23 |
+
|
| 24 |
+
API
|
| 25 |
+
lookup(user_text, target_lang) -> dict | None
|
| 26 |
+
dict has keys: source, target, category, score, match
|
| 27 |
+
"""
|
| 28 |
+
from __future__ import annotations
|
| 29 |
+
|
| 30 |
+
import json
|
| 31 |
+
import logging
|
| 32 |
+
import re
|
| 33 |
+
from difflib import SequenceMatcher
|
| 34 |
+
from functools import lru_cache
|
| 35 |
+
from pathlib import Path
|
| 36 |
+
from typing import Optional
|
| 37 |
+
|
| 38 |
+
logger = logging.getLogger(__name__)
|
| 39 |
+
|
| 40 |
+
_PHRASEBOOK_DIR = (
|
| 41 |
+
Path(__file__).resolve().parent.parent.parent / "configs" / "dialect_anchors"
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
_PHRASEBOOK_FILE = {
|
| 45 |
+
"bam": "bambara_phrasebook.json",
|
| 46 |
+
"ful": "pular_phrasebook.json",
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
DEFAULT_THRESHOLD = 0.88
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def _normalize(text: str) -> str:
|
| 53 |
+
"""Lowercase, strip most punctuation, collapse whitespace."""
|
| 54 |
+
text = (text or "").lower().strip()
|
| 55 |
+
# Keep internal apostrophes (e.g. "don't", "b'a"), drop other punctuation.
|
| 56 |
+
text = re.sub(r"[^\w\s']", " ", text, flags=re.UNICODE)
|
| 57 |
+
text = re.sub(r"\s+", " ", text)
|
| 58 |
+
return text.strip()
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
@lru_cache(maxsize=4)
|
| 62 |
+
def _load_phrasebook(lang: str) -> list[dict]:
|
| 63 |
+
fname = _PHRASEBOOK_FILE.get(lang)
|
| 64 |
+
if not fname:
|
| 65 |
+
return []
|
| 66 |
+
path = _PHRASEBOOK_DIR / fname
|
| 67 |
+
if not path.exists():
|
| 68 |
+
logger.warning("Phrasebook missing: %s", path)
|
| 69 |
+
return []
|
| 70 |
+
with path.open("r", encoding="utf-8") as f:
|
| 71 |
+
data = json.load(f)
|
| 72 |
+
pairs = data.get("pairs", [])
|
| 73 |
+
# Precompute normalised source for speed.
|
| 74 |
+
for p in pairs:
|
| 75 |
+
p["_norm"] = _normalize(p.get("source", ""))
|
| 76 |
+
return pairs
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def lookup(
|
| 80 |
+
user_text: str,
|
| 81 |
+
target_lang: str,
|
| 82 |
+
threshold: float = DEFAULT_THRESHOLD,
|
| 83 |
+
) -> Optional[dict]:
|
| 84 |
+
"""Return best curated match for `user_text` in `target_lang`, or None.
|
| 85 |
+
|
| 86 |
+
Short-circuits only for curated dialects (bam, ful). For any other target
|
| 87 |
+
returns None so the caller falls through to the LLM.
|
| 88 |
+
"""
|
| 89 |
+
pairs = _load_phrasebook(target_lang)
|
| 90 |
+
if not pairs:
|
| 91 |
+
return None
|
| 92 |
+
q = _normalize(user_text)
|
| 93 |
+
if not q:
|
| 94 |
+
return None
|
| 95 |
+
|
| 96 |
+
best: Optional[dict] = None
|
| 97 |
+
best_score = 0.0
|
| 98 |
+
for p in pairs:
|
| 99 |
+
src = p.get("_norm", "")
|
| 100 |
+
if not src:
|
| 101 |
+
continue
|
| 102 |
+
if src == q:
|
| 103 |
+
return {
|
| 104 |
+
"source": p.get("source"),
|
| 105 |
+
"target": p.get("target"),
|
| 106 |
+
"category": p.get("category"),
|
| 107 |
+
"score": 1.0,
|
| 108 |
+
"match": "exact",
|
| 109 |
+
}
|
| 110 |
+
score = SequenceMatcher(None, q, src).ratio()
|
| 111 |
+
if score > best_score:
|
| 112 |
+
best_score = score
|
| 113 |
+
best = p
|
| 114 |
+
|
| 115 |
+
if best and best_score >= threshold:
|
| 116 |
+
return {
|
| 117 |
+
"source": best.get("source"),
|
| 118 |
+
"target": best.get("target"),
|
| 119 |
+
"category": best.get("category"),
|
| 120 |
+
"score": round(best_score, 3),
|
| 121 |
+
"match": "fuzzy",
|
| 122 |
+
}
|
| 123 |
+
return None
|