jefffffff9 commited on
Commit
d0e28fa
·
1 Parent(s): 26659d8

Ground-zero Stages 1–3: dialect anchors + phrasebook short-circuit + Aya-Expanse

Browse files

Stage 1 — dialect-pinned LLM client (src/llm/minimal_client.py)
Plain-text replacement for GemmaClient's JSON/teacher flow. System prompt
pins Bambara-Mali and Pular-Fuuta-Jallon explicitly, names forbidden
neighbouring languages (Wolof, Hausa, Pulaar-Senegal, Fulfulde-Nigeria,
Jula-CI), and injects a 30-pair bilingual gold list as few-shot anchoring
from configs/dialect_anchors/{bambara_mali,pular_guinea}.json.

Stage 2 — curated phrasebook short-circuit (src/llm/phrasebook.py)
100 Bambara + 110 Pular English-keyed pairs across greetings, family,
food, farming, health, shopping, travel, clarity, time, parting. Fuzzy
matched (threshold 0.88) before every LLM call; on hit returns the gold
translation directly — zero drift risk, zero latency.

Stage 3 — default LLM swapped to CohereLabs/aya-expanse-32b
23-language multilingual base with stronger West African coverage than
Qwen 2.5-7B. Overridable via LLM_MODEL_ID.

Space wiring
- README frontmatter app_file: app.py → app_minimal.py (Space now serves
the minimal baseline; app.py untouched for the full production stack).
- .env auto-loaded via python-dotenv so HF_TOKEN is picked up on launch.
- README updated: minimal-baseline section, Stack + env-var tables,
Run-locally block.

README.md CHANGED
@@ -31,7 +31,46 @@ Two intertwined jobs:
31
  1. **Memory loop** — users *teach* the assistant new words; it persists them to a HuggingFace dataset and uses them as the source of truth in future answers.
32
  2. **Agricultural IoT voice interface** — Sahelian farmers query soil, weather, irrigation, and pest data in their own language, short answers, ≤ 6 words per sentence for clean TTS.
33
 
34
- The core stack is explicitly **100% non-Meta** (Whisper / Qwen / F5-TTS / VITS); MMS-TTS is only used as a baseline fallback.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ---
37
 
@@ -54,7 +93,9 @@ See `docs/roadmap_2026-04.md` for the full plan and `docs/baseline_rebuild.md` f
54
  | Layer | Tool |
55
  |-------|------|
56
  | STT | `openai/whisper-large-v3-turbo` + PEFT LoRA hot-swap (~50 MB adapter per language, ~50 ms switch) |
57
- | LLM | `Qwen/Qwen2.5-7B-Instruct` (prod default) via HF Serverless InferenceClient — overridable to `Qwen2.5-72B-Instruct`, Mistral, Zephyr |
 
 
58
  | TTS (baseline) | `facebook/mms-tts-bam`, `facebook/mms-tts-ful` |
59
  | TTS (Bambara) | `ynnov/ekodi-bambara-tts-female` (Waxal VITS) |
60
  | TTS (Fula) | placeholder → `ous-sow/fula-tts` when published |
@@ -70,7 +111,8 @@ See `docs/roadmap_2026-04.md` for the full plan and `docs/baseline_rebuild.md` f
70
 
71
  | File | Purpose | Lifecycle |
72
  |------|---------|-----------|
73
- | `app.py` | **Production Gradio UI** on HF Spaces. Single-file (~99 KB) by design. Tabs: Conversation / Teaching / Knowledge Base / Self-Teaching. | `python app.py` |
 
74
  | `app_lab.py` | **Experimental Gradio UI** for prototyping (e.g. `CuriosityEngine`) before folding into `app.py`. | `python app_lab.py` |
75
  | `src/api/app.py` | **FastAPI service** — loads Whisper once, registers `bam`/`ful` adapters via `AdapterManager`, preloads `bam`, attaches `Transcriber` + `SensorBridge` to `app.state`. | `python scripts/run_server.py` |
76
 
@@ -163,7 +205,7 @@ All variables have sensible defaults, so you can boot the Space without any of t
163
  | `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` | Memory-loop target dataset. |
164
  | `ADAPTER_REPO_ID` | `ous-sow/sahel-agri-adapters` | Published LoRA adapters. |
165
  | `WHISPER_MODEL_ID` | `openai/whisper-large-v3-turbo` | STT base model. |
166
- | `LLM_MODEL_ID` | `Qwen/Qwen2.5-7B-Instruct` | LLM via HF Serverless. |
167
  | `LOG_LEVEL` | `INFO` | Standard Python logging level. |
168
  | `DEVICE` | `cuda` (FastAPI) | Torch device for inference. |
169
 
@@ -193,8 +235,11 @@ All variables have sensible defaults, so you can boot the Space without any of t
193
  ## Run locally
194
 
195
  ```bash
196
- # Gradio production UI
197
  pip install -r requirements.txt
 
 
 
198
  python app.py
199
 
200
  # FastAPI service
@@ -253,7 +298,7 @@ At minimum:
253
  |-----|-------|
254
  | `HF_TOKEN` | write-scope token |
255
  | `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` |
256
- | `LLM_MODEL_ID` | `Qwen/Qwen2.5-7B-Instruct` (or any HF Serverless-supported model) |
257
 
258
  ---
259
 
 
31
  1. **Memory loop** — users *teach* the assistant new words; it persists them to a HuggingFace dataset and uses them as the source of truth in future answers.
32
  2. **Agricultural IoT voice interface** — Sahelian farmers query soil, weather, irrigation, and pest data in their own language, short answers, ≤ 6 words per sentence for clean TTS.
33
 
34
+ The core stack is explicitly **100% non-Meta** (Whisper / Aya-Expanse / F5-TTS / VITS); MMS-TTS is only used as a baseline fallback.
35
+
36
+ ---
37
+
38
+ ## What this Space currently runs — the `ground-zero` minimal baseline
39
+
40
+ The deployed Space (`app_file: app_minimal.py`) is the **Month 1–3 rebuild**
41
+ baseline — a stripped-down Whisper → LLM → MMS-TTS pipeline used for field
42
+ testing and to build a real-user eval set. No LoRA adapters, no memory loop,
43
+ no speaker ID, no voice cloning, no IoT, no phrase matcher. Everything in
44
+ `app.py` still exists for the full production stack; it is just not what the
45
+ Space serves today.
46
+
47
+ Three stacked changes land dialect fidelity without any training:
48
+
49
+ 1. **Stage 1 — dialect-pinned system prompt** (`src/llm/minimal_client.py`).
50
+ Replaces the `GemmaClient` JSON/teacher flow with a plain-text client whose
51
+ system prompt pins the target dialect explicitly — *Bambara as spoken in
52
+ Bamako, Mali* and *Pular of Fuuta Jallon, as spoken in Guinea* — names the
53
+ languages the model must **not** drift into (Wolof, Hausa, Pulaar of
54
+ Senegal, Fulfulde of Nigeria, Jula of Côte d'Ivoire), and injects a 30-pair
55
+ bilingual gold list as few-shot anchoring
56
+ (`configs/dialect_anchors/{bambara_mali,pular_guinea}.json`).
57
+
58
+ 2. **Stage 2 — curated phrasebook short-circuit** (`src/llm/phrasebook.py`).
59
+ Before calling the LLM, the user's input is normalised and fuzzy-matched
60
+ (threshold 0.88) against a curated English-keyed phrasebook
61
+ (`configs/dialect_anchors/{bambara,pular}_phrasebook.json` — 100 Bambara /
62
+ 110 Pular entries across greetings, family, food, farming, health,
63
+ shopping, travel, clarity, time, parting). A hit returns the gold
64
+ translation directly — zero LLM risk, zero latency.
65
+
66
+ 3. **Stage 3 — better multilingual base LLM.**
67
+ Default `LLM_MODEL_ID` is now **`CohereLabs/aya-expanse-32b`**, a 23-language
68
+ multilingual model with much stronger West African coverage than Qwen
69
+ 2.5-7B. Can be overridden via the `LLM_MODEL_ID` env var (e.g. to
70
+ `Qwen/Qwen2.5-72B-Instruct`) if Cohere's inference provider is not
71
+ available on your HF account.
72
+
73
+ See `docs/baseline_rebuild.md` for the broader minimal-track plan.
74
 
75
  ---
76
 
 
93
  | Layer | Tool |
94
  |-------|------|
95
  | STT | `openai/whisper-large-v3-turbo` + PEFT LoRA hot-swap (~50 MB adapter per language, ~50 ms switch) |
96
+ | LLM | `CohereLabs/aya-expanse-32b` (minimal-baseline default, strong African-language coverage) via HF Serverless InferenceClient — overridable to `Qwen/Qwen2.5-72B-Instruct`, `Qwen2.5-7B-Instruct`, Mistral, Zephyr |
97
+ | Dialect anchoring (minimal) | `src/llm/minimal_client.py` — pinned Bambara-Mali / Pular-Guinea system prompt with 30-pair bilingual few-shot + forbidden-drift guardrails |
98
+ | Phrasebook short-circuit (minimal) | `src/llm/phrasebook.py` — 100 Bambara + 110 Pular curated gold pairs, fuzzy-matched (0.88 threshold) before any LLM call |
99
  | TTS (baseline) | `facebook/mms-tts-bam`, `facebook/mms-tts-ful` |
100
  | TTS (Bambara) | `ynnov/ekodi-bambara-tts-female` (Waxal VITS) |
101
  | TTS (Fula) | placeholder → `ous-sow/fula-tts` when published |
 
111
 
112
  | File | Purpose | Lifecycle |
113
  |------|---------|-----------|
114
+ | `app_minimal.py` | **Minimal baseline Gradio UI** what the HF Space currently serves. Whisper LLM MMS-TTS with dialect-pinned prompts + curated phrasebook short-circuit. Tabs: Voice / Text. | `python app_minimal.py` |
115
+ | `app.py` | **Full production Gradio UI** (not currently served on the Space). Single-file (~99 KB) by design. Tabs: Conversation / Teaching / Knowledge Base / Self-Teaching. | `python app.py` |
116
  | `app_lab.py` | **Experimental Gradio UI** for prototyping (e.g. `CuriosityEngine`) before folding into `app.py`. | `python app_lab.py` |
117
  | `src/api/app.py` | **FastAPI service** — loads Whisper once, registers `bam`/`ful` adapters via `AdapterManager`, preloads `bam`, attaches `Transcriber` + `SensorBridge` to `app.state`. | `python scripts/run_server.py` |
118
 
 
205
  | `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` | Memory-loop target dataset. |
206
  | `ADAPTER_REPO_ID` | `ous-sow/sahel-agri-adapters` | Published LoRA adapters. |
207
  | `WHISPER_MODEL_ID` | `openai/whisper-large-v3-turbo` | STT base model. |
208
+ | `LLM_MODEL_ID` | `CohereLabs/aya-expanse-32b` | LLM via HF Serverless. Override to any HF Serverless-supported model. |
209
  | `LOG_LEVEL` | `INFO` | Standard Python logging level. |
210
  | `DEVICE` | `cuda` (FastAPI) | Torch device for inference. |
211
 
 
235
  ## Run locally
236
 
237
  ```bash
238
+ # Minimal baseline (what the Space runs)
239
  pip install -r requirements.txt
240
+ python app_minimal.py
241
+
242
+ # Full production UI (not currently on the Space)
243
  python app.py
244
 
245
  # FastAPI service
 
298
  |-----|-------|
299
  | `HF_TOKEN` | write-scope token |
300
  | `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` |
301
+ | `LLM_MODEL_ID` | `CohereLabs/aya-expanse-32b` (or any HF Serverless-supported model) |
302
 
303
  ---
304
 
app_minimal.py CHANGED
@@ -12,7 +12,8 @@ Run locally:
12
  Environment variables (all optional except HF_TOKEN, which is needed for the
13
  Qwen HF Serverless call):
14
  HF_TOKEN — HuggingFace token with read access
15
- LLM_MODEL_ID — default "Qwen/Qwen2.5-7B-Instruct"
 
16
  DEVICE — "cuda" or "cpu" (auto if unset)
17
  LOG_LEVEL — default "INFO"
18
  """
@@ -24,11 +25,20 @@ from typing import Optional, Tuple
24
 
25
  import numpy as np
26
 
 
 
 
 
 
 
 
 
27
  # Local imports — the four modules the baseline-rebuild plan authorizes.
28
  # Everything else in src/ is intentionally unused here.
29
  from src.data.bam_normalize import normalize as bam_normalize
30
  from src.engine.whisper_base import WhisperBackbone
31
- from src.llm.gemma_client import GemmaClient
 
32
  from src.tts.mms_tts import MMSTTSEngine
33
 
34
  logging.basicConfig(
@@ -40,7 +50,7 @@ logger = logging.getLogger(__name__)
40
 
41
  # ── Environment ──────────────────────────────────────────────────────────────
42
  HF_TOKEN = os.environ.get("HF_TOKEN")
43
- LLM_MODEL_ID = os.environ.get("LLM_MODEL_ID", "Qwen/Qwen2.5-7B-Instruct")
44
  _REQUESTED_DEVICE = os.environ.get("DEVICE") # optional override
45
 
46
  LANG_CHOICES = [("Bambara", "bam"), ("Fula", "ful"), ("French", "fr"), ("English", "en")]
@@ -56,20 +66,13 @@ LANG_TO_WHISPER_HINT = {
56
  }
57
 
58
 
59
- def _with_reply_language_directive(user_text: str, output_lang: str) -> str:
60
- """Append an explicit reply-language directive to the user message.
61
-
62
- The LLM's system prompt (in GemmaClient) does not know which language we
63
- want the reply in — it picks based on vibes, which can drift (e.g. to
64
- Wolof). We keep GemmaClient untouched and steer from the user turn.
65
- """
66
- name = LANG_NAMES.get(output_lang, "English")
67
- return f"{user_text}\n\n(Please reply in {name} only.)"
68
 
69
 
70
  # ── Service singletons (lazy-loaded) ────────────────────────────────────────
71
  _backbone: Optional[WhisperBackbone] = None
72
- _llm: Optional[GemmaClient] = None
73
  _tts: Optional[MMSTTSEngine] = None
74
 
75
 
@@ -92,11 +95,11 @@ def get_backbone() -> WhisperBackbone:
92
  return _backbone
93
 
94
 
95
- def get_llm() -> GemmaClient:
96
  global _llm
97
  if _llm is None:
98
- _llm = GemmaClient(model_id=LLM_MODEL_ID, hf_token=HF_TOKEN)
99
- logger.info("LLM client configured: %s", LLM_MODEL_ID)
100
  return _llm
101
 
102
 
@@ -193,17 +196,25 @@ def run_pipeline(
193
  if not transcript:
194
  return "", "(no speech detected)", None
195
 
196
- try:
197
- # No memory loop in minimal always pass empty vocabulary context.
198
- reply = get_llm().chat(
199
- _with_reply_language_directive(transcript, output_lang),
200
- vocabulary_context="",
 
 
 
201
  )
202
- except Exception as exc: # pragma: no cover
203
- logger.exception("LLM call failed")
204
- return transcript, f"(LLM error: {exc})", None
 
 
 
 
 
205
 
206
- reply_text: str = reply.get("response", "") or "(empty reply)"
207
 
208
  try:
209
  wav, sr = get_tts().synthesize(
@@ -234,16 +245,22 @@ def run_text_pipeline(
234
  if not text:
235
  return "(no text entered)", None
236
 
237
- try:
238
- reply = get_llm().chat(
239
- _with_reply_language_directive(text, output_lang),
240
- vocabulary_context="",
 
 
241
  )
242
- except Exception as exc: # pragma: no cover
243
- logger.exception("LLM call failed")
244
- return f"(LLM error: {exc})", None
 
 
 
 
245
 
246
- reply_text: str = reply.get("response", "") or "(empty reply)"
247
 
248
  try:
249
  wav, sr = get_tts().synthesize(
@@ -264,8 +281,10 @@ def build_ui():
264
  with gr.Blocks(title="Sahel-Voice — Minimal Baseline") as demo:
265
  gr.Markdown(
266
  "# 🌾 Sahel-Voice — Minimal Baseline\n"
267
- "Zero-shot Whisper → Qwen → MMS-TTS. No adapters, no memory, no polish. "
268
- "This is the field-test baseline see `docs/baseline_rebuild.md`."
 
 
269
  )
270
 
271
  # Shared across tabs. Split into two so input and output language
 
12
  Environment variables (all optional except HF_TOKEN, which is needed for the
13
  Qwen HF Serverless call):
14
  HF_TOKEN — HuggingFace token with read access
15
+ LLM_MODEL_ID — default "CohereLabs/aya-expanse-32b"
16
+ (23-language multilingual, strong African-language coverage)
17
  DEVICE — "cuda" or "cpu" (auto if unset)
18
  LOG_LEVEL — default "INFO"
19
  """
 
25
 
26
  import numpy as np
27
 
28
+ # Load .env (HF_TOKEN etc.) before reading os.environ below. Silent no-op if
29
+ # python-dotenv is not installed or no .env is present.
30
+ try:
31
+ from dotenv import load_dotenv
32
+ load_dotenv()
33
+ except ImportError:
34
+ pass
35
+
36
  # Local imports — the four modules the baseline-rebuild plan authorizes.
37
  # Everything else in src/ is intentionally unused here.
38
  from src.data.bam_normalize import normalize as bam_normalize
39
  from src.engine.whisper_base import WhisperBackbone
40
+ from src.llm.minimal_client import MinimalClient
41
+ from src.llm.phrasebook import lookup as phrasebook_lookup
42
  from src.tts.mms_tts import MMSTTSEngine
43
 
44
  logging.basicConfig(
 
50
 
51
  # ── Environment ──────────────────────────────────────────────────────────────
52
  HF_TOKEN = os.environ.get("HF_TOKEN")
53
+ LLM_MODEL_ID = os.environ.get("LLM_MODEL_ID", "CohereLabs/aya-expanse-32b")
54
  _REQUESTED_DEVICE = os.environ.get("DEVICE") # optional override
55
 
56
  LANG_CHOICES = [("Bambara", "bam"), ("Fula", "ful"), ("French", "fr"), ("English", "en")]
 
66
  }
67
 
68
 
69
+ # Reply-language steering is handled inside MinimalClient via a dialect-anchored
70
+ # system prompt (see src/llm/minimal_client.py). No per-turn directive needed.
 
 
 
 
 
 
 
71
 
72
 
73
  # ── Service singletons (lazy-loaded) ────────────────────────────────────────
74
  _backbone: Optional[WhisperBackbone] = None
75
+ _llm: Optional[MinimalClient] = None
76
  _tts: Optional[MMSTTSEngine] = None
77
 
78
 
 
95
  return _backbone
96
 
97
 
98
+ def get_llm() -> MinimalClient:
99
  global _llm
100
  if _llm is None:
101
+ _llm = MinimalClient(model_id=LLM_MODEL_ID, hf_token=HF_TOKEN)
102
+ logger.info("Minimal LLM client configured: %s", LLM_MODEL_ID)
103
  return _llm
104
 
105
 
 
196
  if not transcript:
197
  return "", "(no speech detected)", None
198
 
199
+ # ── Phrasebook short-circuit ──────────────────────────────────────────
200
+ # Canonical greetings/courtesies hit the curated gold phrasebook directly,
201
+ # skipping the LLM entirely. Only fires for bam/ful targets.
202
+ hit = phrasebook_lookup(transcript, output_lang)
203
+ if hit:
204
+ logger.info(
205
+ "Phrasebook hit (%s, score=%.2f): %r → %r [cat=%s]",
206
+ hit["match"], hit["score"], transcript, hit["target"], hit["category"],
207
  )
208
+ reply_text = hit["target"]
209
+ else:
210
+ try:
211
+ # Dialect-anchored plain-string reply (see MinimalClient).
212
+ reply_text = get_llm().chat(transcript, target_lang=output_lang)
213
+ except Exception as exc: # pragma: no cover
214
+ logger.exception("LLM call failed")
215
+ return transcript, f"(LLM error: {exc})", None
216
 
217
+ reply_text = reply_text or "(empty reply)"
218
 
219
  try:
220
  wav, sr = get_tts().synthesize(
 
245
  if not text:
246
  return "(no text entered)", None
247
 
248
+ # ── Phrasebook short-circuit (see voice path above) ──────────────────
249
+ hit = phrasebook_lookup(text, output_lang)
250
+ if hit:
251
+ logger.info(
252
+ "Phrasebook hit (%s, score=%.2f): %r → %r [cat=%s]",
253
+ hit["match"], hit["score"], text, hit["target"], hit["category"],
254
  )
255
+ reply_text = hit["target"]
256
+ else:
257
+ try:
258
+ reply_text = get_llm().chat(text, target_lang=output_lang)
259
+ except Exception as exc: # pragma: no cover
260
+ logger.exception("LLM call failed")
261
+ return f"(LLM error: {exc})", None
262
 
263
+ reply_text = reply_text or "(empty reply)"
264
 
265
  try:
266
  wav, sr = get_tts().synthesize(
 
281
  with gr.Blocks(title="Sahel-Voice — Minimal Baseline") as demo:
282
  gr.Markdown(
283
  "# 🌾 Sahel-Voice — Minimal Baseline\n"
284
+ f"Zero-shot Whisper → {LLM_MODEL_ID} → MMS-TTS, with a curated "
285
+ "Bambara/Pular phrasebook short-circuit in front of the LLM. "
286
+ "No adapters, no memory, no polish. This is the field-test "
287
+ "baseline — see `docs/baseline_rebuild.md`."
288
  )
289
 
290
  # Shared across tabs. Split into two so input and output language
configs/dialect_anchors/bambara_mali.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dialect": "Bambara as spoken in Bamako, Mali",
3
+ "iso": "bam",
4
+ "notes": "Curated 30-phrase gold list. Orthography uses ɛ, ɔ, ɲ. Elisions (t', b', k') are preserved as in standard written Mali Bambara. Do NOT substitute with Jula/Dyula (Côte d'Ivoire) forms.",
5
+ "pairs": [
6
+ {"source": "Good morning / Bonjour", "target": "I ni sɔgɔma"},
7
+ {"source": "Good afternoon / Bon après-midi", "target": "I ni tile"},
8
+ {"source": "Good evening / Bonsoir", "target": "I ni wula"},
9
+ {"source": "Hello (general) / Salut", "target": "I ni ce"},
10
+ {"source": "Thank you / Merci", "target": "I ni ce"},
11
+ {"source": "How are you? / Comment vas-tu ?", "target": "I ka kɛnɛ wa?"},
12
+ {"source": "I am fine. / Je vais bien.", "target": "Kɛnɛ, tɔɔrɔ tɛ."},
13
+ {"source": "How is the family? / Comment va la famille ?", "target": "Sɔmɔgɔw bɛ di?"},
14
+ {"source": "They are fine. / Ils vont bien.", "target": "Tɔɔrɔ t'u la."},
15
+ {"source": "What is your name? / Comment t'appelles-tu ?", "target": "I tɔgɔ bi di?"},
16
+ {"source": "My name is... / Je m'appelle...", "target": "Ne tɔgɔ ye..."},
17
+ {"source": "Where are you going? / Où vas-tu ?", "target": "I bɛ taa min?"},
18
+ {"source": "I am going to the market. / Je vais au marché.", "target": "N bɛ taa sugu la."},
19
+ {"source": "How much is this? / C'est combien ?", "target": "Nin ye joli ye?"},
20
+ {"source": "It is too expensive. / C'est trop cher.", "target": "A da ka gɛlɛn."},
21
+ {"source": "Please / S'il vous plaît", "target": "Hakɛ to"},
22
+ {"source": "I am sorry / Je suis désolé", "target": "Yafa n ma"},
23
+ {"source": "I don't understand / Je ne comprends pas", "target": "N m'a faamu"},
24
+ {"source": "Speak slowly / Parle doucement", "target": "Kuma dɔɔni dɔɔni"},
25
+ {"source": "I am hungry / J'ai faim", "target": "Kɔngɔ bɛ n na"},
26
+ {"source": "I want to eat / Je veux manger", "target": "N b'a fɛ ka dumu"},
27
+ {"source": "Give me water / Donne-moi de l'eau", "target": "Ji di n ma"},
28
+ {"source": "How is the work/field? / Comment va le travail/champ ?", "target": "Baara bɛ di? / Sɛnɛ bɛ di?"},
29
+ {"source": "The work is good. / Le travail va bien.", "target": "Baara bɛ kɛnɛ."},
30
+ {"source": "Where is the doctor? / Où est le docteur ?", "target": "Dɔkɔtɔrɔ bɛ min?"},
31
+ {"source": "I am tired / Je suis fatigué", "target": "N sɛgɛnna"},
32
+ {"source": "See you tomorrow / À demain", "target": "K'an bɛn sini"},
33
+ {"source": "Goodbye / Au revoir", "target": "K'an bɛn"},
34
+ {"source": "God bless you / Que Dieu te bénisse", "target": "Ala ka duga i ye"},
35
+ {"source": "Peace only / La paix seulement", "target": "Hɛɛrɛ dɔrɔn"}
36
+ ]
37
+ }
configs/dialect_anchors/bambara_phrasebook.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dialect": "Bambara as spoken in Bamako, Mali",
3
+ "iso": "bam",
4
+ "notes": "Curated 100-phrase field phrasebook, organized by conversational category. Used by the phrasebook short-circuit in src/llm/phrasebook.py — English-keyed, fuzzy-matched. Do NOT substitute with Jula/Dyula (Côte d'Ivoire) forms.",
5
+ "pairs": [
6
+ {"category": "Greetings", "source": "Hello / Thank you", "target": "I ni ce"},
7
+ {"category": "Greetings", "source": "Good morning", "target": "I ni sɔgɔma"},
8
+ {"category": "Greetings", "source": "Good afternoon", "target": "I ni tile"},
9
+ {"category": "Greetings", "source": "Good evening", "target": "I ni wula"},
10
+ {"category": "Greetings", "source": "Welcome", "target": "I ni dɔn"},
11
+ {"category": "Greetings", "source": "How are you?", "target": "I ka kɛnɛ wa?"},
12
+ {"category": "Greetings", "source": "Fine, no trouble", "target": "Kɛnɛ, tɔɔrɔ tɛ"},
13
+ {"category": "Greetings", "source": "How was the night?", "target": "Sini kɛnɛ?"},
14
+ {"category": "Greetings", "source": "How was the work?", "target": "Baara ni ce"},
15
+ {"category": "Greetings", "source": "Well done", "target": "I ni baara"},
16
+ {"category": "Identity", "source": "What is your name?", "target": "I tɔgɔ bi di?"},
17
+ {"category": "Identity", "source": "My name is...", "target": "Ne tɔgɔ ye..."},
18
+ {"category": "Identity", "source": "Where are you from?", "target": "I bɔra min?"},
19
+ {"category": "Identity", "source": "I am from...", "target": "N bɔra..."},
20
+ {"category": "Identity", "source": "What is your work?", "target": "I bɛ mun baara kɛ?"},
21
+ {"category": "Family", "source": "How is the family?", "target": "Sɔmɔgɔw bɛ di?"},
22
+ {"category": "Family", "source": "How is your wife?", "target": "I muso bɛ di?"},
23
+ {"category": "Family", "source": "How is your husband?", "target": "I tigi bɛ di?"},
24
+ {"category": "Family", "source": "How are the children?", "target": "Denmisɛnw bɛ di?"},
25
+ {"category": "Family", "source": "How is the baby?", "target": "Denu bɛ di?"},
26
+ {"category": "Family", "source": "They are fine", "target": "Tɔɔrɔ t'u la"},
27
+ {"category": "Family", "source": "My father is well", "target": "N fa bɛ kɛnɛ"},
28
+ {"category": "Family", "source": "My mother is well", "target": "N ba bɛ kɛnɛ"},
29
+ {"category": "Family", "source": "Are you married?", "target": "I furula wa?"},
30
+ {"category": "Food/Water", "source": "I am hungry", "target": "Kɔngɔ bɛ n na"},
31
+ {"category": "Food/Water", "source": "I am thirsty", "target": "Min nɔgɔ bɛ n na"},
32
+ {"category": "Food/Water", "source": "I want to eat", "target": "N b'a fɛ ka dumu"},
33
+ {"category": "Food/Water", "source": "Give me water", "target": "Ji di n ma"},
34
+ {"category": "Food/Water", "source": "The food is sweet", "target": "Dumuni ka di"},
35
+ {"category": "Food/Water", "source": "I am full", "target": "N fara"},
36
+ {"category": "Food/Water", "source": "Bread", "target": "Buruburu"},
37
+ {"category": "Food/Water", "source": "Rice", "target": "Malo"},
38
+ {"category": "Food/Water", "source": "Meat", "target": "Sogo"},
39
+ {"category": "Food/Water", "source": "Tea", "target": "Te"},
40
+ {"category": "Food/Water", "source": "Sugar", "target": "Sukaro"},
41
+ {"category": "Farming", "source": "How is the farming?", "target": "Sɛnɛ bɛ di?"},
42
+ {"category": "Farming", "source": "It rained today", "target": "Sanji nna bi"},
43
+ {"category": "Farming", "source": "The field", "target": "Sɛnɛfɛla"},
44
+ {"category": "Farming", "source": "Maize / Corn", "target": "Kaba"},
45
+ {"category": "Farming", "source": "Cow", "target": "Misi"},
46
+ {"category": "Farming", "source": "Sheep", "target": "Saga"},
47
+ {"category": "Farming", "source": "Goat", "target": "Ba"},
48
+ {"category": "Farming", "source": "Chicken", "target": "Shɛ"},
49
+ {"category": "Farming", "source": "Where is the hoe?", "target": "Daba bɛ min?"},
50
+ {"category": "Farming", "source": "We are working", "target": "An bɛ baara kɛ"},
51
+ {"category": "Health", "source": "I am sick", "target": "N bana"},
52
+ {"category": "Health", "source": "My head hurts", "target": "N kungolo bɛ n dimi"},
53
+ {"category": "Health", "source": "My stomach hurts", "target": "N kɔnɔ bɛ n dimi"},
54
+ {"category": "Health", "source": "I have fever", "target": "Sumaya bɛ n na"},
55
+ {"category": "Health", "source": "Where is the hospital?", "target": "Ɲɛnajɛso bɛ min?"},
56
+ {"category": "Health", "source": "Where is the doctor?", "target": "Dɔkɔtɔrɔ bɛ min?"},
57
+ {"category": "Health", "source": "Take the medicine", "target": "Fura min"},
58
+ {"category": "Health", "source": "Drink this", "target": "Nin min"},
59
+ {"category": "Health", "source": "Lie down", "target": "I la"},
60
+ {"category": "Health", "source": "Do you feel better?", "target": "A ka fisa wa?"},
61
+ {"category": "Shopping", "source": "How much?", "target": "Joli ye?"},
62
+ {"category": "Shopping", "source": "It is too much", "target": "A ka ca"},
63
+ {"category": "Shopping", "source": "Reduce it", "target": "Dɔɔni dɔɔni bɔ a la"},
64
+ {"category": "Shopping", "source": "I have no money", "target": "Wari tɛ n fɛ"},
65
+ {"category": "Shopping", "source": "Here is the money", "target": "Wari filɛ"},
66
+ {"category": "Shopping", "source": "Market", "target": "Sugu"},
67
+ {"category": "Shopping", "source": "Shop", "target": "Butiki"},
68
+ {"category": "Shopping", "source": "Soap", "target": "Safinɛ"},
69
+ {"category": "Shopping", "source": "Oil", "target": "Tulu"},
70
+ {"category": "Shopping", "source": "Salt", "target": "Kɔgɔ"},
71
+ {"category": "Travel", "source": "Where is the road?", "target": "Sira bɛ min?"},
72
+ {"category": "Travel", "source": "Is it far?", "target": "A ka jan wa?"},
73
+ {"category": "Travel", "source": "It is close", "target": "A surunya"},
74
+ {"category": "Travel", "source": "Turn right", "target": "Kini bolo fɛ"},
75
+ {"category": "Travel", "source": "Turn left", "target": "Numa bolo fɛ"},
76
+ {"category": "Travel", "source": "Stop here", "target": "I jɔ yan"},
77
+ {"category": "Travel", "source": "Let's go", "target": "An ka taa"},
78
+ {"category": "Travel", "source": "Car", "target": "Mobili"},
79
+ {"category": "Travel", "source": "Bus", "target": "Sɔta"},
80
+ {"category": "Travel", "source": "Motorbike", "target": "Nɛgɛso"},
81
+ {"category": "Clarity", "source": "I understand", "target": "N n'a faamu"},
82
+ {"category": "Clarity", "source": "I don't understand", "target": "N m'a faamu"},
83
+ {"category": "Clarity", "source": "Repeat it", "target": "Segi a kan"},
84
+ {"category": "Clarity", "source": "Speak slowly", "target": "Kuma dɔɔni dɔɔni"},
85
+ {"category": "Clarity", "source": "Do you speak Bambara?", "target": "I bɛ Bamanankan mɛn wa?"},
86
+ {"category": "Clarity", "source": "A little", "target": "Dɔɔni dɔɔni"},
87
+ {"category": "Clarity", "source": "I don't know", "target": "N m'a lɔn"},
88
+ {"category": "Clarity", "source": "Yes", "target": "Awɔ"},
89
+ {"category": "Clarity", "source": "No", "target": "Ayi"},
90
+ {"category": "Clarity", "source": "Wait", "target": "Kɔnɔ"},
91
+ {"category": "Time", "source": "Today", "target": "Bi"},
92
+ {"category": "Time", "source": "Tomorrow", "target": "Sini"},
93
+ {"category": "Time", "source": "Yesterday", "target": "Kunu"},
94
+ {"category": "Time", "source": "Now", "target": "Sisan"},
95
+ {"category": "Time", "source": "Later", "target": "Kɔfɛ"},
96
+ {"category": "Parting", "source": "Goodbye", "target": "K'an bɛn"},
97
+ {"category": "Parting", "source": "Until later", "target": "K'an bɛn kɔfɛ"},
98
+ {"category": "Parting", "source": "Until tomorrow", "target": "K'an bɛn sini"},
99
+ {"category": "Parting", "source": "Have a good day", "target": "Tile hɛɛrɛ"},
100
+ {"category": "Parting", "source": "Have a good night", "target": "Su hɛɛrɛ"},
101
+ {"category": "Parting", "source": "Go in peace", "target": "Taa hɛɛrɛ la"},
102
+ {"category": "Parting", "source": "God bless you", "target": "Ala ka duga i ye"},
103
+ {"category": "Parting", "source": "God willing", "target": "Ala sɔnna"},
104
+ {"category": "Parting", "source": "Thank God", "target": "Ala tando"},
105
+ {"category": "Parting", "source": "Peace only", "target": "Hɛɛrɛ dɔrɔn"}
106
+ ]
107
+ }
configs/dialect_anchors/pular_guinea.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dialect": "Pular of Fuuta Jallon, as spoken in Guinea",
3
+ "iso": "ful",
4
+ "notes": "Curated 30-phrase gold list, cross-checked against the Peace Corps Guinea 2015 Pular manual. Orthography uses ɓ, ɗ, ñ, ŋ. Signature Fuuta Jallon markers: 'Miɗo yaha' (1sg progressive), 'No ... wa'i' (how is), 'Jam tun' (peace only response), 'A jaraama' (thank you / hello). Do NOT substitute with Pulaar (Senegal) or Fulfulde (Nigeria, Cameroon) forms.",
5
+ "pairs": [
6
+ {"source": "Hello / Thank you (General)", "target": "A jaraama"},
7
+ {"source": "Good morning (Did you sleep in peace?)", "target": "On walli e jam?"},
8
+ {"source": "Good afternoon (Have you spent the day in peace?)", "target": "On ñalli e jam?"},
9
+ {"source": "Good evening (Have you spent the evening in peace?)", "target": "On hiiri e jam?"},
10
+ {"source": "Peace only (Standard response)", "target": "Jam tun"},
11
+ {"source": "How are you? / How is it?", "target": "No wa'i?"},
12
+ {"source": "Is there any trouble? / Is it okay?", "target": "Tana alaa?"},
13
+ {"source": "No trouble / Fine", "target": "Tana alaa"},
14
+ {"source": "Thank you (Respectful/Plural)", "target": "On jaraama"},
15
+ {"source": "How is the family?", "target": "No ɓeyngure nden wa'i?"},
16
+ {"source": "How are the children?", "target": "No fayɓe ɓen wa'i?"},
17
+ {"source": "What is your name?", "target": "Innde maa ko woni?"},
18
+ {"source": "My name is...", "target": "Innde am ko..."},
19
+ {"source": "Where are you going?", "target": "Hoto yahataa?"},
20
+ {"source": "I am going to the market", "target": "Miɗo yaha ka sugu"},
21
+ {"source": "Please (I ask you)", "target": "Mi yidiima"},
22
+ {"source": "Excuse me / Sorry", "target": "Accu hakke"},
23
+ {"source": "I understand", "target": "Mi faamii"},
24
+ {"source": "I don't understand", "target": "Mi faamaali"},
25
+ {"source": "Do you speak Pular?", "target": "Aɗa waawi Pular?"},
26
+ {"source": "Just a little bit", "target": "Seeɗa tun"},
27
+ {"source": "I want water", "target": "Miɗo yiɗi ndiyam"},
28
+ {"source": "Give me...", "target": "Okku am..."},
29
+ {"source": "How much is it?", "target": "Ko jelu?"},
30
+ {"source": "It is expensive", "target": "No tiiɗi"},
31
+ {"source": "God bless you", "target": "Alla duga maa"},
32
+ {"source": "If God wills (God willing)", "target": "Si Alla jaɓii"},
33
+ {"source": "Goodbye (Formal)", "target": "Oo-o"},
34
+ {"source": "Until tomorrow (See you tomorrow)", "target": "En jango"},
35
+ {"source": "Go in peace", "target": "Yahu e jam"}
36
+ ]
37
+ }
configs/dialect_anchors/pular_phrasebook.json ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dialect": "Pular of Fuuta Jallon, as spoken in Guinea",
3
+ "iso": "ful",
4
+ "notes": "Curated 110-phrase field phrasebook, organized by conversational category. Used by the phrasebook short-circuit in src/llm/phrasebook.py — English-keyed, fuzzy-matched. Cross-checked against Peace Corps Guinea 2015 Pular manual. Do NOT substitute with Pulaar (Senegal) or Fulfulde (Nigeria/Cameroon) forms.",
5
+ "pairs": [
6
+ {"category": "Greetings", "source": "Hello / Thank you", "target": "A jaraama"},
7
+ {"category": "Greetings", "source": "Good morning", "target": "On walli e jam?"},
8
+ {"category": "Greetings", "source": "Good afternoon", "target": "On ñalli e jam?"},
9
+ {"category": "Greetings", "source": "Good evening", "target": "On hiiri e jam?"},
10
+ {"category": "Greetings", "source": "Peace only (Response)", "target": "Jam tun"},
11
+ {"category": "Greetings", "source": "How are you?", "target": "No wa'i?"},
12
+ {"category": "Greetings", "source": "Is there any trouble?", "target": "Tana alaa?"},
13
+ {"category": "Greetings", "source": "No trouble", "target": "Tana alaa"},
14
+ {"category": "Greetings", "source": "How is the heat/weather?", "target": "Ho no yasi ken waye?"},
15
+ {"category": "Greetings", "source": "Welcome", "target": "Tana alaa"},
16
+ {"category": "Identity", "source": "What is your name?", "target": "Ko ho no inne te dah?"},
17
+ {"category": "Identity", "source": "My name is...", "target": "Innde am ko..."},
18
+ {"category": "Identity", "source": "Where are you coming from?", "target": "Hoto iwruɗaa?"},
19
+ {"category": "Identity", "source": "Where are you from?", "target": "Eewdi maadiin koh hontoh?"},
20
+ {"category": "Identity", "source": "I am coming from...", "target": "Mi iwri ko..."},
21
+ {"category": "Identity", "source": "I am from", "target": "Eewdi an diin koh"},
22
+ {"category": "Identity", "source": "I am a farmer", "target": "Koh mee rehmohwoh"},
23
+ {"category": "Family", "source": "How is the family?", "target": "No ɓeyngure nden wa'i?"},
24
+ {"category": "Family", "source": "How is the woman?", "target": "No debbo on wa'i?"},
25
+ {"category": "Family", "source": "How is your wife?", "target": "No mbehgu ma wa'i?"},
26
+ {"category": "Family", "source": "How is your husband?", "target": "No mohdi ma wa'i?"},
27
+ {"category": "Family", "source": "How is the man?", "target": "No gorko on wa'i?"},
28
+ {"category": "Family", "source": "How are the children?", "target": "No fayɓe ɓen wa'i?"},
29
+ {"category": "Family", "source": "How is the baby?", "target": "No boobo on wa'i?"},
30
+ {"category": "Family", "source": "Everyone is fine", "target": "Hiɓe e jam"},
31
+ {"category": "Family", "source": "My father is well", "target": "Baba am no e jam"},
32
+ {"category": "Family", "source": "My mother is well", "target": "Neene am no e jam"},
33
+ {"category": "Family", "source": "How many children?", "target": "Fayɓe ben ko jelu?"},
34
+ {"category": "Food/Water", "source": "I am hungry", "target": "Mi weelaa maa"},
35
+ {"category": "Food/Water", "source": "I am thirsty", "target": "Miɗo ɗonɗa"},
36
+ {"category": "Food/Water", "source": "I want to eat", "target": "Miɗo faalaa ñaamude"},
37
+ {"category": "Food/Water", "source": "Give me water", "target": "Okku am ndiyam"},
38
+ {"category": "Food/Water", "source": "The food is good", "target": "Ñaameteeɗon no weli"},
39
+ {"category": "Food/Water", "source": "I am full", "target": "Mi haraama"},
40
+ {"category": "Food/Water", "source": "Bread", "target": "Biirehdi"},
41
+ {"category": "Food/Water", "source": "Rice", "target": "Maaro"},
42
+ {"category": "Food/Water", "source": "Milk", "target": "Mɓeerah"},
43
+ {"category": "Food/Water", "source": "Sour Cream", "target": "Kosam"},
44
+ {"category": "Food/Water", "source": "Hot water", "target": "Ndiyam wuuldham"},
45
+ {"category": "Food/Water", "source": "Cold water", "target": "Ndiyam ɓuuɓudham"},
46
+ {"category": "Food/Water", "source": "Coffee", "target": "Kafe"},
47
+ {"category": "Food/Water", "source": "Sugar", "target": "Sukkar"},
48
+ {"category": "Farming", "source": "How is the farming?", "target": "No ngsa kan wa'i?"},
49
+ {"category": "Farming", "source": "The rain is good", "target": "Ndiyam ndan no moƴƴi"},
50
+ {"category": "Farming", "source": "The field", "target": "Ngesa"},
51
+ {"category": "Farming", "source": "Garden", "target": "Suntuure"},
52
+ {"category": "Farming", "source": "Cattle / Cows", "target": "Nai"},
53
+ {"category": "Farming", "source": "Sheep", "target": "Baali"},
54
+ {"category": "Farming", "source": "Goat", "target": "Mbeewa"},
55
+ {"category": "Farming", "source": "Chicken", "target": "Gertogal"},
56
+ {"category": "Farming", "source": "Where is the thing?", "target": "Hoto huunde nden woni?"},
57
+ {"category": "Farming", "source": "To cultivate or to farm", "target": "Remugol"},
58
+ {"category": "Farming", "source": "To sow or plant seeds", "target": "Aawugol"},
59
+ {"category": "Farming", "source": "To harvest", "target": "Heptugol"},
60
+ {"category": "Farming", "source": "We are working (speaking to the person I'm working with)", "target": "Hiɗen e golle"},
61
+ {"category": "Farming", "source": "We are working (speaking to another person not working with us)", "target": "Meein gollu deh"},
62
+ {"category": "Health", "source": "I am sick", "target": "Miɗo nawni"},
63
+ {"category": "Health", "source": "My head hurts", "target": "Hoore am den no muusa"},
64
+ {"category": "Health", "source": "My stomach hurts", "target": "Reedu am doun no muusa"},
65
+ {"category": "Health", "source": "I have fever", "target": "Miɗo jogi yontere"},
66
+ {"category": "Health", "source": "Where is the clinic?", "target": "Hoto kilinik on woni?"},
67
+ {"category": "Health", "source": "Where is the doctor?", "target": "Hoto dɔkɔtɔrɔ on woni?"},
68
+ {"category": "Health", "source": "Take this medicine", "target": "Jehhtu leki kin"},
69
+ {"category": "Health", "source": "Drink this", "target": "Yaru ɗun"},
70
+ {"category": "Health", "source": "Rest now", "target": "Fow'w toh"},
71
+ {"category": "Health", "source": "Are you better?", "target": "Aɗa selli jooni?"},
72
+ {"category": "Shopping", "source": "How much is this?", "target": "Dounn ko jelu?"},
73
+ {"category": "Shopping", "source": "It is too expensive", "target": "No sahtee"},
74
+ {"category": "Shopping", "source": "Reduce the price", "target": "Dhuitah nam seeɗa"},
75
+ {"category": "Shopping", "source": "I have no money", "target": "Mi alaa buudi"},
76
+ {"category": "Shopping", "source": "Here is the money", "target": "Hinoh buudi dinn"},
77
+ {"category": "Shopping", "source": "Market", "target": "Luhmoh"},
78
+ {"category": "Shopping", "source": "Shop / Boutique", "target": "Bitiki"},
79
+ {"category": "Shopping", "source": "Soap", "target": "Sabunnde"},
80
+ {"category": "Shopping", "source": "Matches", "target": "Almet"},
81
+ {"category": "Shopping", "source": "Salt", "target": "Landan"},
82
+ {"category": "Travel", "source": "Where is the road to...?", "target": "Hoto ngol laawol yahata...?"},
83
+ {"category": "Travel", "source": "Is it far?", "target": "No woɗɗi?"},
84
+ {"category": "Travel", "source": "It is near", "target": "No ɓadii"},
85
+ {"category": "Travel", "source": "Turn right", "target": "Ýillu ka ñaamo"},
86
+ {"category": "Travel", "source": "Turn left", "target": "Ýillu ka nannoh"},
87
+ {"category": "Travel", "source": "Stop here", "target": "Daroh ɗoo"},
88
+ {"category": "Travel", "source": "Let's go", "target": "Mah een"},
89
+ {"category": "Travel", "source": "Car / Taxi", "target": "Oto"},
90
+ {"category": "Travel", "source": "Bicycle", "target": "Velo"},
91
+ {"category": "Travel", "source": "Motorcycle", "target": "Moto"},
92
+ {"category": "Clarity", "source": "I understand", "target": "Mi faamii"},
93
+ {"category": "Clarity", "source": "I don't understand", "target": "Mi faamaali"},
94
+ {"category": "Clarity", "source": "Please repeat", "target": "Fultu kadi"},
95
+ {"category": "Clarity", "source": "Speak slowly", "target": "Halu seeɗa seeɗa"},
96
+ {"category": "Clarity", "source": "Do you speak French?", "target": "Aɗa waawi Faransi?"},
97
+ {"category": "Clarity", "source": "I can just a little", "target": "Mi nan waawi seeɗa tun"},
98
+ {"category": "Clarity", "source": "I don't know", "target": "Mi andaa"},
99
+ {"category": "Clarity", "source": "Yes", "target": "Eyyo / Hii'hi"},
100
+ {"category": "Clarity", "source": "No", "target": "O'o"},
101
+ {"category": "Clarity", "source": "Wait", "target": "Sabboh"},
102
+ {"category": "Time", "source": "Today", "target": "Hannde"},
103
+ {"category": "Time", "source": "Tomorrow", "target": "Jango"},
104
+ {"category": "Time", "source": "Yesterday", "target": "Hanki"},
105
+ {"category": "Time", "source": "Now", "target": "Joni"},
106
+ {"category": "Time", "source": "Later", "target": "On tuma"},
107
+ {"category": "Parting", "source": "Goodbye", "target": "Oo-o"},
108
+ {"category": "Parting", "source": "See you later", "target": "En on tuma"},
109
+ {"category": "Parting", "source": "See you tomorrow", "target": "En jango"},
110
+ {"category": "Parting", "source": "Have a good day", "target": "Ñallu e jam"},
111
+ {"category": "Parting", "source": "Have a good night", "target": "Waalu e jam"},
112
+ {"category": "Parting", "source": "Go in peace", "target": "Yahu e jam"},
113
+ {"category": "Parting", "source": "God willing", "target": "Si Alla jaɓii"},
114
+ {"category": "Parting", "source": "Thank God", "target": "Ko ýettude Alla"},
115
+ {"category": "Parting", "source": "Peace only", "target": "Jam tun"}
116
+ ]
117
+ }
src/llm/minimal_client.py ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """MinimalClient — dialect-anchored plain-text LLM client for the Month 1–3 rebuild.
2
+
3
+ Why this exists (and not GemmaClient):
4
+ GemmaClient wraps every reply in a JSON object and runs a "teacher / child"
5
+ intent-classification flow. That's fine for the full app, but for the minimal
6
+ baseline it (a) spends model capacity on JSON compliance, (b) lets the model
7
+ drift into neighbouring languages (Wolof, Hausa, Pulaar of Senegal, Fulfulde
8
+ of Nigeria, Jula of Côte d'Ivoire), and (c) produces text that isn't clean
9
+ for TTS.
10
+
11
+ This client instead:
12
+ - pins the target dialect explicitly (Bambara / Bamako–Mali or Pular / Fuuta
13
+ Jallon–Guinea),
14
+ - injects the curated 30-phrase gold list for the target language as
15
+ few-shot anchoring in the system prompt,
16
+ - names forbidden neighbouring languages the model must not code-switch to,
17
+ - returns a plain string, ready for MMS-TTS.
18
+
19
+ GemmaClient and app.py are intentionally untouched.
20
+ """
21
+ from __future__ import annotations
22
+
23
+ import json
24
+ import logging
25
+ from functools import lru_cache
26
+ from pathlib import Path
27
+ from typing import Optional
28
+
29
+ logger = logging.getLogger(__name__)
30
+
31
+ # configs/dialect_anchors/*.json lives at <repo>/configs/dialect_anchors
32
+ _ANCHOR_DIR = (
33
+ Path(__file__).resolve().parent.parent.parent / "configs" / "dialect_anchors"
34
+ )
35
+
36
+ _ANCHOR_FILE = {
37
+ "bam": "bambara_mali.json",
38
+ "ful": "pular_guinea.json",
39
+ }
40
+
41
+ LANG_FULL_NAME = {
42
+ "bam": "Bambara as spoken in Bamako, Mali",
43
+ "ful": "Pular of Fuuta Jallon, as spoken in Guinea",
44
+ "fr": "French",
45
+ "en": "English",
46
+ }
47
+
48
+ # Neighbouring languages the model is most likely to drift into. Empty for
49
+ # fr/en — we don't need to fence those.
50
+ FORBIDDEN_DRIFT = {
51
+ "bam": (
52
+ "Jula / Dyula of Côte d'Ivoire, Wolof, Hausa, Swahili, Lingala, "
53
+ "or any other African language"
54
+ ),
55
+ "ful": (
56
+ "Pulaar of Senegal, Fulfulde of Nigeria or Cameroon, Wolof, Hausa, "
57
+ "Swahili, or any other African language"
58
+ ),
59
+ "fr": "",
60
+ "en": "",
61
+ }
62
+
63
+
64
+ @lru_cache(maxsize=4)
65
+ def _load_anchors(lang: str) -> list[dict]:
66
+ """Load the curated gold-phrase list for `lang`. Cached per process."""
67
+ fname = _ANCHOR_FILE.get(lang)
68
+ if not fname:
69
+ return []
70
+ path = _ANCHOR_DIR / fname
71
+ if not path.exists():
72
+ logger.warning("Dialect anchor file missing: %s", path)
73
+ return []
74
+ with path.open("r", encoding="utf-8") as f:
75
+ data = json.load(f)
76
+ return data.get("pairs", [])
77
+
78
+
79
+ def _build_system_prompt(target_lang: str) -> str:
80
+ """Assemble the per-call system prompt for a target output language."""
81
+ full = LANG_FULL_NAME.get(target_lang, "English")
82
+ forbidden = FORBIDDEN_DRIFT.get(target_lang, "")
83
+ anchors = _load_anchors(target_lang)
84
+
85
+ lines: list[str] = [
86
+ f"You are a warm, concise conversational assistant that replies ONLY in {full}.",
87
+ "",
88
+ "Output format: plain natural text only. No JSON, no code fences, no "
89
+ "markdown, no translations, no romanisation, no explanations. Reply in "
90
+ "1–3 short sentences suitable to be read aloud by a text-to-speech voice.",
91
+ ]
92
+
93
+ if forbidden:
94
+ lines += [
95
+ "",
96
+ (
97
+ f"CRITICAL — dialect fidelity: do NOT use, mix, or substitute words "
98
+ f"from {forbidden}. If you are not confident a word belongs to "
99
+ f"{full}, rephrase using simpler vocabulary you are certain of, or "
100
+ f"apologise briefly in {full} (for example that you did not "
101
+ f"understand)."
102
+ ),
103
+ ]
104
+
105
+ if anchors:
106
+ lines += [
107
+ "",
108
+ f"Reference phrases in {full} — use this exact orthography, spelling, "
109
+ "and dialectal style as your model for every reply:",
110
+ ]
111
+ for item in anchors:
112
+ src = item.get("source", "").strip()
113
+ tgt = item.get("target", "").strip()
114
+ if src and tgt:
115
+ lines.append(f"- {src} → {tgt}")
116
+
117
+ lines += [
118
+ "",
119
+ f"Always reply in {full}, even if the user writes to you in English, "
120
+ "French, or another language. Never translate your own reply.",
121
+ ]
122
+ return "\n".join(lines)
123
+
124
+
125
+ class MinimalClient:
126
+ """Dialect-anchored plain-text LLM client over HF Serverless Inference.
127
+
128
+ Usage:
129
+ client = MinimalClient(model_id="Qwen/Qwen2.5-7B-Instruct", hf_token=TOK)
130
+ reply = client.chat("Good morning", target_lang="bam")
131
+ # → "I ni sɔgɔma. I ka kɛnɛ wa?"
132
+ """
133
+
134
+ def __init__(
135
+ self,
136
+ model_id: str = "CohereLabs/aya-expanse-32b",
137
+ hf_token: Optional[str] = None,
138
+ ) -> None:
139
+ self.model_id = model_id
140
+ self.hf_token = hf_token
141
+ self._client = None # lazy init
142
+
143
+ def _get_client(self):
144
+ if self._client is None:
145
+ from huggingface_hub import InferenceClient
146
+ self._client = InferenceClient(token=self.hf_token)
147
+ return self._client
148
+
149
+ def chat(self, user_text: str, target_lang: str = "bam") -> str:
150
+ """Return a plain-text reply in `target_lang`.
151
+
152
+ On any error returns a short parenthetical error string so the caller
153
+ can still feed something into TTS / display.
154
+ """
155
+ system_prompt = _build_system_prompt(target_lang)
156
+ try:
157
+ client = self._get_client()
158
+ completion = client.chat_completion(
159
+ model=self.model_id,
160
+ messages=[
161
+ {"role": "system", "content": system_prompt},
162
+ {"role": "user", "content": user_text},
163
+ ],
164
+ max_tokens=256,
165
+ temperature=0.3,
166
+ )
167
+ raw = (completion.choices[0].message.content or "").strip()
168
+ # Defensive: strip any stray code fences the model may emit anyway.
169
+ if raw.startswith("```"):
170
+ raw = raw.strip("`").strip()
171
+ # If a language tag slipped in on the first line, drop it.
172
+ if "\n" in raw:
173
+ first, rest = raw.split("\n", 1)
174
+ if len(first) < 20 and " " not in first:
175
+ raw = rest.strip()
176
+ return raw
177
+ except Exception as exc: # pragma: no cover — surfaced to UI
178
+ logger.error("MinimalClient error: %s", exc)
179
+ return f"(LLM unavailable: {exc})"
src/llm/phrasebook.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Phrasebook short-circuit — skip the LLM when the user hits a curated phrase.
2
+
3
+ Purpose
4
+ For the 80% of field-demo inputs that are canonical greetings, courtesies,
5
+ or basic questions, the LLM adds risk (dialect drift, hallucination,
6
+ latency) without adding value — we already have a gold translation. This
7
+ module does an English-keyed, fuzzy-normalised match against the curated
8
+ phrasebooks in configs/dialect_anchors/{bambara,pular}_phrasebook.json and
9
+ returns the target string directly when the match is strong.
10
+
11
+ Scope
12
+ - Only fires when target language is bam or ful. For en/fr output we let
13
+ the LLM (or a passthrough) handle it — nothing to short-circuit.
14
+ - Only English source keys (what the curated sheets contain). French or
15
+ in-language inputs will not match and will fall through to the LLM —
16
+ that's correct behaviour.
17
+
18
+ Matching
19
+ - Exact match on normalised string → score 1.0 ("exact").
20
+ - Otherwise SequenceMatcher ratio; threshold DEFAULT_THRESHOLD = 0.88.
21
+ - Normalisation: lowercase, strip punctuation (keeps internal apostrophes),
22
+ collapse whitespace.
23
+
24
+ API
25
+ lookup(user_text, target_lang) -> dict | None
26
+ dict has keys: source, target, category, score, match
27
+ """
28
+ from __future__ import annotations
29
+
30
+ import json
31
+ import logging
32
+ import re
33
+ from difflib import SequenceMatcher
34
+ from functools import lru_cache
35
+ from pathlib import Path
36
+ from typing import Optional
37
+
38
+ logger = logging.getLogger(__name__)
39
+
40
+ _PHRASEBOOK_DIR = (
41
+ Path(__file__).resolve().parent.parent.parent / "configs" / "dialect_anchors"
42
+ )
43
+
44
+ _PHRASEBOOK_FILE = {
45
+ "bam": "bambara_phrasebook.json",
46
+ "ful": "pular_phrasebook.json",
47
+ }
48
+
49
+ DEFAULT_THRESHOLD = 0.88
50
+
51
+
52
+ def _normalize(text: str) -> str:
53
+ """Lowercase, strip most punctuation, collapse whitespace."""
54
+ text = (text or "").lower().strip()
55
+ # Keep internal apostrophes (e.g. "don't", "b'a"), drop other punctuation.
56
+ text = re.sub(r"[^\w\s']", " ", text, flags=re.UNICODE)
57
+ text = re.sub(r"\s+", " ", text)
58
+ return text.strip()
59
+
60
+
61
+ @lru_cache(maxsize=4)
62
+ def _load_phrasebook(lang: str) -> list[dict]:
63
+ fname = _PHRASEBOOK_FILE.get(lang)
64
+ if not fname:
65
+ return []
66
+ path = _PHRASEBOOK_DIR / fname
67
+ if not path.exists():
68
+ logger.warning("Phrasebook missing: %s", path)
69
+ return []
70
+ with path.open("r", encoding="utf-8") as f:
71
+ data = json.load(f)
72
+ pairs = data.get("pairs", [])
73
+ # Precompute normalised source for speed.
74
+ for p in pairs:
75
+ p["_norm"] = _normalize(p.get("source", ""))
76
+ return pairs
77
+
78
+
79
+ def lookup(
80
+ user_text: str,
81
+ target_lang: str,
82
+ threshold: float = DEFAULT_THRESHOLD,
83
+ ) -> Optional[dict]:
84
+ """Return best curated match for `user_text` in `target_lang`, or None.
85
+
86
+ Short-circuits only for curated dialects (bam, ful). For any other target
87
+ returns None so the caller falls through to the LLM.
88
+ """
89
+ pairs = _load_phrasebook(target_lang)
90
+ if not pairs:
91
+ return None
92
+ q = _normalize(user_text)
93
+ if not q:
94
+ return None
95
+
96
+ best: Optional[dict] = None
97
+ best_score = 0.0
98
+ for p in pairs:
99
+ src = p.get("_norm", "")
100
+ if not src:
101
+ continue
102
+ if src == q:
103
+ return {
104
+ "source": p.get("source"),
105
+ "target": p.get("target"),
106
+ "category": p.get("category"),
107
+ "score": 1.0,
108
+ "match": "exact",
109
+ }
110
+ score = SequenceMatcher(None, q, src).ratio()
111
+ if score > best_score:
112
+ best_score = score
113
+ best = p
114
+
115
+ if best and best_score >= threshold:
116
+ return {
117
+ "source": best.get("source"),
118
+ "target": best.get("target"),
119
+ "category": best.get("category"),
120
+ "score": round(best_score, 3),
121
+ "match": "fuzzy",
122
+ }
123
+ return None