Broulaye Doumbia commited on
Commit
cc8b90c
·
1 Parent(s): bb78cbf

push docs and script

Browse files
docs/baseline_rebuild.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Baseline Rebuild Plan — Recovering Months 1-3 Without Losing Existing Work
2
+
3
+ *Created: 2026-04-20*
4
+ *Maintainer: Broulaye*
5
+
6
+ ## The framing
7
+
8
+ You are not restarting. You are **backfilling** the measurement foundation that was skipped the first time through Stages C (ASR + eval), E (memory loop with real users), and G (field test). Every existing file in `src/` stays exactly where it is. The `app.py` Gradio Space on HuggingFace keeps running. Phase 3 voice-to-voice, Waxal VITS, Adlam/Pular, F5-TTS, ONNX exporters, the FastAPI service — all of it stays.
9
+
10
+ What you add is a **parallel minimal track**: a new, deliberately simple entry point that uses only the smallest slice of the existing codebase, runs a real field test against it, and collects the data that should have been collected in Months 1-3. Once the minimal track has produced field signal, you use that signal to guide which features in the main app are actually earning their keep.
11
+
12
+ Three principles govern this plan:
13
+
14
+ 1. **Never delete, never rewrite.** If something is wrong in an existing module, fix it in place. The minimal track imports from `src/`; it does not fork it.
15
+ 2. **The existing `app.py` keeps shipping.** Do not take down the production Space. The minimal version deploys as a *separate* Space.
16
+ 3. **The measurement artifacts (eval set, logs, field-test notes) merge back into main when done.** Code stays isolated on a branch; data and docs come back.
17
+
18
+ ## Step-by-step
19
+
20
+ ### Step 1 — Protect main with a branch and a tag
21
+
22
+ **Why.** Every experiment has to be safely discardable. Tagging the current commit lets you return to known-good state at any point; branching means nothing the rebuild does can touch the main deploy.
23
+
24
+ **How.**
25
+
26
+ ```bash
27
+ cd /sessions/practical-intelligent-knuth/mnt/sahel-agri-voice
28
+ git status # confirm clean working tree
29
+ git tag v0.3-pre-rebuild -m "Last state before baseline rebuild"
30
+ git push origin v0.3-pre-rebuild # if you want the tag on GitHub
31
+ git checkout -b experimental/baseline-rebuild
32
+ ```
33
+
34
+ From this point on, all rebuild work happens on `experimental/baseline-rebuild`. Main is frozen for the duration of the rebuild. Hotfixes to production still go through main as normal.
35
+
36
+ ### Step 2 — Create the minimal entry point
37
+
38
+ **Why.** You need to run Whisper + LLM + MMS-TTS in the simplest possible wiring, with nothing else in the critical path. This is what users will actually evaluate. Every extra component adds a failure mode you can't isolate. The minimal entry point becomes a joint debugging tool and a field-test artifact.
39
+
40
+ **How.** Add a new file `app_minimal.py` at the repo root — a third entry point alongside `app.py` (full production) and `app_lab.py` (experimental). It should import only:
41
+
42
+ - `src.llm.gemma_client` — the Qwen LLM client, unchanged
43
+ - `src.engine.whisper_base` — Whisper backbone, used *zero-shot* (no adapter)
44
+ - `src.tts.mms_tts` — MMS-TTS Bambara fallback
45
+ - `src.data.bam_normalize` — the orthography normalizer
46
+
47
+ It should **not** touch:
48
+
49
+ - `src/engine/adapter_manager.py` (skip LoRA entirely — zero-shot only)
50
+ - `src/engine/transcriber.py` (the adapter-aware wrapper — use `whisper_base` directly)
51
+ - `src/memory/` (no memory loop in the minimal version yet)
52
+ - `src/voice/speaker_profiles.py` (no speaker ID)
53
+ - `src/iot/` (no sensors, no intent parsing — LLM handles it all)
54
+ - `src/tts/waxal_tts.py`, `src/tts/f5_tts.py`, `src/tts/voice_cloner.py` (no upgraded TTS)
55
+ - `src/conversation/phrase_matcher.py` (no fast-path shortcuts)
56
+
57
+ Single Gradio interface, one tab: microphone input, audio output, transcript visible for debugging. Roughly 150-200 lines total. Add a header comment explaining what it is:
58
+
59
+ ```python
60
+ """Minimal baseline Gradio entry point for the Month 1-3 rebuild.
61
+
62
+ Wires the simplest possible slice: Whisper (zero-shot) -> Qwen -> MMS-TTS.
63
+ No LoRA adapters, no memory loop, no speaker ID, no voice cloning.
64
+ Used for field testing and building a real-user eval set.
65
+ See docs/baseline_rebuild.md for the plan this fits into.
66
+ """
67
+ ```
68
+
69
+ ### Step 3 — Add the evaluation infrastructure
70
+
71
+ **Why.** This is the single most load-bearing deliverable of the rebuild. Without a real-user eval set, every subsequent decision is speculation. The eval set is what turns "I think this change helps" into "I measured this change helped." It also makes the LoRA Kaggle training work (Stage C continuation) scientifically valid whenever you get back to it.
72
+
73
+ **How.** Create the folder structure:
74
+
75
+ ```
76
+ data/eval/
77
+ bambara_field.jsonl # the eval manifest — starts empty
78
+ audio/ # the actual wav files (gitignore large files; keep manifest in git)
79
+ README.md # recording protocol
80
+ scripts/
81
+ eval_baseline.py # runs minimal stack against manifest, emits metrics
82
+ docs/
83
+ eval_protocol.md # how to add a new recording, quality criteria
84
+ metrics.md # where baseline numbers are recorded
85
+ ```
86
+
87
+ The JSONL manifest format:
88
+
89
+ ```json
90
+ {"audio_path": "audio/speaker01_001.wav", "transcript": "ji be min?", "speaker_id": "speaker01", "region": "Bamako", "noise": "quiet", "duration_s": 2.3}
91
+ ```
92
+
93
+ `scripts/eval_baseline.py` loads the manifest, runs each audio through Whisper-large-v3-turbo (zero-shot, no adapter), compares to the ground-truth transcript, and prints WER and CER per-speaker and overall. Also prints a few failure cases for inspection. This script becomes your standard measurement harness — every future change gets compared against the same manifest.
94
+
95
+ ### Step 4 — Collect real recordings (the only human-gated step)
96
+
97
+ **Why.** This is where the rebuild touches reality. Three to five native speakers, using their actual phones, in their actual environments. Fifteen to twenty utterances each covering the agricultural domain you scoped for. The recording conditions have to be real, or the eval set will give you FLEURS-like numbers that lie to you.
98
+
99
+ **How.** Write a recording script with 50-100 prompts covering:
100
+
101
+ - Greetings and politeness formulas (baseline — should be easy)
102
+ - Agricultural queries the product actually needs to handle ("how wet is the soil," "when should I water the tomatoes," "is there a pest alert")
103
+ - Vocabulary you know is underrepresented in FLEURS (crop names, tool names, regional agricultural terms)
104
+ - A few natural code-switch utterances (Bambara with French loanwords)
105
+
106
+ Share the script via WhatsApp voice messages or have them record in a free mobile app that returns wav or m4a. Transcribe by hand (or by LLM with manual correction). Commit the JSONL manifest to the repo; upload the audio to a private HF dataset to avoid bloating git history.
107
+
108
+ Set a target: at least 50 utterances across at least 3 speakers before running your first baseline eval. More is better, but 50 is the usable floor.
109
+
110
+ ### Step 5 — Deploy the minimal Space
111
+
112
+ **Why.** A second HF Space running `app_minimal.py` in parallel with the main Space gives testers a stripped-down version to react to. Comparing two Spaces teaches you which features in the main app are actually pulling weight — if minimal gets the same "I'd use this" reaction as the full version, most of the fancy work isn't load-bearing for first-use value (which doesn't mean it's wrong, just that adoption doesn't depend on it).
113
+
114
+ **How.** Create a new Space, e.g. `ous-sow/sahel-voice-minimal`. Set the Space entry point to `app_minimal.py`. Keep `packages.txt` unchanged (ffmpeg is still needed). In `requirements.txt`, consider a trimmed version that doesn't pull in voice cloning or training-only deps — this is a chance to get the minimal Space to cold-boot faster.
115
+
116
+ Add basic session logging: every interaction writes a row to a HF dataset `ous-sow/sahel-agri-field-logs` with fields `{timestamp, speaker_opt_in_id, audio_hash, transcript, llm_reply, tts_audio_hash, latency_ms}`. With opt-in consent text in the UI. No PII. This logging is what will feed your future training data and answer "are users actually coming back."
117
+
118
+ ### Step 6 — Run the field test
119
+
120
+ **Why.** The whole rebuild exists to get this step done. Everything before it is scaffolding; everything after it is informed by what happens in it. The success metric is not WER. It is: **do the testers ask a second question they came up with themselves?** That is the shortest signal that tells you whether this is a product or a demo.
121
+
122
+ **How.** Five testers, two weeks. WhatsApp intro: here is the link, please try to ask about soil or weather in Bambara, tell me anything weird. No coaching on phrasing. At the end of week 1 and week 2, ask each tester three questions: what worked, what failed, would you come back tomorrow. Record answers. No metrics from this stage go in a spreadsheet; they go in a short note under `docs/field_test_notes_YYYY-MM-DD.md` written in plain language.
123
+
124
+ In parallel, the session logs from Step 5 accumulate. At the end of two weeks, run a small analysis: median latency, distribution of utterance lengths, most common failure utterances, return rate per tester.
125
+
126
+ ### Step 7 — Selective reintegration
127
+
128
+ **Why.** Now you have evidence. Some of the Stage H features the main app already has will turn out to be essential — users asked for speaker memory, or they wanted the IoT integration enough to keep trying. Other features will turn out to be polish no tester noticed. The rebuild ends not with a big merge but with a prioritized list: which features go back into the critical path immediately, which wait, which get deprecated.
129
+
130
+ **How.** Open a small PR from `experimental/baseline-rebuild` back into main that brings in *only the data and documentation*:
131
+
132
+ - `data/eval/bambara_field.jsonl` and the audio reference
133
+ - `scripts/eval_baseline.py`
134
+ - `docs/eval_protocol.md`
135
+ - `docs/metrics.md` with baseline numbers recorded
136
+ - `docs/field_test_notes_*.md`
137
+ - The session-logging infrastructure (if you want it in the production Space too — usually yes)
138
+
139
+ Leave `app_minimal.py` on the branch as a long-lived tool — it's now your smoke-test harness. Don't merge it into main unless it's actively useful there.
140
+
141
+ From the field test notes, write a short follow-up roadmap document (`docs/roadmap_post_field_test.md`) that reorders the Month 7+ work based on what you actually learned. The features the testers needed get priority. The features that weren't missed drop in rank.
142
+
143
+ ## What NOT to touch during the rebuild
144
+
145
+ - **Production `app.py`** — stays as-is on main. Users continue to see it on the main HF Space.
146
+ - **The HF dataset `ous-sow/sahel-agri-feedback`** — keep accepting writes from the main app; the minimal Space can also write to it or to a separate one, your call.
147
+ - **LoRA training infrastructure** — fixing the Kaggle crash is important Stage C work but it is *not* part of this rebuild. Track it as a separate issue. The rebuild uses Whisper zero-shot deliberately, to decouple field testing from training progress.
148
+ - **All `src/` modules** — use them, import them, fix bugs in-place if found, but do not rewrite.
149
+ - **The FastAPI service** — leave dormant for the duration. It comes back into focus post-rebuild.
150
+
151
+ ## Rough timeline
152
+
153
+ | Week | Work |
154
+ |------|------|
155
+ | 1 | Steps 1-2: branch, tag, `app_minimal.py` wired and locally runnable |
156
+ | 2 | Step 3: eval infrastructure + script scaffolded; Step 4 started (recording script sent to speakers) |
157
+ | 3 | Step 4 continues: first 50 utterances collected, transcribed, committed to eval manifest |
158
+ | 4 | Step 3 closed: first baseline eval run, numbers recorded in `docs/metrics.md`; Step 5: minimal Space deployed |
159
+ | 5-6 | Step 6: field test runs, logs accumulate, interviews at end of week 5 and week 6 |
160
+ | 7 | Step 7: reintegration PR, follow-up roadmap written |
161
+
162
+ Seven weeks to close the measurement gap, with production untouched the whole time.
163
+
164
+ ## One-line summary
165
+
166
+ The rebuild is a parallel minimal track that collects the real-user signal the project was built without — nothing gets deleted, production keeps shipping, and the reintegration at the end is a PR of data and docs, not code.
docs/notebook_collaboration.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Notebook Collaboration — How We Work on Kaggle Notebooks Together
2
+
3
+ *Audience: anyone collaborating with Broulaye on Sahel-Voice-Lab*
4
+ *Last updated: 2026-04-20*
5
+
6
+ ## Why we're changing how we work
7
+
8
+ Up to now, we've mostly been editing notebooks inside the Kaggle web UI, downloading them occasionally, and pushing to git. That's painful because:
9
+
10
+ - The Kaggle copy and the git copy drift apart — it's never clear which one is "right."
11
+ - Cell outputs and execution counts change every run, so git diffs are huge and unreadable.
12
+ - If both of us edit the same notebook at the same time, one of us accidentally overwrites the other.
13
+
14
+ The new workflow fixes this by making **git the single source of truth** and using Kaggle purely as the place where notebooks *run*. We edit locally, commit to git, and push the notebook up to Kaggle with one command. The Kaggle web UI becomes read-only for our shared notebooks — we still go there to watch runs and read logs, but we don't type code into it anymore.
15
+
16
+ ## What you need to install (one time, on your own machine)
17
+
18
+ ```bash
19
+ pip install kaggle nbstripout
20
+ ```
21
+
22
+ - `kaggle` — the official Kaggle command-line tool. Lets you push, pull, run, and monitor Kaggle notebooks from your terminal.
23
+ - `nbstripout` — strips cell outputs and execution counts from notebooks before they hit git, so diffs stay about *code*, not noise.
24
+
25
+ ## Set up your Kaggle API credentials (one time)
26
+
27
+ 1. Go to [kaggle.com](https://www.kaggle.com), click your avatar → **Settings** → **API** → **Create New API Token**. A file called `kaggle.json` downloads.
28
+ 2. Move it to the right place and lock down permissions:
29
+
30
+ ```bash
31
+ mkdir -p ~/.kaggle
32
+ mv ~/Downloads/kaggle.json ~/.kaggle/kaggle.json
33
+ chmod 600 ~/.kaggle/kaggle.json
34
+ ```
35
+
36
+ 3. Confirm it works:
37
+
38
+ ```bash
39
+ kaggle kernels list --mine
40
+ ```
41
+
42
+ You should see your existing kernels. If you get an auth error, check the file location and permissions.
43
+
44
+ **Never commit `kaggle.json` to git.** It's already in `.gitignore` in this repo, but if you work in another repo, add it yourself.
45
+
46
+ ## Repository layout for notebooks
47
+
48
+ Each Kaggle notebook ("kernel" in Kaggle's API language) needs its own folder with a `kernel-metadata.json` file next to the `.ipynb`. Our structure:
49
+
50
+ ```
51
+ notebooks/
52
+ kaggle_master_trainer/
53
+ kernel-metadata.json
54
+ kaggle_master_trainer.ipynb
55
+ train_fula_tts/
56
+ kernel-metadata.json
57
+ train_fula_tts.ipynb
58
+ bootstrap_repos.ipynb # local helper, not a Kaggle kernel
59
+ train_colab.ipynb # runs on Google Colab, different flow
60
+ ```
61
+
62
+ A `kernel-metadata.json` looks roughly like this (example for the master trainer):
63
+
64
+ ```json
65
+ {
66
+ "id": "ous-sow/sahel-kaggle-master-trainer",
67
+ "title": "Sahel-Voice-Lab Master Trainer",
68
+ "code_file": "kaggle_master_trainer.ipynb",
69
+ "language": "python",
70
+ "kernel_type": "notebook",
71
+ "is_private": true,
72
+ "enable_gpu": true,
73
+ "enable_internet": true,
74
+ "dataset_sources": ["google/fleurs", "robotsmali/jeli-asr"],
75
+ "kernel_sources": [],
76
+ "competition_sources": []
77
+ }
78
+ ```
79
+
80
+ The `id` field (`owner/slug`) is **permanent**. Once we've agreed on a slug for a shared kernel, never change it — that's our shared pointer to the kernel living on Kaggle.
81
+
82
+ ## Enable the nbstripout filter in the repo (one time per clone)
83
+
84
+ From the repo root, the first time you clone:
85
+
86
+ ```bash
87
+ nbstripout --install --attributes .gitattributes
88
+ ```
89
+
90
+ This adds a git filter that runs on every `.ipynb` before it gets committed, stripping outputs and execution counts. Commit the `.gitattributes` file so everyone else picks it up automatically.
91
+
92
+ **First-time caveat:** if the repo previously had notebooks-with-outputs committed, your first diff after enabling this will look like everything is being "deleted." That's correct and one-time — it's just stripping the old outputs.
93
+
94
+ ## The daily workflow
95
+
96
+ ```bash
97
+ # 1. Pull the latest version from git
98
+ git pull
99
+
100
+ # 2. Edit the notebook locally (VS Code, JupyterLab, whatever you prefer)
101
+ # — running cells is fine; nbstripout handles the cleanup on commit.
102
+
103
+ # 3. Commit your changes
104
+ git add notebooks/kaggle_master_trainer/kaggle_master_trainer.ipynb
105
+ git commit -m "experiment: lower LR for ASR adapter"
106
+ git push
107
+
108
+ # 4. Push the notebook up to Kaggle to actually run it
109
+ cd notebooks/kaggle_master_trainer
110
+ kaggle kernels push
111
+
112
+ # 5. Watch the run
113
+ kaggle kernels status ous-sow/sahel-kaggle-master-trainer
114
+
115
+ # 6. When it's done, pull outputs if you need them
116
+ kaggle kernels output ous-sow/sahel-kaggle-master-trainer -p ./runs/$(date +%F)/
117
+ ```
118
+
119
+ Results go into `runs/` (which is gitignored). **They do not go back into the `.ipynb` in git** — that's what nbstripout is protecting us from.
120
+
121
+ ## Team rules (please read these — they matter)
122
+
123
+ 1. **Never edit shared notebooks in the Kaggle web UI.** Use the web UI to watch runs, read logs, download output files. If you want to experiment, do it locally. If you absolutely must try something quick in the web UI, treat it as a scratch copy — do not manually merge it back.
124
+
125
+ 2. **One runner at a time per kernel.** `kaggle kernels push` *replaces* the notebook on Kaggle's side. If you push while the other person's run is queued or mid-execution, you'll queue behind them or disrupt them. Coordinate over chat, or — better — give yourself a personal kernel slug (e.g. `ous-sow/sahel-trainer-dev-<yourname>`) for experimentation, and only push to the shared kernel (`ous-sow/sahel-kaggle-master-trainer`) when a change is ready to run cleanly.
126
+
127
+ 3. **Git is the source of truth, always.** Every Kaggle run begins with a `kaggle kernels push` from the current git state. Nothing on Kaggle is authoritative. If something on Kaggle looks different from git, git wins — pull from git, re-push, run again.
128
+
129
+ ## Troubleshooting
130
+
131
+ **`kaggle kernels push` says "message: Kernel already exists."**
132
+ Expected — it's just telling you the kernel already exists on Kaggle and will be updated. Not an error.
133
+
134
+ **Huge diff with no real code changes.**
135
+ `nbstripout` isn't active in your clone. Run `nbstripout --install --attributes .gitattributes` from the repo root and re-stage the file.
136
+
137
+ **Auth errors from `kaggle` CLI.**
138
+ Check `~/.kaggle/kaggle.json` exists, is yours (not someone else's), and has mode 600.
139
+
140
+ **Merge conflict on a `kernel-metadata.json`.**
141
+ Rare but possible if two people edit metadata simultaneously. The file is small JSON — resolve by hand, keeping the shared `id` untouched.
142
+
143
+ **The notebook ran fine on Kaggle but saved outputs landed in git anyway.**
144
+ You committed before `nbstripout` stripped outputs. Either re-stage (`git add`) which triggers the filter, or run `nbstripout <file.ipynb>` manually before `git add`.
145
+
146
+ **You accidentally edited on the Kaggle web UI.**
147
+ Go to Kaggle → your kernel → "..." → Download notebook. Overwrite the local `.ipynb` with the downloaded file. Commit. Re-push. Don't panic — just restore git as the source of truth.
148
+
149
+ ## What this workflow does not solve
150
+
151
+ - **Two people editing the same cell at the same time.** Normal git merge conflicts will still happen if both of us touch the same notebook cell simultaneously. Mitigation: work on different notebooks when possible, or pair-edit voice-on-voice. If this becomes frequent, we can add `jupytext` later, which pairs each `.ipynb` with a `.py` mirror that merges like regular Python.
152
+ - **Debugging a crashing Kaggle run.** The MCP/CLI pushes and watches, but fixing the crash is still back-and-forth between your local editor and the Kaggle logs. The workflow just removes the "which version is right" confusion from that loop.
153
+ - **Kaggle's GPU quota.** You still get 30 free GPU hours per week. Plan accordingly.
154
+
155
+ ## TL;DR
156
+
157
+ Edit locally, commit to git, `kaggle kernels push` to run, `kaggle kernels output` to retrieve. Never edit on the Kaggle web UI for shared kernels. Git is the source of truth. `nbstripout` keeps diffs clean.
158
+
159
+ If anything here doesn't make sense, ping Broulaye before improvising.
docs/roadmap_2026-04.md ADDED
@@ -0,0 +1,292 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sahel-Voice-Lab — Roadmap & Starting-from-Scratch Plan
2
+
3
+ *Last updated: 2026-04-19*
4
+ *Maintainer: Broulaye*
5
+
6
+ This document has two parts:
7
+
8
+ 1. **Where the project stands today** — what's built, what's missing, and what to do next.
9
+ 2. **If I were starting from zero today** — a realistic, solo-maintainer, free-compute path from nothing to a usable Bambara voice assistant.
10
+
11
+ ---
12
+
13
+ ## Part 1 — The four layers, in plain language
14
+
15
+ A voice assistant is four layers stacked on top of each other. Most of this project is about the fourth layer; the first three are mostly rented or borrowed.
16
+
17
+ ### Layer 1 — the ear: Speech-to-Text (STT / ASR)
18
+
19
+ Abbreviations:
20
+ - **STT** = Speech-to-Text
21
+ - **ASR** = Automatic Speech Recognition (same thing)
22
+ - **LoRA** = Low-Rank Adaptation — a technique to "patch" a large model with a tiny file (~50 MB) instead of retraining all of it
23
+ - **PEFT** = Parameter-Efficient Fine-Tuning — the HuggingFace library that implements LoRA
24
+
25
+ The model used is **Whisper** (OpenAI's open-source multilingual speech model). Out of the box, Whisper's Bambara is poor because it barely saw Bambara during training. The fix: train LoRA adapters per language. One ~1.5 GB Whisper backbone stays in memory; small Bambara and Fula patches swap in and out in ~50 ms.
26
+
27
+ Where it lives in the repo:
28
+ - `src/engine/whisper_base.py` — loads the backbone
29
+ - `src/engine/adapter_manager.py` — the hot-swap
30
+ - `src/engine/transcriber.py` — what the app calls
31
+ - `src/training/trainer.py` + `notebooks/kaggle_master_trainer.ipynb` — training
32
+
33
+ ### Layer 2 — the brain: Large Language Model (LLM)
34
+
35
+ Abbreviations:
36
+ - **LLM** = Large Language Model
37
+ - **JSON** = JavaScript Object Notation, a structured format
38
+
39
+ No one trains this from scratch. You rent one. The project calls **Qwen** (Alibaba's multilingual model) through HuggingFace's hosted inference service, with a custom "adult-child" prompt that forces structured JSON output (fields like intent, reply, translation).
40
+
41
+ Where it lives:
42
+ - `src/llm/gemma_client.py` — named "gemma" for legacy reasons; now talks to Qwen.
43
+
44
+ ### Layer 3 — the mouth: Text-to-Speech (TTS)
45
+
46
+ Abbreviations:
47
+ - **TTS** = Text-to-Speech
48
+ - **MMS** = Massively Multilingual Speech (Meta's 1000+ language model, lower quality, used as fallback)
49
+ - **VITS** = Variational Inference Text-to-Speech (a specific architecture — higher quality, one speaker per trained model)
50
+ - **F5-TTS** = a recent zero-shot voice-cloning TTS system
51
+
52
+ The hardest layer for low-resource languages. Needs hours of clean studio audio from a native speaker. Used in tiers:
53
+ - MMS-TTS as fallback baseline
54
+ - Waxal-VITS for trained Bambara quality
55
+ - F5-TTS for voice cloning in Phase 3
56
+
57
+ Where it lives:
58
+ - `src/tts/mms_tts.py`, `src/tts/waxal_tts.py`, `src/tts/f5_tts.py`, `src/tts/voice_cloner.py`
59
+
60
+ ### Layer 4 — the glue
61
+
62
+ The real differentiator of the project — everything that makes the rented models into a product.
63
+
64
+ Abbreviations:
65
+ - **IoT** = Internet of Things (networked sensors)
66
+ - **ECAPA-TDNN** = Emphasized Channel Attention Propagation — Time-Delay Neural Network; a speaker-fingerprint model
67
+
68
+ Components:
69
+ - Memory loop — `src/memory/memory_manager.py`
70
+ - Normalization — `src/data/bam_normalize.py`, `src/data/adlam.py`
71
+ - Fast-path phrases — `src/conversation/phrase_matcher.py`
72
+ - Intent detection — `src/iot/intent_parser.py`
73
+ - Voice responder (≤ 6-word replies) — `src/iot/voice_responder.py`
74
+ - Sensor bridge — `src/iot/sensor_bridge.py`
75
+ - Speaker ID — `src/voice/speaker_profiles.py`
76
+
77
+ ---
78
+
79
+ ## Part 2 — What's present vs missing
80
+
81
+ ### Present
82
+ - All four layers scaffolded; every module named in the project description exists in `src/`
83
+ - Two entry points: `app.py` (Gradio, HF Space) and `src/api/app.py` (FastAPI)
84
+ - Training infrastructure and Kaggle notebooks
85
+ - Mobile export pipeline (ONNX, TFLite) in `src/optimization/`
86
+ - Bambara Waxal-VITS TTS working
87
+ - Memory loop wired into UI
88
+ - Agricultural domain vocabulary and intent model
89
+
90
+ ### Missing or weak
91
+ 1. `data/vocabulary.jsonl` is empty — no local snapshot of user-taught words
92
+ 2. LoRA fine-tuning still crashes on Kaggle T4 (active blocker per project notes)
93
+ 3. Fula TTS is a placeholder — no trained `ous-sow/fula-tts` yet
94
+ 4. No real-user evaluation set (no `data/eval/` folder with farmer recordings); all quality numbers currently come from FLEURS, which does not reflect real conditions
95
+ 5. No documented tone-handling policy for TTS (Bambara tone is unmarked in writing but matters for pronunciation)
96
+
97
+ ---
98
+
99
+ ## Part 3 — Actionable next steps (ordered by leverage)
100
+
101
+ ### Step 1 — Fix the LoRA training crash on Kaggle
102
+ Highest leverage. Unblocks every ASR quality gain downstream.
103
+ - Reproduce the exact error on a Kaggle T4 runtime
104
+ - Pin `datasets` to a known-good version (either pre-4.x, or the correct torchcodec pin for 4.x)
105
+ - If the AMP (Automatic Mixed Precision) scaler is the issue, either disable AMP or switch to bf16 if T4 supports it cleanly
106
+ - Validate with a tiny 100-sample training job before a full run
107
+ - Commit a working Bambara adapter before moving on
108
+
109
+ ### Step 2 — Build a real-user evaluation set
110
+ Do this in parallel with Step 1.
111
+ - Record 50-100 Bambara utterances from at least 3 native speakers
112
+ - Include noisy conditions (wind, motorcycle, livestock — `noise_samples/` already anticipates this)
113
+ - Transcribe by hand; store under `data/eval/bambara_field.jsonl`
114
+ - Run the current stack and record baseline WER (Word Error Rate) and CER (Character Error Rate)
115
+ - From here on, all changes measured against this set, not FLEURS
116
+
117
+ ### Step 3 — Exercise the memory loop end-to-end
118
+ - Run 10 live teaching sessions
119
+ - Confirm local JSONL grows; confirm HuggingFace Hub push
120
+ - Add a test under `tests/` that mocks the Hub and validates the write path
121
+
122
+ ### Step 4 — Train `ous-sow/fula-tts`
123
+ - Can run in parallel on RunPod
124
+ - Need 1-3 hours of clean studio audio from a single Fula speaker
125
+ - Same VITS recipe as Waxal Bambara
126
+
127
+ ### Step 5 — Close Phase 3 voice-to-voice parity
128
+ - Once Fula TTS exists, test the full voice-in → voice-out pipeline for both languages
129
+ - Measure round-trip CER: spoken sentence → transcript → response → synthesized speech → re-transcribe → compare
130
+ - Catches compounding errors across layers
131
+
132
+ ### Step 6 — Small field test
133
+ - Five Malian farmers. Cheap version: WhatsApp voice messages or a phone call with screen-shared Gradio
134
+ - Log what they try to ask, whether the response is intelligible, whether they'd use it again
135
+ - Success metric: do they ask a second question without being prompted?
136
+
137
+ ### Step 7 — Write a tone-handling policy
138
+ - Pick a position: "accept tonally-wrong TTS on homographs as a known limitation" vs "invest in tone annotation for TTS training corpus in cycle N+1"
139
+ - Either is defensible. The bad option is leaving it unspoken.
140
+
141
+ ---
142
+
143
+ ## Part 4 — If I were starting from zero today
144
+
145
+ Realistic assumptions: solo maintainer, nights-and-weekends pace, free or cheap compute (Kaggle free T4, HuggingFace Spaces cpu-basic, occasional RunPod for bigger runs), access to native speakers (this one is non-negotiable — if you don't have them, stop and find them first).
146
+
147
+ The single most important lesson from Sahel-Voice-Lab as it exists today: **it does four product's worth of things at once** (agricultural IoT, self-teaching, multi-language, voice cloning). If starting over, I'd ship one at a time.
148
+
149
+ ### Month 0 — Before writing any code
150
+ 1. Pick a narrow use case. Not "general Bambara assistant." Something like "voice queries for soil moisture" or "learn 100 agricultural words." One domain, one job.
151
+ 2. Identify 3-5 native speakers willing to test throughout. Get their phone numbers. Ask now, not later.
152
+ 3. Map the data landscape. Write a one-page doc listing every Bambara dataset you find: FLEURS (bam_ML), RobotsMali Jeli-ASR, OpenSLR, Masakhane resources, Common Voice. Note size, license, quality.
153
+ 4. Decide: Bambara only for the first version. Fula comes later. Do not start bilingual.
154
+
155
+ ### Month 1 — Text-first prototype (no audio yet)
156
+ 5. Wire the LLM (Qwen via HuggingFace Inference) with a carefully-written system prompt. Start in French or English; have it answer in Bambara.
157
+ 6. Build a Gradio text-in / text-out demo. Deploy to a HuggingFace Space on cpu-basic.
158
+ 7. Write the normalizer (the `bam_normalize.py` equivalent) with real tests. Spend real time on this; the audit you already did on the alphabet is the specification.
159
+ 8. Show it to your native speakers. Is the Bambara intelligible? Are the answers right?
160
+ 9. **Do not add STT or TTS yet.** This stage's only job is to learn what the LLM knows about Bambara and what it doesn't.
161
+
162
+ ### Month 2 — Add the ear, and the eval set
163
+ 10. **Build the evaluation set before training anything.** 50 utterances, 3 speakers, hand-transcribed. This is the most "wish I'd done this earlier" advice in low-resource ASR.
164
+ 11. Try Whisper-large-v3-turbo zero-shot on your eval set. Record the baseline WER. It will probably be 60-80%.
165
+ 12. Only then start LoRA fine-tuning with FLEURS + Jeli-ASR on Kaggle T4. Target: WER from ~70% to ~30% within four weeks.
166
+ 13. Wire the trained adapter into the Gradio app.
167
+
168
+ ### Month 3 — Add the mouth (baseline quality)
169
+ 14. Use MMS-TTS Bambara. One API call. It sounds robotic but it speaks.
170
+ 15. Ship this as the "Phase 1 complete" milestone on HuggingFace Spaces. This is a real product now: voice in, voice out.
171
+ 16. Collect 50-100 field interactions. Log everything.
172
+
173
+ ### Month 4 — Memory loop
174
+ 17. Build the teach-new-word flow.
175
+ 18. JSONL on disk + HuggingFace dataset push.
176
+ 19. Add the "curiosity" feature (system occasionally asks the user to teach it a word).
177
+ 20. Exercise it with real users before declaring it done. An empty `vocabulary.jsonl` is a sign the loop was never really tested.
178
+
179
+ ### Month 5 — Upgrade TTS
180
+ 21. Record 1-3 hours of studio audio with a single native speaker reading from a curated script that covers your domain vocabulary. This is the single biggest quality jump in the whole project.
181
+ 22. Train a VITS model (Waxal-style). Swap MMS out for it.
182
+ 23. Compare side-by-side with native listeners. Keep MMS as fallback.
183
+
184
+ ### Month 6 — Field test and iterate
185
+ 24. Five farmers. Phone calls or WhatsApp. Real conditions.
186
+ 25. Success metric: do they ask a second question unprompted? Do they come back tomorrow?
187
+ 26. Expect this stage to reshape priorities. Follow the feedback; do not defend the roadmap.
188
+
189
+ ### Month 7+ — Everything else
190
+ 27. Second language (Fula / Adlam): only after Bambara is stable
191
+ 28. Voice cloning (F5-TTS)
192
+ 29. Mobile / offline export (ONNX, TFLite)
193
+ 30. IoT sensor integration
194
+ 31. FastAPI service alongside the Gradio app
195
+
196
+ ### Things I would deliberately do differently
197
+ - **Ship the ugliest possible version at Month 3, not the polished pipeline at Month 9.** Five farmers with a robotic voice tell you more than 500 hours of benchmark tuning.
198
+ - **Build the evaluation set in Month 2, not later.** Every decision compounds; without an eval, you cannot tell which decisions to keep.
199
+ - **One language, one entry point, one framework at a time.** The current project has FastAPI + Gradio + Kaggle + ONNX + TFLite + bitsandbytes + speaker ID + voice cloning. Each is a maintenance commitment. Add them only when the product's existence justifies them.
200
+ - **No training your own ASR adapter until the LLM/TTS product has been tested.** Whisper zero-shot is good enough to validate the product idea. Training is expensive; you might end up optimizing a layer users don't care about.
201
+ - **Native speakers as collaborators, not testers at the end.** Monthly review calls from Month 1, not Month 6.
202
+
203
+ ### One-sentence summary
204
+ If I were starting from zero today, I would ship a narrow, ugly, one-language, text-first version to five real native-speaker users in the first three months, and build everything else on top of the feedback from those five people.
205
+
206
+ ---
207
+
208
+ ## Part 5 — Expanded walkthrough: why, how, and where Sahel-Voice-Lab fits
209
+
210
+ Each stage below has three sections: **Why** (the purpose — why this stage exists and what breaks if you skip it), **How** (concrete mechanics — files, commands, tools, decisions), and **Current project status** (what you have, what's missing, relative to this stage).
211
+
212
+ ### Stage A — Scoping and data audit (Month 0)
213
+
214
+ **Why.** The single biggest failure mode in low-resource voice AI is attempting a "general Bambara assistant." You cannot measure general; you cannot ship general; you cannot collect targeted data for general. You need one narrow domain so vocabulary is bounded, users can be found, failures are diagnosable, and every subsequent decision has a clear yes/no test: "does this help a farmer query soil moisture?" A bad scope locks in months of wasted work.
215
+
216
+ **How.** Write a one-page scoping document that answers: (1) what is the single first use case — one sentence, measurable; (2) who is the first user — names, phone numbers, what language variety they speak; (3) what does success look like in three months — one metric, not five. Then write a data audit: every public Bambara dataset with size, license, quality, and known issues. FLEURS (`bam_ML`), RobotsMali Jeli-ASR, OpenSLR, Masakhane, Common Voice. Note what's missing — domain vocabulary usually is.
217
+
218
+ **Current project status.** Stage A is implicitly done. The domain is "agricultural voice interface for Sahelian farmers." The data sources are identified and wired (`src/data/waxal_loader.py`, `src/data/web_harvester.py`, FLEURS referenced in training configs). The one thing weakly documented is the *target user profile* — which region, which dialect, what level of literacy, what phones they use. Writing this down explicitly (even as a one-paragraph persona in the README) tightens every downstream decision.
219
+
220
+ ### Stage B — Text-first prototype (Month 1)
221
+
222
+ **Why.** Before introducing audio, you need to know what the LLM actually knows about Bambara and what it doesn't. If the text-in/text-out experience is bad, adding voice will not save it; voice only adds more failure modes. Text prototyping is cheap — one deployment, no GPU, a few prompts — and teaches you the vocabulary gap you will spend the rest of the project closing.
223
+
224
+ **How.** Call a hosted multilingual LLM (Qwen, Mistral, Gemma) via HuggingFace Inference with `huggingface-hub`'s `InferenceClient`. Write a careful system prompt — the "adult-child" contract: LLM acts like a patient teacher, returns structured JSON with fields `{intent, reply_bm, reply_fr, confidence}`. Deploy a Gradio text-in/text-out interface to a HuggingFace Space on `cpu-basic`. Show it to two native speakers; ask what sounds wrong. Spend real time on the normalizer at this stage — the orthography audit (`ɛ ↔ e`, `ɔ ↔ o`, `ɲ ↔ ny/gn`, `ŋ ↔ ng`, 1967 vs older forms, and the `ny` ambiguity between palatal nasal and nasal + palatal glide) is the specification.
225
+
226
+ **Current project status.** Stage B is done. `src/llm/gemma_client.py` implements the adult-child JSON contract against Qwen. `src/data/bam_normalize.py` handles the orthographic cleanups. The Gradio app has been deployed. This stage is behind you.
227
+
228
+ ### Stage C — The ear: STT plus the evaluation set (Month 2)
229
+
230
+ **Why.** This is the stage with the highest "wish I'd done it earlier" rate in low-resource ASR. You need a real-user evaluation set *before* you train anything, because training without an eval is hill-climbing in the dark. FLEURS numbers do not predict field performance; field recordings do. Only after an eval exists is it worth investing Kaggle hours in fine-tuning.
231
+
232
+ **How.** First, the eval set. Ask three native speakers to each record 15-20 utterances covering your domain vocabulary. Use their actual phones, in their actual environments (not a quiet office). Transcribe by hand. Store under `data/eval/bambara_field.jsonl` as `{audio_path, transcript, speaker_id, noise_conditions}`. Run Whisper-large-v3-turbo zero-shot against it. Record the baseline WER (Word Error Rate) and CER (Character Error Rate) numbers in the repo somewhere durable (`docs/metrics.md`). Only then: start LoRA fine-tuning with FLEURS + Jeli-ASR on Kaggle T4. Each training run is measured against your eval set, not against FLEURS.
233
+
234
+ **Current project status. You are mostly here — with two important gaps.** The Whisper + LoRA + adapter-swap pipeline is built (`src/engine/whisper_base.py`, `src/engine/adapter_manager.py`, `src/engine/transcriber.py`). Training infrastructure exists (`src/training/trainer.py`, `notebooks/kaggle_master_trainer.ipynb`). However: (1) there is no `data/eval/` folder with real farmer recordings, and (2) the LoRA fine-tuning pipeline still crashes on Kaggle T4 per your project notes. These are your two most important current blockers. Until they resolve, every other ASR improvement is speculative.
235
+
236
+ ### Stage D — The mouth: baseline TTS and first ship (Month 3)
237
+
238
+ **Why.** Shipping an ugly working product beats polishing a pretty broken one. The first voice-in/voice-out deployment reveals failure modes no amount of offline testing catches — wake-word confusion, ambient noise you didn't model, users speaking too fast or too softly, compounding latency that makes the system feel dead. You cannot learn these from benchmarks; you learn them from users. Ship at the robotic-voice MMS-TTS baseline, then improve.
239
+
240
+ **How.** Wire MMS-TTS Bambara (`facebook/mms-tts-bam`) into the Gradio app — it's one `from transformers import VitsModel` call plus audio post-processing. Return audio as a Gradio `gr.Audio` output. Deploy. Write a very short intro text explaining this is a prototype. Share the Space URL with two native-speaker testers, tell them nothing about how it works, ask them to try three things.
241
+
242
+ **Current project status.** Stage D is done. MMS-TTS is wired (`src/tts/mms_tts.py`), the Gradio Space is deployed, Phase 1 has shipped per your notes. Two things that might be worth auditing: whether the deployed Space is still on the MMS fallback or already on Waxal-VITS, and whether there is *any* logging/telemetry on usage to tell you whether real people are actually touching the deployed Space.
243
+
244
+ ### Stage E — The memory loop (Month 4)
245
+
246
+ **Why.** The model does not know most Bambara vocabulary; users do. Without a mechanism to capture and persist what they teach, every conversation's knowledge dies with the session. The memory loop is the product's data-collection engine — the thing that lets it get better over time without you personally labeling data. This is also the core differentiation of Sahel-Voice-Lab versus a generic Bambara ASR+TTS demo.
247
+
248
+ **How.** Three components. (1) A teach-new-word flow in the UI: the user says "this is how you say X," the system confirms, stores to `data/vocabulary.jsonl` as `{word, translation, speaker_id, timestamp, audio_ref}`. (2) An async push to a versioned HuggingFace dataset (`ous-sow/sahel-agri-feedback`). (3) A "curiosity" mechanism where every N turns the LLM is prompted to identify a vocabulary gap and ask the user — this inverts the teaching initiative and collects more data per session.
249
+
250
+ **Current project status.** Stage E is structurally done but likely not exercised. `src/memory/memory_manager.py` implements the thread-safe JSONL + Hub push. `src/engine/curiosity.py` implements the CuriosityEngine. The Gradio app has a Teaching tab. However, your local `data/vocabulary.jsonl` is empty (0 lines). This means one of three things: (a) no one has used the teach flow yet, (b) the write path is broken and you haven't noticed because no one has used it, or (c) data goes only to the Hub and you've never pulled a snapshot locally. Worth a 20-minute investigation to confirm which. A test in `tests/` that mocks the Hub and asserts the local JSONL write is cheap insurance.
251
+
252
+ ### Stage F — Upgraded TTS (Month 5)
253
+
254
+ **Why.** MMS-TTS works but sounds robotic, and users notice immediately. Moving to a single-speaker VITS model trained on 1-3 hours of clean studio audio is the single biggest perceived-quality jump in the entire pipeline. It also gives you something MMS cannot: a consistent, identifiable voice that users remember. For long-term adoption, voice identity matters as much as intelligibility.
255
+
256
+ **How.** Record 1-3 hours of studio audio with one native speaker reading a curated script that covers your domain vocabulary plus conversational filler. Target: quiet room, decent USB mic, 22050 or 44100 Hz, single take per sentence. Align transcripts, clean silence, normalize loudness. Train a VITS model on your RunPod GPU (Kaggle T4 usually not enough memory for full VITS). Publish to HuggingFace as a private or public model repo. Swap out MMS in the TTS dispatcher, keep MMS as fallback.
257
+
258
+ **Current project status.** Stage F is done for Bambara, not for Fula. The Waxal VITS integration lives in `src/tts/waxal_tts.py` and per your notes is partially shipped for Bambara (`ynnov/ekodi-bambara-tts-female`). Fula TTS is a placeholder — `ous-sow/fula-tts` does not exist yet. Closing this is one of your active goals. The recording session is usually the bottleneck, not the training.
259
+
260
+ ### Stage G — Field test (Month 6)
261
+
262
+ **Why.** Everything before this stage is technical. This stage is where you find out whether the technical work produced something humans actually use. It's also where you discover that three of your prior assumptions were wrong — assumptions you could not have tested any other way. Every low-resource voice project that skips this stage ends up polished and unused.
263
+
264
+ **How.** Five native-speaker users. Cheapest version: WhatsApp voice messages or a phone call with screen-shared Gradio. Give them a small task ("ask about your soil moisture"), observe without coaching. Record what they try to ask, whether the transcript is right, whether the answer is intelligible to them, whether they would use it unprompted again. The success metric is not WER. It is: *does the user ask a second question they came up with themselves?*
265
+
266
+ **Current project status.** Stage G is **not done**. There is no field-test evidence in the repo, no usage logs, no session transcripts from actual farmers. This is, honestly, the single largest gap between where the project is and where it needs to be — more important than the Kaggle crash or the missing Fula TTS. You can ship a field test with what you have today and the feedback will reshape everything downstream.
267
+
268
+ ### Stage H — Expansion (Month 7+)
269
+
270
+ **Why.** Only once a single-language, single-domain product has real users do you earn the right to expand. Each added dimension (second language, voice cloning, mobile export, IoT integration) doubles surface area for bugs and maintenance. Adding them in parallel to the core product means you will ship nothing well; adding them after the core is stable means each addition builds on a known-good base.
271
+
272
+ **How.** Second language (Fula/Adlam): repeat stages B through G with the new language, reusing infrastructure but refitting normalization and TTS training. Voice cloning: F5-TTS or OpenVoice, keyed to a speaker embedding from the speaker-ID layer. Mobile export: ONNX per language, then TFLite via onnx-tf, then bundle into a thin Android app. IoT integration: FastAPI service in front of the sensor bridge, authenticated, cached.
273
+
274
+ **Current project status. You are ahead of schedule here, which is the diagnostic.** Phase 3 voice-to-voice is merged and stabilizing. F5-TTS is scaffolded (`src/tts/f5_tts.py`). OpenVoice-based voice cloning is scaffolded (`src/tts/voice_cloner.py`). Speaker ID with ECAPA-TDNN is in place (`src/voice/speaker_profiles.py`). Adlam/Pular integration has landed. ONNX and TFLite exporters exist (`src/optimization/`). A FastAPI service is scaffolded (`src/api/`). This is Month 7+ work already in the codebase. The issue is not that this work is wrong — it is that it was built before Stages C (eval set), E (loop exercised with real data), and G (field test) were actually completed. The risk is building a polished Stage H surface on an unmeasured Stage C-E foundation.
275
+
276
+ ---
277
+
278
+ ## Where you actually are right now
279
+
280
+ The honest diagnosis of Sahel-Voice-Lab as of 2026-04-19, mapped onto the staged plan:
281
+
282
+ Done: Stages A, B, D. Bambara text and audio pipeline ships to users via Gradio on HF Spaces. The LLM contract is stable. Normalization is implemented.
283
+
284
+ Partially done: Stage C (ASR pipeline built but no field eval set, training still crashes on Kaggle), Stage E (memory loop built but `vocabulary.jsonl` empty — not yet exercised with real users), Stage F (Bambara TTS upgraded, Fula TTS still placeholder).
285
+
286
+ Not done: Stage G (no field test with real farmers).
287
+
288
+ Ahead of schedule: Stage H (Phase 3 voice-to-voice, voice cloning, Adlam/Pular, ONNX/TFLite, FastAPI — all built in parallel with, or before, completing C/E/G).
289
+
290
+ The path forward, ordered by leverage: (1) fix the Kaggle LoRA crash so Stage C can continue; (2) build the real-user eval set so Stage C has a measurement foundation; (3) exercise the memory loop with three real users so Stage E is confirmed; (4) run a small field test so Stage G is unblocked; (5) train `ous-sow/fula-tts` so Stage F closes for Fula; (6) return to Stage H work with actual user signal guiding priorities.
291
+
292
+ Everything the project is missing is measurement. Everything the project has is implementation. That is a recoverable position, but only if the measurement work now gets the same weight the implementation work has had.
project-context.txt ADDED
@@ -0,0 +1,295 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ================================================================================
2
+ PROJECT CONTEXT — sahel-agri-voice
3
+ Generated: 2026-04-17
4
+ ================================================================================
5
+
6
+ PROJECT NAME
7
+ ------------
8
+ Sahel-Voice-Lab / Sahel-Agri Voice AI
9
+ (HuggingFace Space title: "Sahel-Voice-Lab", Phase 1: "The Memory Loop")
10
+
11
+ PURPOSE
12
+ -------
13
+ A voice-first, self-learning AI assistant for two West African languages —
14
+ Bambara (bam, spoken in Mali) and Fula/Pular (ful, spoken in Guinea and
15
+ Senegal) — targeted at farmers in the Sahel region.
16
+
17
+ The system has two complementary capabilities:
18
+
19
+ 1. LANGUAGE-LEARNING MEMORY LOOP (Phase 1)
20
+ The assistant behaves like an "eager child learner." Users teach it
21
+ Bambara/Fula words ("I ni ce means hello") via voice or text; an LLM
22
+ detects the teaching intent and the word pair is persisted to a
23
+ HuggingFace Hub dataset (ous-sow/sahel-agri-feedback → vocabulary.jsonl)
24
+ so knowledge accumulates across sessions and users. The vocabulary is
25
+ then injected into the LLM's system prompt as its source of truth for
26
+ answering questions.
27
+
28
+ 2. AGRICULTURAL IoT VOICE INTERFACE
29
+ Farmers speak questions in their own language ("how is the soil?",
30
+ "is it going to rain?"). Whisper transcribes, an intent parser keyword-
31
+ matches Bambara/Fula agricultural terms (soil, rain, irrigation, pest),
32
+ a sensor bridge fetches data from an IoT backend (or mock data), and
33
+ VoiceResponder + a TTS engine reply in short Bambara/Fula sentences
34
+ with alert thresholds (e.g. "Bunding ji dɔgɔ. I ka foro ji." =
35
+ "Soil moisture is low. Irrigate your field.").
36
+
37
+ The project is deployed as a HuggingFace Space (Gradio frontend) with an
38
+ optional FastAPI service. The system is explicitly "100% non-Meta" for its
39
+ core stack (Whisper / Qwen / F5-TTS / VITS), avoiding Meta models for the
40
+ main loop.
41
+
42
+ FULL TECH STACK
43
+ ---------------
44
+ Deployment / hosting
45
+ - HuggingFace Spaces (Gradio SDK 5.25.0, hardware: cpu-basic)
46
+ - Kaggle notebooks (T4 GPU) for training runs
47
+ - RunPod alternative training environment
48
+ - HF Hub datasets as persistent vocabulary + feedback store
49
+
50
+ Frontend
51
+ - Gradio 5.25.0 (app.py — main UI; app_lab.py — experimental lab UI)
52
+
53
+ Backend API
54
+ - FastAPI (src/api/app.py via create_app() + lifespan)
55
+ - Pydantic v2 (schemas)
56
+ - httpx (async calls to IoT sensor backend)
57
+
58
+ Speech-to-text (STT)
59
+ - openai/whisper-large-v3-turbo (default backbone)
60
+ - transformers 5.5.0 (WhisperForConditionalGeneration, WhisperProcessor)
61
+ - PEFT (LoRA adapters, hot-swappable per language)
62
+ - accelerate 1.13.0
63
+ - librosa 0.10.2, soundfile 0.12.1, torchaudio
64
+
65
+ LLM (reasoning / teaching-intent detection)
66
+ - Qwen/Qwen2.5-72B-Instruct (default, via HF Serverless Inference)
67
+ - Qwen/Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Zephyr-7b-beta
68
+ as faster alternatives
69
+ - huggingface-hub 1.9.0 InferenceClient
70
+
71
+ Text-to-speech (TTS)
72
+ - Phase 1: facebook/mms-tts-bam, mms-tts-ful, mms-tts-fra, mms-tts-eng
73
+ - Phase 2: ynnov/ekodi-bambara-tts-female (VITS)
74
+ + placeholder ous-sow/fula-tts
75
+ - F5-TTS (SWivid/F5-TTS) for GPU voice cloning (optional, ~2GB)
76
+ - OpenVoice V2 (myshell-ai/openvoice-v2) for tone-color conversion
77
+ - SpeechBrain ECAPA-TDNN for speaker identification (per-user profiles)
78
+
79
+ Data / datasets
80
+ - google/fleurs (bam_ML, ff_SN) as STT training corpus
81
+ - RobotsMali/jeli-asr, google/fleurs Fula, Wikipedia (bm, ff) harvested
82
+ text via src/data/web_harvester.py
83
+ - datasets 4.8.4 (+ torchcodec for 4.x audio decoding)
84
+ - Adlam ↔ Latin transliteration for Guinea Pular
85
+
86
+ Training / fine-tuning
87
+ - PEFT LoRA + Seq2SeqTrainer
88
+ - jiwer 3.0.4 (WER / CER metrics)
89
+ - Custom callbacks: EarlyStoppingOnWER, AdapterCheckpointCallback
90
+ - FieldNoiseAugmenter (tractor / wind / livestock noise mixing)
91
+
92
+ Optimization / edge deploy
93
+ - optimum[onnxruntime] → per-language ONNX export
94
+ - onnx-tf / TensorFlow → TFLite for Android
95
+ - bitsandbytes NF4 / 8-bit quantization (training environments)
96
+
97
+ Utilities / runtime
98
+ - PyYAML 6.0.2, python-dotenv 1.1.0
99
+ - NumPy 2.2.4, SciPy 1.15.2
100
+ - rapidfuzz 3.13.0 (fuzzy phrase matching)
101
+ - pypdf, python-docx (Knowledge Base upload → vocabulary.jsonl)
102
+ - Kaggle API (Self-Teaching tab triggers training runs)
103
+ - ffmpeg (packages.txt — sole system-level dep)
104
+
105
+ Environment variables
106
+ HF_TOKEN, FEEDBACK_REPO_ID (ous-sow/sahel-agri-feedback),
107
+ LLM_MODEL_ID, BAMBARA_ADAPTER_PATH, FULA_ADAPTER_PATH,
108
+ SENSOR_API_URL, BAMBARA_TTS_REPO, FULA_TTS_REPO, DEVICE, LOG_LEVEL
109
+
110
+ KEY SOURCE FILES AND WHAT THEY DO
111
+ ---------------------------------
112
+ Top-level entry points
113
+ app.py
114
+ Gradio UI (~99 KB). Main user-facing application running on the HF Space.
115
+ Wires STT → LLM → memory → TTS, exposes the Conversation / Teaching /
116
+ Knowledge Base / Self-Teaching tabs.
117
+ app_lab.py
118
+ Experimental/lab Gradio UI used to prototype new features
119
+ (e.g. CuriosityEngine integration) before folding into app.py.
120
+ setup.sh
121
+ Shell bootstrap for local + RunPod environments.
122
+
123
+ src/api/ — FastAPI service (alternative to Gradio-only deploy)
124
+ app.py FastAPI factory with async lifespan: loads Whisper backbone
125
+ once, registers bam/ful adapters, pre-loads 'bam', attaches
126
+ Transcriber + SensorBridge to app.state.
127
+ dependencies.py FastAPI DI helpers to pull shared objects off app.state.
128
+ middleware.py CORS / logging middleware registration.
129
+ schemas.py Pydantic v2 request/response models.
130
+ routes/health.py GET /health — model status + loaded adapters.
131
+ routes/transcribe.py POST /transcribe — audio → text, 10 MB cap,
132
+ wav/mp3/ogg/m4a/flac/webm.
133
+ routes/iot.py POST /query — full pipeline: audio → transcribe → intent
134
+ → sensor → voice response (IoTQueryResponse).
135
+
136
+ src/engine/ — STT core
137
+ whisper_base.py Singleton loader for WhisperForConditionalGeneration +
138
+ WhisperProcessor. FP16 on CUDA, FP32 on CPU. free()
139
+ releases VRAM.
140
+ adapter_manager.py Hot-swap LoRA adapters via PEFT's multi-adapter API:
141
+ first load ~2s, subsequent set_adapter ~50ms.
142
+ Keeps one backbone in VRAM and swaps ~50MB adapters.
143
+ transcriber.py Public inference API. Handles ≤30s chunks directly,
144
+ >30s by slicing into 30s windows. Returns
145
+ TranscriptionResult (text, language, duration_s,
146
+ processing_time_ms, confidence).
147
+ stt_processor.py avg_logprob confidence extractor; threshold -1.0 =
148
+ "confused", caller should ask user to repeat.
149
+ curiosity.py CuriosityEngine — every N interactions, prompts the
150
+ LLM to spot a vocabulary gap and ask the user how to
151
+ say a missing agricultural term.
152
+
153
+ src/llm/
154
+ gemma_client.py Wraps HF Serverless InferenceClient. Implements the
155
+ "adult-child" system prompt that returns structured
156
+ JSON with intent ∈ {teaching, question, conversation,
157
+ error}. Parses JSON out of optional markdown fences.
158
+
159
+ src/memory/
160
+ memory_manager.py Thread-safe vocabulary store. Persists to
161
+ data/vocabulary.jsonl locally and pushes asynchronously
162
+ to HF Hub dataset. Provides get_recent() and a
163
+ formatted get_vocabulary_context() for the LLM prompt.
164
+
165
+ src/conversation/
166
+ phrase_matcher.py RapidFuzz-based matcher over curated JSON phrase
167
+ libraries (data/phrases/{lang}.json + _additions.json).
168
+ Handles greetings / thanks / farewells without hitting
169
+ the LLM.
170
+
171
+ src/iot/
172
+ intent_parser.py Keyword-based Intent classifier
173
+ (greeting/thanks/farewell/check_soil/check_weather/
174
+ irrigation_status/pest_alert) for bam, ful, fr, en.
175
+ Confidence = matched_keywords / total_keywords.
176
+ sensor_bridge.py Async bridge to an IoT backend (SENSOR_API_URL) for
177
+ soil / weather / irrigation / pest readings.
178
+ Falls back to mock random data.
179
+ voice_responder.py Maps (Intent, SensorData) → short Bambara/Fula reply
180
+ string (≤6 words per sentence for clean MMS-TTS) plus
181
+ English translation. Alert thresholds encoded here
182
+ (SOIL_MOISTURE_LOW=30, PH bounds, TEMP_HIGH=38, etc.).
183
+ Also has a verbose French-language path.
184
+
185
+ src/data/
186
+ agri_dictionary.py Bambara + Fula domain vocab used to bias the Whisper
187
+ decoder prompt toward agricultural terms.
188
+ waxal_loader.py Streams google/fleurs (bam_ML, ff_SN) — the
189
+ replacement for the retired google/waxal dataset.
190
+ feature_extractor.py Log-mel spectrogram extraction and batched padding
191
+ collator for Whisper Seq2SeqTrainer.
192
+ augmentation.py FieldNoiseAugmenter — mixes clean speech with
193
+ tractor/wind/livestock samples; falls back to
194
+ Gaussian noise.
195
+ bam_normalize.py Bambara phonetic normalizer (ou→u, gn/ny→ɲ,
196
+ N'Ko-derived standard).
197
+ adlam.py Adlam (𞤀𞤣𞤤𞤢𞤥) ↔ Latin transliteration for Pular;
198
+ normalize_pular() for ASR preprocessing.
199
+ web_harvester.py Harvests RobotsMali/jeli-asr, google/fleurs ff_SN,
200
+ and bm/ff Wikipedia into the feedback Hub dataset.
201
+
202
+ src/training/
203
+ trainer.py WhisperLoRATrainer — full fine-tune orchestration
204
+ (backbone + LoraConfig + WaxalDataLoader +
205
+ Seq2SeqTrainer).
206
+ metrics.py WER/CER for Seq2SeqTrainer eval loop (via jiwer).
207
+ callbacks.py EarlyStoppingOnWER, AdapterCheckpointCallback
208
+ (saves adapter-only, not full model).
209
+
210
+ src/tts/
211
+ waxal_tts.py VITS engine wrapping ynnov/ekodi-bambara-tts-female
212
+ for Bambara; Fula is a placeholder until
213
+ ous-sow/fula-tts is trained.
214
+ mms_tts.py Facebook MMS-TTS (bam/ful/fra/eng).
215
+ f5_tts.py F5-TTS voice cloning (optional, GPU-only, ~750MB);
216
+ gracefully falls back to MMS when missing.
217
+ voice_cloner.py OpenVoice V2 tone-color converter — reshapes VITS
218
+ audio to a target speaker's voice.
219
+
220
+ src/voice/
221
+ speaker_profiles.py SpeakerProfileManager with SpeechBrain ECAPA-TDNN
222
+ (192-d embeddings). Per-user running-average embeddings
223
+ for identification + OpenVoice SE for cloning; cosine
224
+ similarity ≥ 0.75 attributes to an existing user.
225
+
226
+ src/optimization/
227
+ onnx_exporter.py Merges LoRA into backbone and exports per-language
228
+ ONNX (ONNX can't hot-swap adapters at runtime).
229
+ quantizer.py BitsAndBytes NF4 / 8-bit quantization for GPU-
230
+ constrained deploys (turbo ~3GB → ~1GB VRAM).
231
+ tflite_converter.py ONNX → TFLite for offline Android; exports encoder
232
+ and decoder separately.
233
+
234
+ Config / data folders
235
+ configs/ base_config.yaml + per-language LoRA configs.
236
+ data/ vocabulary.jsonl, phrases/*.json, profiles/, etc.
237
+ notebooks/ Kaggle / RunPod fine-tune + TTS training notebooks.
238
+ noise_samples/ .wav clips for field-noise augmentation.
239
+ scripts/ utility scripts (bootstrap, harvest, eval).
240
+ tests/ pytest suite (not installed in HF Spaces runtime).
241
+
242
+ RECENT GIT COMMITS SUMMARY (last 20)
243
+ ------------------------------------
244
+ The recent history is focused on three concurrent tracks:
245
+
246
+ 1. STT / training stability
247
+ - bb78cbf Add torchcodec install for datasets 4.x audio decoding
248
+ - 9049ef3 Prepare training stack for RunPod: env-aware notebook +
249
+ bootstrap script
250
+ - cc50efb Align Whisper default to turbo-v3 + add document upload to
251
+ Knowledge Base tab
252
+ - c33a061 Fix WhisperProcessor import in reload + upgrade base to
253
+ large-v3-turbo
254
+ - 7fae91b Fix mel-bin mismatch: load per-language processor from
255
+ fine-tuned checkpoint
256
+ - 6682858 Fix jiwer crash on post-normalisation empty refs;
257
+ register SLR106/105 datasets
258
+ - 58f431a Fix SyntaxError in Cell 17: unterminated f-string literal
259
+ - 3632a23 Fix compute_metrics crash on empty eval references
260
+ in Fula training
261
+ - 71bb3bc Fix: add trust_remote_code=True for datasets 3.x compatibility
262
+ - cd017e2 Fix Cell 16 ValueError: load model fp32 so AMP gradient scaler
263
+ works
264
+
265
+ 2. Language support / Adlam / Pular expansion
266
+ - ced078c Add Adlam/Pular Fula integration: transliterator +
267
+ 3 new datasets + normalisation pipeline
268
+ - 40cf84d Fix language mixing: per-language prompts +
269
+ Mali Bambara / Guinea Pular context
270
+ - 33c3a5a Fix Self-Teaching language detection: parse code from
271
+ dropdown label
272
+ - 24b1617 Fix Self-Teaching tab: float sliders, deduplication,
273
+ Kaggle API fallback
274
+
275
+ 3. Conversation / voice pipeline
276
+ - 8952fff Phase 3: Voice-to-Voice S2S pipeline —
277
+ F5-TTS, LLM brain, CER metric
278
+ - ad902c6 Add real conversational memory + live learning to
279
+ Conversation Mode
280
+ - 8d7d9d8 Fix conversation mode timeout: two-stage pipeline + faster LLM
281
+ - 1958814 Fix "Model loading" stuck state: block in _do_asr until
282
+ Whisper is ready
283
+ - 618eab5 Fix model loading stuck forever + unhandled TTS crash in
284
+ conversation mode
285
+ - bfe5b59 Fix slow build: strip runtime-irrelevant heavy packages from
286
+ requirements.txt
287
+
288
+ Overall trajectory: the project has moved past initial Phase 1 scaffolding
289
+ and is iterating hard on (a) stabilising fine-tuning on Kaggle/RunPod with
290
+ large-v3-turbo, (b) expanding to Guinea Pular with the native Adlam script,
291
+ and (c) finishing the Phase 3 voice-to-voice pipeline (F5-TTS + LLM brain).
292
+ Most recent commits are bug-fixes rather than net-new features, suggesting
293
+ the current codebase is approaching a stable milestone.
294
+
295
+ ================================================================================
scripts/push_to_hf.sh ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # push_to_hf.sh — Push current branch to Hugging Face Space main using HF_TOKEN.
3
+ #
4
+ # Usage:
5
+ # bash scripts/push_to_hf.sh
6
+ # HF_SPACE_REPO="spaces/ous-sow/sahel-agri-voice" bash scripts/push_to_hf.sh
7
+
8
+ set -euo pipefail
9
+
10
+ REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
11
+ cd "$REPO_ROOT"
12
+
13
+ HF_SPACE_REPO="${HF_SPACE_REPO:-spaces/ous-sow/sahel-agri-voice}"
14
+ TARGET_BRANCH="${TARGET_BRANCH:-main}"
15
+
16
+ # Load .env if present and HF_TOKEN is not already exported.
17
+ if [[ -z "${HF_TOKEN:-}" && -f "$REPO_ROOT/.env" ]]; then
18
+ set -a
19
+ # shellcheck disable=SC1091
20
+ source "$REPO_ROOT/.env"
21
+ set +a
22
+ fi
23
+
24
+ if [[ -z "${HF_TOKEN:-}" ]]; then
25
+ echo "HF_TOKEN is not set."
26
+ echo "Set HF_TOKEN in your shell or in .env, then rerun."
27
+ exit 1
28
+ fi
29
+
30
+ CURRENT_BRANCH="$(git branch --show-current)"
31
+ if [[ -z "$CURRENT_BRANCH" ]]; then
32
+ echo "Could not detect current git branch."
33
+ exit 1
34
+ fi
35
+
36
+ echo "Pushing '$CURRENT_BRANCH' to '$HF_SPACE_REPO' (remote branch: '$TARGET_BRANCH')..."
37
+ git push "https://__token__:${HF_TOKEN}@huggingface.co/${HF_SPACE_REPO}" "${CURRENT_BRANCH}:${TARGET_BRANCH}"
38
+ echo "Done."