Spaces:
Sleeping
Sleeping
Broulaye Doumbia commited on
Commit ·
cc8b90c
1
Parent(s): bb78cbf
push docs and script
Browse files- docs/baseline_rebuild.md +166 -0
- docs/notebook_collaboration.md +159 -0
- docs/roadmap_2026-04.md +292 -0
- project-context.txt +295 -0
- scripts/push_to_hf.sh +38 -0
docs/baseline_rebuild.md
ADDED
|
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Baseline Rebuild Plan — Recovering Months 1-3 Without Losing Existing Work
|
| 2 |
+
|
| 3 |
+
*Created: 2026-04-20*
|
| 4 |
+
*Maintainer: Broulaye*
|
| 5 |
+
|
| 6 |
+
## The framing
|
| 7 |
+
|
| 8 |
+
You are not restarting. You are **backfilling** the measurement foundation that was skipped the first time through Stages C (ASR + eval), E (memory loop with real users), and G (field test). Every existing file in `src/` stays exactly where it is. The `app.py` Gradio Space on HuggingFace keeps running. Phase 3 voice-to-voice, Waxal VITS, Adlam/Pular, F5-TTS, ONNX exporters, the FastAPI service — all of it stays.
|
| 9 |
+
|
| 10 |
+
What you add is a **parallel minimal track**: a new, deliberately simple entry point that uses only the smallest slice of the existing codebase, runs a real field test against it, and collects the data that should have been collected in Months 1-3. Once the minimal track has produced field signal, you use that signal to guide which features in the main app are actually earning their keep.
|
| 11 |
+
|
| 12 |
+
Three principles govern this plan:
|
| 13 |
+
|
| 14 |
+
1. **Never delete, never rewrite.** If something is wrong in an existing module, fix it in place. The minimal track imports from `src/`; it does not fork it.
|
| 15 |
+
2. **The existing `app.py` keeps shipping.** Do not take down the production Space. The minimal version deploys as a *separate* Space.
|
| 16 |
+
3. **The measurement artifacts (eval set, logs, field-test notes) merge back into main when done.** Code stays isolated on a branch; data and docs come back.
|
| 17 |
+
|
| 18 |
+
## Step-by-step
|
| 19 |
+
|
| 20 |
+
### Step 1 — Protect main with a branch and a tag
|
| 21 |
+
|
| 22 |
+
**Why.** Every experiment has to be safely discardable. Tagging the current commit lets you return to known-good state at any point; branching means nothing the rebuild does can touch the main deploy.
|
| 23 |
+
|
| 24 |
+
**How.**
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
cd /sessions/practical-intelligent-knuth/mnt/sahel-agri-voice
|
| 28 |
+
git status # confirm clean working tree
|
| 29 |
+
git tag v0.3-pre-rebuild -m "Last state before baseline rebuild"
|
| 30 |
+
git push origin v0.3-pre-rebuild # if you want the tag on GitHub
|
| 31 |
+
git checkout -b experimental/baseline-rebuild
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
From this point on, all rebuild work happens on `experimental/baseline-rebuild`. Main is frozen for the duration of the rebuild. Hotfixes to production still go through main as normal.
|
| 35 |
+
|
| 36 |
+
### Step 2 — Create the minimal entry point
|
| 37 |
+
|
| 38 |
+
**Why.** You need to run Whisper + LLM + MMS-TTS in the simplest possible wiring, with nothing else in the critical path. This is what users will actually evaluate. Every extra component adds a failure mode you can't isolate. The minimal entry point becomes a joint debugging tool and a field-test artifact.
|
| 39 |
+
|
| 40 |
+
**How.** Add a new file `app_minimal.py` at the repo root — a third entry point alongside `app.py` (full production) and `app_lab.py` (experimental). It should import only:
|
| 41 |
+
|
| 42 |
+
- `src.llm.gemma_client` — the Qwen LLM client, unchanged
|
| 43 |
+
- `src.engine.whisper_base` — Whisper backbone, used *zero-shot* (no adapter)
|
| 44 |
+
- `src.tts.mms_tts` — MMS-TTS Bambara fallback
|
| 45 |
+
- `src.data.bam_normalize` — the orthography normalizer
|
| 46 |
+
|
| 47 |
+
It should **not** touch:
|
| 48 |
+
|
| 49 |
+
- `src/engine/adapter_manager.py` (skip LoRA entirely — zero-shot only)
|
| 50 |
+
- `src/engine/transcriber.py` (the adapter-aware wrapper — use `whisper_base` directly)
|
| 51 |
+
- `src/memory/` (no memory loop in the minimal version yet)
|
| 52 |
+
- `src/voice/speaker_profiles.py` (no speaker ID)
|
| 53 |
+
- `src/iot/` (no sensors, no intent parsing — LLM handles it all)
|
| 54 |
+
- `src/tts/waxal_tts.py`, `src/tts/f5_tts.py`, `src/tts/voice_cloner.py` (no upgraded TTS)
|
| 55 |
+
- `src/conversation/phrase_matcher.py` (no fast-path shortcuts)
|
| 56 |
+
|
| 57 |
+
Single Gradio interface, one tab: microphone input, audio output, transcript visible for debugging. Roughly 150-200 lines total. Add a header comment explaining what it is:
|
| 58 |
+
|
| 59 |
+
```python
|
| 60 |
+
"""Minimal baseline Gradio entry point for the Month 1-3 rebuild.
|
| 61 |
+
|
| 62 |
+
Wires the simplest possible slice: Whisper (zero-shot) -> Qwen -> MMS-TTS.
|
| 63 |
+
No LoRA adapters, no memory loop, no speaker ID, no voice cloning.
|
| 64 |
+
Used for field testing and building a real-user eval set.
|
| 65 |
+
See docs/baseline_rebuild.md for the plan this fits into.
|
| 66 |
+
"""
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
### Step 3 — Add the evaluation infrastructure
|
| 70 |
+
|
| 71 |
+
**Why.** This is the single most load-bearing deliverable of the rebuild. Without a real-user eval set, every subsequent decision is speculation. The eval set is what turns "I think this change helps" into "I measured this change helped." It also makes the LoRA Kaggle training work (Stage C continuation) scientifically valid whenever you get back to it.
|
| 72 |
+
|
| 73 |
+
**How.** Create the folder structure:
|
| 74 |
+
|
| 75 |
+
```
|
| 76 |
+
data/eval/
|
| 77 |
+
bambara_field.jsonl # the eval manifest — starts empty
|
| 78 |
+
audio/ # the actual wav files (gitignore large files; keep manifest in git)
|
| 79 |
+
README.md # recording protocol
|
| 80 |
+
scripts/
|
| 81 |
+
eval_baseline.py # runs minimal stack against manifest, emits metrics
|
| 82 |
+
docs/
|
| 83 |
+
eval_protocol.md # how to add a new recording, quality criteria
|
| 84 |
+
metrics.md # where baseline numbers are recorded
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
The JSONL manifest format:
|
| 88 |
+
|
| 89 |
+
```json
|
| 90 |
+
{"audio_path": "audio/speaker01_001.wav", "transcript": "ji be min?", "speaker_id": "speaker01", "region": "Bamako", "noise": "quiet", "duration_s": 2.3}
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
`scripts/eval_baseline.py` loads the manifest, runs each audio through Whisper-large-v3-turbo (zero-shot, no adapter), compares to the ground-truth transcript, and prints WER and CER per-speaker and overall. Also prints a few failure cases for inspection. This script becomes your standard measurement harness — every future change gets compared against the same manifest.
|
| 94 |
+
|
| 95 |
+
### Step 4 — Collect real recordings (the only human-gated step)
|
| 96 |
+
|
| 97 |
+
**Why.** This is where the rebuild touches reality. Three to five native speakers, using their actual phones, in their actual environments. Fifteen to twenty utterances each covering the agricultural domain you scoped for. The recording conditions have to be real, or the eval set will give you FLEURS-like numbers that lie to you.
|
| 98 |
+
|
| 99 |
+
**How.** Write a recording script with 50-100 prompts covering:
|
| 100 |
+
|
| 101 |
+
- Greetings and politeness formulas (baseline — should be easy)
|
| 102 |
+
- Agricultural queries the product actually needs to handle ("how wet is the soil," "when should I water the tomatoes," "is there a pest alert")
|
| 103 |
+
- Vocabulary you know is underrepresented in FLEURS (crop names, tool names, regional agricultural terms)
|
| 104 |
+
- A few natural code-switch utterances (Bambara with French loanwords)
|
| 105 |
+
|
| 106 |
+
Share the script via WhatsApp voice messages or have them record in a free mobile app that returns wav or m4a. Transcribe by hand (or by LLM with manual correction). Commit the JSONL manifest to the repo; upload the audio to a private HF dataset to avoid bloating git history.
|
| 107 |
+
|
| 108 |
+
Set a target: at least 50 utterances across at least 3 speakers before running your first baseline eval. More is better, but 50 is the usable floor.
|
| 109 |
+
|
| 110 |
+
### Step 5 — Deploy the minimal Space
|
| 111 |
+
|
| 112 |
+
**Why.** A second HF Space running `app_minimal.py` in parallel with the main Space gives testers a stripped-down version to react to. Comparing two Spaces teaches you which features in the main app are actually pulling weight — if minimal gets the same "I'd use this" reaction as the full version, most of the fancy work isn't load-bearing for first-use value (which doesn't mean it's wrong, just that adoption doesn't depend on it).
|
| 113 |
+
|
| 114 |
+
**How.** Create a new Space, e.g. `ous-sow/sahel-voice-minimal`. Set the Space entry point to `app_minimal.py`. Keep `packages.txt` unchanged (ffmpeg is still needed). In `requirements.txt`, consider a trimmed version that doesn't pull in voice cloning or training-only deps — this is a chance to get the minimal Space to cold-boot faster.
|
| 115 |
+
|
| 116 |
+
Add basic session logging: every interaction writes a row to a HF dataset `ous-sow/sahel-agri-field-logs` with fields `{timestamp, speaker_opt_in_id, audio_hash, transcript, llm_reply, tts_audio_hash, latency_ms}`. With opt-in consent text in the UI. No PII. This logging is what will feed your future training data and answer "are users actually coming back."
|
| 117 |
+
|
| 118 |
+
### Step 6 — Run the field test
|
| 119 |
+
|
| 120 |
+
**Why.** The whole rebuild exists to get this step done. Everything before it is scaffolding; everything after it is informed by what happens in it. The success metric is not WER. It is: **do the testers ask a second question they came up with themselves?** That is the shortest signal that tells you whether this is a product or a demo.
|
| 121 |
+
|
| 122 |
+
**How.** Five testers, two weeks. WhatsApp intro: here is the link, please try to ask about soil or weather in Bambara, tell me anything weird. No coaching on phrasing. At the end of week 1 and week 2, ask each tester three questions: what worked, what failed, would you come back tomorrow. Record answers. No metrics from this stage go in a spreadsheet; they go in a short note under `docs/field_test_notes_YYYY-MM-DD.md` written in plain language.
|
| 123 |
+
|
| 124 |
+
In parallel, the session logs from Step 5 accumulate. At the end of two weeks, run a small analysis: median latency, distribution of utterance lengths, most common failure utterances, return rate per tester.
|
| 125 |
+
|
| 126 |
+
### Step 7 — Selective reintegration
|
| 127 |
+
|
| 128 |
+
**Why.** Now you have evidence. Some of the Stage H features the main app already has will turn out to be essential — users asked for speaker memory, or they wanted the IoT integration enough to keep trying. Other features will turn out to be polish no tester noticed. The rebuild ends not with a big merge but with a prioritized list: which features go back into the critical path immediately, which wait, which get deprecated.
|
| 129 |
+
|
| 130 |
+
**How.** Open a small PR from `experimental/baseline-rebuild` back into main that brings in *only the data and documentation*:
|
| 131 |
+
|
| 132 |
+
- `data/eval/bambara_field.jsonl` and the audio reference
|
| 133 |
+
- `scripts/eval_baseline.py`
|
| 134 |
+
- `docs/eval_protocol.md`
|
| 135 |
+
- `docs/metrics.md` with baseline numbers recorded
|
| 136 |
+
- `docs/field_test_notes_*.md`
|
| 137 |
+
- The session-logging infrastructure (if you want it in the production Space too — usually yes)
|
| 138 |
+
|
| 139 |
+
Leave `app_minimal.py` on the branch as a long-lived tool — it's now your smoke-test harness. Don't merge it into main unless it's actively useful there.
|
| 140 |
+
|
| 141 |
+
From the field test notes, write a short follow-up roadmap document (`docs/roadmap_post_field_test.md`) that reorders the Month 7+ work based on what you actually learned. The features the testers needed get priority. The features that weren't missed drop in rank.
|
| 142 |
+
|
| 143 |
+
## What NOT to touch during the rebuild
|
| 144 |
+
|
| 145 |
+
- **Production `app.py`** — stays as-is on main. Users continue to see it on the main HF Space.
|
| 146 |
+
- **The HF dataset `ous-sow/sahel-agri-feedback`** — keep accepting writes from the main app; the minimal Space can also write to it or to a separate one, your call.
|
| 147 |
+
- **LoRA training infrastructure** — fixing the Kaggle crash is important Stage C work but it is *not* part of this rebuild. Track it as a separate issue. The rebuild uses Whisper zero-shot deliberately, to decouple field testing from training progress.
|
| 148 |
+
- **All `src/` modules** — use them, import them, fix bugs in-place if found, but do not rewrite.
|
| 149 |
+
- **The FastAPI service** — leave dormant for the duration. It comes back into focus post-rebuild.
|
| 150 |
+
|
| 151 |
+
## Rough timeline
|
| 152 |
+
|
| 153 |
+
| Week | Work |
|
| 154 |
+
|------|------|
|
| 155 |
+
| 1 | Steps 1-2: branch, tag, `app_minimal.py` wired and locally runnable |
|
| 156 |
+
| 2 | Step 3: eval infrastructure + script scaffolded; Step 4 started (recording script sent to speakers) |
|
| 157 |
+
| 3 | Step 4 continues: first 50 utterances collected, transcribed, committed to eval manifest |
|
| 158 |
+
| 4 | Step 3 closed: first baseline eval run, numbers recorded in `docs/metrics.md`; Step 5: minimal Space deployed |
|
| 159 |
+
| 5-6 | Step 6: field test runs, logs accumulate, interviews at end of week 5 and week 6 |
|
| 160 |
+
| 7 | Step 7: reintegration PR, follow-up roadmap written |
|
| 161 |
+
|
| 162 |
+
Seven weeks to close the measurement gap, with production untouched the whole time.
|
| 163 |
+
|
| 164 |
+
## One-line summary
|
| 165 |
+
|
| 166 |
+
The rebuild is a parallel minimal track that collects the real-user signal the project was built without — nothing gets deleted, production keeps shipping, and the reintegration at the end is a PR of data and docs, not code.
|
docs/notebook_collaboration.md
ADDED
|
@@ -0,0 +1,159 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Notebook Collaboration — How We Work on Kaggle Notebooks Together
|
| 2 |
+
|
| 3 |
+
*Audience: anyone collaborating with Broulaye on Sahel-Voice-Lab*
|
| 4 |
+
*Last updated: 2026-04-20*
|
| 5 |
+
|
| 6 |
+
## Why we're changing how we work
|
| 7 |
+
|
| 8 |
+
Up to now, we've mostly been editing notebooks inside the Kaggle web UI, downloading them occasionally, and pushing to git. That's painful because:
|
| 9 |
+
|
| 10 |
+
- The Kaggle copy and the git copy drift apart — it's never clear which one is "right."
|
| 11 |
+
- Cell outputs and execution counts change every run, so git diffs are huge and unreadable.
|
| 12 |
+
- If both of us edit the same notebook at the same time, one of us accidentally overwrites the other.
|
| 13 |
+
|
| 14 |
+
The new workflow fixes this by making **git the single source of truth** and using Kaggle purely as the place where notebooks *run*. We edit locally, commit to git, and push the notebook up to Kaggle with one command. The Kaggle web UI becomes read-only for our shared notebooks — we still go there to watch runs and read logs, but we don't type code into it anymore.
|
| 15 |
+
|
| 16 |
+
## What you need to install (one time, on your own machine)
|
| 17 |
+
|
| 18 |
+
```bash
|
| 19 |
+
pip install kaggle nbstripout
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
- `kaggle` — the official Kaggle command-line tool. Lets you push, pull, run, and monitor Kaggle notebooks from your terminal.
|
| 23 |
+
- `nbstripout` — strips cell outputs and execution counts from notebooks before they hit git, so diffs stay about *code*, not noise.
|
| 24 |
+
|
| 25 |
+
## Set up your Kaggle API credentials (one time)
|
| 26 |
+
|
| 27 |
+
1. Go to [kaggle.com](https://www.kaggle.com), click your avatar → **Settings** → **API** → **Create New API Token**. A file called `kaggle.json` downloads.
|
| 28 |
+
2. Move it to the right place and lock down permissions:
|
| 29 |
+
|
| 30 |
+
```bash
|
| 31 |
+
mkdir -p ~/.kaggle
|
| 32 |
+
mv ~/Downloads/kaggle.json ~/.kaggle/kaggle.json
|
| 33 |
+
chmod 600 ~/.kaggle/kaggle.json
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
3. Confirm it works:
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
kaggle kernels list --mine
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
You should see your existing kernels. If you get an auth error, check the file location and permissions.
|
| 43 |
+
|
| 44 |
+
**Never commit `kaggle.json` to git.** It's already in `.gitignore` in this repo, but if you work in another repo, add it yourself.
|
| 45 |
+
|
| 46 |
+
## Repository layout for notebooks
|
| 47 |
+
|
| 48 |
+
Each Kaggle notebook ("kernel" in Kaggle's API language) needs its own folder with a `kernel-metadata.json` file next to the `.ipynb`. Our structure:
|
| 49 |
+
|
| 50 |
+
```
|
| 51 |
+
notebooks/
|
| 52 |
+
kaggle_master_trainer/
|
| 53 |
+
kernel-metadata.json
|
| 54 |
+
kaggle_master_trainer.ipynb
|
| 55 |
+
train_fula_tts/
|
| 56 |
+
kernel-metadata.json
|
| 57 |
+
train_fula_tts.ipynb
|
| 58 |
+
bootstrap_repos.ipynb # local helper, not a Kaggle kernel
|
| 59 |
+
train_colab.ipynb # runs on Google Colab, different flow
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
A `kernel-metadata.json` looks roughly like this (example for the master trainer):
|
| 63 |
+
|
| 64 |
+
```json
|
| 65 |
+
{
|
| 66 |
+
"id": "ous-sow/sahel-kaggle-master-trainer",
|
| 67 |
+
"title": "Sahel-Voice-Lab Master Trainer",
|
| 68 |
+
"code_file": "kaggle_master_trainer.ipynb",
|
| 69 |
+
"language": "python",
|
| 70 |
+
"kernel_type": "notebook",
|
| 71 |
+
"is_private": true,
|
| 72 |
+
"enable_gpu": true,
|
| 73 |
+
"enable_internet": true,
|
| 74 |
+
"dataset_sources": ["google/fleurs", "robotsmali/jeli-asr"],
|
| 75 |
+
"kernel_sources": [],
|
| 76 |
+
"competition_sources": []
|
| 77 |
+
}
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
The `id` field (`owner/slug`) is **permanent**. Once we've agreed on a slug for a shared kernel, never change it — that's our shared pointer to the kernel living on Kaggle.
|
| 81 |
+
|
| 82 |
+
## Enable the nbstripout filter in the repo (one time per clone)
|
| 83 |
+
|
| 84 |
+
From the repo root, the first time you clone:
|
| 85 |
+
|
| 86 |
+
```bash
|
| 87 |
+
nbstripout --install --attributes .gitattributes
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
This adds a git filter that runs on every `.ipynb` before it gets committed, stripping outputs and execution counts. Commit the `.gitattributes` file so everyone else picks it up automatically.
|
| 91 |
+
|
| 92 |
+
**First-time caveat:** if the repo previously had notebooks-with-outputs committed, your first diff after enabling this will look like everything is being "deleted." That's correct and one-time — it's just stripping the old outputs.
|
| 93 |
+
|
| 94 |
+
## The daily workflow
|
| 95 |
+
|
| 96 |
+
```bash
|
| 97 |
+
# 1. Pull the latest version from git
|
| 98 |
+
git pull
|
| 99 |
+
|
| 100 |
+
# 2. Edit the notebook locally (VS Code, JupyterLab, whatever you prefer)
|
| 101 |
+
# — running cells is fine; nbstripout handles the cleanup on commit.
|
| 102 |
+
|
| 103 |
+
# 3. Commit your changes
|
| 104 |
+
git add notebooks/kaggle_master_trainer/kaggle_master_trainer.ipynb
|
| 105 |
+
git commit -m "experiment: lower LR for ASR adapter"
|
| 106 |
+
git push
|
| 107 |
+
|
| 108 |
+
# 4. Push the notebook up to Kaggle to actually run it
|
| 109 |
+
cd notebooks/kaggle_master_trainer
|
| 110 |
+
kaggle kernels push
|
| 111 |
+
|
| 112 |
+
# 5. Watch the run
|
| 113 |
+
kaggle kernels status ous-sow/sahel-kaggle-master-trainer
|
| 114 |
+
|
| 115 |
+
# 6. When it's done, pull outputs if you need them
|
| 116 |
+
kaggle kernels output ous-sow/sahel-kaggle-master-trainer -p ./runs/$(date +%F)/
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
Results go into `runs/` (which is gitignored). **They do not go back into the `.ipynb` in git** — that's what nbstripout is protecting us from.
|
| 120 |
+
|
| 121 |
+
## Team rules (please read these — they matter)
|
| 122 |
+
|
| 123 |
+
1. **Never edit shared notebooks in the Kaggle web UI.** Use the web UI to watch runs, read logs, download output files. If you want to experiment, do it locally. If you absolutely must try something quick in the web UI, treat it as a scratch copy — do not manually merge it back.
|
| 124 |
+
|
| 125 |
+
2. **One runner at a time per kernel.** `kaggle kernels push` *replaces* the notebook on Kaggle's side. If you push while the other person's run is queued or mid-execution, you'll queue behind them or disrupt them. Coordinate over chat, or — better — give yourself a personal kernel slug (e.g. `ous-sow/sahel-trainer-dev-<yourname>`) for experimentation, and only push to the shared kernel (`ous-sow/sahel-kaggle-master-trainer`) when a change is ready to run cleanly.
|
| 126 |
+
|
| 127 |
+
3. **Git is the source of truth, always.** Every Kaggle run begins with a `kaggle kernels push` from the current git state. Nothing on Kaggle is authoritative. If something on Kaggle looks different from git, git wins — pull from git, re-push, run again.
|
| 128 |
+
|
| 129 |
+
## Troubleshooting
|
| 130 |
+
|
| 131 |
+
**`kaggle kernels push` says "message: Kernel already exists."**
|
| 132 |
+
Expected — it's just telling you the kernel already exists on Kaggle and will be updated. Not an error.
|
| 133 |
+
|
| 134 |
+
**Huge diff with no real code changes.**
|
| 135 |
+
`nbstripout` isn't active in your clone. Run `nbstripout --install --attributes .gitattributes` from the repo root and re-stage the file.
|
| 136 |
+
|
| 137 |
+
**Auth errors from `kaggle` CLI.**
|
| 138 |
+
Check `~/.kaggle/kaggle.json` exists, is yours (not someone else's), and has mode 600.
|
| 139 |
+
|
| 140 |
+
**Merge conflict on a `kernel-metadata.json`.**
|
| 141 |
+
Rare but possible if two people edit metadata simultaneously. The file is small JSON — resolve by hand, keeping the shared `id` untouched.
|
| 142 |
+
|
| 143 |
+
**The notebook ran fine on Kaggle but saved outputs landed in git anyway.**
|
| 144 |
+
You committed before `nbstripout` stripped outputs. Either re-stage (`git add`) which triggers the filter, or run `nbstripout <file.ipynb>` manually before `git add`.
|
| 145 |
+
|
| 146 |
+
**You accidentally edited on the Kaggle web UI.**
|
| 147 |
+
Go to Kaggle → your kernel → "..." → Download notebook. Overwrite the local `.ipynb` with the downloaded file. Commit. Re-push. Don't panic — just restore git as the source of truth.
|
| 148 |
+
|
| 149 |
+
## What this workflow does not solve
|
| 150 |
+
|
| 151 |
+
- **Two people editing the same cell at the same time.** Normal git merge conflicts will still happen if both of us touch the same notebook cell simultaneously. Mitigation: work on different notebooks when possible, or pair-edit voice-on-voice. If this becomes frequent, we can add `jupytext` later, which pairs each `.ipynb` with a `.py` mirror that merges like regular Python.
|
| 152 |
+
- **Debugging a crashing Kaggle run.** The MCP/CLI pushes and watches, but fixing the crash is still back-and-forth between your local editor and the Kaggle logs. The workflow just removes the "which version is right" confusion from that loop.
|
| 153 |
+
- **Kaggle's GPU quota.** You still get 30 free GPU hours per week. Plan accordingly.
|
| 154 |
+
|
| 155 |
+
## TL;DR
|
| 156 |
+
|
| 157 |
+
Edit locally, commit to git, `kaggle kernels push` to run, `kaggle kernels output` to retrieve. Never edit on the Kaggle web UI for shared kernels. Git is the source of truth. `nbstripout` keeps diffs clean.
|
| 158 |
+
|
| 159 |
+
If anything here doesn't make sense, ping Broulaye before improvising.
|
docs/roadmap_2026-04.md
ADDED
|
@@ -0,0 +1,292 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Sahel-Voice-Lab — Roadmap & Starting-from-Scratch Plan
|
| 2 |
+
|
| 3 |
+
*Last updated: 2026-04-19*
|
| 4 |
+
*Maintainer: Broulaye*
|
| 5 |
+
|
| 6 |
+
This document has two parts:
|
| 7 |
+
|
| 8 |
+
1. **Where the project stands today** — what's built, what's missing, and what to do next.
|
| 9 |
+
2. **If I were starting from zero today** — a realistic, solo-maintainer, free-compute path from nothing to a usable Bambara voice assistant.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## Part 1 — The four layers, in plain language
|
| 14 |
+
|
| 15 |
+
A voice assistant is four layers stacked on top of each other. Most of this project is about the fourth layer; the first three are mostly rented or borrowed.
|
| 16 |
+
|
| 17 |
+
### Layer 1 — the ear: Speech-to-Text (STT / ASR)
|
| 18 |
+
|
| 19 |
+
Abbreviations:
|
| 20 |
+
- **STT** = Speech-to-Text
|
| 21 |
+
- **ASR** = Automatic Speech Recognition (same thing)
|
| 22 |
+
- **LoRA** = Low-Rank Adaptation — a technique to "patch" a large model with a tiny file (~50 MB) instead of retraining all of it
|
| 23 |
+
- **PEFT** = Parameter-Efficient Fine-Tuning — the HuggingFace library that implements LoRA
|
| 24 |
+
|
| 25 |
+
The model used is **Whisper** (OpenAI's open-source multilingual speech model). Out of the box, Whisper's Bambara is poor because it barely saw Bambara during training. The fix: train LoRA adapters per language. One ~1.5 GB Whisper backbone stays in memory; small Bambara and Fula patches swap in and out in ~50 ms.
|
| 26 |
+
|
| 27 |
+
Where it lives in the repo:
|
| 28 |
+
- `src/engine/whisper_base.py` — loads the backbone
|
| 29 |
+
- `src/engine/adapter_manager.py` — the hot-swap
|
| 30 |
+
- `src/engine/transcriber.py` — what the app calls
|
| 31 |
+
- `src/training/trainer.py` + `notebooks/kaggle_master_trainer.ipynb` — training
|
| 32 |
+
|
| 33 |
+
### Layer 2 — the brain: Large Language Model (LLM)
|
| 34 |
+
|
| 35 |
+
Abbreviations:
|
| 36 |
+
- **LLM** = Large Language Model
|
| 37 |
+
- **JSON** = JavaScript Object Notation, a structured format
|
| 38 |
+
|
| 39 |
+
No one trains this from scratch. You rent one. The project calls **Qwen** (Alibaba's multilingual model) through HuggingFace's hosted inference service, with a custom "adult-child" prompt that forces structured JSON output (fields like intent, reply, translation).
|
| 40 |
+
|
| 41 |
+
Where it lives:
|
| 42 |
+
- `src/llm/gemma_client.py` — named "gemma" for legacy reasons; now talks to Qwen.
|
| 43 |
+
|
| 44 |
+
### Layer 3 — the mouth: Text-to-Speech (TTS)
|
| 45 |
+
|
| 46 |
+
Abbreviations:
|
| 47 |
+
- **TTS** = Text-to-Speech
|
| 48 |
+
- **MMS** = Massively Multilingual Speech (Meta's 1000+ language model, lower quality, used as fallback)
|
| 49 |
+
- **VITS** = Variational Inference Text-to-Speech (a specific architecture — higher quality, one speaker per trained model)
|
| 50 |
+
- **F5-TTS** = a recent zero-shot voice-cloning TTS system
|
| 51 |
+
|
| 52 |
+
The hardest layer for low-resource languages. Needs hours of clean studio audio from a native speaker. Used in tiers:
|
| 53 |
+
- MMS-TTS as fallback baseline
|
| 54 |
+
- Waxal-VITS for trained Bambara quality
|
| 55 |
+
- F5-TTS for voice cloning in Phase 3
|
| 56 |
+
|
| 57 |
+
Where it lives:
|
| 58 |
+
- `src/tts/mms_tts.py`, `src/tts/waxal_tts.py`, `src/tts/f5_tts.py`, `src/tts/voice_cloner.py`
|
| 59 |
+
|
| 60 |
+
### Layer 4 — the glue
|
| 61 |
+
|
| 62 |
+
The real differentiator of the project — everything that makes the rented models into a product.
|
| 63 |
+
|
| 64 |
+
Abbreviations:
|
| 65 |
+
- **IoT** = Internet of Things (networked sensors)
|
| 66 |
+
- **ECAPA-TDNN** = Emphasized Channel Attention Propagation — Time-Delay Neural Network; a speaker-fingerprint model
|
| 67 |
+
|
| 68 |
+
Components:
|
| 69 |
+
- Memory loop — `src/memory/memory_manager.py`
|
| 70 |
+
- Normalization — `src/data/bam_normalize.py`, `src/data/adlam.py`
|
| 71 |
+
- Fast-path phrases — `src/conversation/phrase_matcher.py`
|
| 72 |
+
- Intent detection — `src/iot/intent_parser.py`
|
| 73 |
+
- Voice responder (≤ 6-word replies) — `src/iot/voice_responder.py`
|
| 74 |
+
- Sensor bridge — `src/iot/sensor_bridge.py`
|
| 75 |
+
- Speaker ID — `src/voice/speaker_profiles.py`
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## Part 2 — What's present vs missing
|
| 80 |
+
|
| 81 |
+
### Present
|
| 82 |
+
- All four layers scaffolded; every module named in the project description exists in `src/`
|
| 83 |
+
- Two entry points: `app.py` (Gradio, HF Space) and `src/api/app.py` (FastAPI)
|
| 84 |
+
- Training infrastructure and Kaggle notebooks
|
| 85 |
+
- Mobile export pipeline (ONNX, TFLite) in `src/optimization/`
|
| 86 |
+
- Bambara Waxal-VITS TTS working
|
| 87 |
+
- Memory loop wired into UI
|
| 88 |
+
- Agricultural domain vocabulary and intent model
|
| 89 |
+
|
| 90 |
+
### Missing or weak
|
| 91 |
+
1. `data/vocabulary.jsonl` is empty — no local snapshot of user-taught words
|
| 92 |
+
2. LoRA fine-tuning still crashes on Kaggle T4 (active blocker per project notes)
|
| 93 |
+
3. Fula TTS is a placeholder — no trained `ous-sow/fula-tts` yet
|
| 94 |
+
4. No real-user evaluation set (no `data/eval/` folder with farmer recordings); all quality numbers currently come from FLEURS, which does not reflect real conditions
|
| 95 |
+
5. No documented tone-handling policy for TTS (Bambara tone is unmarked in writing but matters for pronunciation)
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## Part 3 — Actionable next steps (ordered by leverage)
|
| 100 |
+
|
| 101 |
+
### Step 1 — Fix the LoRA training crash on Kaggle
|
| 102 |
+
Highest leverage. Unblocks every ASR quality gain downstream.
|
| 103 |
+
- Reproduce the exact error on a Kaggle T4 runtime
|
| 104 |
+
- Pin `datasets` to a known-good version (either pre-4.x, or the correct torchcodec pin for 4.x)
|
| 105 |
+
- If the AMP (Automatic Mixed Precision) scaler is the issue, either disable AMP or switch to bf16 if T4 supports it cleanly
|
| 106 |
+
- Validate with a tiny 100-sample training job before a full run
|
| 107 |
+
- Commit a working Bambara adapter before moving on
|
| 108 |
+
|
| 109 |
+
### Step 2 — Build a real-user evaluation set
|
| 110 |
+
Do this in parallel with Step 1.
|
| 111 |
+
- Record 50-100 Bambara utterances from at least 3 native speakers
|
| 112 |
+
- Include noisy conditions (wind, motorcycle, livestock — `noise_samples/` already anticipates this)
|
| 113 |
+
- Transcribe by hand; store under `data/eval/bambara_field.jsonl`
|
| 114 |
+
- Run the current stack and record baseline WER (Word Error Rate) and CER (Character Error Rate)
|
| 115 |
+
- From here on, all changes measured against this set, not FLEURS
|
| 116 |
+
|
| 117 |
+
### Step 3 — Exercise the memory loop end-to-end
|
| 118 |
+
- Run 10 live teaching sessions
|
| 119 |
+
- Confirm local JSONL grows; confirm HuggingFace Hub push
|
| 120 |
+
- Add a test under `tests/` that mocks the Hub and validates the write path
|
| 121 |
+
|
| 122 |
+
### Step 4 — Train `ous-sow/fula-tts`
|
| 123 |
+
- Can run in parallel on RunPod
|
| 124 |
+
- Need 1-3 hours of clean studio audio from a single Fula speaker
|
| 125 |
+
- Same VITS recipe as Waxal Bambara
|
| 126 |
+
|
| 127 |
+
### Step 5 — Close Phase 3 voice-to-voice parity
|
| 128 |
+
- Once Fula TTS exists, test the full voice-in → voice-out pipeline for both languages
|
| 129 |
+
- Measure round-trip CER: spoken sentence → transcript → response → synthesized speech → re-transcribe → compare
|
| 130 |
+
- Catches compounding errors across layers
|
| 131 |
+
|
| 132 |
+
### Step 6 — Small field test
|
| 133 |
+
- Five Malian farmers. Cheap version: WhatsApp voice messages or a phone call with screen-shared Gradio
|
| 134 |
+
- Log what they try to ask, whether the response is intelligible, whether they'd use it again
|
| 135 |
+
- Success metric: do they ask a second question without being prompted?
|
| 136 |
+
|
| 137 |
+
### Step 7 — Write a tone-handling policy
|
| 138 |
+
- Pick a position: "accept tonally-wrong TTS on homographs as a known limitation" vs "invest in tone annotation for TTS training corpus in cycle N+1"
|
| 139 |
+
- Either is defensible. The bad option is leaving it unspoken.
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
## Part 4 — If I were starting from zero today
|
| 144 |
+
|
| 145 |
+
Realistic assumptions: solo maintainer, nights-and-weekends pace, free or cheap compute (Kaggle free T4, HuggingFace Spaces cpu-basic, occasional RunPod for bigger runs), access to native speakers (this one is non-negotiable — if you don't have them, stop and find them first).
|
| 146 |
+
|
| 147 |
+
The single most important lesson from Sahel-Voice-Lab as it exists today: **it does four product's worth of things at once** (agricultural IoT, self-teaching, multi-language, voice cloning). If starting over, I'd ship one at a time.
|
| 148 |
+
|
| 149 |
+
### Month 0 — Before writing any code
|
| 150 |
+
1. Pick a narrow use case. Not "general Bambara assistant." Something like "voice queries for soil moisture" or "learn 100 agricultural words." One domain, one job.
|
| 151 |
+
2. Identify 3-5 native speakers willing to test throughout. Get their phone numbers. Ask now, not later.
|
| 152 |
+
3. Map the data landscape. Write a one-page doc listing every Bambara dataset you find: FLEURS (bam_ML), RobotsMali Jeli-ASR, OpenSLR, Masakhane resources, Common Voice. Note size, license, quality.
|
| 153 |
+
4. Decide: Bambara only for the first version. Fula comes later. Do not start bilingual.
|
| 154 |
+
|
| 155 |
+
### Month 1 — Text-first prototype (no audio yet)
|
| 156 |
+
5. Wire the LLM (Qwen via HuggingFace Inference) with a carefully-written system prompt. Start in French or English; have it answer in Bambara.
|
| 157 |
+
6. Build a Gradio text-in / text-out demo. Deploy to a HuggingFace Space on cpu-basic.
|
| 158 |
+
7. Write the normalizer (the `bam_normalize.py` equivalent) with real tests. Spend real time on this; the audit you already did on the alphabet is the specification.
|
| 159 |
+
8. Show it to your native speakers. Is the Bambara intelligible? Are the answers right?
|
| 160 |
+
9. **Do not add STT or TTS yet.** This stage's only job is to learn what the LLM knows about Bambara and what it doesn't.
|
| 161 |
+
|
| 162 |
+
### Month 2 — Add the ear, and the eval set
|
| 163 |
+
10. **Build the evaluation set before training anything.** 50 utterances, 3 speakers, hand-transcribed. This is the most "wish I'd done this earlier" advice in low-resource ASR.
|
| 164 |
+
11. Try Whisper-large-v3-turbo zero-shot on your eval set. Record the baseline WER. It will probably be 60-80%.
|
| 165 |
+
12. Only then start LoRA fine-tuning with FLEURS + Jeli-ASR on Kaggle T4. Target: WER from ~70% to ~30% within four weeks.
|
| 166 |
+
13. Wire the trained adapter into the Gradio app.
|
| 167 |
+
|
| 168 |
+
### Month 3 — Add the mouth (baseline quality)
|
| 169 |
+
14. Use MMS-TTS Bambara. One API call. It sounds robotic but it speaks.
|
| 170 |
+
15. Ship this as the "Phase 1 complete" milestone on HuggingFace Spaces. This is a real product now: voice in, voice out.
|
| 171 |
+
16. Collect 50-100 field interactions. Log everything.
|
| 172 |
+
|
| 173 |
+
### Month 4 — Memory loop
|
| 174 |
+
17. Build the teach-new-word flow.
|
| 175 |
+
18. JSONL on disk + HuggingFace dataset push.
|
| 176 |
+
19. Add the "curiosity" feature (system occasionally asks the user to teach it a word).
|
| 177 |
+
20. Exercise it with real users before declaring it done. An empty `vocabulary.jsonl` is a sign the loop was never really tested.
|
| 178 |
+
|
| 179 |
+
### Month 5 — Upgrade TTS
|
| 180 |
+
21. Record 1-3 hours of studio audio with a single native speaker reading from a curated script that covers your domain vocabulary. This is the single biggest quality jump in the whole project.
|
| 181 |
+
22. Train a VITS model (Waxal-style). Swap MMS out for it.
|
| 182 |
+
23. Compare side-by-side with native listeners. Keep MMS as fallback.
|
| 183 |
+
|
| 184 |
+
### Month 6 — Field test and iterate
|
| 185 |
+
24. Five farmers. Phone calls or WhatsApp. Real conditions.
|
| 186 |
+
25. Success metric: do they ask a second question unprompted? Do they come back tomorrow?
|
| 187 |
+
26. Expect this stage to reshape priorities. Follow the feedback; do not defend the roadmap.
|
| 188 |
+
|
| 189 |
+
### Month 7+ — Everything else
|
| 190 |
+
27. Second language (Fula / Adlam): only after Bambara is stable
|
| 191 |
+
28. Voice cloning (F5-TTS)
|
| 192 |
+
29. Mobile / offline export (ONNX, TFLite)
|
| 193 |
+
30. IoT sensor integration
|
| 194 |
+
31. FastAPI service alongside the Gradio app
|
| 195 |
+
|
| 196 |
+
### Things I would deliberately do differently
|
| 197 |
+
- **Ship the ugliest possible version at Month 3, not the polished pipeline at Month 9.** Five farmers with a robotic voice tell you more than 500 hours of benchmark tuning.
|
| 198 |
+
- **Build the evaluation set in Month 2, not later.** Every decision compounds; without an eval, you cannot tell which decisions to keep.
|
| 199 |
+
- **One language, one entry point, one framework at a time.** The current project has FastAPI + Gradio + Kaggle + ONNX + TFLite + bitsandbytes + speaker ID + voice cloning. Each is a maintenance commitment. Add them only when the product's existence justifies them.
|
| 200 |
+
- **No training your own ASR adapter until the LLM/TTS product has been tested.** Whisper zero-shot is good enough to validate the product idea. Training is expensive; you might end up optimizing a layer users don't care about.
|
| 201 |
+
- **Native speakers as collaborators, not testers at the end.** Monthly review calls from Month 1, not Month 6.
|
| 202 |
+
|
| 203 |
+
### One-sentence summary
|
| 204 |
+
If I were starting from zero today, I would ship a narrow, ugly, one-language, text-first version to five real native-speaker users in the first three months, and build everything else on top of the feedback from those five people.
|
| 205 |
+
|
| 206 |
+
---
|
| 207 |
+
|
| 208 |
+
## Part 5 — Expanded walkthrough: why, how, and where Sahel-Voice-Lab fits
|
| 209 |
+
|
| 210 |
+
Each stage below has three sections: **Why** (the purpose — why this stage exists and what breaks if you skip it), **How** (concrete mechanics — files, commands, tools, decisions), and **Current project status** (what you have, what's missing, relative to this stage).
|
| 211 |
+
|
| 212 |
+
### Stage A — Scoping and data audit (Month 0)
|
| 213 |
+
|
| 214 |
+
**Why.** The single biggest failure mode in low-resource voice AI is attempting a "general Bambara assistant." You cannot measure general; you cannot ship general; you cannot collect targeted data for general. You need one narrow domain so vocabulary is bounded, users can be found, failures are diagnosable, and every subsequent decision has a clear yes/no test: "does this help a farmer query soil moisture?" A bad scope locks in months of wasted work.
|
| 215 |
+
|
| 216 |
+
**How.** Write a one-page scoping document that answers: (1) what is the single first use case — one sentence, measurable; (2) who is the first user — names, phone numbers, what language variety they speak; (3) what does success look like in three months — one metric, not five. Then write a data audit: every public Bambara dataset with size, license, quality, and known issues. FLEURS (`bam_ML`), RobotsMali Jeli-ASR, OpenSLR, Masakhane, Common Voice. Note what's missing — domain vocabulary usually is.
|
| 217 |
+
|
| 218 |
+
**Current project status.** Stage A is implicitly done. The domain is "agricultural voice interface for Sahelian farmers." The data sources are identified and wired (`src/data/waxal_loader.py`, `src/data/web_harvester.py`, FLEURS referenced in training configs). The one thing weakly documented is the *target user profile* — which region, which dialect, what level of literacy, what phones they use. Writing this down explicitly (even as a one-paragraph persona in the README) tightens every downstream decision.
|
| 219 |
+
|
| 220 |
+
### Stage B — Text-first prototype (Month 1)
|
| 221 |
+
|
| 222 |
+
**Why.** Before introducing audio, you need to know what the LLM actually knows about Bambara and what it doesn't. If the text-in/text-out experience is bad, adding voice will not save it; voice only adds more failure modes. Text prototyping is cheap — one deployment, no GPU, a few prompts — and teaches you the vocabulary gap you will spend the rest of the project closing.
|
| 223 |
+
|
| 224 |
+
**How.** Call a hosted multilingual LLM (Qwen, Mistral, Gemma) via HuggingFace Inference with `huggingface-hub`'s `InferenceClient`. Write a careful system prompt — the "adult-child" contract: LLM acts like a patient teacher, returns structured JSON with fields `{intent, reply_bm, reply_fr, confidence}`. Deploy a Gradio text-in/text-out interface to a HuggingFace Space on `cpu-basic`. Show it to two native speakers; ask what sounds wrong. Spend real time on the normalizer at this stage — the orthography audit (`ɛ ↔ e`, `ɔ ↔ o`, `ɲ ↔ ny/gn`, `ŋ ↔ ng`, 1967 vs older forms, and the `ny` ambiguity between palatal nasal and nasal + palatal glide) is the specification.
|
| 225 |
+
|
| 226 |
+
**Current project status.** Stage B is done. `src/llm/gemma_client.py` implements the adult-child JSON contract against Qwen. `src/data/bam_normalize.py` handles the orthographic cleanups. The Gradio app has been deployed. This stage is behind you.
|
| 227 |
+
|
| 228 |
+
### Stage C — The ear: STT plus the evaluation set (Month 2)
|
| 229 |
+
|
| 230 |
+
**Why.** This is the stage with the highest "wish I'd done it earlier" rate in low-resource ASR. You need a real-user evaluation set *before* you train anything, because training without an eval is hill-climbing in the dark. FLEURS numbers do not predict field performance; field recordings do. Only after an eval exists is it worth investing Kaggle hours in fine-tuning.
|
| 231 |
+
|
| 232 |
+
**How.** First, the eval set. Ask three native speakers to each record 15-20 utterances covering your domain vocabulary. Use their actual phones, in their actual environments (not a quiet office). Transcribe by hand. Store under `data/eval/bambara_field.jsonl` as `{audio_path, transcript, speaker_id, noise_conditions}`. Run Whisper-large-v3-turbo zero-shot against it. Record the baseline WER (Word Error Rate) and CER (Character Error Rate) numbers in the repo somewhere durable (`docs/metrics.md`). Only then: start LoRA fine-tuning with FLEURS + Jeli-ASR on Kaggle T4. Each training run is measured against your eval set, not against FLEURS.
|
| 233 |
+
|
| 234 |
+
**Current project status. You are mostly here — with two important gaps.** The Whisper + LoRA + adapter-swap pipeline is built (`src/engine/whisper_base.py`, `src/engine/adapter_manager.py`, `src/engine/transcriber.py`). Training infrastructure exists (`src/training/trainer.py`, `notebooks/kaggle_master_trainer.ipynb`). However: (1) there is no `data/eval/` folder with real farmer recordings, and (2) the LoRA fine-tuning pipeline still crashes on Kaggle T4 per your project notes. These are your two most important current blockers. Until they resolve, every other ASR improvement is speculative.
|
| 235 |
+
|
| 236 |
+
### Stage D — The mouth: baseline TTS and first ship (Month 3)
|
| 237 |
+
|
| 238 |
+
**Why.** Shipping an ugly working product beats polishing a pretty broken one. The first voice-in/voice-out deployment reveals failure modes no amount of offline testing catches — wake-word confusion, ambient noise you didn't model, users speaking too fast or too softly, compounding latency that makes the system feel dead. You cannot learn these from benchmarks; you learn them from users. Ship at the robotic-voice MMS-TTS baseline, then improve.
|
| 239 |
+
|
| 240 |
+
**How.** Wire MMS-TTS Bambara (`facebook/mms-tts-bam`) into the Gradio app — it's one `from transformers import VitsModel` call plus audio post-processing. Return audio as a Gradio `gr.Audio` output. Deploy. Write a very short intro text explaining this is a prototype. Share the Space URL with two native-speaker testers, tell them nothing about how it works, ask them to try three things.
|
| 241 |
+
|
| 242 |
+
**Current project status.** Stage D is done. MMS-TTS is wired (`src/tts/mms_tts.py`), the Gradio Space is deployed, Phase 1 has shipped per your notes. Two things that might be worth auditing: whether the deployed Space is still on the MMS fallback or already on Waxal-VITS, and whether there is *any* logging/telemetry on usage to tell you whether real people are actually touching the deployed Space.
|
| 243 |
+
|
| 244 |
+
### Stage E — The memory loop (Month 4)
|
| 245 |
+
|
| 246 |
+
**Why.** The model does not know most Bambara vocabulary; users do. Without a mechanism to capture and persist what they teach, every conversation's knowledge dies with the session. The memory loop is the product's data-collection engine — the thing that lets it get better over time without you personally labeling data. This is also the core differentiation of Sahel-Voice-Lab versus a generic Bambara ASR+TTS demo.
|
| 247 |
+
|
| 248 |
+
**How.** Three components. (1) A teach-new-word flow in the UI: the user says "this is how you say X," the system confirms, stores to `data/vocabulary.jsonl` as `{word, translation, speaker_id, timestamp, audio_ref}`. (2) An async push to a versioned HuggingFace dataset (`ous-sow/sahel-agri-feedback`). (3) A "curiosity" mechanism where every N turns the LLM is prompted to identify a vocabulary gap and ask the user — this inverts the teaching initiative and collects more data per session.
|
| 249 |
+
|
| 250 |
+
**Current project status.** Stage E is structurally done but likely not exercised. `src/memory/memory_manager.py` implements the thread-safe JSONL + Hub push. `src/engine/curiosity.py` implements the CuriosityEngine. The Gradio app has a Teaching tab. However, your local `data/vocabulary.jsonl` is empty (0 lines). This means one of three things: (a) no one has used the teach flow yet, (b) the write path is broken and you haven't noticed because no one has used it, or (c) data goes only to the Hub and you've never pulled a snapshot locally. Worth a 20-minute investigation to confirm which. A test in `tests/` that mocks the Hub and asserts the local JSONL write is cheap insurance.
|
| 251 |
+
|
| 252 |
+
### Stage F — Upgraded TTS (Month 5)
|
| 253 |
+
|
| 254 |
+
**Why.** MMS-TTS works but sounds robotic, and users notice immediately. Moving to a single-speaker VITS model trained on 1-3 hours of clean studio audio is the single biggest perceived-quality jump in the entire pipeline. It also gives you something MMS cannot: a consistent, identifiable voice that users remember. For long-term adoption, voice identity matters as much as intelligibility.
|
| 255 |
+
|
| 256 |
+
**How.** Record 1-3 hours of studio audio with one native speaker reading a curated script that covers your domain vocabulary plus conversational filler. Target: quiet room, decent USB mic, 22050 or 44100 Hz, single take per sentence. Align transcripts, clean silence, normalize loudness. Train a VITS model on your RunPod GPU (Kaggle T4 usually not enough memory for full VITS). Publish to HuggingFace as a private or public model repo. Swap out MMS in the TTS dispatcher, keep MMS as fallback.
|
| 257 |
+
|
| 258 |
+
**Current project status.** Stage F is done for Bambara, not for Fula. The Waxal VITS integration lives in `src/tts/waxal_tts.py` and per your notes is partially shipped for Bambara (`ynnov/ekodi-bambara-tts-female`). Fula TTS is a placeholder — `ous-sow/fula-tts` does not exist yet. Closing this is one of your active goals. The recording session is usually the bottleneck, not the training.
|
| 259 |
+
|
| 260 |
+
### Stage G — Field test (Month 6)
|
| 261 |
+
|
| 262 |
+
**Why.** Everything before this stage is technical. This stage is where you find out whether the technical work produced something humans actually use. It's also where you discover that three of your prior assumptions were wrong — assumptions you could not have tested any other way. Every low-resource voice project that skips this stage ends up polished and unused.
|
| 263 |
+
|
| 264 |
+
**How.** Five native-speaker users. Cheapest version: WhatsApp voice messages or a phone call with screen-shared Gradio. Give them a small task ("ask about your soil moisture"), observe without coaching. Record what they try to ask, whether the transcript is right, whether the answer is intelligible to them, whether they would use it unprompted again. The success metric is not WER. It is: *does the user ask a second question they came up with themselves?*
|
| 265 |
+
|
| 266 |
+
**Current project status.** Stage G is **not done**. There is no field-test evidence in the repo, no usage logs, no session transcripts from actual farmers. This is, honestly, the single largest gap between where the project is and where it needs to be — more important than the Kaggle crash or the missing Fula TTS. You can ship a field test with what you have today and the feedback will reshape everything downstream.
|
| 267 |
+
|
| 268 |
+
### Stage H — Expansion (Month 7+)
|
| 269 |
+
|
| 270 |
+
**Why.** Only once a single-language, single-domain product has real users do you earn the right to expand. Each added dimension (second language, voice cloning, mobile export, IoT integration) doubles surface area for bugs and maintenance. Adding them in parallel to the core product means you will ship nothing well; adding them after the core is stable means each addition builds on a known-good base.
|
| 271 |
+
|
| 272 |
+
**How.** Second language (Fula/Adlam): repeat stages B through G with the new language, reusing infrastructure but refitting normalization and TTS training. Voice cloning: F5-TTS or OpenVoice, keyed to a speaker embedding from the speaker-ID layer. Mobile export: ONNX per language, then TFLite via onnx-tf, then bundle into a thin Android app. IoT integration: FastAPI service in front of the sensor bridge, authenticated, cached.
|
| 273 |
+
|
| 274 |
+
**Current project status. You are ahead of schedule here, which is the diagnostic.** Phase 3 voice-to-voice is merged and stabilizing. F5-TTS is scaffolded (`src/tts/f5_tts.py`). OpenVoice-based voice cloning is scaffolded (`src/tts/voice_cloner.py`). Speaker ID with ECAPA-TDNN is in place (`src/voice/speaker_profiles.py`). Adlam/Pular integration has landed. ONNX and TFLite exporters exist (`src/optimization/`). A FastAPI service is scaffolded (`src/api/`). This is Month 7+ work already in the codebase. The issue is not that this work is wrong — it is that it was built before Stages C (eval set), E (loop exercised with real data), and G (field test) were actually completed. The risk is building a polished Stage H surface on an unmeasured Stage C-E foundation.
|
| 275 |
+
|
| 276 |
+
---
|
| 277 |
+
|
| 278 |
+
## Where you actually are right now
|
| 279 |
+
|
| 280 |
+
The honest diagnosis of Sahel-Voice-Lab as of 2026-04-19, mapped onto the staged plan:
|
| 281 |
+
|
| 282 |
+
Done: Stages A, B, D. Bambara text and audio pipeline ships to users via Gradio on HF Spaces. The LLM contract is stable. Normalization is implemented.
|
| 283 |
+
|
| 284 |
+
Partially done: Stage C (ASR pipeline built but no field eval set, training still crashes on Kaggle), Stage E (memory loop built but `vocabulary.jsonl` empty — not yet exercised with real users), Stage F (Bambara TTS upgraded, Fula TTS still placeholder).
|
| 285 |
+
|
| 286 |
+
Not done: Stage G (no field test with real farmers).
|
| 287 |
+
|
| 288 |
+
Ahead of schedule: Stage H (Phase 3 voice-to-voice, voice cloning, Adlam/Pular, ONNX/TFLite, FastAPI — all built in parallel with, or before, completing C/E/G).
|
| 289 |
+
|
| 290 |
+
The path forward, ordered by leverage: (1) fix the Kaggle LoRA crash so Stage C can continue; (2) build the real-user eval set so Stage C has a measurement foundation; (3) exercise the memory loop with three real users so Stage E is confirmed; (4) run a small field test so Stage G is unblocked; (5) train `ous-sow/fula-tts` so Stage F closes for Fula; (6) return to Stage H work with actual user signal guiding priorities.
|
| 291 |
+
|
| 292 |
+
Everything the project is missing is measurement. Everything the project has is implementation. That is a recoverable position, but only if the measurement work now gets the same weight the implementation work has had.
|
project-context.txt
ADDED
|
@@ -0,0 +1,295 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
================================================================================
|
| 2 |
+
PROJECT CONTEXT — sahel-agri-voice
|
| 3 |
+
Generated: 2026-04-17
|
| 4 |
+
================================================================================
|
| 5 |
+
|
| 6 |
+
PROJECT NAME
|
| 7 |
+
------------
|
| 8 |
+
Sahel-Voice-Lab / Sahel-Agri Voice AI
|
| 9 |
+
(HuggingFace Space title: "Sahel-Voice-Lab", Phase 1: "The Memory Loop")
|
| 10 |
+
|
| 11 |
+
PURPOSE
|
| 12 |
+
-------
|
| 13 |
+
A voice-first, self-learning AI assistant for two West African languages —
|
| 14 |
+
Bambara (bam, spoken in Mali) and Fula/Pular (ful, spoken in Guinea and
|
| 15 |
+
Senegal) — targeted at farmers in the Sahel region.
|
| 16 |
+
|
| 17 |
+
The system has two complementary capabilities:
|
| 18 |
+
|
| 19 |
+
1. LANGUAGE-LEARNING MEMORY LOOP (Phase 1)
|
| 20 |
+
The assistant behaves like an "eager child learner." Users teach it
|
| 21 |
+
Bambara/Fula words ("I ni ce means hello") via voice or text; an LLM
|
| 22 |
+
detects the teaching intent and the word pair is persisted to a
|
| 23 |
+
HuggingFace Hub dataset (ous-sow/sahel-agri-feedback → vocabulary.jsonl)
|
| 24 |
+
so knowledge accumulates across sessions and users. The vocabulary is
|
| 25 |
+
then injected into the LLM's system prompt as its source of truth for
|
| 26 |
+
answering questions.
|
| 27 |
+
|
| 28 |
+
2. AGRICULTURAL IoT VOICE INTERFACE
|
| 29 |
+
Farmers speak questions in their own language ("how is the soil?",
|
| 30 |
+
"is it going to rain?"). Whisper transcribes, an intent parser keyword-
|
| 31 |
+
matches Bambara/Fula agricultural terms (soil, rain, irrigation, pest),
|
| 32 |
+
a sensor bridge fetches data from an IoT backend (or mock data), and
|
| 33 |
+
VoiceResponder + a TTS engine reply in short Bambara/Fula sentences
|
| 34 |
+
with alert thresholds (e.g. "Bunding ji dɔgɔ. I ka foro ji." =
|
| 35 |
+
"Soil moisture is low. Irrigate your field.").
|
| 36 |
+
|
| 37 |
+
The project is deployed as a HuggingFace Space (Gradio frontend) with an
|
| 38 |
+
optional FastAPI service. The system is explicitly "100% non-Meta" for its
|
| 39 |
+
core stack (Whisper / Qwen / F5-TTS / VITS), avoiding Meta models for the
|
| 40 |
+
main loop.
|
| 41 |
+
|
| 42 |
+
FULL TECH STACK
|
| 43 |
+
---------------
|
| 44 |
+
Deployment / hosting
|
| 45 |
+
- HuggingFace Spaces (Gradio SDK 5.25.0, hardware: cpu-basic)
|
| 46 |
+
- Kaggle notebooks (T4 GPU) for training runs
|
| 47 |
+
- RunPod alternative training environment
|
| 48 |
+
- HF Hub datasets as persistent vocabulary + feedback store
|
| 49 |
+
|
| 50 |
+
Frontend
|
| 51 |
+
- Gradio 5.25.0 (app.py — main UI; app_lab.py — experimental lab UI)
|
| 52 |
+
|
| 53 |
+
Backend API
|
| 54 |
+
- FastAPI (src/api/app.py via create_app() + lifespan)
|
| 55 |
+
- Pydantic v2 (schemas)
|
| 56 |
+
- httpx (async calls to IoT sensor backend)
|
| 57 |
+
|
| 58 |
+
Speech-to-text (STT)
|
| 59 |
+
- openai/whisper-large-v3-turbo (default backbone)
|
| 60 |
+
- transformers 5.5.0 (WhisperForConditionalGeneration, WhisperProcessor)
|
| 61 |
+
- PEFT (LoRA adapters, hot-swappable per language)
|
| 62 |
+
- accelerate 1.13.0
|
| 63 |
+
- librosa 0.10.2, soundfile 0.12.1, torchaudio
|
| 64 |
+
|
| 65 |
+
LLM (reasoning / teaching-intent detection)
|
| 66 |
+
- Qwen/Qwen2.5-72B-Instruct (default, via HF Serverless Inference)
|
| 67 |
+
- Qwen/Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Zephyr-7b-beta
|
| 68 |
+
as faster alternatives
|
| 69 |
+
- huggingface-hub 1.9.0 InferenceClient
|
| 70 |
+
|
| 71 |
+
Text-to-speech (TTS)
|
| 72 |
+
- Phase 1: facebook/mms-tts-bam, mms-tts-ful, mms-tts-fra, mms-tts-eng
|
| 73 |
+
- Phase 2: ynnov/ekodi-bambara-tts-female (VITS)
|
| 74 |
+
+ placeholder ous-sow/fula-tts
|
| 75 |
+
- F5-TTS (SWivid/F5-TTS) for GPU voice cloning (optional, ~2GB)
|
| 76 |
+
- OpenVoice V2 (myshell-ai/openvoice-v2) for tone-color conversion
|
| 77 |
+
- SpeechBrain ECAPA-TDNN for speaker identification (per-user profiles)
|
| 78 |
+
|
| 79 |
+
Data / datasets
|
| 80 |
+
- google/fleurs (bam_ML, ff_SN) as STT training corpus
|
| 81 |
+
- RobotsMali/jeli-asr, google/fleurs Fula, Wikipedia (bm, ff) harvested
|
| 82 |
+
text via src/data/web_harvester.py
|
| 83 |
+
- datasets 4.8.4 (+ torchcodec for 4.x audio decoding)
|
| 84 |
+
- Adlam ↔ Latin transliteration for Guinea Pular
|
| 85 |
+
|
| 86 |
+
Training / fine-tuning
|
| 87 |
+
- PEFT LoRA + Seq2SeqTrainer
|
| 88 |
+
- jiwer 3.0.4 (WER / CER metrics)
|
| 89 |
+
- Custom callbacks: EarlyStoppingOnWER, AdapterCheckpointCallback
|
| 90 |
+
- FieldNoiseAugmenter (tractor / wind / livestock noise mixing)
|
| 91 |
+
|
| 92 |
+
Optimization / edge deploy
|
| 93 |
+
- optimum[onnxruntime] → per-language ONNX export
|
| 94 |
+
- onnx-tf / TensorFlow → TFLite for Android
|
| 95 |
+
- bitsandbytes NF4 / 8-bit quantization (training environments)
|
| 96 |
+
|
| 97 |
+
Utilities / runtime
|
| 98 |
+
- PyYAML 6.0.2, python-dotenv 1.1.0
|
| 99 |
+
- NumPy 2.2.4, SciPy 1.15.2
|
| 100 |
+
- rapidfuzz 3.13.0 (fuzzy phrase matching)
|
| 101 |
+
- pypdf, python-docx (Knowledge Base upload → vocabulary.jsonl)
|
| 102 |
+
- Kaggle API (Self-Teaching tab triggers training runs)
|
| 103 |
+
- ffmpeg (packages.txt — sole system-level dep)
|
| 104 |
+
|
| 105 |
+
Environment variables
|
| 106 |
+
HF_TOKEN, FEEDBACK_REPO_ID (ous-sow/sahel-agri-feedback),
|
| 107 |
+
LLM_MODEL_ID, BAMBARA_ADAPTER_PATH, FULA_ADAPTER_PATH,
|
| 108 |
+
SENSOR_API_URL, BAMBARA_TTS_REPO, FULA_TTS_REPO, DEVICE, LOG_LEVEL
|
| 109 |
+
|
| 110 |
+
KEY SOURCE FILES AND WHAT THEY DO
|
| 111 |
+
---------------------------------
|
| 112 |
+
Top-level entry points
|
| 113 |
+
app.py
|
| 114 |
+
Gradio UI (~99 KB). Main user-facing application running on the HF Space.
|
| 115 |
+
Wires STT → LLM → memory → TTS, exposes the Conversation / Teaching /
|
| 116 |
+
Knowledge Base / Self-Teaching tabs.
|
| 117 |
+
app_lab.py
|
| 118 |
+
Experimental/lab Gradio UI used to prototype new features
|
| 119 |
+
(e.g. CuriosityEngine integration) before folding into app.py.
|
| 120 |
+
setup.sh
|
| 121 |
+
Shell bootstrap for local + RunPod environments.
|
| 122 |
+
|
| 123 |
+
src/api/ — FastAPI service (alternative to Gradio-only deploy)
|
| 124 |
+
app.py FastAPI factory with async lifespan: loads Whisper backbone
|
| 125 |
+
once, registers bam/ful adapters, pre-loads 'bam', attaches
|
| 126 |
+
Transcriber + SensorBridge to app.state.
|
| 127 |
+
dependencies.py FastAPI DI helpers to pull shared objects off app.state.
|
| 128 |
+
middleware.py CORS / logging middleware registration.
|
| 129 |
+
schemas.py Pydantic v2 request/response models.
|
| 130 |
+
routes/health.py GET /health — model status + loaded adapters.
|
| 131 |
+
routes/transcribe.py POST /transcribe — audio → text, 10 MB cap,
|
| 132 |
+
wav/mp3/ogg/m4a/flac/webm.
|
| 133 |
+
routes/iot.py POST /query — full pipeline: audio → transcribe → intent
|
| 134 |
+
→ sensor → voice response (IoTQueryResponse).
|
| 135 |
+
|
| 136 |
+
src/engine/ — STT core
|
| 137 |
+
whisper_base.py Singleton loader for WhisperForConditionalGeneration +
|
| 138 |
+
WhisperProcessor. FP16 on CUDA, FP32 on CPU. free()
|
| 139 |
+
releases VRAM.
|
| 140 |
+
adapter_manager.py Hot-swap LoRA adapters via PEFT's multi-adapter API:
|
| 141 |
+
first load ~2s, subsequent set_adapter ~50ms.
|
| 142 |
+
Keeps one backbone in VRAM and swaps ~50MB adapters.
|
| 143 |
+
transcriber.py Public inference API. Handles ≤30s chunks directly,
|
| 144 |
+
>30s by slicing into 30s windows. Returns
|
| 145 |
+
TranscriptionResult (text, language, duration_s,
|
| 146 |
+
processing_time_ms, confidence).
|
| 147 |
+
stt_processor.py avg_logprob confidence extractor; threshold -1.0 =
|
| 148 |
+
"confused", caller should ask user to repeat.
|
| 149 |
+
curiosity.py CuriosityEngine — every N interactions, prompts the
|
| 150 |
+
LLM to spot a vocabulary gap and ask the user how to
|
| 151 |
+
say a missing agricultural term.
|
| 152 |
+
|
| 153 |
+
src/llm/
|
| 154 |
+
gemma_client.py Wraps HF Serverless InferenceClient. Implements the
|
| 155 |
+
"adult-child" system prompt that returns structured
|
| 156 |
+
JSON with intent ∈ {teaching, question, conversation,
|
| 157 |
+
error}. Parses JSON out of optional markdown fences.
|
| 158 |
+
|
| 159 |
+
src/memory/
|
| 160 |
+
memory_manager.py Thread-safe vocabulary store. Persists to
|
| 161 |
+
data/vocabulary.jsonl locally and pushes asynchronously
|
| 162 |
+
to HF Hub dataset. Provides get_recent() and a
|
| 163 |
+
formatted get_vocabulary_context() for the LLM prompt.
|
| 164 |
+
|
| 165 |
+
src/conversation/
|
| 166 |
+
phrase_matcher.py RapidFuzz-based matcher over curated JSON phrase
|
| 167 |
+
libraries (data/phrases/{lang}.json + _additions.json).
|
| 168 |
+
Handles greetings / thanks / farewells without hitting
|
| 169 |
+
the LLM.
|
| 170 |
+
|
| 171 |
+
src/iot/
|
| 172 |
+
intent_parser.py Keyword-based Intent classifier
|
| 173 |
+
(greeting/thanks/farewell/check_soil/check_weather/
|
| 174 |
+
irrigation_status/pest_alert) for bam, ful, fr, en.
|
| 175 |
+
Confidence = matched_keywords / total_keywords.
|
| 176 |
+
sensor_bridge.py Async bridge to an IoT backend (SENSOR_API_URL) for
|
| 177 |
+
soil / weather / irrigation / pest readings.
|
| 178 |
+
Falls back to mock random data.
|
| 179 |
+
voice_responder.py Maps (Intent, SensorData) → short Bambara/Fula reply
|
| 180 |
+
string (≤6 words per sentence for clean MMS-TTS) plus
|
| 181 |
+
English translation. Alert thresholds encoded here
|
| 182 |
+
(SOIL_MOISTURE_LOW=30, PH bounds, TEMP_HIGH=38, etc.).
|
| 183 |
+
Also has a verbose French-language path.
|
| 184 |
+
|
| 185 |
+
src/data/
|
| 186 |
+
agri_dictionary.py Bambara + Fula domain vocab used to bias the Whisper
|
| 187 |
+
decoder prompt toward agricultural terms.
|
| 188 |
+
waxal_loader.py Streams google/fleurs (bam_ML, ff_SN) — the
|
| 189 |
+
replacement for the retired google/waxal dataset.
|
| 190 |
+
feature_extractor.py Log-mel spectrogram extraction and batched padding
|
| 191 |
+
collator for Whisper Seq2SeqTrainer.
|
| 192 |
+
augmentation.py FieldNoiseAugmenter — mixes clean speech with
|
| 193 |
+
tractor/wind/livestock samples; falls back to
|
| 194 |
+
Gaussian noise.
|
| 195 |
+
bam_normalize.py Bambara phonetic normalizer (ou→u, gn/ny→ɲ,
|
| 196 |
+
N'Ko-derived standard).
|
| 197 |
+
adlam.py Adlam (𞤀𞤣𞤤𞤢𞤥) ↔ Latin transliteration for Pular;
|
| 198 |
+
normalize_pular() for ASR preprocessing.
|
| 199 |
+
web_harvester.py Harvests RobotsMali/jeli-asr, google/fleurs ff_SN,
|
| 200 |
+
and bm/ff Wikipedia into the feedback Hub dataset.
|
| 201 |
+
|
| 202 |
+
src/training/
|
| 203 |
+
trainer.py WhisperLoRATrainer — full fine-tune orchestration
|
| 204 |
+
(backbone + LoraConfig + WaxalDataLoader +
|
| 205 |
+
Seq2SeqTrainer).
|
| 206 |
+
metrics.py WER/CER for Seq2SeqTrainer eval loop (via jiwer).
|
| 207 |
+
callbacks.py EarlyStoppingOnWER, AdapterCheckpointCallback
|
| 208 |
+
(saves adapter-only, not full model).
|
| 209 |
+
|
| 210 |
+
src/tts/
|
| 211 |
+
waxal_tts.py VITS engine wrapping ynnov/ekodi-bambara-tts-female
|
| 212 |
+
for Bambara; Fula is a placeholder until
|
| 213 |
+
ous-sow/fula-tts is trained.
|
| 214 |
+
mms_tts.py Facebook MMS-TTS (bam/ful/fra/eng).
|
| 215 |
+
f5_tts.py F5-TTS voice cloning (optional, GPU-only, ~750MB);
|
| 216 |
+
gracefully falls back to MMS when missing.
|
| 217 |
+
voice_cloner.py OpenVoice V2 tone-color converter — reshapes VITS
|
| 218 |
+
audio to a target speaker's voice.
|
| 219 |
+
|
| 220 |
+
src/voice/
|
| 221 |
+
speaker_profiles.py SpeakerProfileManager with SpeechBrain ECAPA-TDNN
|
| 222 |
+
(192-d embeddings). Per-user running-average embeddings
|
| 223 |
+
for identification + OpenVoice SE for cloning; cosine
|
| 224 |
+
similarity ≥ 0.75 attributes to an existing user.
|
| 225 |
+
|
| 226 |
+
src/optimization/
|
| 227 |
+
onnx_exporter.py Merges LoRA into backbone and exports per-language
|
| 228 |
+
ONNX (ONNX can't hot-swap adapters at runtime).
|
| 229 |
+
quantizer.py BitsAndBytes NF4 / 8-bit quantization for GPU-
|
| 230 |
+
constrained deploys (turbo ~3GB → ~1GB VRAM).
|
| 231 |
+
tflite_converter.py ONNX → TFLite for offline Android; exports encoder
|
| 232 |
+
and decoder separately.
|
| 233 |
+
|
| 234 |
+
Config / data folders
|
| 235 |
+
configs/ base_config.yaml + per-language LoRA configs.
|
| 236 |
+
data/ vocabulary.jsonl, phrases/*.json, profiles/, etc.
|
| 237 |
+
notebooks/ Kaggle / RunPod fine-tune + TTS training notebooks.
|
| 238 |
+
noise_samples/ .wav clips for field-noise augmentation.
|
| 239 |
+
scripts/ utility scripts (bootstrap, harvest, eval).
|
| 240 |
+
tests/ pytest suite (not installed in HF Spaces runtime).
|
| 241 |
+
|
| 242 |
+
RECENT GIT COMMITS SUMMARY (last 20)
|
| 243 |
+
------------------------------------
|
| 244 |
+
The recent history is focused on three concurrent tracks:
|
| 245 |
+
|
| 246 |
+
1. STT / training stability
|
| 247 |
+
- bb78cbf Add torchcodec install for datasets 4.x audio decoding
|
| 248 |
+
- 9049ef3 Prepare training stack for RunPod: env-aware notebook +
|
| 249 |
+
bootstrap script
|
| 250 |
+
- cc50efb Align Whisper default to turbo-v3 + add document upload to
|
| 251 |
+
Knowledge Base tab
|
| 252 |
+
- c33a061 Fix WhisperProcessor import in reload + upgrade base to
|
| 253 |
+
large-v3-turbo
|
| 254 |
+
- 7fae91b Fix mel-bin mismatch: load per-language processor from
|
| 255 |
+
fine-tuned checkpoint
|
| 256 |
+
- 6682858 Fix jiwer crash on post-normalisation empty refs;
|
| 257 |
+
register SLR106/105 datasets
|
| 258 |
+
- 58f431a Fix SyntaxError in Cell 17: unterminated f-string literal
|
| 259 |
+
- 3632a23 Fix compute_metrics crash on empty eval references
|
| 260 |
+
in Fula training
|
| 261 |
+
- 71bb3bc Fix: add trust_remote_code=True for datasets 3.x compatibility
|
| 262 |
+
- cd017e2 Fix Cell 16 ValueError: load model fp32 so AMP gradient scaler
|
| 263 |
+
works
|
| 264 |
+
|
| 265 |
+
2. Language support / Adlam / Pular expansion
|
| 266 |
+
- ced078c Add Adlam/Pular Fula integration: transliterator +
|
| 267 |
+
3 new datasets + normalisation pipeline
|
| 268 |
+
- 40cf84d Fix language mixing: per-language prompts +
|
| 269 |
+
Mali Bambara / Guinea Pular context
|
| 270 |
+
- 33c3a5a Fix Self-Teaching language detection: parse code from
|
| 271 |
+
dropdown label
|
| 272 |
+
- 24b1617 Fix Self-Teaching tab: float sliders, deduplication,
|
| 273 |
+
Kaggle API fallback
|
| 274 |
+
|
| 275 |
+
3. Conversation / voice pipeline
|
| 276 |
+
- 8952fff Phase 3: Voice-to-Voice S2S pipeline —
|
| 277 |
+
F5-TTS, LLM brain, CER metric
|
| 278 |
+
- ad902c6 Add real conversational memory + live learning to
|
| 279 |
+
Conversation Mode
|
| 280 |
+
- 8d7d9d8 Fix conversation mode timeout: two-stage pipeline + faster LLM
|
| 281 |
+
- 1958814 Fix "Model loading" stuck state: block in _do_asr until
|
| 282 |
+
Whisper is ready
|
| 283 |
+
- 618eab5 Fix model loading stuck forever + unhandled TTS crash in
|
| 284 |
+
conversation mode
|
| 285 |
+
- bfe5b59 Fix slow build: strip runtime-irrelevant heavy packages from
|
| 286 |
+
requirements.txt
|
| 287 |
+
|
| 288 |
+
Overall trajectory: the project has moved past initial Phase 1 scaffolding
|
| 289 |
+
and is iterating hard on (a) stabilising fine-tuning on Kaggle/RunPod with
|
| 290 |
+
large-v3-turbo, (b) expanding to Guinea Pular with the native Adlam script,
|
| 291 |
+
and (c) finishing the Phase 3 voice-to-voice pipeline (F5-TTS + LLM brain).
|
| 292 |
+
Most recent commits are bug-fixes rather than net-new features, suggesting
|
| 293 |
+
the current codebase is approaching a stable milestone.
|
| 294 |
+
|
| 295 |
+
================================================================================
|
scripts/push_to_hf.sh
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# push_to_hf.sh — Push current branch to Hugging Face Space main using HF_TOKEN.
|
| 3 |
+
#
|
| 4 |
+
# Usage:
|
| 5 |
+
# bash scripts/push_to_hf.sh
|
| 6 |
+
# HF_SPACE_REPO="spaces/ous-sow/sahel-agri-voice" bash scripts/push_to_hf.sh
|
| 7 |
+
|
| 8 |
+
set -euo pipefail
|
| 9 |
+
|
| 10 |
+
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
| 11 |
+
cd "$REPO_ROOT"
|
| 12 |
+
|
| 13 |
+
HF_SPACE_REPO="${HF_SPACE_REPO:-spaces/ous-sow/sahel-agri-voice}"
|
| 14 |
+
TARGET_BRANCH="${TARGET_BRANCH:-main}"
|
| 15 |
+
|
| 16 |
+
# Load .env if present and HF_TOKEN is not already exported.
|
| 17 |
+
if [[ -z "${HF_TOKEN:-}" && -f "$REPO_ROOT/.env" ]]; then
|
| 18 |
+
set -a
|
| 19 |
+
# shellcheck disable=SC1091
|
| 20 |
+
source "$REPO_ROOT/.env"
|
| 21 |
+
set +a
|
| 22 |
+
fi
|
| 23 |
+
|
| 24 |
+
if [[ -z "${HF_TOKEN:-}" ]]; then
|
| 25 |
+
echo "HF_TOKEN is not set."
|
| 26 |
+
echo "Set HF_TOKEN in your shell or in .env, then rerun."
|
| 27 |
+
exit 1
|
| 28 |
+
fi
|
| 29 |
+
|
| 30 |
+
CURRENT_BRANCH="$(git branch --show-current)"
|
| 31 |
+
if [[ -z "$CURRENT_BRANCH" ]]; then
|
| 32 |
+
echo "Could not detect current git branch."
|
| 33 |
+
exit 1
|
| 34 |
+
fi
|
| 35 |
+
|
| 36 |
+
echo "Pushing '$CURRENT_BRANCH' to '$HF_SPACE_REPO' (remote branch: '$TARGET_BRANCH')..."
|
| 37 |
+
git push "https://__token__:${HF_TOKEN}@huggingface.co/${HF_SPACE_REPO}" "${CURRENT_BRANCH}:${TARGET_BRANCH}"
|
| 38 |
+
echo "Done."
|