Update src/streamlit_app.py
Browse files- src/streamlit_app.py +80 -3
src/streamlit_app.py
CHANGED
|
@@ -15,13 +15,14 @@ st.set_page_config(
|
|
| 15 |
st.title("ASR for African Languages Model Hub")
|
| 16 |
|
| 17 |
# Create tabs
|
| 18 |
-
tab1, tab2, tab3, tab4, tab5, tab6 = st.tabs([
|
| 19 |
"About",
|
| 20 |
"Benchmark Dataset",
|
| 21 |
"Model Collections",
|
| 22 |
"Evaluation Scenarios",
|
| 23 |
"ASR models demo",
|
| 24 |
-
"Results"
|
|
|
|
| 25 |
])
|
| 26 |
|
| 27 |
with tab5:
|
|
@@ -299,4 +300,80 @@ with tab6:
|
|
| 299 |
- **Language models (LMs) provide the greatest benefit in low-data regimes (<50 hours)** by supplying additional contextual information.
|
| 300 |
- As supervised training data increases, **the added value of LMs decreases**, though their effectiveness varies somewhat across languages.
|
| 301 |
- **Model choice matters**: XLS-R benefits most from scaling data, while W2v-BERT shines in extremely low-resource scenarios.
|
| 302 |
-
""")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
st.title("ASR for African Languages Model Hub")
|
| 16 |
|
| 17 |
# Create tabs
|
| 18 |
+
tab1, tab2, tab3, tab4, tab5, tab6, tab7 = st.tabs([
|
| 19 |
"About",
|
| 20 |
"Benchmark Dataset",
|
| 21 |
"Model Collections",
|
| 22 |
"Evaluation Scenarios",
|
| 23 |
"ASR models demo",
|
| 24 |
+
"Results",
|
| 25 |
+
"Human Evaluation of ASR Models"
|
| 26 |
])
|
| 27 |
|
| 28 |
with tab5:
|
|
|
|
| 300 |
- **Language models (LMs) provide the greatest benefit in low-data regimes (<50 hours)** by supplying additional contextual information.
|
| 301 |
- As supervised training data increases, **the added value of LMs decreases**, though their effectiveness varies somewhat across languages.
|
| 302 |
- **Model choice matters**: XLS-R benefits most from scaling data, while W2v-BERT shines in extremely low-resource scenarios.
|
| 303 |
+
""")
|
| 304 |
+
|
| 305 |
+
with tab7:
|
| 306 |
+
st.header("Human Evaluation of ASR Models")
|
| 307 |
+
|
| 308 |
+
# --- Introduction ---
|
| 309 |
+
st.subheader("Introduction")
|
| 310 |
+
st.write("""
|
| 311 |
+
ASR systems are typically assessed using automatic metrics such as Word Error Rate (WER) or Character Error Rate (CER).
|
| 312 |
+
While these provide valuable quantitative insights, they do not fully capture how well transcriptions preserve meaning, respect language orthography, or handle specific features such as tone, diacritics, or named entities.
|
| 313 |
+
To address these gaps, we conducted a human evaluation of ASR systems across African languages to obtain the qualitative performance of the best performing models.
|
| 314 |
+
""")
|
| 315 |
+
|
| 316 |
+
# --- Guidelines ---
|
| 317 |
+
st.subheader("Evaluation Guidelines")
|
| 318 |
+
st.write("""
|
| 319 |
+
Evaluators were provided with structured instructions to ensure consistency in their assessments. The main criteria included:
|
| 320 |
+
|
| 321 |
+
- **Accuracy (1β5 scale):** How correctly the model transcribed the audio.
|
| 322 |
+
- **Meaning Preservation (1β5 scale):** Whether the transcription retained the original meaning.
|
| 323 |
+
- **Orthography:** Whether the transcription followed standard writing conventions, including accents, diacritics, and special characters.
|
| 324 |
+
- **Error Types:** Evaluators identified common error categories, such as:
|
| 325 |
+
- Substitutions (wrong words used)
|
| 326 |
+
- Omissions (missing words)
|
| 327 |
+
- Insertions (extra words added)
|
| 328 |
+
- Pronunciation-related errors
|
| 329 |
+
- Diacritic/Tone/Special character errors
|
| 330 |
+
- Named Entity errors (people, places, currencies)
|
| 331 |
+
- Punctuation errors
|
| 332 |
+
- **Performance Description:** Free text where evaluators described strengths and weaknesses of the models.
|
| 333 |
+
""")
|
| 334 |
+
|
| 335 |
+
# --- Setup ---
|
| 336 |
+
st.subheader("Evaluation Setup")
|
| 337 |
+
st.write("""
|
| 338 |
+
- **Languages Evaluated:** 12 languages including Afrikaans, Amharic, Bemba, Hausa, Igbo, Kinyarwanda, Lingala, Luganda, Oromo, Swahili, Wolof, Xhosa, and Yoruba.
|
| 339 |
+
- **Participants:** 20 evaluators (native speakers or fluent linguists), aged 18β50, majority with postgraduate education.
|
| 340 |
+
- **Platform:** A Gradio-based interface allowed evaluators to upload/record audio, view transcriptions, and complete the feedback form directly online.
|
| 341 |
+
""")
|
| 342 |
+
|
| 343 |
+
# --- Findings ---
|
| 344 |
+
st.subheader("Findings")
|
| 345 |
+
st.write("""
|
| 346 |
+
- **High-Performing Languages:**
|
| 347 |
+
- Swahili (Accuracy 4.96, Meaning 4.97)
|
| 348 |
+
- Luganda (Accuracy 4.70, Meaning 4.78)
|
| 349 |
+
- Amharic (Accuracy 4.65, Meaning 4.82)
|
| 350 |
+
These models produced highly accurate transcriptions with minimal meaning loss.
|
| 351 |
+
|
| 352 |
+
- **Moderate Performance:**
|
| 353 |
+
Hausa, Oromo, Bemba, Yoruba, and Wolof β generally understandable, but often with orthography and punctuation issues.
|
| 354 |
+
|
| 355 |
+
- **Low-Performing Languages:**
|
| 356 |
+
- Igbo (Accuracy 2.25, Meaning 2.15)
|
| 357 |
+
- Afrikaans (Accuracy 3.59, Meaning 4.10)
|
| 358 |
+
- Xhosa (Accuracy 3.62, Meaning 3.38)
|
| 359 |
+
These suffered from limited training data, frequent substitution/omission errors, and poor handling of named entities.
|
| 360 |
+
""")
|
| 361 |
+
|
| 362 |
+
# --- Error Patterns ---
|
| 363 |
+
st.subheader("Common Error Patterns")
|
| 364 |
+
st.write("""
|
| 365 |
+
1. Punctuation and formatting inconsistencies.
|
| 366 |
+
2. Word merging or spacing errors, especially in morphologically rich languages.
|
| 367 |
+
3. Named entity recognition failures (numbers, currencies, names).
|
| 368 |
+
4. Spelling and orthography deviations, especially in languages with tones/diacritics.
|
| 369 |
+
""")
|
| 370 |
+
|
| 371 |
+
# --- Takeaways ---
|
| 372 |
+
st.subheader("Takeaways")
|
| 373 |
+
st.write("""
|
| 374 |
+
- Human ratings generally aligned with automatic metrics: languages with larger datasets (Swahili, Luganda, Amharic) scored highest.
|
| 375 |
+
- Language models (LMs) were most effective in **low-data regimes (<50 hours)**, improving readability and accuracy.
|
| 376 |
+
- WER alone misses issues such as meaning drift, orthography violations, and named entity errors.
|
| 377 |
+
- More curated, domain-diverse training data is needed for low-performing languages such as Igbo and Afrikaans.
|
| 378 |
+
- Human evaluation remains essential for **user-facing ASR systems**, where usability depends on meaning preservation and fluency, not just raw error rates.
|
| 379 |
+
""")
|