Beijuka commited on
Commit
03449f2
Β·
verified Β·
1 Parent(s): eb23303

Update src/streamlit_app.py

Browse files
Files changed (1) hide show
  1. src/streamlit_app.py +80 -3
src/streamlit_app.py CHANGED
@@ -15,13 +15,14 @@ st.set_page_config(
15
  st.title("ASR for African Languages Model Hub")
16
 
17
  # Create tabs
18
- tab1, tab2, tab3, tab4, tab5, tab6 = st.tabs([
19
  "About",
20
  "Benchmark Dataset",
21
  "Model Collections",
22
  "Evaluation Scenarios",
23
  "ASR models demo",
24
- "Results"
 
25
  ])
26
 
27
  with tab5:
@@ -299,4 +300,80 @@ with tab6:
299
  - **Language models (LMs) provide the greatest benefit in low-data regimes (<50 hours)** by supplying additional contextual information.
300
  - As supervised training data increases, **the added value of LMs decreases**, though their effectiveness varies somewhat across languages.
301
  - **Model choice matters**: XLS-R benefits most from scaling data, while W2v-BERT shines in extremely low-resource scenarios.
302
- """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  st.title("ASR for African Languages Model Hub")
16
 
17
  # Create tabs
18
+ tab1, tab2, tab3, tab4, tab5, tab6, tab7 = st.tabs([
19
  "About",
20
  "Benchmark Dataset",
21
  "Model Collections",
22
  "Evaluation Scenarios",
23
  "ASR models demo",
24
+ "Results",
25
+ "Human Evaluation of ASR Models"
26
  ])
27
 
28
  with tab5:
 
300
  - **Language models (LMs) provide the greatest benefit in low-data regimes (<50 hours)** by supplying additional contextual information.
301
  - As supervised training data increases, **the added value of LMs decreases**, though their effectiveness varies somewhat across languages.
302
  - **Model choice matters**: XLS-R benefits most from scaling data, while W2v-BERT shines in extremely low-resource scenarios.
303
+ """)
304
+
305
+ with tab7:
306
+ st.header("Human Evaluation of ASR Models")
307
+
308
+ # --- Introduction ---
309
+ st.subheader("Introduction")
310
+ st.write("""
311
+ ASR systems are typically assessed using automatic metrics such as Word Error Rate (WER) or Character Error Rate (CER).
312
+ While these provide valuable quantitative insights, they do not fully capture how well transcriptions preserve meaning, respect language orthography, or handle specific features such as tone, diacritics, or named entities.
313
+ To address these gaps, we conducted a human evaluation of ASR systems across African languages to obtain the qualitative performance of the best performing models.
314
+ """)
315
+
316
+ # --- Guidelines ---
317
+ st.subheader("Evaluation Guidelines")
318
+ st.write("""
319
+ Evaluators were provided with structured instructions to ensure consistency in their assessments. The main criteria included:
320
+
321
+ - **Accuracy (1–5 scale):** How correctly the model transcribed the audio.
322
+ - **Meaning Preservation (1–5 scale):** Whether the transcription retained the original meaning.
323
+ - **Orthography:** Whether the transcription followed standard writing conventions, including accents, diacritics, and special characters.
324
+ - **Error Types:** Evaluators identified common error categories, such as:
325
+ - Substitutions (wrong words used)
326
+ - Omissions (missing words)
327
+ - Insertions (extra words added)
328
+ - Pronunciation-related errors
329
+ - Diacritic/Tone/Special character errors
330
+ - Named Entity errors (people, places, currencies)
331
+ - Punctuation errors
332
+ - **Performance Description:** Free text where evaluators described strengths and weaknesses of the models.
333
+ """)
334
+
335
+ # --- Setup ---
336
+ st.subheader("Evaluation Setup")
337
+ st.write("""
338
+ - **Languages Evaluated:** 12 languages including Afrikaans, Amharic, Bemba, Hausa, Igbo, Kinyarwanda, Lingala, Luganda, Oromo, Swahili, Wolof, Xhosa, and Yoruba.
339
+ - **Participants:** 20 evaluators (native speakers or fluent linguists), aged 18–50, majority with postgraduate education.
340
+ - **Platform:** A Gradio-based interface allowed evaluators to upload/record audio, view transcriptions, and complete the feedback form directly online.
341
+ """)
342
+
343
+ # --- Findings ---
344
+ st.subheader("Findings")
345
+ st.write("""
346
+ - **High-Performing Languages:**
347
+ - Swahili (Accuracy 4.96, Meaning 4.97)
348
+ - Luganda (Accuracy 4.70, Meaning 4.78)
349
+ - Amharic (Accuracy 4.65, Meaning 4.82)
350
+ These models produced highly accurate transcriptions with minimal meaning loss.
351
+
352
+ - **Moderate Performance:**
353
+ Hausa, Oromo, Bemba, Yoruba, and Wolof β€” generally understandable, but often with orthography and punctuation issues.
354
+
355
+ - **Low-Performing Languages:**
356
+ - Igbo (Accuracy 2.25, Meaning 2.15)
357
+ - Afrikaans (Accuracy 3.59, Meaning 4.10)
358
+ - Xhosa (Accuracy 3.62, Meaning 3.38)
359
+ These suffered from limited training data, frequent substitution/omission errors, and poor handling of named entities.
360
+ """)
361
+
362
+ # --- Error Patterns ---
363
+ st.subheader("Common Error Patterns")
364
+ st.write("""
365
+ 1. Punctuation and formatting inconsistencies.
366
+ 2. Word merging or spacing errors, especially in morphologically rich languages.
367
+ 3. Named entity recognition failures (numbers, currencies, names).
368
+ 4. Spelling and orthography deviations, especially in languages with tones/diacritics.
369
+ """)
370
+
371
+ # --- Takeaways ---
372
+ st.subheader("Takeaways")
373
+ st.write("""
374
+ - Human ratings generally aligned with automatic metrics: languages with larger datasets (Swahili, Luganda, Amharic) scored highest.
375
+ - Language models (LMs) were most effective in **low-data regimes (<50 hours)**, improving readability and accuracy.
376
+ - WER alone misses issues such as meaning drift, orthography violations, and named entity errors.
377
+ - More curated, domain-diverse training data is needed for low-performing languages such as Igbo and Afrikaans.
378
+ - Human evaluation remains essential for **user-facing ASR systems**, where usability depends on meaning preservation and fluency, not just raw error rates.
379
+ """)