Beijuka commited on
Commit
bd0de02
Β·
verified Β·
1 Parent(s): 03449f2

Update src/streamlit_app.py

Browse files
Files changed (1) hide show
  1. src/streamlit_app.py +110 -28
src/streamlit_app.py CHANGED
@@ -321,17 +321,21 @@ with tab7:
321
  - **Accuracy (1–5 scale):** How correctly the model transcribed the audio.
322
  - **Meaning Preservation (1–5 scale):** Whether the transcription retained the original meaning.
323
  - **Orthography:** Whether the transcription followed standard writing conventions, including accents, diacritics, and special characters.
324
- - **Error Types:** Evaluators identified common error categories, such as:
325
- - Substitutions (wrong words used)
326
- - Omissions (missing words)
327
- - Insertions (extra words added)
328
- - Pronunciation-related errors
329
- - Diacritic/Tone/Special character errors
330
- - Named Entity errors (people, places, currencies)
331
- - Punctuation errors
332
- - **Performance Description:** Free text where evaluators described strengths and weaknesses of the models.
 
 
 
333
  """)
334
 
 
335
  # --- Setup ---
336
  st.subheader("Evaluation Setup")
337
  st.write("""
@@ -339,41 +343,119 @@ with tab7:
339
  - **Participants:** 20 evaluators (native speakers or fluent linguists), aged 18–50, majority with postgraduate education.
340
  - **Platform:** A Gradio-based interface allowed evaluators to upload/record audio, view transcriptions, and complete the feedback form directly online.
341
  """)
342
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
343
  # --- Findings ---
344
  st.subheader("Findings")
 
345
  st.write("""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
346
  - **High-Performing Languages:**
347
- - Swahili (Accuracy 4.96, Meaning 4.97)
348
- - Luganda (Accuracy 4.70, Meaning 4.78)
349
- - Amharic (Accuracy 4.65, Meaning 4.82)
350
- These models produced highly accurate transcriptions with minimal meaning loss.
351
-
352
  - **Moderate Performance:**
353
- Hausa, Oromo, Bemba, Yoruba, and Wolof β€” generally understandable, but often with orthography and punctuation issues.
354
 
355
- - **Low-Performing Languages:**
356
- - Igbo (Accuracy 2.25, Meaning 2.15)
357
- - Afrikaans (Accuracy 3.59, Meaning 4.10)
358
- - Xhosa (Accuracy 3.62, Meaning 3.38)
359
- These suffered from limited training data, frequent substitution/omission errors, and poor handling of named entities.
360
  """)
361
 
362
  # --- Error Patterns ---
363
  st.subheader("Common Error Patterns")
 
364
  st.write("""
365
- 1. Punctuation and formatting inconsistencies.
366
- 2. Word merging or spacing errors, especially in morphologically rich languages.
367
- 3. Named entity recognition failures (numbers, currencies, names).
368
- 4. Spelling and orthography deviations, especially in languages with tones/diacritics.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
369
  """)
370
 
371
  # --- Takeaways ---
372
  st.subheader("Takeaways")
373
  st.write("""
374
  - Human ratings generally aligned with automatic metrics: languages with larger datasets (Swahili, Luganda, Amharic) scored highest.
375
- - Language models (LMs) were most effective in **low-data regimes (<50 hours)**, improving readability and accuracy.
376
  - WER alone misses issues such as meaning drift, orthography violations, and named entity errors.
377
- - More curated, domain-diverse training data is needed for low-performing languages such as Igbo and Afrikaans.
378
- - Human evaluation remains essential for **user-facing ASR systems**, where usability depends on meaning preservation and fluency, not just raw error rates.
379
  """)
 
321
  - **Accuracy (1–5 scale):** How correctly the model transcribed the audio.
322
  - **Meaning Preservation (1–5 scale):** Whether the transcription retained the original meaning.
323
  - **Orthography:** Whether the transcription followed standard writing conventions, including accents, diacritics, and special characters.
324
+ - **Recording Environment:** Evaluators noted the type of environment (quiet, professional studio, or noisy background) since background noise impacts ASR performance.
325
+ - **Device Used:** Information on whether the recording was made with a mobile phone, laptop microphone, or dedicated mic, as device quality affects clarity.
326
+ - **Domain/Topic of Speech:** Evaluators indicated if the speech belonged to a specific topic such as education, health, law, or everyday conversation, to assess domain adaptability.
327
+ - **Error Types:** Evaluators identified common error categories, such as:
328
+ - Substitutions (wrong words used)
329
+ - Omissions (missing words)
330
+ - Insertions (extra words added)
331
+ - Pronunciation-related errors
332
+ - Diacritic/Tone/Special character errors
333
+ - Named Entity errors (people, places, currencies)
334
+ - Punctuation errors
335
+ - **Performance Description:** Free text where evaluators described strengths and weaknesses of the models in their own words.
336
  """)
337
 
338
+
339
  # --- Setup ---
340
  st.subheader("Evaluation Setup")
341
  st.write("""
 
343
  - **Participants:** 20 evaluators (native speakers or fluent linguists), aged 18–50, majority with postgraduate education.
344
  - **Platform:** A Gradio-based interface allowed evaluators to upload/record audio, view transcriptions, and complete the feedback form directly online.
345
  """)
346
+ st.subheader("Evaluator Contributions")
347
+ data = [
348
+ {"Evaluator ID": "eval_001", "Contributions": 65, "Languages": "Afrikaans"},
349
+ {"Evaluator ID": "eval_002", "Contributions": 50, "Languages": "Afrikaans"},
350
+ {"Evaluator ID": "eval_005", "Contributions": 63, "Languages": "Amharic"},
351
+ {"Evaluator ID": "eval_006", "Contributions": 69, "Languages": "Amharic"},
352
+ {"Evaluator ID": "eval_007", "Contributions": 50, "Languages": "Bemba"},
353
+ {"Evaluator ID": "eval_008", "Contributions": 53, "Languages": "Bemba"},
354
+ {"Evaluator ID": "eval_009", "Contributions": 60, "Languages": "Hausa"},
355
+ {"Evaluator ID": "eval_010", "Contributions": 53, "Languages": "Igbo"},
356
+ {"Evaluator ID": "eval_011", "Contributions": 12, "Languages": "Lingala"},
357
+ {"Evaluator ID": "eval_012", "Contributions": 115, "Languages": "Oromo"},
358
+ {"Evaluator ID": "eval_014", "Contributions": 52, "Languages": "Wolof"},
359
+ {"Evaluator ID": "eval_015", "Contributions": 8, "Languages": "Xhosa"},
360
+ {"Evaluator ID": "eval_017", "Contributions": 59, "Languages": "Yoruba"},
361
+ {"Evaluator ID": "eval_018", "Contributions": 58, "Languages": "Yoruba"},
362
+ {"Evaluator ID": "eval_019", "Contributions": 52, "Languages": "Luganda"},
363
+ {"Evaluator ID": "eval_020", "Contributions": 55, "Languages": "Luganda"},
364
+ {"Evaluator ID": "eval_021", "Contributions": 66, "Languages": "Swahili"},
365
+ {"Evaluator ID": "eval_022", "Contributions": 64, "Languages": "Swahili"},
366
+ {"Evaluator ID": "eval_023", "Contributions": 50, "Languages": "Kinyarwanda"},
367
+ {"Evaluator ID": "eval_024", "Contributions": 53, "Languages": "Kinyarwanda"},
368
+ ]
369
+
370
+ df_evaluators = pd.DataFrame(data)
371
+
372
+ st.dataframe(df_evaluators, width="stretch")
373
+
374
+ # Optional: also show totals
375
+ st.write("### Summary")
376
+ st.write(f"- **Total Evaluators:** {df_evaluators['Evaluator ID'].nunique()}")
377
+ st.write(f"- **Total Contributions:** {df_evaluators['Contributions'].sum()}")
378
+
379
  # --- Findings ---
380
  st.subheader("Findings")
381
+
382
  st.write("""
383
+ ASR performance varied significantly across languages, reflecting differences in data availability,
384
+ orthography complexity, and domain coverage. Below we summarize the average **Accuracy** and
385
+ **Meaning Preservation** scores (1–5 scale) by language.
386
+ """)
387
+
388
+ # Data table of results
389
+ results_data = [
390
+ {"Language": "Swahili", "Audios Evaluated": 132, "Accuracy": 4.96, "Meaning": 4.97},
391
+ {"Language": "Luganda", "Audios Evaluated": 110, "Accuracy": 4.70, "Meaning": 4.78},
392
+ {"Language": "Amharic", "Audios Evaluated": 132, "Accuracy": 4.65, "Meaning": 4.82},
393
+ {"Language": "Lingala", "Audios Evaluated": 30, "Accuracy": 4.63, "Meaning": 4.70},
394
+ {"Language": "Hausa", "Audios Evaluated": 60, "Accuracy": 4.58, "Meaning": 4.97},
395
+ {"Language": "Oromo", "Audios Evaluated": 115, "Accuracy": 4.54, "Meaning": 4.52},
396
+ {"Language": "Bemba", "Audios Evaluated": 116, "Accuracy": 4.39, "Meaning": 4.86},
397
+ {"Language": "Yoruba", "Audios Evaluated": 122, "Accuracy": 4.22, "Meaning": 4.48},
398
+ {"Language": "Wolof", "Audios Evaluated": 53, "Accuracy": 3.98, "Meaning": 4.13},
399
+ {"Language": "Kinyarwanda", "Audios Evaluated": 103, "Accuracy": 3.75, "Meaning": 4.81},
400
+ {"Language": "Xhosa", "Audios Evaluated": 8, "Accuracy": 3.62, "Meaning": 3.38},
401
+ {"Language": "Afrikaans", "Audios Evaluated": 116, "Accuracy": 3.59, "Meaning": 4.10},
402
+ {"Language": "Igbo", "Audios Evaluated": 55, "Accuracy": 2.25, "Meaning": 2.15},
403
+ ]
404
+
405
+ df_results = pd.DataFrame(results_data)
406
+ st.dataframe(df_results, width="stretch")
407
+
408
+ # Narrative summary
409
+ st.markdown("""
410
+ ### Key Takeaways
411
  - **High-Performing Languages:**
412
+ - Swahili (Accuracy 4.96, Meaning 4.97)
413
+ - Luganda (Accuracy 4.70, Meaning 4.78)
414
+ - Amharic (Accuracy 4.65, Meaning 4.82)
415
+ These models produced highly accurate transcriptions with minimal meaning loss.
416
+
417
  - **Moderate Performance:**
418
+ Hausa, Oromo, Bemba, Yoruba, Wolof, and Kinyarwanda β€” generally understandable, but often with orthography and punctuation issues.
419
 
420
+ - **Low-Performing Languages from evaluation:**
421
+ - Igbo (Accuracy 2.25, Meaning 2.15)
422
+ - Afrikaans (Accuracy 3.59, Meaning 4.10)
423
+ - Xhosa (Accuracy 3.62, Meaning 3.38)
 
424
  """)
425
 
426
  # --- Error Patterns ---
427
  st.subheader("Common Error Patterns")
428
+
429
  st.write("""
430
+ Evaluators highlighted several recurring challenges and areas for improvement across
431
+ different languages. These reflect both linguistic complexities and system limitations.
432
+ """)
433
+
434
+ error_data = [
435
+ {"Issue": "Punctuation and Formatting",
436
+ "Comments": "Absence of punctuation, lack of capitalisation"},
437
+ {"Issue": "Spelling and Grammar",
438
+ "Comments": "Word merging, frequent spelling mistakes in individual words"},
439
+ {"Issue": "Named Entity Recognition",
440
+ "Comments": "Inaccurate handling of numbers, currencies, and names"},
441
+ {"Issue": "Device Compatibility & Performance",
442
+ "Comments": "Better performance on laptops than on mobile phones"},
443
+ ]
444
+
445
+ df_errors = pd.DataFrame(error_data)
446
+ st.dataframe(df_errors, width="stretch")
447
+
448
+ st.markdown("""
449
+ ### Summary
450
+ 1. **Punctuation and formatting inconsistencies** make transcriptions harder to read.
451
+ 2. **Word merging and spelling errors** were frequent, particularly in morphologically rich languages.
452
+ 3. **Named entity recognition** (e.g., names, currencies, numbers) was a common source of error.
453
+ 4. **Platform performance** was reported as better on laptops than mobile devices.
454
  """)
455
 
456
  # --- Takeaways ---
457
  st.subheader("Takeaways")
458
  st.write("""
459
  - Human ratings generally aligned with automatic metrics: languages with larger datasets (Swahili, Luganda, Amharic) scored highest.
 
460
  - WER alone misses issues such as meaning drift, orthography violations, and named entity errors.
 
 
461
  """)