Update src/streamlit_app.py
Browse files- src/streamlit_app.py +110 -28
src/streamlit_app.py
CHANGED
|
@@ -321,17 +321,21 @@ with tab7:
|
|
| 321 |
- **Accuracy (1β5 scale):** How correctly the model transcribed the audio.
|
| 322 |
- **Meaning Preservation (1β5 scale):** Whether the transcription retained the original meaning.
|
| 323 |
- **Orthography:** Whether the transcription followed standard writing conventions, including accents, diacritics, and special characters.
|
| 324 |
-
- **
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
|
|
|
|
|
|
|
|
|
|
| 333 |
""")
|
| 334 |
|
|
|
|
| 335 |
# --- Setup ---
|
| 336 |
st.subheader("Evaluation Setup")
|
| 337 |
st.write("""
|
|
@@ -339,41 +343,119 @@ with tab7:
|
|
| 339 |
- **Participants:** 20 evaluators (native speakers or fluent linguists), aged 18β50, majority with postgraduate education.
|
| 340 |
- **Platform:** A Gradio-based interface allowed evaluators to upload/record audio, view transcriptions, and complete the feedback form directly online.
|
| 341 |
""")
|
| 342 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 343 |
# --- Findings ---
|
| 344 |
st.subheader("Findings")
|
|
|
|
| 345 |
st.write("""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 346 |
- **High-Performing Languages:**
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
- **Moderate Performance:**
|
| 353 |
-
Hausa, Oromo, Bemba, Yoruba, and
|
| 354 |
|
| 355 |
-
- **Low-Performing Languages:**
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
|
| 359 |
-
These suffered from limited training data, frequent substitution/omission errors, and poor handling of named entities.
|
| 360 |
""")
|
| 361 |
|
| 362 |
# --- Error Patterns ---
|
| 363 |
st.subheader("Common Error Patterns")
|
|
|
|
| 364 |
st.write("""
|
| 365 |
-
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 369 |
""")
|
| 370 |
|
| 371 |
# --- Takeaways ---
|
| 372 |
st.subheader("Takeaways")
|
| 373 |
st.write("""
|
| 374 |
- Human ratings generally aligned with automatic metrics: languages with larger datasets (Swahili, Luganda, Amharic) scored highest.
|
| 375 |
-
- Language models (LMs) were most effective in **low-data regimes (<50 hours)**, improving readability and accuracy.
|
| 376 |
- WER alone misses issues such as meaning drift, orthography violations, and named entity errors.
|
| 377 |
-
- More curated, domain-diverse training data is needed for low-performing languages such as Igbo and Afrikaans.
|
| 378 |
-
- Human evaluation remains essential for **user-facing ASR systems**, where usability depends on meaning preservation and fluency, not just raw error rates.
|
| 379 |
""")
|
|
|
|
| 321 |
- **Accuracy (1β5 scale):** How correctly the model transcribed the audio.
|
| 322 |
- **Meaning Preservation (1β5 scale):** Whether the transcription retained the original meaning.
|
| 323 |
- **Orthography:** Whether the transcription followed standard writing conventions, including accents, diacritics, and special characters.
|
| 324 |
+
- **Recording Environment:** Evaluators noted the type of environment (quiet, professional studio, or noisy background) since background noise impacts ASR performance.
|
| 325 |
+
- **Device Used:** Information on whether the recording was made with a mobile phone, laptop microphone, or dedicated mic, as device quality affects clarity.
|
| 326 |
+
- **Domain/Topic of Speech:** Evaluators indicated if the speech belonged to a specific topic such as education, health, law, or everyday conversation, to assess domain adaptability.
|
| 327 |
+
- **Error Types:** Evaluators identified common error categories, such as:
|
| 328 |
+
- Substitutions (wrong words used)
|
| 329 |
+
- Omissions (missing words)
|
| 330 |
+
- Insertions (extra words added)
|
| 331 |
+
- Pronunciation-related errors
|
| 332 |
+
- Diacritic/Tone/Special character errors
|
| 333 |
+
- Named Entity errors (people, places, currencies)
|
| 334 |
+
- Punctuation errors
|
| 335 |
+
- **Performance Description:** Free text where evaluators described strengths and weaknesses of the models in their own words.
|
| 336 |
""")
|
| 337 |
|
| 338 |
+
|
| 339 |
# --- Setup ---
|
| 340 |
st.subheader("Evaluation Setup")
|
| 341 |
st.write("""
|
|
|
|
| 343 |
- **Participants:** 20 evaluators (native speakers or fluent linguists), aged 18β50, majority with postgraduate education.
|
| 344 |
- **Platform:** A Gradio-based interface allowed evaluators to upload/record audio, view transcriptions, and complete the feedback form directly online.
|
| 345 |
""")
|
| 346 |
+
st.subheader("Evaluator Contributions")
|
| 347 |
+
data = [
|
| 348 |
+
{"Evaluator ID": "eval_001", "Contributions": 65, "Languages": "Afrikaans"},
|
| 349 |
+
{"Evaluator ID": "eval_002", "Contributions": 50, "Languages": "Afrikaans"},
|
| 350 |
+
{"Evaluator ID": "eval_005", "Contributions": 63, "Languages": "Amharic"},
|
| 351 |
+
{"Evaluator ID": "eval_006", "Contributions": 69, "Languages": "Amharic"},
|
| 352 |
+
{"Evaluator ID": "eval_007", "Contributions": 50, "Languages": "Bemba"},
|
| 353 |
+
{"Evaluator ID": "eval_008", "Contributions": 53, "Languages": "Bemba"},
|
| 354 |
+
{"Evaluator ID": "eval_009", "Contributions": 60, "Languages": "Hausa"},
|
| 355 |
+
{"Evaluator ID": "eval_010", "Contributions": 53, "Languages": "Igbo"},
|
| 356 |
+
{"Evaluator ID": "eval_011", "Contributions": 12, "Languages": "Lingala"},
|
| 357 |
+
{"Evaluator ID": "eval_012", "Contributions": 115, "Languages": "Oromo"},
|
| 358 |
+
{"Evaluator ID": "eval_014", "Contributions": 52, "Languages": "Wolof"},
|
| 359 |
+
{"Evaluator ID": "eval_015", "Contributions": 8, "Languages": "Xhosa"},
|
| 360 |
+
{"Evaluator ID": "eval_017", "Contributions": 59, "Languages": "Yoruba"},
|
| 361 |
+
{"Evaluator ID": "eval_018", "Contributions": 58, "Languages": "Yoruba"},
|
| 362 |
+
{"Evaluator ID": "eval_019", "Contributions": 52, "Languages": "Luganda"},
|
| 363 |
+
{"Evaluator ID": "eval_020", "Contributions": 55, "Languages": "Luganda"},
|
| 364 |
+
{"Evaluator ID": "eval_021", "Contributions": 66, "Languages": "Swahili"},
|
| 365 |
+
{"Evaluator ID": "eval_022", "Contributions": 64, "Languages": "Swahili"},
|
| 366 |
+
{"Evaluator ID": "eval_023", "Contributions": 50, "Languages": "Kinyarwanda"},
|
| 367 |
+
{"Evaluator ID": "eval_024", "Contributions": 53, "Languages": "Kinyarwanda"},
|
| 368 |
+
]
|
| 369 |
+
|
| 370 |
+
df_evaluators = pd.DataFrame(data)
|
| 371 |
+
|
| 372 |
+
st.dataframe(df_evaluators, width="stretch")
|
| 373 |
+
|
| 374 |
+
# Optional: also show totals
|
| 375 |
+
st.write("### Summary")
|
| 376 |
+
st.write(f"- **Total Evaluators:** {df_evaluators['Evaluator ID'].nunique()}")
|
| 377 |
+
st.write(f"- **Total Contributions:** {df_evaluators['Contributions'].sum()}")
|
| 378 |
+
|
| 379 |
# --- Findings ---
|
| 380 |
st.subheader("Findings")
|
| 381 |
+
|
| 382 |
st.write("""
|
| 383 |
+
ASR performance varied significantly across languages, reflecting differences in data availability,
|
| 384 |
+
orthography complexity, and domain coverage. Below we summarize the average **Accuracy** and
|
| 385 |
+
**Meaning Preservation** scores (1β5 scale) by language.
|
| 386 |
+
""")
|
| 387 |
+
|
| 388 |
+
# Data table of results
|
| 389 |
+
results_data = [
|
| 390 |
+
{"Language": "Swahili", "Audios Evaluated": 132, "Accuracy": 4.96, "Meaning": 4.97},
|
| 391 |
+
{"Language": "Luganda", "Audios Evaluated": 110, "Accuracy": 4.70, "Meaning": 4.78},
|
| 392 |
+
{"Language": "Amharic", "Audios Evaluated": 132, "Accuracy": 4.65, "Meaning": 4.82},
|
| 393 |
+
{"Language": "Lingala", "Audios Evaluated": 30, "Accuracy": 4.63, "Meaning": 4.70},
|
| 394 |
+
{"Language": "Hausa", "Audios Evaluated": 60, "Accuracy": 4.58, "Meaning": 4.97},
|
| 395 |
+
{"Language": "Oromo", "Audios Evaluated": 115, "Accuracy": 4.54, "Meaning": 4.52},
|
| 396 |
+
{"Language": "Bemba", "Audios Evaluated": 116, "Accuracy": 4.39, "Meaning": 4.86},
|
| 397 |
+
{"Language": "Yoruba", "Audios Evaluated": 122, "Accuracy": 4.22, "Meaning": 4.48},
|
| 398 |
+
{"Language": "Wolof", "Audios Evaluated": 53, "Accuracy": 3.98, "Meaning": 4.13},
|
| 399 |
+
{"Language": "Kinyarwanda", "Audios Evaluated": 103, "Accuracy": 3.75, "Meaning": 4.81},
|
| 400 |
+
{"Language": "Xhosa", "Audios Evaluated": 8, "Accuracy": 3.62, "Meaning": 3.38},
|
| 401 |
+
{"Language": "Afrikaans", "Audios Evaluated": 116, "Accuracy": 3.59, "Meaning": 4.10},
|
| 402 |
+
{"Language": "Igbo", "Audios Evaluated": 55, "Accuracy": 2.25, "Meaning": 2.15},
|
| 403 |
+
]
|
| 404 |
+
|
| 405 |
+
df_results = pd.DataFrame(results_data)
|
| 406 |
+
st.dataframe(df_results, width="stretch")
|
| 407 |
+
|
| 408 |
+
# Narrative summary
|
| 409 |
+
st.markdown("""
|
| 410 |
+
### Key Takeaways
|
| 411 |
- **High-Performing Languages:**
|
| 412 |
+
- Swahili (Accuracy 4.96, Meaning 4.97)
|
| 413 |
+
- Luganda (Accuracy 4.70, Meaning 4.78)
|
| 414 |
+
- Amharic (Accuracy 4.65, Meaning 4.82)
|
| 415 |
+
These models produced highly accurate transcriptions with minimal meaning loss.
|
| 416 |
+
|
| 417 |
- **Moderate Performance:**
|
| 418 |
+
Hausa, Oromo, Bemba, Yoruba, Wolof, and Kinyarwanda β generally understandable, but often with orthography and punctuation issues.
|
| 419 |
|
| 420 |
+
- **Low-Performing Languages from evaluation:**
|
| 421 |
+
- Igbo (Accuracy 2.25, Meaning 2.15)
|
| 422 |
+
- Afrikaans (Accuracy 3.59, Meaning 4.10)
|
| 423 |
+
- Xhosa (Accuracy 3.62, Meaning 3.38)
|
|
|
|
| 424 |
""")
|
| 425 |
|
| 426 |
# --- Error Patterns ---
|
| 427 |
st.subheader("Common Error Patterns")
|
| 428 |
+
|
| 429 |
st.write("""
|
| 430 |
+
Evaluators highlighted several recurring challenges and areas for improvement across
|
| 431 |
+
different languages. These reflect both linguistic complexities and system limitations.
|
| 432 |
+
""")
|
| 433 |
+
|
| 434 |
+
error_data = [
|
| 435 |
+
{"Issue": "Punctuation and Formatting",
|
| 436 |
+
"Comments": "Absence of punctuation, lack of capitalisation"},
|
| 437 |
+
{"Issue": "Spelling and Grammar",
|
| 438 |
+
"Comments": "Word merging, frequent spelling mistakes in individual words"},
|
| 439 |
+
{"Issue": "Named Entity Recognition",
|
| 440 |
+
"Comments": "Inaccurate handling of numbers, currencies, and names"},
|
| 441 |
+
{"Issue": "Device Compatibility & Performance",
|
| 442 |
+
"Comments": "Better performance on laptops than on mobile phones"},
|
| 443 |
+
]
|
| 444 |
+
|
| 445 |
+
df_errors = pd.DataFrame(error_data)
|
| 446 |
+
st.dataframe(df_errors, width="stretch")
|
| 447 |
+
|
| 448 |
+
st.markdown("""
|
| 449 |
+
### Summary
|
| 450 |
+
1. **Punctuation and formatting inconsistencies** make transcriptions harder to read.
|
| 451 |
+
2. **Word merging and spelling errors** were frequent, particularly in morphologically rich languages.
|
| 452 |
+
3. **Named entity recognition** (e.g., names, currencies, numbers) was a common source of error.
|
| 453 |
+
4. **Platform performance** was reported as better on laptops than mobile devices.
|
| 454 |
""")
|
| 455 |
|
| 456 |
# --- Takeaways ---
|
| 457 |
st.subheader("Takeaways")
|
| 458 |
st.write("""
|
| 459 |
- Human ratings generally aligned with automatic metrics: languages with larger datasets (Swahili, Luganda, Amharic) scored highest.
|
|
|
|
| 460 |
- WER alone misses issues such as meaning drift, orthography violations, and named entity errors.
|
|
|
|
|
|
|
| 461 |
""")
|