Spaces:

asr-africa
/

Automatic_Speech_Recognition_for_African_Languages

Sleeping

App Files Files Community

Beijuka commited on Oct 1

Commit

03449f2

verified ·

1 Parent(s): eb23303

Update src/streamlit_app.py

Browse files

Files changed (1) hide show

src/streamlit_app.py +80 -3

src/streamlit_app.py CHANGED Viewed

@@ -15,13 +15,14 @@ st.set_page_config(
 st.title("ASR for African Languages Model Hub")
 # Create tabs
-tab1, tab2, tab3, tab4, tab5, tab6 = st.tabs([
     "About",
     "Benchmark Dataset",
     "Model Collections",
     "Evaluation Scenarios",
     "ASR models demo",
-    "Results"
 ])
 with tab5:
@@ -299,4 +300,80 @@ with tab6:
     - **Language models (LMs) provide the greatest benefit in low-data regimes (<50 hours)** by supplying additional contextual information.
     - As supervised training data increases, **the added value of LMs decreases**, though their effectiveness varies somewhat across languages.
     - **Model choice matters**: XLS-R benefits most from scaling data, while W2v-BERT shines in extremely low-resource scenarios.
-    """)

 st.title("ASR for African Languages Model Hub")
 # Create tabs
+tab1, tab2, tab3, tab4, tab5, tab6, tab7 = st.tabs([
     "About",
     "Benchmark Dataset",
     "Model Collections",
     "Evaluation Scenarios",
     "ASR models demo",
+    "Results",
+    "Human Evaluation of ASR Models"
 ])
 with tab5:
     - **Language models (LMs) provide the greatest benefit in low-data regimes (<50 hours)** by supplying additional contextual information.
     - As supervised training data increases, **the added value of LMs decreases**, though their effectiveness varies somewhat across languages.
     - **Model choice matters**: XLS-R benefits most from scaling data, while W2v-BERT shines in extremely low-resource scenarios.
+    """)
+with tab7:
+    st.header("Human Evaluation of ASR Models")
+    # --- Introduction ---
+    st.subheader("Introduction")
+    st.write("""
+    ASR systems are typically assessed using automatic metrics such as Word Error Rate (WER) or Character Error Rate (CER).
+    While these provide valuable quantitative insights, they do not fully capture how well transcriptions preserve meaning, respect language orthography, or handle specific features such as tone, diacritics, or named entities.
+    To address these gaps, we conducted a human evaluation of ASR systems across African languages to obtain the qualitative performance of the best performing models.
+    """)
+    # --- Guidelines ---
+    st.subheader("Evaluation Guidelines")
+    st.write("""
+    Evaluators were provided with structured instructions to ensure consistency in their assessments. The main criteria included:
+    - **Accuracy (1–5 scale):** How correctly the model transcribed the audio.
+    - **Meaning Preservation (1–5 scale):** Whether the transcription retained the original meaning.
+    - **Orthography:** Whether the transcription followed standard writing conventions, including accents, diacritics, and special characters.
+    - **Error Types:** Evaluators identified common error categories, such as:
+        - Substitutions (wrong words used)
+        - Omissions (missing words)
+        - Insertions (extra words added)
+        - Pronunciation-related errors
+        - Diacritic/Tone/Special character errors
+        - Named Entity errors (people, places, currencies)
+        - Punctuation errors
+    - **Performance Description:** Free text where evaluators described strengths and weaknesses of the models.
+    """)
+    # --- Setup ---
+    st.subheader("Evaluation Setup")
+    st.write("""
+    - **Languages Evaluated:** 12 languages including Afrikaans, Amharic, Bemba, Hausa, Igbo, Kinyarwanda, Lingala, Luganda, Oromo, Swahili, Wolof, Xhosa, and Yoruba.
+    - **Participants:** 20 evaluators (native speakers or fluent linguists), aged 18–50, majority with postgraduate education.
+    - **Platform:** A Gradio-based interface allowed evaluators to upload/record audio, view transcriptions, and complete the feedback form directly online.
+    """)
+    # --- Findings ---
+    st.subheader("Findings")
+    st.write("""
+    - **High-Performing Languages:**
+        - Swahili (Accuracy 4.96, Meaning 4.97)
+        - Luganda (Accuracy 4.70, Meaning 4.78)
+        - Amharic (Accuracy 4.65, Meaning 4.82)
+        These models produced highly accurate transcriptions with minimal meaning loss.
+    - **Moderate Performance:**
+      Hausa, Oromo, Bemba, Yoruba, and Wolof — generally understandable, but often with orthography and punctuation issues.
+    - **Low-Performing Languages:**
+        - Igbo (Accuracy 2.25, Meaning 2.15)
+        - Afrikaans (Accuracy 3.59, Meaning 4.10)
+        - Xhosa (Accuracy 3.62, Meaning 3.38)
+        These suffered from limited training data, frequent substitution/omission errors, and poor handling of named entities.
+    """)
+    # --- Error Patterns ---
+    st.subheader("Common Error Patterns")
+    st.write("""
+    1. Punctuation and formatting inconsistencies.
+    2. Word merging or spacing errors, especially in morphologically rich languages.
+    3. Named entity recognition failures (numbers, currencies, names).
+    4. Spelling and orthography deviations, especially in languages with tones/diacritics.
+    """)
+    # --- Takeaways ---
+    st.subheader("Takeaways")
+    st.write("""
+    - Human ratings generally aligned with automatic metrics: languages with larger datasets (Swahili, Luganda, Amharic) scored highest.
+    - Language models (LMs) were most effective in **low-data regimes (<50 hours)**, improving readability and accuracy.
+    - WER alone misses issues such as meaning drift, orthography violations, and named entity errors.
+    - More curated, domain-diverse training data is needed for low-performing languages such as Igbo and Afrikaans.
+    - Human evaluation remains essential for **user-facing ASR systems**, where usability depends on meaning preservation and fluency, not just raw error rates.
+    """)