Model Card for chall_wav2vec2_xlsr_300m

This model, named ChaLL-300M, is an Automatic Speech Recognition (ASR) system specifically designed and fine-tuned to transcribe spontaneous speech of young English learners while preserving their errors. It was developed as part of a project to improve the speaking practice and provide corrective feedback in language learning environments.

Model Details

Model Description

ChaLL-300M is a fine-tuned ASR model from the Wav2Vec-XLSR-300M base model. This model represents the best performing fold (Fold 5) from a k-fold cross-validation training process as described in the paper. It addresses the unique challenges of transcribing the spontaneous speech of young English learners by preserving their grammatical, lexical, and pronunciation errors.

Developed by: Zurich University of Applied Sciences, Swiss public schools' collaboration
Funded by: Swiss Innovation Agency (Innosuisse)
Shared by: Zurich University of Applied Sciences
Model type: Automatic Speech Recognition (ASR)
Language(s) (NLP): English
License: [More Information Needed]
Finetuned from model: Wav2Vec-XLSR-300M

Model Sources

Repository: https://github.com/mict-zhaw/chall_e2e_stt
Paper: [More Information Needed]

Uses

Direct Use

ChaLL-300M can be directly used for transcribing the English speech of young learners, particularly in educational software designed to provide real-time feedback on language learning. Its ability to accurately preserve and transcribe grammatical, lexical, and pronunciation errors makes it especially useful for applications aimed at improving language acquisition.

Downstream Use

ChaLL-300M can be integrated into larger systems for more complex applications. Examples include:

Voice-based Chatbot for Language-Learners (ChaLL): ChaLL is a chatbot-based solution designed to enhance language learners' speaking skills through personalized, interactive practice. It stands out by focusing on real-world communication, unlike traditional language learning methods.
- Interactive Speaking: Engages learners in conversation, providing real-time feedback.
- Personalized Feedback: Uses the ChaLL-300M model to preserve and correct speech errors.
- Supportive Environment: Mimics real-world scenarios for practical learning.
- Enhanced Competence: Builds fluency and confidence for effective communication.

Out-of-Scope Use

The ChaLL-300M model is not suitable for certain applications:

Transcription of Error-Free Speech: The model is optimized for preserving errors in speech; therefore, it is not ideal for contexts where error-free transcription of adult native speakers' English speech is required.
High-Stakes Professional Transcriptions: The model should not be used in professional or formal contexts where the accuracy of fluent or highly technical speech is crucial.
Bias-Sensitive Applications: Avoid using the model in applications where the bias towards young learners’ speech patterns could negatively impact the results, such as in automated systems assessing adult speech or in regions with significantly different accents or dialects than those present in the training data.

Bias, Risks, and Limitations

While ChaLL-300M is tailored to preserve the speech errors of young English learners, it may not perform well with adult native speaker data or read-aloud tasks. Additionally, the model could potentially amplify biases present in the training data, such as regional accents or typical errors made by specific age groups. There is also a risk that the model might not handle code-switching or non-English phrases accurately.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. It's recommended to carefully evaluate the context and specific requirements before deploying it in an application.

How to Get Started with the Model

Use the code below to get started with the ChaLL-300M model:

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torchaudio, torch

processor = Wav2Vec2Processor.from_pretrained("mict-zhaw/chall_wav2vec2_xlsr_300m")
model = Wav2Vec2ForCTC.from_pretrained("mict-zhaw/chall_wav2vec2_xlsr_300m")

# Load your own audio file
audio_input, _ = torchaudio.load("path_to_your_audio_file.wav")
input_values = processor(audio_input, return_tensors="pt", sampling_rate=16000).input_values[0]

# Perform transcription
with torch.no_grad():
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)

print(transcription)

Training Details

Training Data

The training data consists of 85 hours of spontaneous English speech recordings from young learners in Swiss public schools, encompassing grades 4 to 6. The dataset contains 45,004 utterances from 327 distinct speakers. The complete dataset cannot be freely accessed due to privacy concerns but can be obtained through a collaboration agreement. Find more information in mict-zhaw/chall.

Training Procedure

Preprocessing

Removal of error annotations and transcript conventions
Conversion to lowercase
Standardization of text (e.g., removing unnecessary spaces, normalizing special characters)
Transformation of digits to words using num2words

Training Hyperparameters

Training regime: fp32
Learning rate: 3e-5
Batch size per device: 14
Gradient accumulation steps: 15 (total batch size corresponds to approximately 2 hours of audio)
Optimizer: 8-bit AdamW
Training steps: 4000

Evaluation

Testing Data, Factors & Metrics

Testing Data

The testing data consists of the same dataset used for training, partitioned into five distinct folds. Each fold includes a variety of grades (4 to 6) to simulate a real-world user scenario where an ASR system is exposed to new, unseen speaker data.

Factors

Grade levels (4th to 6th grade)
School area codes

Metrics

To measure error preservation, we use the error annotations that were manually added to each utterance and a custom phonetic word-level alignment algorithm. This algorithm aligns two or more sequences (e.g., a reference and one or multiple hypotheses), identifying matches, substitutions (S), insertions (I), and deletions (D) at the word level. Our metric, WEPR (Word-Based Error Preservation Rate), considers only those word pairs where the reference word contains an error annotation. WEPR is calculated according to the formula:

$Word-Based Error Preservation Rate (WEPR) Formula$

In addition to WEPR, we also compute the following general ASR metrics using all words in the utterance: Word Error Rate (WER), Character Error Rate (CER), and character n-gram F-Score (chrF).

Results

Summary

The ChaLL-300M model achieves strong performance in preserving the errors made by young English learners, significantly outperforming baseline models in terms of WEPR. The model also achieves competitive WER, CER, and chrF scores, indicating its effectiveness for the target demographic:

WER: 0.30 ± 0.01
CER: 0.16 ± 0.01
chrF: 0.68 ± 0.01
WEPR: 0.38 ± 0.03

Model Examination

Further qualitative analysis on preserved errors indicates that the model effectively retains grammatical, lexical, and pronunciation errors made by young learners. This aspect makes it particularly suitable for educational applications focused on language learning and correction.

Citation

BibTeX:

@inproceedings{
  anonymous2024errorpreserving,
  title={Error-preserving Automatic Speech Recognition of Young English Learners' Language},
  author={Janick Michot, Manuela Hürlimann, Jan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak},
  booktitle={The 62nd Annual Meeting of the Association for Computational Linguistics},
  year={2024},
  url={https://openreview.net/forum?id=XPIwvlqIfI}
}

Model Card Contact

@mict-zhaw

mict-zhaw
/

chall_wav2vec2_xlsr_300m