Phoneme-based Pronunciation Trainer

Interface

Feedback Example

Approach

Create teacher/ground truth audio files

We start by creating teacher audio files using OpenAi's Text-To-Speech API

def produce_teacher_audio(audio_name, text):
    """
    Produce a teacher audio file for the given text using OpenAI's Text-to-Speech API.
    See: https://platform.openai.com/docs/guides/text-to-speech
    """

    speech_file_path = Path(f"audios/teacher/{audio_name}")
    response = client.audio.speech.create(
        model="tts-1-hd",
        voice="alloy",
        input=text,
    )

    response.stream_to_file(speech_file_path)
    log.info(
        f"Successfully produced teacher audio for {text=} at {speech_file_path.name=} 🎉"
    )


if __name__ == "__main__":
    # Produce teacher/ground-truth audio files for the given examples
    data = load_data()
    for datum in data:
        produce_teacher_audio(datum["learner_recording"], datum["text_to_record"])

    # Produce an additional example
    produce_teacher_audio("book.wav", "The book is on the table")

Audio-based and Personalized Input

Given the newly created teacher audios, we can now use both learner and teacher audios as input to our system and thereby avoid the transcription-based limitations that we described in the grapheme-based solution.

As you can see in the following image, we also supply the native language of the learner and the language they want to acquire.

Phoneme-Based ASR

We use a Wav2Vec2Phoneme model as proposed in Simple and Effective Zero-shot Cross-lingual Phoneme Recognition

Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained model.

In particular, we use the following checkpoint: facebook/wav2vec2-lv-60-espeak-cv-ft · Hugging Face.

This checkpoint leverages the pretrained checkpoint wav2vec2-large-lv60 and is fine-tuned on CommonVoice to recognize phonetic labels in multiple languages.

For convenience, we create a partial transcribe_to_phonemes function as interface to this checkpoint:

class TranscriberChoice(StrEnum):
    grapheme = "openai/whisper-base.en"
    phoneme = "facebook/wav2vec2-lv-60-espeak-cv-ft"


def transcribe(
    audio, transcriber_choice: TranscriberChoice = TranscriberChoice.grapheme
):
    """
    The transcribe function takes a single parameter, audio, which is a numpy array of the audio the user recorded.
    The pipeline object expects this in float32 format,so we convert it first to float32, and then extract the transcribed text.
    """
    transcriber = pipeline("automatic-speech-recognition", model=transcriber_choice)
    try:
        sr, y = audio
        print(f"Sampling rate is {sr}")
    except TypeError:
        return None
    y = y.astype(np.float32)
    y /= np.max(np.abs(y))
    transcription = transcriber({"sampling_rate": sr, "raw": y})["text"]
    return transcription


transcribe_to_phonemes = partial(
    transcribe, transcriber_choice=TranscriberChoice.phoneme
)
transcribe_to_graphemes = partial(
    transcribe, transcriber_choice=TranscriberChoice.grapheme
)

Simple Evaluation based on `SequenceMatcher`

We can again apply the SequenceMatcher, but this time compare the phoneme transcriptions of teacher and learner audio.

As illustrated here for the date example, we get a much better similarity score:

Grapheme-based

Phoneme-based

But, the feedback message is still not very helpful.

Advanced Evaluation based on LLM

Much more powerful, however, is an evaluation leveraging the power of an LLM (GPT-4-turbo).

We create a simple LLM chain

prompt = ChatPromptTemplate.from_template(Path("prompt.md").read_text())
output_parser = StrOutputParser()


def create_llm(openai_api_key=openai_api_key):
    if openai_api_key in [None, ""]:
        raise gr.Error(
            "No API key provided! You can find your API key at https://platform.openai.com/account/api-keys."
        )
    llm = ChatOpenAI(model="gpt-4-turbo", openai_api_key=openai_api_key)
    return llm


def create_llm_chain(prompt=prompt, output_parser=output_parser, openai_api_key=openai_api_key):
    if openai_api_key in [None, ""]:
        raise gr.Error(
            """No API key provided! You can find your API key at https://platform.openai.com/account/api-keys."""
        )
    llm = ChatOpenAI(model="gpt-4-turbo", openai_api_key=openai_api_key)
    llm_chain = prompt | llm | output_parser
    return llm_chain

and ingest the following inputs:

def advanced_evaluation(
    learner_l1,
    learner_l2,
    learner_phoneme_transcription,
    teacher_phoneme_transcription,
) -> str:
    """Provide LLM-based feedback"""
    return create_llm_chain().invoke(
        {
            "learner_l1": learner_l1,
            "learner_l2": learner_l2,
            "learner_phoneme_transcription": learner_phoneme_transcription,
            "teacher_phoneme_transcription": teacher_phoneme_transcription,
        }
    )

into an LLM prompt template that can be found here:
Enhanced Prompt Template for Language Model Expert with Motivational Elements and Language-Specific Feedback.

Limitations & Outlook

Improve ASR Model

Due to time constraints, I selected the first phoneme recognition model that I found on Hugging Face. With more time, one could

Experiment with different checkpoints at Phoneme Recognition Models - Hugging Face
Adapt OpenAI's Whisper model on phoneme recognition/transcription by simply changing the tokenizer to handle the new vocabulary (the set of phonemes), and fine-tuning th model on an (audio, phoneme) dataset with an appropriate metric. See openai/whisper · Phoneme recognition for a short discussion about it.
Employ a model like m-bain/whisperX: WhisperX and possibly fine-tune it, to achieve word-level timestamps & diarization.
Also, a probabilistic approach could be used to inform about transcription confidence and adjust/omit feedback according to it

Further, the output of the ASR model could be enhanced by grouping phonemes (to allow for better world-level feedback and alignment) and also adding better prosodic/suprasegmental support.

Improve LLM prompt

Again due to time constraints, I created a single prompt template. Further prompt engineering and metaprompting could

Reduce hallucinations
Create a more didactically sound feedback, e.g. divided in different feedback sections like
- Place. The place of articulation is where a sound is made.
- Manner. The manner of articulation is how a sound is made.
- Voicing. Voice or voicing refers to the vibration of the vocal folds.
Recommend fitting exercises and content of babbel.com

Improve UI/feedback time

The LLM response currently takes some time. Among many ways to tackle this problem, one could:

Stream the response for immediate feedback and better UX
Use clever caching for immediate responses
Collect several attempts, and only provide the LLM feedback on an aggregate of attempts (for example in a dedicated pronounciation trainer section)

Personalization

The personalization is very limited as it only looks as l1 and l2 of the learner. We could further:

Compare the current attempt with previous attempts in order to show progress/regress. This could be especially motivating if learner is still far from a perfect pronunciation, but steadily improves.
Include additional learner information, like preferences and proficiency, etc.

Alternative phoneme-based feedback

Instead of, or complimentary, to employing an LLM for advanced and personalized feedback, we could provide scores and feedback based on a distance measure between phonemes.
- Among a variety of possible distances, a simple starting point could be a 3-D distance of the place of articulation (where the sound is made).