Model

This model is Wav2Vec2-Large-XLSR-53 fine-tuned on the manually annotated subset of CMU's L2-Arctic dataset. It was fine-tuned to perform automatic phonetic transcriptions in IPA. It was tuned following a similar procedure as described by vitouphy with the TIMIT dataset.

Usage

To use the model, create a pipeline and invoke it with the path to your WAV, which must be sampled at 16KHz.

from transformers import pipeline

pipe = pipeline(model="mrrubino/wav2vec2-large-xlsr-53-l2-arctic-phoneme")
transcription = pipe("file.wav")["text"]

Results

The manually annotated subset of L2-Arctic was divided into training and testing datasets with a 90/10 split. The performance metrics for the testing dataset are included below.

WER - 0.425

CER - 0.128

Citation

If you find our model helpful, please feel free to cite us.

@article{Bo_Rubino_Xu_2024,
  title={A Mispronunciation-Based Voice-Omics Representation Framework for Screening Specific Language Impairments in Children},
  DOI={10.1109/ichi61247.2024.00045},
  journal={2024 IEEE 12th International Conference on Healthcare Informatics (ICHI)},
  author={Bo, Wei and Rubino, Matthew and Xu, Wenyao},
  year={2024},
  month={Jun},
  pages={294–304}
} 
Downloads last month
1,065
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.