Model Card for Model ID

This model is a fine-tuned version of:

facebook/wav2vec2-lv-60-espeak-cv-ft

It has been adapted for English phoneme recognition with improved vowel discrimination, using a targeted synthetic dataset designed to address known weaknesses in vowel modelling and accent robustness.

Objective

The fine-tuning process focused on:

Reducing vowel phoneme error rate (PER) Improving performance across diverse English accents Addressing systematic confusions such as: /ɪ/ → /iː/ /ə/ → /æ/, /ʌ/ /ɒ/ → /ɔ/, /ɑː/ 📊 Training Data ~5,000 synthetic audio samples generated in Elevenlabs Word-level recordings generated using multiple TTS voices Balanced across: vowel contrasts (short vs long, front/back) diphthongs schwa and unstressed vowels accent-sensitive words Key design principle:

The dataset was explicitly constructed to target observed phoneme-level confusions.

Training Setup Task: CTC phoneme prediction Target representation: IPA-normalised phoneme sequences Feature encoder: frozen Fine-tuning scope: Transformer + CTC head Hyperparameters: Learning rate: 1e-5 Batch size: 32 (A100) Epochs: ~8 Precision: bf16 Results Vowel PER (Primary Metric)

Fine-tuning resulted in:

Absolute reduction of ~0.10–0.20 in vowel PER Relative improvement of up to ~45% in worst-performing accents Example improvements: Accent Before After Improvement Chinese ~0.61 ~0.37 ↓ 0.24 Filipino ~0.62 ~0.34 ↓ 0.28 African-American ~0.53 ~0.29 ↓ 0.24

Key Findings Vowel errors were the dominant source of phoneme error Fine-tuning significantly improved: vowel length distinction central vowel stability back vowel consistency Accent-related performance gaps were reduced but not eliminated

Limitations Model still exhibits: vowel space compression (e.g. /æ/, /ɛ/, /ʌ/) schwa instability accent-dependent variation Trained on isolated words, not continuous speech Synthetic data may not fully capture real-world acoustic variability

Evaluation

Evaluated on a held-out dataset of:

1,350 synthetic speech samples from 17 accented English speakers Multiple English accents

Metrics:

Phoneme Error Rate (PER) Vowel-specific PER (primary metric) Token overlap

Model Details

Model Description

This is the model card of a transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: Gerard McBreen, Komodo Learning
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: [More Information Needed]
  • Model type: [More Information Needed]
  • Language(s) (NLP): [More Information Needed]
  • License: Apache License 2.0
  • Finetuned from model [optional]: facebook/wav2vec2-lv-60-espeak-cv-ft

Model Sources [optional]

  • Repository: [More Information Needed]
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

Direct Use

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC import torch import librosa

processor = Wav2Vec2Processor.from_pretrained("your-username/model-name") model = Wav2Vec2ForCTC.from_pretrained("your-username/model-name")

speech, sr = librosa.load("audio.wav", sr=16000)

inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad(): logits = model(inputs.input_values).logits

pred_ids = torch.argmax(logits, dim=-1) predicted_phonemes = processor.batch_decode(pred_ids)[0]

print(predicted_phonemes)

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

Vowel PER (Primary Metric)

Fine-tuning resulted in:

Absolute reduction of ~0.10–0.20 in vowel PER Relative improvement of up to ~45% in worst-performing accents Example improvements: Accent Before After Improvement Chinese ~0.61 ~0.37 ↓ 0.24 Filipino ~0.62 ~0.34 ↓ 0.28 African-American ~0.53 ~0.29 ↓ 0.24

[More Information Needed]

Summary

This model demonstrates that targeted fine-tuning can significantly improve vowel recognition and reduce accent-related phoneme errors, while highlighting the remaining challenges in robust phoneme modelling across diverse speech patterns.

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month
5
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for herrado99/wav2vec2-phoneme-vowel-ft