Instructions to use herrado99/wav2vec2-phoneme-vowel-ft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use herrado99/wav2vec2-phoneme-vowel-ft with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="herrado99/wav2vec2-phoneme-vowel-ft")# Load model directly from transformers import AutoProcessor, AutoModelForCTC processor = AutoProcessor.from_pretrained("herrado99/wav2vec2-phoneme-vowel-ft") model = AutoModelForCTC.from_pretrained("herrado99/wav2vec2-phoneme-vowel-ft") - Notebooks
- Google Colab
- Kaggle
- Model Card for Model ID
- Model Details
- Uses
- Bias, Risks, and Limitations
- How to Get Started with the Model
- Training Details
- Evaluation
- Model Examination [optional]
- Environmental Impact
- Technical Specifications [optional]
- Citation [optional]
- Glossary [optional]
- More Information [optional]
- Model Card Authors [optional]
- Model Card Contact
Model Card for Model ID
This model is a fine-tuned version of:
facebook/wav2vec2-lv-60-espeak-cv-ft
It has been adapted for English phoneme recognition with improved vowel discrimination, using a targeted synthetic dataset designed to address known weaknesses in vowel modelling and accent robustness.
Objective
The fine-tuning process focused on:
Reducing vowel phoneme error rate (PER) Improving performance across diverse English accents Addressing systematic confusions such as: /ɪ/ → /iː/ /ə/ → /æ/, /ʌ/ /ɒ/ → /ɔ/, /ɑː/ 📊 Training Data ~5,000 synthetic audio samples generated in Elevenlabs Word-level recordings generated using multiple TTS voices Balanced across: vowel contrasts (short vs long, front/back) diphthongs schwa and unstressed vowels accent-sensitive words Key design principle:
The dataset was explicitly constructed to target observed phoneme-level confusions.
Training Setup Task: CTC phoneme prediction Target representation: IPA-normalised phoneme sequences Feature encoder: frozen Fine-tuning scope: Transformer + CTC head Hyperparameters: Learning rate: 1e-5 Batch size: 32 (A100) Epochs: ~8 Precision: bf16 Results Vowel PER (Primary Metric)
Fine-tuning resulted in:
Absolute reduction of ~0.10–0.20 in vowel PER Relative improvement of up to ~45% in worst-performing accents Example improvements: Accent Before After Improvement Chinese ~0.61 ~0.37 ↓ 0.24 Filipino ~0.62 ~0.34 ↓ 0.28 African-American ~0.53 ~0.29 ↓ 0.24
Key Findings Vowel errors were the dominant source of phoneme error Fine-tuning significantly improved: vowel length distinction central vowel stability back vowel consistency Accent-related performance gaps were reduced but not eliminated
Limitations Model still exhibits: vowel space compression (e.g. /æ/, /ɛ/, /ʌ/) schwa instability accent-dependent variation Trained on isolated words, not continuous speech Synthetic data may not fully capture real-world acoustic variability
Evaluation
Evaluated on a held-out dataset of:
1,350 synthetic speech samples from 17 accented English speakers Multiple English accents
Metrics:
Phoneme Error Rate (PER) Vowel-specific PER (primary metric) Token overlap
Model Details
Model Description
This is the model card of a transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: Gerard McBreen, Komodo Learning
- Funded by [optional]: [More Information Needed]
- Shared by [optional]: [More Information Needed]
- Model type: [More Information Needed]
- Language(s) (NLP): [More Information Needed]
- License: Apache License 2.0
- Finetuned from model [optional]: facebook/wav2vec2-lv-60-espeak-cv-ft
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
Direct Use
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC import torch import librosa
processor = Wav2Vec2Processor.from_pretrained("your-username/model-name") model = Wav2Vec2ForCTC.from_pretrained("your-username/model-name")
speech, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad(): logits = model(inputs.input_values).logits
pred_ids = torch.argmax(logits, dim=-1) predicted_phonemes = processor.batch_decode(pred_ids)[0]
print(predicted_phonemes)
[More Information Needed]
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
[More Information Needed]
Bias, Risks, and Limitations
[More Information Needed]
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing [optional]
[More Information Needed]
Training Hyperparameters
- Training regime: [More Information Needed]
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
Vowel PER (Primary Metric)
Fine-tuning resulted in:
Absolute reduction of ~0.10–0.20 in vowel PER Relative improvement of up to ~45% in worst-performing accents Example improvements: Accent Before After Improvement Chinese ~0.61 ~0.37 ↓ 0.24 Filipino ~0.62 ~0.34 ↓ 0.28 African-American ~0.53 ~0.29 ↓ 0.24
[More Information Needed]
Summary
This model demonstrates that targeted fine-tuning can significantly improve vowel recognition and reduce accent-related phoneme errors, while highlighting the remaining challenges in robust phoneme modelling across diverse speech patterns.
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]
- Downloads last month
- 5