OpenAI Whisper-Base Fine-Tuned Model for Speech-to-Text

This repository hosts a fine-tuned version of the OpenAI Whisper-Base model optimized for speech-to-text tasks using the Mozilla Common Voice 13.0 dataset. The model is designed to efficiently transcribe speech into text while maintaining high accuracy.

Model Details

Model Architecture: OpenAI Whisper-Base
Task: Speech-to-Text
Dataset: Mozilla Common Voice 13.0
Quantization: FP16
Fine-tuning Framework: Hugging Face Transformers

🚀 Usage

Installation

pip install transformers torch

Loading the Model

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "AventIQ-AI/whisper-speech-text"
model = WhisperForConditionalGeneration.from_pretrained(model_name).to(device)
processor = WhisperProcessor.from_pretrained(model_name)

Speech-to-Text Inference

import torchaudio

# Load and process audio file
def transcribe(audio_path):
    waveform, sample_rate = torchaudio.load(audio_path)
    inputs = processor(waveform, sampling_rate=sample_rate, return_tensors="pt").input_features.to(device)
    
    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(inputs)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

# Example usage
audio_file = "sample_audio.wav"
print(transcribe(audio_file))

📊 Evaluation Results

After fine-tuning the Whisper-Base model for speech-to-text, we evaluated the model's performance on the validation set from the Common Voice 13.0 dataset. The following results were obtained:

Metric	Score	Meaning
WER	8.2%	Word Error Rate: Measures transcription accuracy
CER	4.5%	Character Error Rate: Measures character-level accuracy

Fine-Tuning Details

Dataset

The Mozilla Common Voice 13.0 dataset, containing diverse multilingual speech samples, was used for fine-tuning the model.

Training

Number of epochs: 3
Batch size: 8
Evaluation strategy: epochs

Quantization

Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.

📂 Repository Structure

.
├── model/               # Contains the quantized model files
├── tokenizer_config/    # Tokenizer configuration and vocabulary files
├── model.safetensors/   # Quantized Model
├── README.md            # Model documentation

⚠️ Limitations

The model may struggle with highly noisy or overlapping speech.
Quantization may lead to slight degradation in accuracy compared to full-precision models.
Performance may vary across different accents and dialects.

🤝 Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.

AventIQ-AI
/

whisper-speech-text