YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

OpenAI Whisper-Base Fine-Tuned Model for Speech-to-Text

This repository hosts a fine-tuned version of the OpenAI Whisper-Base model optimized for speech-to-text tasks using the Mozilla Common Voice 13.0 dataset. The model is designed to efficiently transcribe speech into text while maintaining high accuracy.

Model Details

  • Model Architecture: OpenAI Whisper-Base
  • Task: Speech-to-Text
  • Dataset: Mozilla Common Voice 13.0
  • Quantization: FP16
  • Fine-tuning Framework: Hugging Face Transformers

πŸš€ Usage

Installation

pip install transformers torch

Loading the Model

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "AventIQ-AI/whisper-speech-text"
model = WhisperForConditionalGeneration.from_pretrained(model_name).to(device)
processor = WhisperProcessor.from_pretrained(model_name)

Speech-to-Text Inference

import torchaudio

# Load and process audio file
def transcribe(audio_path):
    waveform, sample_rate = torchaudio.load(audio_path)
    inputs = processor(waveform, sampling_rate=sample_rate, return_tensors="pt").input_features.to(device)
    
    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(inputs)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

# Example usage
audio_file = "sample_audio.wav"
print(transcribe(audio_file))

πŸ“Š Evaluation Results

After fine-tuning the Whisper-Base model for speech-to-text, we evaluated the model's performance on the validation set from the Common Voice 13.0 dataset. The following results were obtained:

Metric Score Meaning
WER 8.2% Word Error Rate: Measures transcription accuracy
CER 4.5% Character Error Rate: Measures character-level accuracy

Fine-Tuning Details

Dataset

The Mozilla Common Voice 13.0 dataset, containing diverse multilingual speech samples, was used for fine-tuning the model.

Training

  • Number of epochs: 3
  • Batch size: 8
  • Evaluation strategy: epochs

Quantization

Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.

πŸ“‚ Repository Structure

.
β”œβ”€β”€ model/               # Contains the quantized model files
β”œβ”€β”€ tokenizer_config/    # Tokenizer configuration and vocabulary files
β”œβ”€β”€ model.safetensors/   # Quantized Model
β”œβ”€β”€ README.md            # Model documentation

⚠️ Limitations

  • The model may struggle with highly noisy or overlapping speech.
  • Quantization may lead to slight degradation in accuracy compared to full-precision models.
  • Performance may vary across different accents and dialects.

🀝 Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.

Downloads last month
79
Safetensors
Model size
72.6M params
Tensor type
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Space using AventIQ-AI/whisper-speech-text 1