Edit model card

Whisper Medium Bambara

This model is a fine-tuned version of oza75/whisper-bambara-asr-001 on the Bambara voices dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0646
  • Wer: 5.4002

Usage

To use this model, we first need to define a Tokenizer class because the default Whisper tokenizer does not support Bambara.

IMPORTANT: The following code will also override the Whisper tokenizer's LANGUAGES constants. This is not the ideal approach, but it is effective. If you do not make this modification, the generation process will fail.

from typing import List

from tokenizers import AddedToken
from transformers import WhisperTokenizer, WhisperProcessor
import transformers.models.whisper.tokenization_whisper as whisper_tokenization
from transformers.models.whisper.tokenization_whisper import TO_LANGUAGE_CODE, TASK_IDS

CUSTOM_TO_LANGUAGE_CODE = {**TO_LANGUAGE_CODE, "bambara": "bm"}

# IMPORTANT: We update the whisper tokenizer constants to add Bambara Language. Not ideal but at least it works
whisper_tokenization.TO_LANGUAGE_CODE.update(CUSTOM_TO_LANGUAGE_CODE)


class BambaraWhisperTokenizer(WhisperTokenizer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.add_tokens(AddedToken(content="<|bm|>", lstrip=False, rstrip=False, normalized=False, special=True))

    @property
    def prefix_tokens(self) -> List[int]:
        bos_token_id = self.convert_tokens_to_ids("<|startoftranscript|>")
        translate_token_id = self.convert_tokens_to_ids("<|translate|>")
        transcribe_token_id = self.convert_tokens_to_ids("<|transcribe|>")
        notimestamps_token_id = self.convert_tokens_to_ids("<|notimestamps|>")

        if self.language is not None:
            self.language = self.language.lower()
            if self.language in CUSTOM_TO_LANGUAGE_CODE:
                language_id = CUSTOM_TO_LANGUAGE_CODE[self.language]
            elif self.language in CUSTOM_TO_LANGUAGE_CODE.values():
                language_id = self.language
            else:
                is_language_code = len(self.language) == 2
                raise ValueError(
                    f"Unsupported language: {self.language}. Language should be one of:"
                    f" {list(CUSTOM_TO_LANGUAGE_CODE.values()) if is_language_code else list(CUSTOM_TO_LANGUAGE_CODE.keys())}."
                )

        if self.task is not None:
            if self.task not in TASK_IDS:
                raise ValueError(f"Unsupported task: {self.task}. Task should be in: {TASK_IDS}")

        bos_sequence = [bos_token_id]
        if self.language is not None:
            bos_sequence.append(self.convert_tokens_to_ids(f"<|{language_id}|>"))
        if self.task is not None:
            bos_sequence.append(transcribe_token_id if self.task == "transcribe" else translate_token_id)
        if not self.predict_timestamps:
            bos_sequence.append(notimestamps_token_id)
        return bos_sequence

Then, we can define the pipeline:

import torch
from transformers import pipeline

# Determine the appropriate device (GPU or CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Define the model checkpoint and language
model_checkpoint = "oza75/whisper-bambara-asr-001"
language = "bambara"

# Load the custom tokenizer designed for Bambara and the ASR model
tokenizer = BambaraWhisperTokenizer.from_pretrained(model_checkpoint, language=language, device=device)
pipe = pipeline(model=model_checkpoint, tokenizer=tokenizer, device=device)

def transcribe(audio):
    """
    Transcribes the provided audio file into text using the configured ASR pipeline.

    Args:
        audio: The path to the audio file to transcribe.

    Returns:
        A string representing the transcribed text.
    """
    # Use the pipeline to perform transcription
    text = pipe(audio)["text"]
    return text


transcribe(path_to_the_audio)

Intended uses & limitations

This checkpoint is intended to be used ONLY for research purposes !!!

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 8e-06
  • train_batch_size: 64
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer
0.0733 0.4032 25 0.0621 6.4145
0.0625 0.8065 50 0.0576 7.0724
0.0631 1.2097 75 0.0554 7.2094
0.0371 1.6129 100 0.0549 7.3739
0.0453 2.0161 125 0.0533 10.1425
0.0244 2.4194 150 0.0548 7.5658
0.0231 2.8226 175 0.0582 7.6206
0.0159 3.2258 200 0.0577 6.2226
0.0097 3.6290 225 0.0581 7.5932
0.0071 4.0323 250 0.0590 7.3739
0.0042 4.4355 275 0.0609 6.0033
0.0066 4.8387 300 0.0610 5.1809
0.0042 5.2419 325 0.0600 7.2368
0.0036 5.6452 350 0.0622 8.6623
0.0084 6.0484 375 0.0738 6.6886
0.0087 6.4516 400 0.0677 7.2643
0.0077 6.8548 425 0.0748 7.4013
0.0082 7.2581 450 0.0751 8.0318
0.0097 7.6613 475 0.0719 8.1963
0.0114 8.0645 500 0.0746 8.3607
0.0071 8.4677 525 0.0691 6.8805
0.0075 8.8710 550 0.0659 6.0581
0.0034 9.2742 575 0.0647 5.4002
0.0032 9.6774 600 0.0646 5.4002

Framework versions

  • Transformers 4.40.1
  • Pytorch 2.2.0+cu121
  • Datasets 2.19.0
  • Tokenizers 0.19.1
Downloads last month
25
Safetensors
Model size
817M params
Tensor type
FP16
·

Finetuned from

Dataset used to train oza75/whisper-bambara-asr-001

Space using oza75/whisper-bambara-asr-001 1

Evaluation results