metadata

language: ca
datasets:
  - projecte-aina/whisper-large-v3-ca-3catparla
tags:
  - audio
  - automatic-speech-recognition
  - catalan
  - whisper-large-v3
  - projecte-aina
  - barcelona-supercomputing-center
  - bsc
license: apache-2.0
model-index:
  - name: whisper-large-v3-ca-3catparla
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: 3CatParla (Test)
          type: projecte-aina/3catparla_asr
          split: test
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 0.96
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: 3CatParla (Dev)
          type: projecte-aina/3catparla_asr
          split: dev
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 0.92
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Mozilla Common Voice 17.0 (Test)
          type: mozilla-foundation/common_voice_17_0
          split: test
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 10.32
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Mozilla Common Voice 17.0 (Dev)
          type: mozilla-foundation/common_voice_17_0
          split: validation
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 9.26
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Balearic fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Balearic female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 12.25
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Balearic male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Balearic male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 12.18
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Central fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Central female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 8.51
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Central male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Central male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 8.73
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Northern fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Northern female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 8.09
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Northern male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Northern male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 8.28
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Northwestern fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Northwestern female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 7.88
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Northwestern male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Northwestern male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 8.44
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Valencian fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Valencian female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 9.58
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Valencian male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Valencian male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 9.1

whisper-large-v3-ca-3catparla

Paper: 3CatParla: A New Open-Source Corpus of Broadcast TV in Catalan for Automatic Speech Recognition

The "whisper-large-v3-ca-3catparla" is an acoustic model suitable for Automatic Speech Recognition in Catalan. It is the result of finetuning the model "openai/whisper-large-v3" with 710 hours of Catalan data released by the Projecte AINA from Barcelona, Spain.

The specific dataset used to create the model is called "3CatParla".

The fine-tuning process was perform during July (2024) in the servers of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena.

Evaluation

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

#Load the processor and model.
MODEL_NAME="projecte-aina/whisper-large-v3-ca-3catparla"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")

#Load the dataset
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("projecte-aina/whisper-large-v3-ca-3catparla",split='test')

#Downsample to 16kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

#Process the dataset
def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    
    return batch
    
#Do the evaluation
result = ds.map(map_to_pred)

#Compute the overall WER now.
from evaluate import load

wer = load("wer")
WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)

Test Result: 0.96

BibTeX entry and citation info

When publishing results based on these models please refer to:

@misc{mena2024whisperlarge3catparla,
      title={Acoustic Model in Catalan: whisper-large-v3-ca-3catparla.}, 
      author={Hernandez Mena, Carlos Daniel},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/projecte-aina/whisper-large-v3-ca-3catparla},
      year={2024}
}

Acknowledgements

This model has been promoted and financed by the Government of Catalonia through the Aina project.