File size: 3,120 Bytes

840f0db
 
 
 
 
 
54244da
840f0db
 
685771b
840f0db
685771b
876f01b
 
 
 
 
 
 
 
 
 
 
a7d3521
 
 
4fdbf8d
 
 
 
 
 
 
 
 
 
a7d3521
 
 
 
 
840f0db
 
685771b
840f0db
685771b
840f0db
685771b
 
 
 
840f0db
 
 
b3acab6
840f0db
b3acab6
840f0db
a7d3521
840f0db
b3acab6
2a0e2c3
b3acab6
2a0e2c3
 
b3acab6
 
2a0e2c3
 
 
 
 
 
 
 
 
 
 
 
 
b3acab6
840f0db
2a0e2c3
840f0db
 
 
 
 
 
 
685771b
 
840f0db
 
 
 
 
 
 
 
 
 
 
 
a7d3521

---
language:
- ro
license: apache-2.0
tags:
- whisper-event
pinned: true
datasets:
- mozilla-foundation/common_voice_11_0
- gigant/romanian_speech_synthesis_0_8_1
model-index:
- name: Whisper Medium Romanian
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: mozilla-foundation/common_voice_11_0 ro
      type: mozilla-foundation/common_voice_11_0
      config: ro
      split: test
      args: ro
    metrics:
    - name: Wer
      type: wer
      value: 4.73
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: google/fleurs ro
      type: google/fleurs
      config: ro
      split: test
      args: ro
    metrics:
    - name: Wer
      type: wer
      value: 19.64
metrics:
- wer
---

# Whisper Medium Romanian

This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the Common Voice 11.0 dataset, and the Romanian speech synthesis corpus.
It achieves the following results on the evaluation set:
- eval_loss: 0.06453
- eval_wer: 4.717
- epoch: 7.03
- step: 3500

## Model description

The architecture is the same as [openai/whisper-medium](https://huggingface.co/openai/whisper-medium).

## Training and evaluation data

The model was trained on the Common Voice 11.0 dataset (`train+validation+other` splits) and the Romanian speech synthesis corpus, and was tested on the `test` split of the Common Voice 11.0 dataset.

## Usage
Inference with 🤗 transformers
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset
import torch

# load model and processor
processor = WhisperProcessor.from_pretrained("gigant/whisper-medium-romanian")
model = WhisperForConditionalGeneration.from_pretrained("gigant/whisper-medium-romanian")

# load dummy dataset and read soundfiles
ds = load_dataset("common_voice", "ro", split="test", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
input_speech = next(iter(ds))["audio"]["array"]
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "ro", task = "transcribe")
input_features = processor(input_speech, return_tensors="pt", sampling_rate=16_000).input_features 
predicted_ids = model.generate(input_features, max_length=448)
# transcription = processor.batch_decode(predicted_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens = True)
```

The code was adapted from [openai/whisper-medium](https://huggingface.co/openai/whisper-medium).

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 5000
- mixed_precision_training: Native AMP

### Framework versions

- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.7.1.dev0
- Tokenizers 0.13.2