File size: 3,120 Bytes
840f0db 54244da 840f0db 685771b 840f0db 685771b 876f01b a7d3521 4fdbf8d a7d3521 840f0db 685771b 840f0db 685771b 840f0db 685771b 840f0db b3acab6 840f0db b3acab6 840f0db a7d3521 840f0db b3acab6 2a0e2c3 b3acab6 2a0e2c3 b3acab6 2a0e2c3 b3acab6 840f0db 2a0e2c3 840f0db 685771b 840f0db a7d3521 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
language:
- ro
license: apache-2.0
tags:
- whisper-event
pinned: true
datasets:
- mozilla-foundation/common_voice_11_0
- gigant/romanian_speech_synthesis_0_8_1
model-index:
- name: Whisper Medium Romanian
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: mozilla-foundation/common_voice_11_0 ro
type: mozilla-foundation/common_voice_11_0
config: ro
split: test
args: ro
metrics:
- name: Wer
type: wer
value: 4.73
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: google/fleurs ro
type: google/fleurs
config: ro
split: test
args: ro
metrics:
- name: Wer
type: wer
value: 19.64
metrics:
- wer
---
# Whisper Medium Romanian
This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the Common Voice 11.0 dataset, and the Romanian speech synthesis corpus.
It achieves the following results on the evaluation set:
- eval_loss: 0.06453
- eval_wer: 4.717
- epoch: 7.03
- step: 3500
## Model description
The architecture is the same as [openai/whisper-medium](https://huggingface.co/openai/whisper-medium).
## Training and evaluation data
The model was trained on the Common Voice 11.0 dataset (`train+validation+other` splits) and the Romanian speech synthesis corpus, and was tested on the `test` split of the Common Voice 11.0 dataset.
## Usage
Inference with 🤗 transformers
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset
import torch
# load model and processor
processor = WhisperProcessor.from_pretrained("gigant/whisper-medium-romanian")
model = WhisperForConditionalGeneration.from_pretrained("gigant/whisper-medium-romanian")
# load dummy dataset and read soundfiles
ds = load_dataset("common_voice", "ro", split="test", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
input_speech = next(iter(ds))["audio"]["array"]
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "ro", task = "transcribe")
input_features = processor(input_speech, return_tensors="pt", sampling_rate=16_000).input_features
predicted_ids = model.generate(input_features, max_length=448)
# transcription = processor.batch_decode(predicted_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens = True)
```
The code was adapted from [openai/whisper-medium](https://huggingface.co/openai/whisper-medium).
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 5000
- mixed_precision_training: Native AMP
### Framework versions
- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.7.1.dev0
- Tokenizers 0.13.2 |