metadata

language:
  - ru
license: apache-2.0
base_model: openai/whisper-small
tags:
  - generated_from_trainer
datasets:
  - bond005/sberdevices_golos_10h_crowd
model-index:
  - name: ru_whisper_small - Val123val
    results: []

ru_whisper_small - Val123val

This model is a fine-tuned version of openai/whisper-small on the Sberdevices_golos_10h_crowd dataset.

Model description

Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision. Russian language is only 5k hours within all. ru_whisper_small is a fine-tuned version of openai/whisper-small on the Sberdevices_golos_10h_crowd dataset. ru-whisper is also potentially quite useful as an ASR solution for developers, especially for Russian speech recognition. They may exhibit additional capabilities, particularly if fine-tuned on business certain tasks.

Intended uses & limitations

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("Val123val/ru_whisper_small")
model = WhisperForConditionalGeneration.from_pretrained("Val123val/ru_whisper_small")
model.config.forced_decoder_ids = None

# load dataset and read audio files
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

# generate token ids
predicted_ids = model.generate(input_features)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

Long-Form Transcription

The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:

import torch
from transformers import pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="Val123val/ru_whisper_small",
  chunk_length_s=30,
  device=device,
)

ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
sample = ds[0]["audio"]

prediction = pipe(sample.copy(), batch_size=8)["text"]

# we can also return timestamps for the predictions
prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]

Faster using with Speculative Decoding

Speculative Decoding was proposed in Fast Inference from Transformers via Speculative Decoding by Yaniv Leviathan et. al. from Google. It works on the premise that a faster, assistant model very often generates the same tokens as a larger main model.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# load dataset
dataset = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)

# load model
model_id = "Val123val/ru_whisper_small"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# load assistant model
assistant_model_id = "openai/whisper-tiny"

assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    assistant_model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)

assistant_model.to(device);

# make pipe
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=4,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)

sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 32
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 5000

Framework versions

Transformers 4.36.2
Pytorch 2.1.0+cu121
Datasets 2.16.0
Tokenizers 0.15.0