--- language: - ru license: apache-2.0 base_model: openai/whisper-small tags: - generated_from_trainer datasets: - bond005/sberdevices_golos_10h_crowd model-index: - name: ru_whisper_small - Val123val results: [] --- # ru_whisper_small - Val123val This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Sberdevices_golos_10h_crowd dataset. ## Model description Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision. Russian language is only 5k hours within all. ru_whisper_small is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Sberdevices_golos_10h_crowd dataset. ru-whisper is also potentially quite useful as an ASR solution for developers, especially for Russian speech recognition. They may exhibit additional capabilities, particularly if fine-tuned on business certain tasks. ## Intended uses & limitations ```bash from transformers import WhisperProcessor, WhisperForConditionalGeneration from datasets import load_dataset # load model and processor processor = WhisperProcessor.from_pretrained("Val123val/ru_whisper_small") model = WhisperForConditionalGeneration.from_pretrained("Val123val/ru_whisper_small") model.config.forced_decoder_ids = None # load dataset and read audio files ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True) sample = ds[0]["audio"] input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features # generate token ids predicted_ids = model.generate(input_features) # decode token ids to text transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) ``` ## Long-Form Transcription The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True: ```bash import torch from transformers import pipeline from datasets import load_dataset device = "cuda:0" if torch.cuda.is_available() else "cpu" pipe = pipeline( "automatic-speech-recognition", model="Val123val/ru_whisper_small", chunk_length_s=30, device=device, ) ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True) sample = ds[0]["audio"] prediction = pipe(sample.copy(), batch_size=8)["text"] # we can also return timestamps for the predictions prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"] ``` ## Faster using with Speculative Decoding Speculative Decoding was proposed in Fast Inference from Transformers via Speculative Decoding by Yaniv Leviathan et. al. from Google. It works on the premise that a faster, assistant model very often generates the same tokens as a larger main model. ```bash import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor from transformers import pipeline device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 # load dataset dataset = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True) # load model model_id = "Val123val/ru_whisper_small" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa", ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) # load assistant model assistant_model_id = "openai/whisper-tiny" assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained( assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa", ) assistant_model.to(device); # make pipe pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, chunk_length_s=15, batch_size=4, generate_kwargs={"assistant_model": assistant_model}, torch_dtype=torch_dtype, device=device, ) sample = dataset[0]["audio"] result = pipe(sample) print(result["text"]) ``` ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0001 - train_batch_size: 32 - eval_batch_size: 16 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 500 - training_steps: 5000 ### Framework versions - Transformers 4.36.2 - Pytorch 2.1.0+cu121 - Datasets 2.16.0 - Tokenizers 0.15.0