ru_whisper_small / README.md

Val123val

Update README.md

5d1feb8 7 months ago

preview code

raw

history blame

No virus

5.19 kB

	---
	language:
	- ru
	license: apache-2.0
	base_model: openai/whisper-small
	tags:
	- generated_from_trainer
	datasets:
	- bond005/sberdevices_golos_10h_crowd
	model-index:
	- name: ru_whisper_small - Val123val
	results: []
	---


	# ru_whisper_small - Val123val

	This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Sberdevices_golos_10h_crowd dataset.

	## Model description

	Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision. Russian language is only 5k hours within all.
	ru_whisper_small is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Sberdevices_golos_10h_crowd dataset. ru-whisper is also potentially quite useful as an ASR solution for developers, especially for Russian speech recognition. They may exhibit additional capabilities, particularly if fine-tuned on business certain tasks.

	## Intended uses & limitations

	```bash
	from transformers import WhisperProcessor, WhisperForConditionalGeneration
	from datasets import load_dataset

	# load model and processor
	processor = WhisperProcessor.from_pretrained("Val123val/ru_whisper_small")
	model = WhisperForConditionalGeneration.from_pretrained("Val123val/ru_whisper_small")
	model.config.forced_decoder_ids = None

	# load dataset and read audio files
	ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
	sample = ds[0]["audio"]
	input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

	# generate token ids
	predicted_ids = model.generate(input_features)
	# decode token ids to text
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)

	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
	```


	## Long-Form Transcription

	The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:

	```bash
	import torch
	from transformers import pipeline
	from datasets import load_dataset

	device = "cuda:0" if torch.cuda.is_available() else "cpu"

	pipe = pipeline(
	"automatic-speech-recognition",
	model="Val123val/ru_whisper_small",
	chunk_length_s=30,
	device=device,
	)

	ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
	sample = ds[0]["audio"]

	prediction = pipe(sample.copy(), batch_size=8)["text"]

	# we can also return timestamps for the predictions
	prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
	```


	## Faster using with Speculative Decoding

	Speculative Decoding was proposed in Fast Inference from Transformers via Speculative Decoding by Yaniv Leviathan et. al. from Google. It works on the premise that a faster, assistant model very often generates the same tokens as a larger main model.


	```bash
	import torch
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
	from transformers import pipeline

	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	# load dataset
	dataset = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)

	# load model
	model_id = "Val123val/ru_whisper_small"

	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	model_id,
	torch_dtype=torch_dtype,
	low_cpu_mem_usage=True,
	use_safetensors=True,
	attn_implementation="sdpa",
	)
	model.to(device)

	processor = AutoProcessor.from_pretrained(model_id)

	# load assistant model
	assistant_model_id = "openai/whisper-tiny"

	assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
	assistant_model_id,
	torch_dtype=torch_dtype,
	low_cpu_mem_usage=True,
	use_safetensors=True,
	attn_implementation="sdpa",
	)

	assistant_model.to(device);

	# make pipe
	pipe = pipeline(
	"automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	max_new_tokens=128,
	chunk_length_s=15,
	batch_size=4,
	generate_kwargs={"assistant_model": assistant_model},
	torch_dtype=torch_dtype,
	device=device,
	)

	sample = dataset[0]["audio"]
	result = pipe(sample)
	print(result["text"])
	```


	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0001
	- train_batch_size: 32
	- eval_batch_size: 16
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 500
	- training_steps: 5000

	### Framework versions

	- Transformers 4.36.2
	- Pytorch 2.1.0+cu121
	- Datasets 2.16.0
	- Tokenizers 0.15.0