Val123val commited on
Commit
4c205d8
1 Parent(s): d2915fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -5
README.md CHANGED
@@ -21,17 +21,115 @@ This model is a fine-tuned version of [openai/whisper-small](https://huggingface
21
 
22
  ## Model description
23
 
24
- More information needed
 
25
 
26
  ## Intended uses & limitations
27
 
28
- More information needed
 
29
 
30
- ## Training and evaluation data
 
 
 
31
 
32
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- ## Training procedure
35
 
36
  ### Training hyperparameters
37
 
 
21
 
22
  ## Model description
23
 
24
+ Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision. Russian language is only 5k hours within all.
25
+ ru_whisper_small is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Sberdevices_golos_10h_crowd dataset. ru-whisper is also potentially quite useful as an ASR solution for developers, especially for Russian speech recognition. They may exhibit additional capabilities, particularly if fine-tuned on certain tasks
26
 
27
  ## Intended uses & limitations
28
 
29
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
30
+ from datasets import load_dataset
31
 
32
+ # load model and processor
33
+ processor = WhisperProcessor.from_pretrained("Val123val/ru_whisper_small")
34
+ model = WhisperForConditionalGeneration.from_pretrained("Val123val/ru_whisper_small")
35
+ model.config.forced_decoder_ids = None
36
 
37
+ # load dataset and read audio files
38
+ ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
39
+ sample = ds[0]["audio"]
40
+ input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
41
+
42
+ # generate token ids
43
+ predicted_ids = model.generate(input_features)
44
+ # decode token ids to text
45
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
46
+
47
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
48
+
49
+
50
+ ## Long-Form Transcription
51
+
52
+ The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
53
+
54
+ import torch
55
+ from transformers import pipeline
56
+ from datasets import load_dataset
57
+
58
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
59
+
60
+ pipe = pipeline(
61
+ "automatic-speech-recognition",
62
+ model="Val123val/ru_whisper_small",
63
+ chunk_length_s=30,
64
+ device=device,
65
+ )
66
+
67
+ ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
68
+ sample = ds[0]["audio"]
69
+
70
+ prediction = pipe(sample.copy(), batch_size=8)["text"]
71
+
72
+ # we can also return timestamps for the predictions
73
+ prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
74
+
75
+
76
+ ## Faster using with Speculative Decoding
77
+
78
+ Speculative Decoding was proposed in Fast Inference from Transformers via Speculative Decoding by Yaniv Leviathan et. al. from Google. It works on the premise that a faster, assistant model very often generates the same tokens as a larger main model.
79
+
80
+ import torch
81
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
82
+
83
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
84
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
85
+
86
+
87
+ dataset = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
88
+
89
+ model_id = "Val123val/ru_whisper_small"
90
+
91
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
92
+ model_id,
93
+ torch_dtype=torch_dtype,
94
+ low_cpu_mem_usage=True,
95
+ use_safetensors=True,
96
+ attn_implementation="sdpa",
97
+ )
98
+ model.to(device)
99
+
100
+ processor = AutoProcessor.from_pretrained(model_id)
101
+
102
+ assistant_model_id = "openai/whisper-tiny"
103
+
104
+ assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
105
+ assistant_model_id,
106
+ torch_dtype=torch_dtype,
107
+ low_cpu_mem_usage=True,
108
+ use_safetensors=True,
109
+ attn_implementation="sdpa",
110
+ )
111
+
112
+ assistant_model.to(device);
113
+
114
+ from transformers import pipeline
115
+
116
+ pipe = pipeline(
117
+ "automatic-speech-recognition",
118
+ model=model,
119
+ tokenizer=processor.tokenizer,
120
+ feature_extractor=processor.feature_extractor,
121
+ max_new_tokens=128,
122
+ chunk_length_s=15,
123
+ batch_size=4,
124
+ generate_kwargs={"assistant_model": assistant_model},
125
+ torch_dtype=torch_dtype,
126
+ device=device,
127
+ )
128
+
129
+ sample = dataset[0]["audio"]
130
+ result = pipe(sample)
131
+ print(result["text"])
132
 
 
133
 
134
  ### Training hyperparameters
135