Aspik101 commited on
Commit
fab26d4
1 Parent(s): 3c26d98

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pl
4
+ tags:
5
+ - audio
6
+ - automatic-speech-recognition
7
+ - transformers.js
8
+ pipeline_tag: automatic-speech-recognition
9
+ license: mit
10
+ library_name: transformers
11
+ ---
12
+
13
+ # Polish Distil-Whisper: distil-large-v3
14
+
15
+ Distil-Whisper was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).
16
+
17
+ It is a distilled version of the Whisper model that is **3 times faster**, 49% smaller. This is the repository for distil-large-v3-pl, a distilled variant of [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3).
18
+
19
+
20
+ ## Usage
21
+
22
+ Distil-Whisper is supported in Hugging Face 🤗 Transformers from version 4.35 onwards. To run the model, first
23
+ install the latest version of the Transformers library. For this example, we'll also install 🤗 Datasets to load toy
24
+ audio dataset from the Hugging Face Hub:
25
+
26
+ ```bash
27
+ pip install --upgrade pip
28
+ pip install --upgrade transformers accelerate datasets[audio]
29
+ ```
30
+
31
+ ### Short-Form Transcription
32
+
33
+ The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
34
+ class to transcribe short-form audio files (< 30-seconds) as follows:
35
+
36
+ ```python
37
+ import torch
38
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
39
+ from datasets import load_dataset
40
+
41
+
42
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
43
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
44
+
45
+ model_id = "Aspik101/distil-whisper-large-v3-pl"
46
+
47
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
48
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
49
+ )
50
+ model.to(device)
51
+
52
+ processor = AutoProcessor.from_pretrained(model_id)
53
+
54
+ pipe = pipeline(
55
+ "automatic-speech-recognition",
56
+ model=model,
57
+ tokenizer=processor.tokenizer,
58
+ feature_extractor=processor.feature_extractor,
59
+ max_new_tokens=128,
60
+ torch_dtype=torch_dtype,
61
+ device=device,
62
+ )
63
+
64
+ dataset = load_dataset("mozilla-foundation/common_voice_13_0", "pl", split="test")
65
+ sample = dataset[0]["audio"]
66
+
67
+ result = pipe(sample)
68
+ print(result["text"])
69
+ ```
70
+
71
+ To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
72
+ ```diff
73
+ - result = pipe(sample)
74
+ + result = pipe("audio.mp3")
75
+ ```
76
+
77
+ ### Long-Form Transcription
78
+
79
+ Distil-Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
80
+ is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
81
+
82
+ To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For Distil-Whisper, a chunk length of 15-seconds
83
+ is optimal. To activate batching, pass the argument `batch_size`:
84
+
85
+ ```python
86
+ import torch
87
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
88
+ from datasets import load_dataset
89
+
90
+
91
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
92
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
93
+
94
+ model_id = "Aspik101/distil-whisper-large-v3-pl"
95
+
96
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
97
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
98
+ )
99
+ model.to(device)
100
+
101
+ processor = AutoProcessor.from_pretrained(model_id)
102
+
103
+ pipe = pipeline(
104
+ "automatic-speech-recognition",
105
+ model=model,
106
+ tokenizer=processor.tokenizer,
107
+ feature_extractor=processor.feature_extractor,
108
+ max_new_tokens=128,
109
+ chunk_length_s=15,
110
+ batch_size=16,
111
+ torch_dtype=torch_dtype,
112
+ device=device,
113
+ )
114
+
115
+ dataset = load_dataset("mozilla-foundation/common_voice_13_0", "pl", split="test")
116
+ sample = dataset[0]["audio"]
117
+
118
+ result = pipe(sample)
119
+ print(result["text"])
120
+ ```
121
+
122
+ <!---
123
+ **Tip:** The pipeline can also be used to transcribe an audio file from a remote URL, for example:
124
+
125
+ ```python
126
+ result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav")
127
+ ```
128
+ --->