Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ language: multilingual
|
|
9 |
datasets:
|
10 |
- owsm_v3.2_ctc
|
11 |
base_model:
|
12 |
-
- espnet/owsm_ctc_v3.
|
13 |
license: cc-by-4.0
|
14 |
---
|
15 |
|
@@ -27,4 +27,129 @@ espnet_model_zoo
|
|
27 |
```
|
28 |
|
29 |
|
30 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
datasets:
|
10 |
- owsm_v3.2_ctc
|
11 |
base_model:
|
12 |
+
- espnet/owsm_ctc_v3.2_ft_1B
|
13 |
license: cc-by-4.0
|
14 |
---
|
15 |
|
|
|
27 |
```
|
28 |
|
29 |
|
30 |
+
**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
|
31 |
+
|
32 |
+
|
33 |
+
|
34 |
+
### Example script for batched inference
|
35 |
+
|
36 |
+
`Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
|
37 |
+
|
38 |
+
```python
|
39 |
+
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
|
40 |
+
|
41 |
+
s2t = Speech2TextGreedySearch.from_pretrained(
|
42 |
+
"espnet/owsm_ctc_v3.1_1B",
|
43 |
+
device="cuda",
|
44 |
+
use_flash_attn=False, # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
|
45 |
+
lang_sym='<eng>',
|
46 |
+
task_sym='<asr>',
|
47 |
+
)
|
48 |
+
|
49 |
+
res = s2t.batch_decode(
|
50 |
+
"audio.wav", # a single audio (path or 1-D array/tensor) as input
|
51 |
+
batch_size=16,
|
52 |
+
context_len_in_secs=4,
|
53 |
+
) # res is a single str, i.e., the predicted text without special tokens
|
54 |
+
|
55 |
+
res = s2t.batch_decode(
|
56 |
+
["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
|
57 |
+
batch_size=16,
|
58 |
+
context_len_in_secs=4,
|
59 |
+
) # res is a list of str
|
60 |
+
|
61 |
+
# Please check the code of `batch_decode` for all supported inputs
|
62 |
+
```
|
63 |
+
|
64 |
+
### Example script for short-form ASR/ST/LID
|
65 |
+
|
66 |
+
Our models are trained on 16kHz audio with a fixed duration of 30s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 30s.
|
67 |
+
|
68 |
+
```python
|
69 |
+
import librosa
|
70 |
+
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
|
71 |
+
|
72 |
+
s2t = Speech2TextGreedySearch.from_pretrained(
|
73 |
+
"espnet/owsm_ctc_v3.2_ft_1B",
|
74 |
+
device="cuda",
|
75 |
+
generate_interctc_outputs=False,
|
76 |
+
lang_sym='<eng>',
|
77 |
+
task_sym='<asr>',
|
78 |
+
)
|
79 |
+
|
80 |
+
# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
|
81 |
+
speech, rate = librosa.load("xxx.wav", sr=16000)
|
82 |
+
speech = librosa.util.fix_length(speech, size=(16000 * 30))
|
83 |
+
|
84 |
+
res = s2t(speech)[0]
|
85 |
+
print(res)
|
86 |
+
```
|
87 |
+
|
88 |
+
### Example script for long-form ASR/ST
|
89 |
+
|
90 |
+
```python
|
91 |
+
import soundfile as sf
|
92 |
+
import torch
|
93 |
+
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
|
94 |
+
|
95 |
+
context_len_in_secs = 4 # left and right context when doing buffered inference
|
96 |
+
batch_size = 32 # depends on the GPU memory
|
97 |
+
s2t = Speech2TextGreedySearch.from_pretrained(
|
98 |
+
"espnet/owsm_ctc_v3.2_ft_1B",
|
99 |
+
device='cuda' if torch.cuda.is_available() else 'cpu',
|
100 |
+
generate_interctc_outputs=False,
|
101 |
+
lang_sym='<eng>',
|
102 |
+
task_sym='<asr>',
|
103 |
+
)
|
104 |
+
|
105 |
+
speech, rate = sf.read(
|
106 |
+
"xxx.wav"
|
107 |
+
)
|
108 |
+
|
109 |
+
text = s2t.decode_long_batched_buffered(
|
110 |
+
speech,
|
111 |
+
batch_size=batch_size,
|
112 |
+
context_len_in_secs=context_len_in_secs,
|
113 |
+
)
|
114 |
+
print(text)
|
115 |
+
```
|
116 |
+
|
117 |
+
### Example of CTC forced alignment using `ctc-segmentation`
|
118 |
+
|
119 |
+
CTC segmentation can be efficiently applied to audio of an arbitrary length.
|
120 |
+
|
121 |
+
```python
|
122 |
+
import soundfile as sf
|
123 |
+
from espnet2.bin.s2t_ctc_align import CTCSegmentation
|
124 |
+
from espnet_model_zoo.downloader import ModelDownloader
|
125 |
+
|
126 |
+
# Download model first
|
127 |
+
d = ModelDownloader()
|
128 |
+
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B") # "espnet/owsm_ctc_v3.2_ft_1B"
|
129 |
+
|
130 |
+
aligner = CTCSegmentation(
|
131 |
+
**downloaded,
|
132 |
+
fs=16000,
|
133 |
+
ngpu=1,
|
134 |
+
batch_size=32, # batched parallel decoding; reduce it if your GPU memory is smaller
|
135 |
+
kaldi_style_text=True,
|
136 |
+
time_stamps="auto", # "auto" can be more accurate than "fixed" when converting token index to timestamp
|
137 |
+
lang_sym="<eng>",
|
138 |
+
task_sym="<asr>",
|
139 |
+
context_len_in_secs=2, # left and right context in buffered decoding
|
140 |
+
)
|
141 |
+
|
142 |
+
speech, rate = sf.read(
|
143 |
+
"./test_utils/ctc_align_test.wav"
|
144 |
+
)
|
145 |
+
print(f"speech duration: {len(speech) / rate : .2f} seconds")
|
146 |
+
text = """
|
147 |
+
utt1 THE SALE OF THE HOTELS
|
148 |
+
utt2 IS PART OF HOLIDAY'S STRATEGY
|
149 |
+
utt3 TO SELL OFF ASSETS
|
150 |
+
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
|
151 |
+
"""
|
152 |
+
|
153 |
+
segments = aligner(speech, text)
|
154 |
+
print(segments)
|
155 |
+
```
|