pyf98 commited on
Commit
24122a4
1 Parent(s): 7769403

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md CHANGED
@@ -15,3 +15,101 @@ license: cc-by-4.0
15
  It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the previous [encoder-decoder OWSM](https://arxiv.org/abs/2401.16658).
16
 
17
  Due to time constraint, the model used in the paper was trained for 40 "epochs". The new model trained for 45 "epochs" is also added in this repo in order to match the setup of encoder-decoder OWSM. It can have better performance than the old one in many test sets.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the previous [encoder-decoder OWSM](https://arxiv.org/abs/2401.16658).
16
 
17
  Due to time constraint, the model used in the paper was trained for 40 "epochs". The new model trained for 45 "epochs" is also added in this repo in order to match the setup of encoder-decoder OWSM. It can have better performance than the old one in many test sets.
18
+
19
+ Currently, the code for OWSM-CTC has not been merged into ESPnet main branch. Instead, it is available as follows:
20
+ - Code in my repo: https://github.com/pyf98/espnet/tree/owsm-ctc
21
+ - Current model on HF: https://huggingface.co/pyf98/owsm_ctc_v3.1_1B
22
+
23
+ An example script to run short-form ASR/ST:
24
+ ```python
25
+ import soundfile as sf
26
+ import numpy as np
27
+ import librosa
28
+ import kaldiio
29
+ from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
30
+
31
+
32
+ s2t = Speech2TextGreedySearch.from_pretrained(
33
+ "pyf98/owsm_ctc_v3.1_1B",
34
+ device="cuda",
35
+ generate_interctc_outputs=False,
36
+ lang_sym='<eng>',
37
+ task_sym='<asr>',
38
+ )
39
+
40
+ speech, rate = sf.read(
41
+ "xxx.wav"
42
+ )
43
+ speech = librosa.util.fix_length(speech, size=(16000 * 30))
44
+
45
+ res = s2t(speech)[0]
46
+ print(res)
47
+ ```
48
+
49
+ An example script to run long-form ASR:
50
+ ```python
51
+ import soundfile as sf
52
+ import torch
53
+ from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
54
+
55
+
56
+ if __name__ == "__main__":
57
+ context_len_in_secs = 4 # left and right context when doing buffered inference
58
+ batch_size = 32 # depends on the GPU memory
59
+ s2t = Speech2TextGreedySearch.from_pretrained(
60
+ "pyf98/owsm_ctc_v3.1_1B",
61
+ device='cuda' if torch.cuda.is_available() else 'cpu',
62
+ generate_interctc_outputs=False,
63
+ lang_sym='<eng>',
64
+ task_sym='<asr>',
65
+ )
66
+
67
+ speech, rate = sf.read(
68
+ "xxx.wav"
69
+ )
70
+
71
+ text = s2t.decode_long_batched_buffered(
72
+ speech,
73
+ batch_size=batch_size,
74
+ context_len_in_secs=context_len_in_secs,
75
+ frames_per_sec=12.5, # 80ms shift, model-dependent, don't change
76
+ )
77
+ print(text)
78
+ ```
79
+
80
+ An example for CTC forced alignment using `ctc-segmentation`. It can be efficiently applied to audio of an arbitrary length.
81
+ For model downloading, please refer to https://github.com/espnet/espnet?tab=readme-ov-file#ctc-segmentation-demo
82
+
83
+ ```python
84
+ import soundfile as sf
85
+ from espnet2.bin.s2t_ctc_align import CTCSegmentation
86
+
87
+
88
+ if __name__ == "__main__":
89
+ ## Please download model first
90
+ aligner = CTCSegmentation(
91
+ s2t_model_file="exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_raw_bpe50000/valid.total_count.ave_5best.till45epoch.pth",
92
+ fs=16000,
93
+ ngpu=1,
94
+ batch_size=16, # batched parallel decoding; reduce it if your GPU memory is smaller
95
+ kaldi_style_text=True,
96
+ time_stamps="fixed",
97
+ samples_to_frames_ratio=1280, # 80ms time shift; don't change as it depends on the pre-trained model
98
+ lang_sym="<eng>",
99
+ task_sym="<asr>",
100
+ context_len_in_secs=2, # left and right context in buffered decoding
101
+ frames_per_sec=12.5, # 80ms time shift; don't change as it depends on the pre-trained model
102
+ )
103
+
104
+ speech, rate = sf.read(
105
+ "example.wav"
106
+ )
107
+ print(f"speech duration: {len(speech) / rate : .2f} seconds")
108
+ text = '''
109
+ utt1 hello there
110
+ utt2 welcome to this repo
111
+ '''
112
+
113
+ segments = aligner(speech, text)
114
+ print(segments)
115
+ ```