Automatic Speech Recognition
ESPnet
multilingual
audio
speech-translation
pyf98 commited on
Commit
a4b693c
1 Parent(s): f522784
Files changed (27) hide show
  1. README.md +88 -0
  2. data/token_list/bpe_unigram50000/bpe.model +3 -0
  3. exp/s2t_stats_raw_bpe50000/train/feats_stats.npz +3 -0
  4. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/RESULTS.md +0 -0
  5. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/config.yaml +0 -0
  6. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/acc.png +0 -0
  7. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/backward_time.png +0 -0
  8. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/cer.png +0 -0
  9. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/cer_ctc.png +0 -0
  10. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/clip.png +0 -0
  11. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/forward_time.png +0 -0
  12. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/gpu_max_cached_mem_GB.png +0 -0
  13. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/grad_norm.png +0 -0
  14. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/iter_time.png +0 -0
  15. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/loss.png +0 -0
  16. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/loss_att.png +0 -0
  17. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/loss_ctc.png +0 -0
  18. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/loss_scale.png +0 -0
  19. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/optim0_lr0.png +0 -0
  20. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/optim_step_time.png +0 -0
  21. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/train_time.png +0 -0
  22. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/wer.png +0 -0
  23. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/train.2.log +0 -0
  24. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/train.log +0 -0
  25. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/valid.acc.ave_5best.pth +3 -0
  26. exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/valid.total_count.ave_5best.pth +3 -0
  27. meta.yaml +8 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - automatic-speech-recognition
6
+ - speech-translation
7
+ language: multilingual
8
+ datasets:
9
+ - owsm_v3.1_lowrestriction
10
+ license: cc-by-4.0
11
+ ---
12
+
13
+ ## OWSM: Open Whisper-style Speech Model
14
+
15
+ [OWSM](https://arxiv.org/abs/2309.13876) is an Open Whisper-style Speech Model from [CMU WAVLab](https://www.wavlab.org/). It reproduces Whisper-style training using publicly available data and an open-source toolkit [ESPnet](https://github.com/espnet/espnet).
16
+
17
+ Our demo is available [here](https://huggingface.co/spaces/pyf98/OWSM_v3_demo). The [project page](https://www.wavlab.org/activities/2024/owsm/) contains various resources.
18
+
19
+ [OWSM v3.1](https://arxiv.org/abs/2401.16658) is an improved version of OWSM v3. It significantly outperforms OWSM v3 in almost all evaluation benchmarks.
20
+ We do not include any new training data. Instead, we utilize a state-of-the-art speech encoder, [E-Branchformer](https://arxiv.org/abs/2210.00077).
21
+
22
+ This is a small size model with 367M parameters and is trained on 70k hours of public speech data with lower restrictions (compared to the full OWSM data). Please check our [project page](https://www.wavlab.org/activities/2024/owsm/) for more information.
23
+ Specifically, it supports the following speech-to-text tasks:
24
+ - Speech recognition
25
+ - Utterance-level alignment
26
+ - Long-form transcription
27
+ - Language identification
28
+
29
+
30
+ ### Citing OWSM, Branchformers and ESPnet
31
+
32
+ ```BibTex
33
+ @misc{peng2024owsm,
34
+ title={OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer},
35
+ author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
36
+ year={2024},
37
+ eprint={2401.16658},
38
+ archivePrefix={arXiv},
39
+ primaryClass={cs.CL}
40
+ }
41
+ @INPROCEEDINGS{owsm-asru23,
42
+ author={Peng, Yifan and Tian, Jinchuan and Yan, Brian and Berrebbi, Dan and Chang, Xuankai and Li, Xinjian and Shi, Jiatong and Arora, Siddhant and Chen, William and Sharma, Roshan and Zhang, Wangyou and Sudo, Yui and Shakeel, Muhammad and Jung, Jee-Weon and Maiti, Soumi and Watanabe, Shinji},
43
+ booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
44
+ title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
45
+ year={2023},
46
+ doi={10.1109/ASRU57964.2023.10389676}
47
+ }
48
+ @inproceedings{peng23b_interspeech,
49
+ author={Yifan Peng and Kwangyoun Kim and Felix Wu and Brian Yan and Siddhant Arora and William Chen and Jiyang Tang and Suwon Shon and Prashant Sridhar and Shinji Watanabe},
50
+ title={{A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks}},
51
+ year=2023,
52
+ booktitle={Proc. INTERSPEECH 2023},
53
+ pages={2208--2212},
54
+ doi={10.21437/Interspeech.2023-1194}
55
+ }
56
+ @inproceedings{kim2023branchformer,
57
+ title={E-branchformer: Branchformer with enhanced merging for speech recognition},
58
+ author={Kim, Kwangyoun and Wu, Felix and Peng, Yifan and Pan, Jing and Sridhar, Prashant and Han, Kyu J and Watanabe, Shinji},
59
+ booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},
60
+ pages={84--91},
61
+ year={2023},
62
+ organization={IEEE}
63
+ }
64
+ @InProceedings{pmlr-v162-peng22a,
65
+ title = {Branchformer: Parallel {MLP}-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding},
66
+ author = {Peng, Yifan and Dalmia, Siddharth and Lane, Ian and Watanabe, Shinji},
67
+ booktitle = {Proceedings of the 39th International Conference on Machine Learning},
68
+ pages = {17627--17643},
69
+ year = {2022},
70
+ editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
71
+ volume = {162},
72
+ series = {Proceedings of Machine Learning Research},
73
+ month = {17--23 Jul},
74
+ publisher = {PMLR},
75
+ pdf = {https://proceedings.mlr.press/v162/peng22a/peng22a.pdf},
76
+ url = {https://proceedings.mlr.press/v162/peng22a.html},
77
+ abstract = {Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global dependencies using self-attention. Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer, with parallel branches for modeling various ranged dependencies in end-to-end speech processing. In each encoder layer, one branch employs self-attention or its variant to capture long-range dependencies, while the other branch utilizes an MLP module with convolutional gating (cgMLP) to extract local relationships. We conduct experiments on several speech recognition and spoken language understanding benchmarks. Results show that our model outperforms both Transformer and cgMLP. It also matches with or outperforms state-of-the-art results achieved by Conformer. Furthermore, we show various strategies to reduce computation thanks to the two-branch architecture, including the ability to have variable inference complexity in a single trained model. The weights learned for merging branches indicate how local and global dependencies are utilized in different layers, which benefits model designing.}
78
+ }
79
+ @inproceedings{watanabe2018espnet,
80
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
81
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
82
+ year={2018},
83
+ booktitle={Proceedings of Interspeech},
84
+ pages={2207--2211},
85
+ doi={10.21437/Interspeech.2018-1456},
86
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
87
+ }
88
+ ```
data/token_list/bpe_unigram50000/bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af3103a5e6dbaea47c4ca88e2ae92e536ff335c40876de2b2963d2af52746453
3
+ size 1053520
exp/s2t_stats_raw_bpe50000/train/feats_stats.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f1171609665f1de3c8b99eb2f23392e69a32c6d7027361b9645e0ffaa5fdeae6
3
+ size 1402
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/RESULTS.md ADDED
File without changes
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/config.yaml ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/acc.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/backward_time.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/cer.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/cer_ctc.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/clip.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/forward_time.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/gpu_max_cached_mem_GB.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/grad_norm.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/iter_time.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/loss.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/loss_att.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/loss_ctc.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/loss_scale.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/optim0_lr0.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/optim_step_time.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/train_time.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/images/wer.png ADDED
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/train.2.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/train.log ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/valid.acc.ave_5best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03a1dbd2d9498777c9de4f97ab2faae563f6d13f89f4cc85481c916038eb1646
3
+ size 1466924749
exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/valid.total_count.ave_5best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2412ee08e7f7c2dfd2a19f26289d93a92ae1f3beb7bc319e98cd74a6b05347e1
3
+ size 1466929645
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202310'
2
+ files:
3
+ s2t_model_file: exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/valid.acc.ave_5best.pth
4
+ python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
5
+ timestamp: 1725247130.37343
6
+ torch: 1.13.1
7
+ yaml_files:
8
+ s2t_train_config: exp/s2t_train_s2t_ebf_conv2d_size768_e9_d9_piecewise_lr5e-4_warmup60k_flashattn_raw_bpe50000/config.yaml