yky-h commited on
Commit
d538c6e
1 Parent(s): 2bd343d

add readme

Browse files
Files changed (1) hide show
  1. README.md +162 -0
README.md CHANGED
@@ -1,3 +1,165 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
3
+ language: ja
4
+ datasets:
5
+ - reazon-research/reazonspeech
6
+ tags:
7
+ - automatic-speech-recognition
8
+ - speech
9
+ - audio
10
+ - hubert
11
+ - gpt_neox
12
+ - asr
13
+ - nlp
14
  license: apache-2.0
15
  ---
16
+
17
+ # `rinna/nue-asr`
18
+
19
+ ![rinna-icon](./rinna.png)
20
+
21
+ # Overview
22
+ [[Paper]](https://arxiv.org/abs/2312.03668)
23
+ [[GitHub]](https://github.com/rinnakk/nue-asr)
24
+
25
+ We propose a novel end-to-end speech recognition model, `Nue ASR`, which integrates pre-trained speech and language models.
26
+
27
+ The name `Nue` comes from the Japanese word ([`鵺/ぬえ/Nue`](https://en.wikipedia.org/wiki/Nue)), one of the Japanese legendary creatures ([`妖怪/ようかい/Yōkai`](https://en.wikipedia.org/wiki/Y%C5%8Dkai)).
28
+
29
+ This model is capable of performing highly accurate Japanese speech recognition.
30
+ By utilizing a GPU, it can recognize speech at speeds exceeding real-time.
31
+
32
+ Benchmark score including our models can be seen at https://rinnakk.github.io/research/benchmarks/asr/
33
+
34
+ * **Model architecture**
35
+
36
+ This model consists of three main components: HuBERT audio encoder, bridge network, and GPT-NeoX decoder.
37
+ The weights of HuBERT and GPT-NeoX were initialized with the pre-trained weights of HuBERT and GPT-NeoX, respectively.
38
+ - [japanese-hubert-base](https://huggingface.co/rinna/japanese-hubert-base)
39
+ - [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b)
40
+
41
+ * **Training**
42
+
43
+ The model was trained on approximately 19,000 hours of following Japanese speech corpus.
44
+ - [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
45
+
46
+
47
+ * **Authors**
48
+
49
+ - [Yukiya Hono](https://huggingface.co/yky-h)
50
+ - [Koh Mitsuda](https://huggingface.co/mitsu-koh)
51
+ - [Tianyu Zhao](https://huggingface.co/tianyuz)
52
+ - [Kentaro Mitsui](https://huggingface.co/Kentaro321)
53
+ - [Toshiaki Wakatsuki](https://huggingface.co/t-w)
54
+ - [Kei Sawada](https://huggingface.co/keisawada)
55
+
56
+ ---
57
+
58
+ # How to use the model
59
+
60
+ First, install the code for inference this model.
61
+
62
+ ```bash
63
+ pip install git+https://github.com/rinnakk/nue-asr.git
64
+ ```
65
+
66
+ Command-line interface and python interface are available.
67
+
68
+ ## Command-line usage
69
+ The following command will transcribe the audio file via the command line interface.
70
+ Audio files will be automatically downsampled to 16kHz.
71
+ ```bash
72
+ nue-asr audio1.wav
73
+ ```
74
+ You can specify multiple audio files.
75
+ ```bash
76
+ nue-asr audio1.wav audio2.flac audio3.mp3
77
+ ```
78
+
79
+ We can use DeepSpeed-Inference to accelerate the inference speed of GPT-NeoX module.
80
+ If you use DeepSpeed-Inference, you need to install DeepSpeed.
81
+ ```bash
82
+ pip install deepspeed
83
+ ```
84
+
85
+ Then, you can use DeepSpeed-Inference as follows:
86
+ ```bash
87
+ nue-asr --use-deepspeed audio1.wav
88
+ ```
89
+
90
+ Run `nue-asr --help` for more information.
91
+
92
+ ## Python usage
93
+ The example of python interface is as follows:
94
+ ```python
95
+ import nue_asr
96
+
97
+ model = nue_asr.load_model("rinna/nue-asr")
98
+ tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")
99
+
100
+ result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
101
+ print(result.text)
102
+ ```
103
+ `nue_asr.transcribe` function can accept audio data as either a `numpy.array` or a `torch.Tensor`, in addition to traditional audio waveform file paths.
104
+
105
+ Accelerating the inference speed of models using DeepSpeed-Inference is also available through the python interface.
106
+ ```python
107
+ import nue_asr
108
+
109
+ model = nue_asr.load_model("rinna/nue-asr", use_deepspeed=True)
110
+ tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")
111
+
112
+ result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
113
+ print(result.text)
114
+ ```
115
+
116
+ ---
117
+
118
+ # Tokenization
119
+ The model uses the same sentencepiece-based tokenizer as [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b).
120
+
121
+ ---
122
+
123
+ # How to cite
124
+ ```bibtex
125
+ @article{hono2023integration,
126
+ title={An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition},
127
+ author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
128
+ journal={arXiv preprint arXiv:2312.03668},
129
+ year={2023}
130
+ }
131
+
132
+ @misc{rinna-nue-asr,
133
+ title={rinna/nue-asr},
134
+ author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
135
+ url={https://huggingface.co/rinna/nue-asr}
136
+ }
137
+ ```
138
+ ---
139
+
140
+ # Citations
141
+ ```bibtex
142
+ @article{hsu2021hubert,
143
+ title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
144
+ author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
145
+ journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
146
+ year={2021},
147
+ volume={29},
148
+ pages={3451-3460},
149
+ doi={10.1109/TASLP.2021.3122291}
150
+ }
151
+
152
+ @software{andoniangpt2021gpt,
153
+ title={{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
154
+ author={Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
155
+ url={https://www.github.com/eleutherai/gpt-neox},
156
+ doi={10.5281/zenodo.5879544},
157
+ month={8},
158
+ year={2021},
159
+ version={0.0.1},
160
+ }
161
+ ```
162
+ ---
163
+
164
+ # License
165
+ [The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)