File size: 7,185 Bytes
2a29dce
d538c6e
 
 
 
 
 
 
 
 
 
 
 
2a29dce
2199018
2a29dce
d538c6e
 
 
 
 
 
 
 
 
 
 
 
 
802fd42
 
d538c6e
802fd42
d538c6e
 
 
 
 
 
 
 
 
 
b6684c8
802fd42
d538c6e
 
 
2199018
d538c6e
 
 
 
 
 
 
 
 
 
 
 
802fd42
 
 
 
 
d538c6e
 
 
 
 
 
 
 
802fd42
d538c6e
 
 
 
 
 
 
 
 
802fd42
d538c6e
 
 
 
 
 
 
 
 
 
 
 
 
802fd42
d538c6e
 
 
 
 
 
 
 
 
802fd42
d538c6e
802fd42
d538c6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6684c8
 
 
 
 
d538c6e
 
 
b6684c8
 
 
d538c6e
 
 
 
2199018
d538c6e
b6684c8
 
 
 
 
 
 
 
 
 
 
d538c6e
b6684c8
 
 
 
 
 
 
 
d538c6e
 
 
b6684c8
 
 
 
 
 
 
d538c6e
802fd42
 
b6684c8
 
 
 
 
 
802fd42
d538c6e
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
language: ja
datasets:
  - reazon-research/reazonspeech
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - hubert
  - gpt_neox
  - asr
  - nlp
license: apache-2.0
inference: false
---

# `rinna/nue-asr`

![rinna-icon](./rinna.png)

# Overview
[[Paper]](https://arxiv.org/abs/2312.03668)
[[GitHub]](https://github.com/rinnakk/nue-asr)

We propose a novel end-to-end speech recognition model, `Nue ASR`, which integrates pre-trained speech and language models.

The name `Nue` comes from the Japanese word ([`鵺/ぬえ/Nue`](https://en.wikipedia.org/wiki/Nue)), one of the Japanese legendary creatures ([`妖怪/ようかい/Yōkai`](https://en.wikipedia.org/wiki/Y%C5%8Dkai)).

This model provides end-to-end Japanese speech recognition with recognition accuracy comparable to the recent ASR models.
You can recognize speech faster than real time by using a GPU.

Benchmark scores, including our models, can be found at https://rinnakk.github.io/research/benchmarks/asr/

* **Model architecture**

    This model consists of three main components: HuBERT audio encoder, bridge network, and GPT-NeoX decoder.
    The weights of HuBERT and GPT-NeoX were initialized with the pre-trained weights of HuBERT and GPT-NeoX, respectively.
    - [japanese-hubert-base](https://huggingface.co/rinna/japanese-hubert-base)
    - [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b)

* **Training**

    The model was trained on approximately 19,000 hours of following Japanese speech corpus ReazonSpeech v1.
    Note that speech samples longer than 16 seconds were excluded before training.
    - [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)


* **Contributors**

    - [Yukiya Hono](https://huggingface.co/yky-h)
    - [Koh Mitsuda](https://huggingface.co/mitsu-koh)
    - [Tianyu Zhao](https://huggingface.co/tianyuz)
    - [Kentaro Mitsui](https://huggingface.co/Kentaro321)
    - [Toshiaki Wakatsuki](https://huggingface.co/t-w)
    - [Kei Sawada](https://huggingface.co/keisawada)

---

# How to use the model

We tested our code using Python 3.8.10 and 3.10.12 with [PyTorch](https://pytorch.org/) 2.1.1 and [Transformers](https://huggingface.co/docs/transformers) 4.35.2. 
This codebase is expected to be compatible with Python 3.8 or later and recent PyTorch versions. 
The version of Transformers should be 4.33.0 or higher.

First, install the code for inference of this model.

```bash
pip install git+https://github.com/rinnakk/nue-asr.git
```

Command-line interface and python interface are available.

## Command-line usage
The following command transcribes the audio file using the command line interface.
Audio files will be automatically downsampled to 16kHz.
```bash
nue-asr audio1.wav
```
You can specify multiple audio files.
```bash
nue-asr audio1.wav audio2.flac audio3.mp3
```

We can use [DeepSpeed-Inference](https://www.deepspeed.ai/inference/) to accelerate the inference speed of GPT-NeoX module.
If you use DeepSpeed-Inference, you need to install DeepSpeed.
```bash
pip install deepspeed
```

Then, you can use DeepSpeed-Inference as follows:
```bash
nue-asr --use-deepspeed audio1.wav
```

Run `nue-asr --help` for more information.

## Python usage
The example of Python interface is as follows:
```python
import nue_asr

model = nue_asr.load_model("rinna/nue-asr")
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")

result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
print(result.text)
```
`nue_asr.transcribe` function can accept audio data as either a `numpy.array` or a `torch.Tensor`, in addition to audio file paths.

Acceleration of inference speed using DeepSpeed-Inference is also available within the Python interface.
```python
import nue_asr

model = nue_asr.load_model("rinna/nue-asr", use_deepspeed=True)
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")

result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
print(result.text)
```

---

# Tokenization
The model uses the same sentencepiece-based tokenizer as [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b).

---

# How to cite
```bibtex
@inproceedings{hono2024integrating,
    title = {Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition},
    author = {Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
    booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
    year = {2024}
}

@misc{rinna-nue-asr,
    title = {rinna/nue-asr},
    author = {Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
    url = {https://huggingface.co/rinna/nue-asr}
}
```
---

# References
```bibtex
@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

@article{hsu2021hubert,
    title = {{HuBERT}: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
    author = {Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
    journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
    month = {10},
    year = {2021},
    volume = {29},
    pages = {3451-3460},
    doi = {10.1109/TASLP.2021.3122291}
}

@software{andoniangpt2021gpt,
    title = {{GPT}-{N}eo{X}: Large Scale Autoregressive Language Modeling in {P}y{T}orch},
    author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
    month = {8},
    year = {2021},
    version = {0.0.1},
    doi = {10.5281/zenodo.5879544},
    url = {https://www.github.com/eleutherai/gpt-neox}
}

@inproceedings{aminabadi2022deepspeed,
    title = {{DeepSpeed-Inference}: enabling efficient inference of transformer models at unprecedented scale},
    author = {Aminabadi, Reza Yazdani and Rajbhandari, Samyam and Awan, Ammar Ahmad and Li, Cheng and Li, Du and Zheng, Elton and Ruwase, Olatunji and Smith, Shaden and Zhang, Minjia and Rasley, Jeff and others},
    booktitle = {SC22: International Conference for High Performance Computing, Networking, Storage and Analysis},
    pages = {1--15},
    year = {2022},
    doi = {10.1109/SC41404.2022.00051}
}
```
---

# License
[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)