|
--- |
|
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png |
|
language: ja |
|
datasets: |
|
- reazon-research/reazonspeech |
|
tags: |
|
- automatic-speech-recognition |
|
- speech |
|
- audio |
|
- hubert |
|
- gpt_neox |
|
- asr |
|
- nlp |
|
license: apache-2.0 |
|
--- |
|
|
|
# `rinna/nue-asr` |
|
|
|
![rinna-icon](./rinna.png) |
|
|
|
# Overview |
|
[[Paper]](https://arxiv.org/abs/2312.03668) |
|
[[GitHub]](https://github.com/rinnakk/nue-asr) |
|
|
|
We propose a novel end-to-end speech recognition model, `Nue ASR`, which integrates pre-trained speech and language models. |
|
|
|
The name `Nue` comes from the Japanese word ([`鵺/ぬえ/Nue`](https://en.wikipedia.org/wiki/Nue)), one of the Japanese legendary creatures ([`妖怪/ようかい/Yōkai`](https://en.wikipedia.org/wiki/Y%C5%8Dkai)). |
|
|
|
This model is capable of performing highly accurate Japanese speech recognition. |
|
By utilizing a GPU, it can recognize speech at speeds exceeding real-time. |
|
|
|
Benchmark score including our models can be seen at https://rinnakk.github.io/research/benchmarks/asr/ |
|
|
|
* **Model architecture** |
|
|
|
This model consists of three main components: HuBERT audio encoder, bridge network, and GPT-NeoX decoder. |
|
The weights of HuBERT and GPT-NeoX were initialized with the pre-trained weights of HuBERT and GPT-NeoX, respectively. |
|
- [japanese-hubert-base](https://huggingface.co/rinna/japanese-hubert-base) |
|
- [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b) |
|
|
|
* **Training** |
|
|
|
The model was trained on approximately 19,000 hours of following Japanese speech corpus. |
|
- [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) |
|
|
|
|
|
* **Authors** |
|
|
|
- [Yukiya Hono](https://huggingface.co/yky-h) |
|
- [Koh Mitsuda](https://huggingface.co/mitsu-koh) |
|
- [Tianyu Zhao](https://huggingface.co/tianyuz) |
|
- [Kentaro Mitsui](https://huggingface.co/Kentaro321) |
|
- [Toshiaki Wakatsuki](https://huggingface.co/t-w) |
|
- [Kei Sawada](https://huggingface.co/keisawada) |
|
|
|
--- |
|
|
|
# How to use the model |
|
|
|
First, install the code for inference this model. |
|
|
|
```bash |
|
pip install git+https://github.com/rinnakk/nue-asr.git |
|
``` |
|
|
|
Command-line interface and python interface are available. |
|
|
|
## Command-line usage |
|
The following command will transcribe the audio file via the command line interface. |
|
Audio files will be automatically downsampled to 16kHz. |
|
```bash |
|
nue-asr audio1.wav |
|
``` |
|
You can specify multiple audio files. |
|
```bash |
|
nue-asr audio1.wav audio2.flac audio3.mp3 |
|
``` |
|
|
|
We can use DeepSpeed-Inference to accelerate the inference speed of GPT-NeoX module. |
|
If you use DeepSpeed-Inference, you need to install DeepSpeed. |
|
```bash |
|
pip install deepspeed |
|
``` |
|
|
|
Then, you can use DeepSpeed-Inference as follows: |
|
```bash |
|
nue-asr --use-deepspeed audio1.wav |
|
``` |
|
|
|
Run `nue-asr --help` for more information. |
|
|
|
## Python usage |
|
The example of python interface is as follows: |
|
```python |
|
import nue_asr |
|
|
|
model = nue_asr.load_model("rinna/nue-asr") |
|
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr") |
|
|
|
result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav") |
|
print(result.text) |
|
``` |
|
`nue_asr.transcribe` function can accept audio data as either a `numpy.array` or a `torch.Tensor`, in addition to traditional audio waveform file paths. |
|
|
|
Accelerating the inference speed of models using DeepSpeed-Inference is also available through the python interface. |
|
```python |
|
import nue_asr |
|
|
|
model = nue_asr.load_model("rinna/nue-asr", use_deepspeed=True) |
|
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr") |
|
|
|
result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav") |
|
print(result.text) |
|
``` |
|
|
|
--- |
|
|
|
# Tokenization |
|
The model uses the same sentencepiece-based tokenizer as [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b). |
|
|
|
--- |
|
|
|
# How to cite |
|
```bibtex |
|
@article{hono2023integration, |
|
title={An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition}, |
|
author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei}, |
|
journal={arXiv preprint arXiv:2312.03668}, |
|
year={2023} |
|
} |
|
|
|
@misc{rinna-nue-asr, |
|
title={rinna/nue-asr}, |
|
author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei}, |
|
url={https://huggingface.co/rinna/nue-asr} |
|
} |
|
``` |
|
--- |
|
|
|
# Citations |
|
```bibtex |
|
@article{hsu2021hubert, |
|
title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units}, |
|
author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman}, |
|
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, |
|
year={2021}, |
|
volume={29}, |
|
pages={3451-3460}, |
|
doi={10.1109/TASLP.2021.3122291} |
|
} |
|
|
|
@software{andoniangpt2021gpt, |
|
title={{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}}, |
|
author={Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel}, |
|
url={https://www.github.com/eleutherai/gpt-neox}, |
|
doi={10.5281/zenodo.5879544}, |
|
month={8}, |
|
year={2021}, |
|
version={0.0.1}, |
|
} |
|
``` |
|
--- |
|
|
|
# License |
|
[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) |
|
|