nue-asr /
yky-h's picture
add readme
history blame
No virus
5.44 kB
language: ja
- reazon-research/reazonspeech
- automatic-speech-recognition
- speech
- audio
- hubert
- gpt_neox
- asr
- nlp
license: apache-2.0
# `rinna/nue-asr`
# Overview
We propose a novel end-to-end speech recognition model, `Nue ASR`, which integrates pre-trained speech and language models.
The name `Nue` comes from the Japanese word ([`鵺/ぬえ/Nue`](, one of the Japanese legendary creatures ([`妖怪/ようかい/Yōkai`](
This model is capable of performing highly accurate Japanese speech recognition.
By utilizing a GPU, it can recognize speech at speeds exceeding real-time.
Benchmark score including our models can be seen at
* **Model architecture**
This model consists of three main components: HuBERT audio encoder, bridge network, and GPT-NeoX decoder.
The weights of HuBERT and GPT-NeoX were initialized with the pre-trained weights of HuBERT and GPT-NeoX, respectively.
- [japanese-hubert-base](
- [japanese-gpt-neox-3.6b](
* **Training**
The model was trained on approximately 19,000 hours of following Japanese speech corpus.
- [ReazonSpeech](
* **Authors**
- [Yukiya Hono](
- [Koh Mitsuda](
- [Tianyu Zhao](
- [Kentaro Mitsui](
- [Toshiaki Wakatsuki](
- [Kei Sawada](
# How to use the model
First, install the code for inference this model.
pip install git+
Command-line interface and python interface are available.
## Command-line usage
The following command will transcribe the audio file via the command line interface.
Audio files will be automatically downsampled to 16kHz.
nue-asr audio1.wav
You can specify multiple audio files.
nue-asr audio1.wav audio2.flac audio3.mp3
We can use DeepSpeed-Inference to accelerate the inference speed of GPT-NeoX module.
If you use DeepSpeed-Inference, you need to install DeepSpeed.
pip install deepspeed
Then, you can use DeepSpeed-Inference as follows:
nue-asr --use-deepspeed audio1.wav
Run `nue-asr --help` for more information.
## Python usage
The example of python interface is as follows:
import nue_asr
model = nue_asr.load_model("rinna/nue-asr")
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")
result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
`nue_asr.transcribe` function can accept audio data as either a `numpy.array` or a `torch.Tensor`, in addition to traditional audio waveform file paths.
Accelerating the inference speed of models using DeepSpeed-Inference is also available through the python interface.
import nue_asr
model = nue_asr.load_model("rinna/nue-asr", use_deepspeed=True)
tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")
result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
# Tokenization
The model uses the same sentencepiece-based tokenizer as [japanese-gpt-neox-3.6b](
# How to cite
title={An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition},
author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
journal={arXiv preprint arXiv:2312.03668},
author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
# Citations
title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
author={Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
# License
[The Apache 2.0 license](