nue-asr / README.md

yky-h

add readme

d538c6e 10 months ago

preview code

raw

history blame

No virus

5.44 kB

	---
	thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
	language: ja
	datasets:
	- reazon-research/reazonspeech
	tags:
	- automatic-speech-recognition
	- speech
	- audio
	- hubert
	- gpt_neox
	- asr
	- nlp
	license: apache-2.0
	---

	# `rinna/nue-asr`

	![rinna-icon](./rinna.png)

	# Overview
	[[Paper]](https://arxiv.org/abs/2312.03668)
	[[GitHub]](https://github.com/rinnakk/nue-asr)

	We propose a novel end-to-end speech recognition model, `Nue ASR`, which integrates pre-trained speech and language models.

	The name `Nue` comes from the Japanese word ([`鵺/ぬえ/Nue`](https://en.wikipedia.org/wiki/Nue)), one of the Japanese legendary creatures ([`妖怪/ようかい/Yōkai`](https://en.wikipedia.org/wiki/Y%C5%8Dkai)).

	This model is capable of performing highly accurate Japanese speech recognition.
	By utilizing a GPU, it can recognize speech at speeds exceeding real-time.

	Benchmark score including our models can be seen at https://rinnakk.github.io/research/benchmarks/asr/

	* Model architecture

	This model consists of three main components: HuBERT audio encoder, bridge network, and GPT-NeoX decoder.
	The weights of HuBERT and GPT-NeoX were initialized with the pre-trained weights of HuBERT and GPT-NeoX, respectively.
	- [japanese-hubert-base](https://huggingface.co/rinna/japanese-hubert-base)
	- [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b)

	* Training

	The model was trained on approximately 19,000 hours of following Japanese speech corpus.
	- [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)


	* Authors

	- [Yukiya Hono](https://huggingface.co/yky-h)
	- [Koh Mitsuda](https://huggingface.co/mitsu-koh)
	- [Tianyu Zhao](https://huggingface.co/tianyuz)
	- [Kentaro Mitsui](https://huggingface.co/Kentaro321)
	- [Toshiaki Wakatsuki](https://huggingface.co/t-w)
	- [Kei Sawada](https://huggingface.co/keisawada)

	---

	# How to use the model

	First, install the code for inference this model.

	```bash
	pip install git+https://github.com/rinnakk/nue-asr.git
	```

	Command-line interface and python interface are available.

	## Command-line usage
	The following command will transcribe the audio file via the command line interface.
	Audio files will be automatically downsampled to 16kHz.
	```bash
	nue-asr audio1.wav
	```
	You can specify multiple audio files.
	```bash
	nue-asr audio1.wav audio2.flac audio3.mp3
	```

	We can use DeepSpeed-Inference to accelerate the inference speed of GPT-NeoX module.
	If you use DeepSpeed-Inference, you need to install DeepSpeed.
	```bash
	pip install deepspeed
	```

	Then, you can use DeepSpeed-Inference as follows:
	```bash
	nue-asr --use-deepspeed audio1.wav
	```

	Run `nue-asr --help` for more information.

	## Python usage
	The example of python interface is as follows:
	```python
	import nue_asr

	model = nue_asr.load_model("rinna/nue-asr")
	tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")

	result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
	print(result.text)
	```
	`nue_asr.transcribe` function can accept audio data as either a `numpy.array` or a `torch.Tensor`, in addition to traditional audio waveform file paths.

	Accelerating the inference speed of models using DeepSpeed-Inference is also available through the python interface.
	```python
	import nue_asr

	model = nue_asr.load_model("rinna/nue-asr", use_deepspeed=True)
	tokenizer = nue_asr.load_tokenizer("rinna/nue-asr")

	result = nue_asr.transcribe(model, tokenizer, "path_to_audio.wav")
	print(result.text)
	```

	---

	# Tokenization
	The model uses the same sentencepiece-based tokenizer as [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b).

	---

	# How to cite
	```bibtex
	@article{hono2023integration,
	title={An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition},
	author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
	journal={arXiv preprint arXiv:2312.03668},
	year={2023}
	}

	@misc{rinna-nue-asr,
	title={rinna/nue-asr},
	author={Hono, Yukiya and Mitsuda, Koh and Zhao, Tianyu and Mitsui, Kentaro and Wakatsuki, Toshiaki and Sawada, Kei},
	url={https://huggingface.co/rinna/nue-asr}
	}
	```
	---

	# Citations
	```bibtex
	@article{hsu2021hubert,
	title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
	author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
	journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
	year={2021},
	volume={29},
	pages={3451-3460},
	doi={10.1109/TASLP.2021.3122291}
	}

	@software{andoniangpt2021gpt,
	title={{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
	author={Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
	url={https://www.github.com/eleutherai/gpt-neox},
	doi={10.5281/zenodo.5879544},
	month={8},
	year={2021},
	version={0.0.1},
	}
	```
	---

	# License
	[The Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0)