--- language: - ja --- # Japanese GSLM This is an Japanese implementation of [Generative Spoken Language Model](https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm) to support textless NLP in Japanese.
Submitted to Acoustical Society of Japan, 2023 Spring.
## How to use - PyTorch version >= 1.10.0 - Python version >= 3.8 ### Install requirements It is pre-required to install the [fairseq](https://github.com/facebookresearch/fairseq/) library and all the requirements the library needs. ``` git clone https://github.com/pytorch/fairseq cd fairseq pip install --editable ./ pip install librosa, unidecode, inflect ``` ## Re-synthesis of voice signal ### speech2unit The procedure for speech2unit is the same as the gslm example in [fairseq](https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit). You can convert the Japanese voice signal to discrete unit through this [pre-trained quantization model](https://huggingface.co/nonmetal/gslm-japanese/resolve/main/hubert200_JPN.bin). Route the downloaded model to ```KM_MODEL_PATH```. This file replaces the ```HuBERT Base + KM200``` model provided by fariseq, so it is required to download ```HuBERT-Base``` model as a pretrained acoustic model. ``` TYPE='hubert' CKPT_PATH= LAYER=6 KM_MODEL_PATH= MANIFEST= OUT_QUANTIZED_FILE= python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \ --feature_type $TYPE \ --kmeans_model_path $KM_MODEL_PATH \ --acoustic_model_path $CKPT_PATH \ --layer $LAYER \ --manifest_path $MANIFEST \ --out_quantized_file_path $OUT_QUANTIZED_FILE \ --extension ".wav" ``` ### unit2speech unit2speech model is modified Tacotron2 model that learns to synthesize speech from discrete speech units. You can convert the discrete unit to synthesized voice through this [model](https://huggingface.co/nonmetal/gslm-japanese/resolve/main/checkpoint_125k.pt). Also, it is required to download [Waveglow checkpoint](https://dl.fbaipublicfiles.com/textless_nlp/gslm/waveglow_256channels_new.pt) for Vocoder. Conversion from unit to speech is available with ```unit2speech_ja.py``` from this repository. It is also required to use ```hparam.py``` for extended compatability. ``` TTS_MODEL_PATH= OUT_DIR= WAVEGLOW_PATH= python unit2speech_ja.py \ --tts_model_path $TTS_MODEL_PATH \ --out_audio_dir $OUT_DIR \ --waveglow_path $WAVEGLOW_PATH \ ``` ## References - Lakhotia, Kushal et al. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021. - Ott, Myle et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, 2019.