File size: 3,167 Bytes
5450807
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fd63dae
 
5450807
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
language: 
  - ja
---

# Japanese GSLM

This is an Japanese implementation of [Generative Spoken Language Model](https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm) to support textless NLP in Japanese. </br> Submitted to Acoustical Society of Japan, 2023 Spring.
</br>

## How to use
- PyTorch version >= 1.10.0
- Python version >= 3.8

### Install requirements
It is pre-required to install the [fairseq](https://github.com/facebookresearch/fairseq/) library and all the requirements the library needs.

```
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

pip install librosa, unidecode, inflect
```

## Re-synthesis of voice signal
### speech2unit

The procedure for speech2unit is the same as the gslm example in [fairseq](https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit). 


You can convert the Japanese voice signal to discrete unit through this [pre-trained quantization model](https://huggingface.co/nonmetal/gslm-japanese/resolve/main/hubert200_JPN.bin). Route the downloaded model to ```KM_MODEL_PATH```. 


This file replaces the ```HuBERT Base + KM200``` model provided by fariseq, so it is required to download ```HuBERT-Base``` model as a pretrained acoustic model.

```
TYPE='hubert'
CKPT_PATH=<path_of_pretrained_acoustic_model>
LAYER=6
KM_MODEL_PATH=<output_path_of_the_kmeans_model>
MANIFEST=<tab_separated_manifest_of_audio_files_to_quantize>
OUT_QUANTIZED_FILE=<output_quantized_audio_file_path>

python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
    --feature_type $TYPE \
    --kmeans_model_path $KM_MODEL_PATH \
    --acoustic_model_path $CKPT_PATH \
    --layer $LAYER \
    --manifest_path $MANIFEST \
    --out_quantized_file_path $OUT_QUANTIZED_FILE \
    --extension ".wav"
```

### unit2speech

unit2speech model is modified Tacotron2 model that learns to synthesize speech from discrete speech units. 
You can convert the discrete unit to synthesized voice through this [model](https://huggingface.co/nonmetal/gslm-japanese/resolve/main/checkpoint_125k.pt). Also, it is required to download [Waveglow checkpoint](https://dl.fbaipublicfiles.com/textless_nlp/gslm/waveglow_256channels_new.pt) for Vocoder. 

Conversion from unit to speech is available with ```unit2speech_ja.py``` from this repository. It is also required to use ```hparam.py``` for extended compatability. 

```
TTS_MODEL_PATH=<unit2speech_model_file_path>
OUT_DIR=<dir_to_dump_synthesized_audio_files>
WAVEGLOW_PATH=<path_where_you_have_downloaded_waveglow_checkpoint>

python unit2speech_ja.py \
    --tts_model_path $TTS_MODEL_PATH \
    --out_audio_dir $OUT_DIR \
    --waveglow_path  $WAVEGLOW_PATH \
```

## References
- Lakhotia, Kushal et al. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021. 
- Ott, Myle et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, 2019.