nonmetal commited on
Commit
5450807
1 Parent(s): 98bf4d0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -0
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ja
4
+ ---
5
+
6
+ # Japanese GSLM
7
+
8
+ This is an Japanese implementation of [Generative Spoken Language Model](https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm) to support textless NLP in Japanese. </br> Submitted to Acoustical Society of Japan, 2023 Spring.
9
+ </br>
10
+
11
+ ## How to use
12
+ - PyTorch version >= 1.10.0
13
+ - Python version >= 3.8
14
+
15
+ ### Install requirements
16
+ It is pre-required to install the [fairseq](https://github.com/facebookresearch/fairseq/) library and all the requirements the library needs.
17
+
18
+ ```
19
+ git clone https://github.com/pytorch/fairseq
20
+ cd fairseq
21
+ pip install --editable ./
22
+
23
+ pip install librosa, unidecode, inflect
24
+ ```
25
+
26
+ ## Re-synthesis of voice signal
27
+ ### speech2unit
28
+
29
+ The procedure for speech2unit is the same as the gslm example in [fairseq](https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit).
30
+
31
+
32
+ You can convert the Japanese voice signal to discrete unit through this [pre-trained quantization model](https://huggingface.co/nonmetal/gslm-japanese/resolve/main/hubert200_JPN.bin). Route the downloaded model to ```KM_MODEL_PATH```.
33
+
34
+
35
+ This file replaces the ```HuBERT Base + KM200``` model provided by fariseq, so it is required to download ```HuBERT-Base``` model as a pretrained acoustic model.
36
+
37
+ ```
38
+ TYPE='hubert'
39
+ CKPT_PATH=<path_of_pretrained_acoustic_model>
40
+ LAYER=6
41
+ KM_MODEL_PATH=<output_path_of_the_kmeans_model>
42
+ MANIFEST=<tab_separated_manifest_of_audio_files_to_quantize>
43
+ OUT_QUANTIZED_FILE=<output_quantized_audio_file_path>
44
+
45
+ python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
46
+ --feature_type $TYPE \
47
+ --kmeans_model_path $KM_MODEL_PATH \
48
+ --acoustic_model_path $CKPT_PATH \
49
+ --layer $LAYER \
50
+ --manifest_path $MANIFEST \
51
+ --out_quantized_file_path $OUT_QUANTIZED_FILE \
52
+ --extension ".wav"
53
+ ```
54
+
55
+ ### unit2speech
56
+
57
+ unit2speech model is modified Tacotron2 model that learns to synthesize speech from discrete speech units.
58
+ You can convert the discrete unit to synthesized voice through this [model](https://huggingface.co/nonmetal/gslm-japanese/resolve/main/checkpoint_125k.pt). Also, it is required to download [Waveglow checkpoint](https://dl.fbaipublicfiles.com/textless_nlp/gslm/waveglow_256channels_new.pt) for Vocoder.
59
+
60
+ ```
61
+ TTS_MODEL_PATH=<unit2speech_model_file_path>
62
+ OUT_DIR=<dir_to_dump_synthesized_audio_files>
63
+ WAVEGLOW_PATH=<path_where_you_have_downloaded_waveglow_checkpoint>
64
+
65
+ python unit2speech_ja.py \
66
+ --tts_model_path $TTS_MODEL_PATH \
67
+ --out_audio_dir $OUT_DIR \
68
+ --waveglow_path $WAVEGLOW_PATH \
69
+ ```
70
+
71
+ ## References
72
+ - Lakhotia, Kushal et al. On Generative Spoken Language Modeling from Raw Audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
73
+ - Ott, Myle et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, 2019.