Spaces:

Silentlin
/

DiffSinger

Build error

App Files Files Community

ddd commited on Dec 23, 2022

Commit

40e984c

1 Parent(s): c4e83e4

pndm codes

Browse files

Files changed (17) hide show

.gitattributes +1 -0
docs/README-SVS-opencpop-cascade.md +3 -3
docs/README-SVS-opencpop-e2e.md +2 -1
docs/README-SVS-popcs.md +1 -1
docs/README-SVS.md +41 -9
docs/README-TTS.md +7 -1
inference/svs/base_svs_infer.py +1 -1
inference/svs/ds_cascade.py +2 -0
inference/svs/ds_e2e.py +2 -2
inference/svs/gradio/infer.py +1 -1
modules/diffsinger_midi/fs2.py +110 -0
modules/hifigan/hifigan.py +365 -365
modules/hifigan/mel_utils.py +80 -80
modules/parallel_wavegan/models/parallel_wavegan.py +434 -434
usr/configs/midi/cascade/opencs/ds60_rel.yaml +2 -1
usr/diff/shallow_diffusion_tts.py +324 -273
utils/hparams.py +36 -44

.gitattributes CHANGED Viewed

@@ -30,3 +30,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zstandard filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 model_ckpt_steps* filter=lfs diff=lfs merge=lfs -text

 *.zstandard filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 model_ckpt_steps* filter=lfs diff=lfs merge=lfs -text
+checkpoints/0831_opencpop_ds1000 filter=lfs diff=lfs merge=lfs -text

docs/README-SVS-opencpop-cascade.md CHANGED Viewed

@@ -3,7 +3,7 @@
 [![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
 [![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
-## DiffSinger (MIDI version SVS)
 ### 0. Data Acquirement
 For Opencpop dataset: Please strictly follow the instructions of [Opencpop](https://wenet.org.cn/opencpop/). We have no right to give you the access to Opencpop.
@@ -67,7 +67,7 @@ CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/cascade/ope
 Remember to adjust the "fs2_ckpt" parameter in `usr/configs/midi/cascade/opencs/ds60_rel.yaml` to fit your path.
-### 3. Inference Example
 ```sh
 CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer
 ```
@@ -82,7 +82,7 @@ Remember to put the pre-trained models in `checkpoints` directory.
 ### 4. Inference from raw inputs
 ```sh
-python inference/svs/ds_e2e.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name $MY_DS_EXP_NAME
 ```
 Raw inputs:
 ```

 [![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
 [![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
+## DiffSinger (MIDI SVS | A version)
 ### 0. Data Acquirement
 For Opencpop dataset: Please strictly follow the instructions of [Opencpop](https://wenet.org.cn/opencpop/). We have no right to give you the access to Opencpop.
 Remember to adjust the "fs2_ckpt" parameter in `usr/configs/midi/cascade/opencs/ds60_rel.yaml` to fit your path.
+### 3. Inference from packed test set
 ```sh
 CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer
 ```
 ### 4. Inference from raw inputs
 ```sh
+python inference/svs/ds_cascade.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name $MY_DS_EXP_NAME
 ```
 Raw inputs:
 ```

docs/README-SVS-opencpop-e2e.md CHANGED Viewed

@@ -2,13 +2,14 @@
 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
 [![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
 [![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
 Substantial update: We 1) **abandon** the explicit prediction of the F0 curve; 2) increase the receptive field of the denoiser; 3) make the linguistic encoder more robust.
 **By doing so, 1) the synthesized recordings are more natural in terms of pitch; 2) the pipeline is simpler.**
 简而言之，把F0曲线的动态性交给生成式模型去捕捉，而不再是以前那样用MSE约束对数域F0。
-## DiffSinger (MIDI version SVS)
 ### 0. Data Acquirement
 For Opencpop dataset: Please strictly follow the instructions of [Opencpop](https://wenet.org.cn/opencpop/). We have no right to give you the access to Opencpop.

 [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
 [![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
 [![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
+ | [Interactive🤗 SVS](https://huggingface.co/spaces/Silentlin/DiffSinger)
 Substantial update: We 1) **abandon** the explicit prediction of the F0 curve; 2) increase the receptive field of the denoiser; 3) make the linguistic encoder more robust.
 **By doing so, 1) the synthesized recordings are more natural in terms of pitch; 2) the pipeline is simpler.**
 简而言之，把F0曲线的动态性交给生成式模型去捕捉，而不再是以前那样用MSE约束对数域F0。
+## DiffSinger (MIDI SVS | B version)
 ### 0. Data Acquirement
 For Opencpop dataset: Please strictly follow the instructions of [Opencpop](https://wenet.org.cn/opencpop/). We have no right to give you the access to Opencpop.

docs/README-SVS-popcs.md CHANGED Viewed

@@ -54,7 +54,7 @@ Remember to put the pre-trained models in `checkpoints` directory.
 *Note that:*
 - *the original PWG version vocoder in the paper we used has been put into commercial use, so we provide this HifiGAN version vocoder as a substitute.*
-- *we assume the ground-truth F0 to be given as the pitch information following [1][2][3]. If you want to conduct experiments on MIDI data, you need an external F0 predictor (like [MIDI-old-version](README-SVS-opencpop-cascade.md)) or a joint prediction with spectrograms(like [MIDI-new-version](README-SVS-opencpop-e2e.md)).*
 [1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.

 *Note that:*
 - *the original PWG version vocoder in the paper we used has been put into commercial use, so we provide this HifiGAN version vocoder as a substitute.*
+- *we assume the ground-truth F0 to be given as the pitch information following [1][2][3]. If you want to conduct experiments on MIDI data, you need an external F0 predictor (like [MIDI-A-version](README-SVS-opencpop-cascade.md)) or a joint prediction with spectrograms(like [MIDI-B-version](README-SVS-opencpop-e2e.md)).*
 [1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.

docs/README-SVS.md CHANGED Viewed

@@ -1,7 +1,13 @@
-## DiffSinger (SVS version)
 ### PART1. [Run DiffSinger on PopCS](README-SVS-popcs.md)
-In this part, we only focus on spectrum modeling (acoustic model) and assume the ground-truth (GT) F0 to be given as the pitch information following these papers [1][2][3].
 Thus, the pipeline of this part can be summarized as:
@@ -18,13 +24,16 @@ Thus, the pipeline of this part can be summarized as:
 [3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.
 ### PART2. [Run DiffSinger on Opencpop](README-SVS-opencpop-cascade.md)
-Thanks [Opencpop team](https://wenet.org.cn/opencpop/) for releasing their SVS dataset with MIDI label, **Jan.20, 2022**. (Also thanks to my co-author [Yi Ren](https://github.com/RayeRen), who applied for the dataset and did some preprocessing works for this part).
 Since there are elaborately annotated MIDI labels, we are able to supplement the pipeline in PART 1 by adding a naive melody frontend.
-#### 2.1
-Thus, the pipeline of [this part](README-SVS-opencpop-cascade.md) can be summarized as:
 ```
 [lyrics] + [MIDI] -> [linguistic representation (with MIDI information)] + [predicted F0] + [predicted phoneme duration] (Melody frontend)
@@ -32,13 +41,36 @@ Thus, the pipeline of [this part](README-SVS-opencpop-cascade.md) can be summari
 [mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)
 ```
-#### 2.2
-In 2.1, we find that if we predict F0 explicitly in the melody frontend, there will be many bad cases of uv/v prediction. Then, we abandon the explicit prediction of the F0 curve in the melody frontend but make a joint prediction with spectrograms.
-Thus, the pipeline of [this part](README-SVS-opencpop-e2e.md) can be summarized as:
 ```
 [lyrics] + [MIDI] -> [linguistic representation] + [predicted phoneme duration] (Melody frontend)
 [linguistic representation (with MIDI information)] + [predicted phoneme duration] -> [mel-spectrogram]  (Acoustic model)
 [mel-spectrogram] -> [predicted F0]  (Pitch extractor)
 [mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)
-```

+# DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
+[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
+[![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
+ | [Interactive🤗 SVS](https://huggingface.co/spaces/Silentlin/DiffSinger)
+## DiffSinger (SVS)
 ### PART1. [Run DiffSinger on PopCS](README-SVS-popcs.md)
+In PART1, we only focus on spectrum modeling (acoustic model) and assume the ground-truth (GT) F0 to be given as the pitch information following these papers [1][2][3]. If you want to conduct experiments with F0 prediction, please move to PART2.
 Thus, the pipeline of this part can be summarized as:
 [3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.
+Click here for detailed instructions: [link](README-SVS-popcs.md).
 ### PART2. [Run DiffSinger on Opencpop](README-SVS-opencpop-cascade.md)
+Thanks [Opencpop team](https://wenet.org.cn/opencpop/) for releasing their SVS dataset with MIDI label, **Jan.20, 2022** (after we published our paper).
 Since there are elaborately annotated MIDI labels, we are able to supplement the pipeline in PART 1 by adding a naive melody frontend.
+#### 2.A
+Thus, the pipeline of [2.A](README-SVS-opencpop-cascade.md) can be summarized as:
 ```
 [lyrics] + [MIDI] -> [linguistic representation (with MIDI information)] + [predicted F0] + [predicted phoneme duration] (Melody frontend)
 [mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)
 ```
+Click here for detailed instructions: [link](README-SVS-opencpop-cascade.md).
+#### 2.B
+In 2.1, we find that if we predict F0 explicitly in the melody frontend, there will be many bad cases of uv/v prediction. Then, we abandon the explicit prediction of the F0 curve in the melody frontend and make a joint prediction with spectrograms.
+Thus, the pipeline of [2.B](README-SVS-opencpop-e2e.md) can be summarized as:
 ```
 [lyrics] + [MIDI] -> [linguistic representation] + [predicted phoneme duration] (Melody frontend)
 [linguistic representation (with MIDI information)] + [predicted phoneme duration] -> [mel-spectrogram]  (Acoustic model)
 [mel-spectrogram] -> [predicted F0]  (Pitch extractor)
 [mel-spectrogram] + [predicted F0] -> [waveform] (Vocoder)
+```
+Click here for detailed instructions: [link](README-SVS-opencpop-e2e.md).
+### FAQ
+Q1: Why do I need F0 in Vocoders?
+A1: See vocoder parts in HiFiSinger, DiffSinger or SingGAN. This is a common practice now.
+Q2: Why not run MIDI version SVS on PopCS dataset? or Why not release MIDI labels for PopCS dataset?
+A2: Our laboratory has no funds to label PopCS dataset. But there are funds for labeling other singing dataset, which is coming soon.
+Q3: Why " 'HifiGAN' object has no attribute 'model' "?
+A3: Please put the pretrained vocoders in your `checkpoints` dictionary.
+Q4: How to check whether I use GT information or predicted information during inference from packed test set?
+A4: Please see codes [here](https://github.com/MoonInTheRiver/DiffSinger/blob/55e2f46068af6e69940a9f8f02d306c24a940cab/tasks/tts/fs2.py#L343).
+...

docs/README-TTS.md CHANGED Viewed

@@ -1,4 +1,10 @@
-## DiffSpeech (TTS version)
 ### 1. Preparation
 #### Data Preparation

+# DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
+[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
+[![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
+ | [Interactive🤗 TTS](https://huggingface.co/spaces/NATSpeech/DiffSpeech)
+## DiffSpeech (TTS)
 ### 1. Preparation
 #### Data Preparation

inference/svs/base_svs_infer.py CHANGED Viewed

@@ -142,7 +142,7 @@ class BaseSVSInfer:
         ph_seq = inp['ph_seq']
         note_lst = inp['note_seq'].split()
         midi_dur_lst = inp['note_dur_seq'].split()
-        is_slur = inp['is_slur_seq'].split()
         print(len(note_lst), len(ph_seq.split()), len(midi_dur_lst))
         if len(note_lst) == len(ph_seq.split()) == len(midi_dur_lst):
             print('Pass word-notes check.')

         ph_seq = inp['ph_seq']
         note_lst = inp['note_seq'].split()
         midi_dur_lst = inp['note_dur_seq'].split()
+        is_slur = [float(x) for x in inp['is_slur_seq'].split()]
         print(len(note_lst), len(ph_seq.split()), len(midi_dur_lst))
         if len(note_lst) == len(ph_seq.split()) == len(midi_dur_lst):
             print('Pass word-notes check.')

inference/svs/ds_cascade.py CHANGED Viewed

@@ -52,3 +52,5 @@ if __name__ == '__main__':
         'input_type': 'phoneme'
     }  # input like Opencpop dataset.
     DiffSingerCascadeInfer.example_run(inp)

         'input_type': 'phoneme'
     }  # input like Opencpop dataset.
     DiffSingerCascadeInfer.example_run(inp)
+# # CUDA_VISIBLE_DEVICES=1 python inference/svs/ds_cascade.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name 0303_opencpop_ds58_midi

inference/svs/ds_e2e.py CHANGED Viewed

@@ -53,7 +53,7 @@ if __name__ == '__main__':
         'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
         'input_type': 'word'
     }  # user input: Chinese characters
-    c = {
         'text': '小酒窝长睫毛AP是你最美的记号',
         'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
         'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
@@ -64,4 +64,4 @@ if __name__ == '__main__':
     DiffSingerE2EInfer.example_run(inp)
-# python inference/svs/ds_e2e.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name 0228_opencpop_ds100_rel

         'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
         'input_type': 'word'
     }  # user input: Chinese characters
+    inp = {
         'text': '小酒窝长睫毛AP是你最美的记号',
         'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
         'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
     DiffSingerE2EInfer.example_run(inp)
+# CUDA_VISIBLE_DEVICES=3 python inference/svs/ds_e2e.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name 0228_opencpop_ds100_rel

inference/svs/gradio/infer.py CHANGED Viewed

@@ -88,4 +88,4 @@ if __name__ == '__main__':
 # python inference/svs/gradio/infer.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name 0303_opencpop_ds58_midi
 # python inference/svs/ds_cascade.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name 0303_opencpop_ds58_midi
-# CUDA_VISIBLE_DEVICES=3 python inference/svs/gradio/infer.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name 0228_opencpop_ds100_rel

 # python inference/svs/gradio/infer.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name 0303_opencpop_ds58_midi
 # python inference/svs/ds_cascade.py --config usr/configs/midi/cascade/opencs/ds60_rel.yaml --exp_name 0303_opencpop_ds58_midi
+# CUDA_VISIBLE_DEVICES=3 python inference/svs/gradio/infer.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name 0228_opencpop_ds100_rel

modules/diffsinger_midi/fs2.py CHANGED Viewed

@@ -116,3 +116,113 @@ class FastSpeech2MIDI(FastSpeech2):
         ret['mel_out'] = self.run_decoder(decoder_inp, tgt_nonpadding, ret, infer=infer, **kwargs)
         return ret

         ret['mel_out'] = self.run_decoder(decoder_inp, tgt_nonpadding, ret, infer=infer, **kwargs)
         return ret
+    def add_pitch(self, decoder_inp, f0, uv, mel2ph, ret, encoder_out=None):
+        decoder_inp = decoder_inp.detach() + hparams['predictor_grad'] * (decoder_inp - decoder_inp.detach())
+        pitch_padding = mel2ph == 0
+        if hparams['pitch_ar']:
+            ret['pitch_pred'] = pitch_pred = self.pitch_predictor(decoder_inp, f0 if self.training else None)
+            if f0 is None:
+                f0 = pitch_pred[:, :, 0]
+        else:
+            ret['pitch_pred'] = pitch_pred = self.pitch_predictor(decoder_inp)
+            if f0 is None:
+                f0 = pitch_pred[:, :, 0]
+            if hparams['use_uv'] and uv is None:
+                uv = pitch_pred[:, :, 1] > 0
+        # here f0_denorm for pitch prediction
+        ret['f0_denorm'] = denorm_f0(f0, uv, hparams, pitch_padding=pitch_padding)
+        # here f0_denorm for mel prediction
+        if self.training:
+            mask = torch.full(uv.shape, hparams.get('mask_uv_prob', 0.)).to(f0.device)
+            masked_uv = torch.bernoulli(mask).bool().to(f0.device)  # prob 的概率吐出一个随机uv.
+            uv_masked = uv.bool() | masked_uv
+            # print((uv.float()-uv_masked.float()).mean(dim=1))
+            f0_denorm = denorm_f0(f0, uv_masked, hparams, pitch_padding=pitch_padding)
+        else:
+            f0_denorm = ret['f0_denorm']
+        if pitch_padding is not None:
+            f0[pitch_padding] = 0
+        pitch = f0_to_coarse(f0_denorm)  # start from 0
+        pitch_embed = self.pitch_embed(pitch)
+        return pitch_embed
+class FastSpeech2MIDIMasked(FastSpeech2MIDI):
+    def forward(self, txt_tokens, mel2ph=None, spk_embed=None,
+                ref_mels=None, f0=None, uv=None, energy=None, skip_decoder=False,
+                spk_embed_dur_id=None, spk_embed_f0_id=None, infer=False, **kwargs):
+        ret = {}
+        midi_dur_embedding, slur_embedding = 0, 0
+        if kwargs.get('midi_dur') is not None:
+            midi_dur_embedding = self.midi_dur_layer(kwargs['midi_dur'][:, :, None])  # [B, T, 1] -> [B, T, H]
+        if kwargs.get('is_slur') is not None:
+            slur_embedding = self.is_slur_embed(kwargs['is_slur'])
+        encoder_out = self.encoder(txt_tokens, 0, midi_dur_embedding, slur_embedding)  # [B, T, C]
+        src_nonpadding = (txt_tokens > 0).float()[:, :, None]
+        # add ref style embed
+        # Not implemented
+        # variance encoder
+        var_embed = 0
+        # encoder_out_dur denotes encoder outputs for duration predictor
+        # in speech adaptation, duration predictor use old speaker embedding
+        if hparams['use_spk_embed']:
+            spk_embed_dur = spk_embed_f0 = spk_embed = self.spk_embed_proj(spk_embed)[:, None, :]
+        elif hparams['use_spk_id']:
+            spk_embed_id = spk_embed
+            if spk_embed_dur_id is None:
+                spk_embed_dur_id = spk_embed_id
+            if spk_embed_f0_id is None:
+                spk_embed_f0_id = spk_embed_id
+            spk_embed = self.spk_embed_proj(spk_embed_id)[:, None, :]
+            spk_embed_dur = spk_embed_f0 = spk_embed
+            if hparams['use_split_spk_id']:
+                spk_embed_dur = self.spk_embed_dur(spk_embed_dur_id)[:, None, :]
+                spk_embed_f0 = self.spk_embed_f0(spk_embed_f0_id)[:, None, :]
+        else:
+            spk_embed_dur = spk_embed_f0 = spk_embed = 0
+        # add dur
+        dur_inp = (encoder_out + var_embed + spk_embed_dur) * src_nonpadding
+        mel2ph = self.add_dur(dur_inp, mel2ph, txt_tokens, ret)
+        decoder_inp = F.pad(encoder_out, [0, 0, 1, 0])
+        mel2ph_ = mel2ph[..., None].repeat([1, 1, encoder_out.shape[-1]])
+        decoder_inp = torch.gather(decoder_inp, 1, mel2ph_)  # [B, T, H]
+        # expanded midi
+        midi_embedding = self.midi_embed(kwargs['pitch_midi'])
+        midi_embedding = F.pad(midi_embedding, [0, 0, 1, 0])
+        midi_embedding = torch.gather(midi_embedding, 1, mel2ph_)
+        print(midi_embedding.shape, decoder_inp.shape)
+        midi_mask = torch.full(midi_embedding.shape, hparams.get('mask_uv_prob', 0.)).to(midi_embedding.device)
+        midi_mask = 1 - torch.bernoulli(midi_mask).bool().to(midi_embedding.device)  # prob 的概率吐出一个随机uv.
+        tgt_nonpadding = (mel2ph > 0).float()[:, :, None]
+        decoder_inp += midi_embedding
+        decoder_inp_origin = decoder_inp
+        # add pitch and energy embed
+        pitch_inp = (decoder_inp_origin + var_embed + spk_embed_f0) * tgt_nonpadding
+        if hparams['use_pitch_embed']:
+            pitch_inp_ph = (encoder_out + var_embed + spk_embed_f0) * src_nonpadding
+            decoder_inp = decoder_inp + self.add_pitch(pitch_inp, f0, uv, mel2ph, ret, encoder_out=pitch_inp_ph)
+        if hparams['use_energy_embed']:
+            decoder_inp = decoder_inp + self.add_energy(pitch_inp, energy, ret)
+        ret['decoder_inp'] = decoder_inp = (decoder_inp + spk_embed) * tgt_nonpadding
+        if skip_decoder:
+            return ret
+        ret['mel_out'] = self.run_decoder(decoder_inp, tgt_nonpadding, ret, infer=infer, **kwargs)
+        return ret

modules/hifigan/hifigan.py CHANGED Viewed

@@ -1,365 +1,365 @@
-import torch
-import torch.nn.functional as F
-import torch.nn as nn
-from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
-from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
-from modules.parallel_wavegan.layers import UpsampleNetwork, ConvInUpsampleNetwork
-from modules.parallel_wavegan.models.source import SourceModuleHnNSF
-import numpy as np
-LRELU_SLOPE = 0.1
-def init_weights(m, mean=0.0, std=0.01):
-    classname = m.__class__.__name__
-    if classname.find("Conv") != -1:
-        m.weight.data.normal_(mean, std)
-def apply_weight_norm(m):
-    classname = m.__class__.__name__
-    if classname.find("Conv") != -1:
-        weight_norm(m)
-def get_padding(kernel_size, dilation=1):
-    return int((kernel_size * dilation - dilation) / 2)
-class ResBlock1(torch.nn.Module):
-    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
-        super(ResBlock1, self).__init__()
-        self.h = h
-        self.convs1 = nn.ModuleList([
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
-                               padding=get_padding(kernel_size, dilation[0]))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
-                               padding=get_padding(kernel_size, dilation[1]))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
-                               padding=get_padding(kernel_size, dilation[2])))
-        ])
-        self.convs1.apply(init_weights)
-        self.convs2 = nn.ModuleList([
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
-                               padding=get_padding(kernel_size, 1))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
-                               padding=get_padding(kernel_size, 1))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
-                               padding=get_padding(kernel_size, 1)))
-        ])
-        self.convs2.apply(init_weights)
-    def forward(self, x):
-        for c1, c2 in zip(self.convs1, self.convs2):
-            xt = F.leaky_relu(x, LRELU_SLOPE)
-            xt = c1(xt)
-            xt = F.leaky_relu(xt, LRELU_SLOPE)
-            xt = c2(xt)
-            x = xt + x
-        return x
-    def remove_weight_norm(self):
-        for l in self.convs1:
-            remove_weight_norm(l)
-        for l in self.convs2:
-            remove_weight_norm(l)
-class ResBlock2(torch.nn.Module):
-    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
-        super(ResBlock2, self).__init__()
-        self.h = h
-        self.convs = nn.ModuleList([
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
-                               padding=get_padding(kernel_size, dilation[0]))),
-            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
-                               padding=get_padding(kernel_size, dilation[1])))
-        ])
-        self.convs.apply(init_weights)
-    def forward(self, x):
-        for c in self.convs:
-            xt = F.leaky_relu(x, LRELU_SLOPE)
-            xt = c(xt)
-            x = xt + x
-        return x
-    def remove_weight_norm(self):
-        for l in self.convs:
-            remove_weight_norm(l)
-class Conv1d1x1(Conv1d):
-    """1x1 Conv1d with customized initialization."""
-    def __init__(self, in_channels, out_channels, bias):
-        """Initialize 1x1 Conv1d module."""
-        super(Conv1d1x1, self).__init__(in_channels, out_channels,
-                                        kernel_size=1, padding=0,
-                                        dilation=1, bias=bias)
-class HifiGanGenerator(torch.nn.Module):
-    def __init__(self, h, c_out=1):
-        super(HifiGanGenerator, self).__init__()
-        self.h = h
-        self.num_kernels = len(h['resblock_kernel_sizes'])
-        self.num_upsamples = len(h['upsample_rates'])
-        if h['use_pitch_embed']:
-            self.harmonic_num = 8
-            self.f0_upsamp = torch.nn.Upsample(scale_factor=np.prod(h['upsample_rates']))
-            self.m_source = SourceModuleHnNSF(
-                sampling_rate=h['audio_sample_rate'],
-                harmonic_num=self.harmonic_num)
-            self.noise_convs = nn.ModuleList()
-        self.conv_pre = weight_norm(Conv1d(80, h['upsample_initial_channel'], 7, 1, padding=3))
-        resblock = ResBlock1 if h['resblock'] == '1' else ResBlock2
-        self.ups = nn.ModuleList()
-        for i, (u, k) in enumerate(zip(h['upsample_rates'], h['upsample_kernel_sizes'])):
-            c_cur = h['upsample_initial_channel'] // (2 ** (i + 1))
-            self.ups.append(weight_norm(
-                ConvTranspose1d(c_cur * 2, c_cur, k, u, padding=(k - u) // 2)))
-            if h['use_pitch_embed']:
-                if i + 1 < len(h['upsample_rates']):
-                    stride_f0 = np.prod(h['upsample_rates'][i + 1:])
-                    self.noise_convs.append(Conv1d(
-                        1, c_cur, kernel_size=stride_f0 * 2, stride=stride_f0, padding=stride_f0 // 2))
-                else:
-                    self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
-        self.resblocks = nn.ModuleList()
-        for i in range(len(self.ups)):
-            ch = h['upsample_initial_channel'] // (2 ** (i + 1))
-            for j, (k, d) in enumerate(zip(h['resblock_kernel_sizes'], h['resblock_dilation_sizes'])):
-                self.resblocks.append(resblock(h, ch, k, d))
-        self.conv_post = weight_norm(Conv1d(ch, c_out, 7, 1, padding=3))
-        self.ups.apply(init_weights)
-        self.conv_post.apply(init_weights)
-    def forward(self, x, f0=None):
-        if f0 is not None:
-            # harmonic-source signal, noise-source signal, uv flag
-            f0 = self.f0_upsamp(f0[:, None]).transpose(1, 2)
-            har_source, noi_source, uv = self.m_source(f0)
-            har_source = har_source.transpose(1, 2)
-        x = self.conv_pre(x)
-        for i in range(self.num_upsamples):
-            x = F.leaky_relu(x, LRELU_SLOPE)
-            x = self.ups[i](x)
-            if f0 is not None:
-                x_source = self.noise_convs[i](har_source)
-                x = x + x_source
-            xs = None
-            for j in range(self.num_kernels):
-                if xs is None:
-                    xs = self.resblocks[i * self.num_kernels + j](x)
-                else:
-                    xs += self.resblocks[i * self.num_kernels + j](x)
-            x = xs / self.num_kernels
-        x = F.leaky_relu(x)
-        x = self.conv_post(x)
-        x = torch.tanh(x)
-        return x
-    def remove_weight_norm(self):
-        print('Removing weight norm...')
-        for l in self.ups:
-            remove_weight_norm(l)
-        for l in self.resblocks:
-            l.remove_weight_norm()
-        remove_weight_norm(self.conv_pre)
-        remove_weight_norm(self.conv_post)
-class DiscriminatorP(torch.nn.Module):
-    def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False, use_cond=False, c_in=1):
-        super(DiscriminatorP, self).__init__()
-        self.use_cond = use_cond
-        if use_cond:
-            from utils.hparams import hparams
-            t = hparams['hop_size']
-            self.cond_net = torch.nn.ConvTranspose1d(80, 1, t * 2, stride=t, padding=t // 2)
-            c_in = 2
-        self.period = period
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.convs = nn.ModuleList([
-            norm_f(Conv2d(c_in, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
-            norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(2, 0))),
-        ])
-        self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
-    def forward(self, x, mel):
-        fmap = []
-        if self.use_cond:
-            x_mel = self.cond_net(mel)
-            x = torch.cat([x_mel, x], 1)
-        # 1d to 2d
-        b, c, t = x.shape
-        if t % self.period != 0:  # pad first
-            n_pad = self.period - (t % self.period)
-            x = F.pad(x, (0, n_pad), "reflect")
-            t = t + n_pad
-        x = x.view(b, c, t // self.period, self.period)
-        for l in self.convs:
-            x = l(x)
-            x = F.leaky_relu(x, LRELU_SLOPE)
-            fmap.append(x)
-        x = self.conv_post(x)
-        fmap.append(x)
-        x = torch.flatten(x, 1, -1)
-        return x, fmap
-class MultiPeriodDiscriminator(torch.nn.Module):
-    def __init__(self, use_cond=False, c_in=1):
-        super(MultiPeriodDiscriminator, self).__init__()
-        self.discriminators = nn.ModuleList([
-            DiscriminatorP(2, use_cond=use_cond, c_in=c_in),
-            DiscriminatorP(3, use_cond=use_cond, c_in=c_in),
-            DiscriminatorP(5, use_cond=use_cond, c_in=c_in),
-            DiscriminatorP(7, use_cond=use_cond, c_in=c_in),
-            DiscriminatorP(11, use_cond=use_cond, c_in=c_in),
-        ])
-    def forward(self, y, y_hat, mel=None):
-        y_d_rs = []
-        y_d_gs = []
-        fmap_rs = []
-        fmap_gs = []
-        for i, d in enumerate(self.discriminators):
-            y_d_r, fmap_r = d(y, mel)
-            y_d_g, fmap_g = d(y_hat, mel)
-            y_d_rs.append(y_d_r)
-            fmap_rs.append(fmap_r)
-            y_d_gs.append(y_d_g)
-            fmap_gs.append(fmap_g)
-        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
-class DiscriminatorS(torch.nn.Module):
-    def __init__(self, use_spectral_norm=False, use_cond=False, upsample_rates=None, c_in=1):
-        super(DiscriminatorS, self).__init__()
-        self.use_cond = use_cond
-        if use_cond:
-            t = np.prod(upsample_rates)
-            self.cond_net = torch.nn.ConvTranspose1d(80, 1, t * 2, stride=t, padding=t // 2)
-            c_in = 2
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.convs = nn.ModuleList([
-            norm_f(Conv1d(c_in, 128, 15, 1, padding=7)),
-            norm_f(Conv1d(128, 128, 41, 2, groups=4, padding=20)),
-            norm_f(Conv1d(128, 256, 41, 2, groups=16, padding=20)),
-            norm_f(Conv1d(256, 512, 41, 4, groups=16, padding=20)),
-            norm_f(Conv1d(512, 1024, 41, 4, groups=16, padding=20)),
-            norm_f(Conv1d(1024, 1024, 41, 1, groups=16, padding=20)),
-            norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
-        ])
-        self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
-    def forward(self, x, mel):
-        if self.use_cond:
-            x_mel = self.cond_net(mel)
-            x = torch.cat([x_mel, x], 1)
-        fmap = []
-        for l in self.convs:
-            x = l(x)
-            x = F.leaky_relu(x, LRELU_SLOPE)
-            fmap.append(x)
-        x = self.conv_post(x)
-        fmap.append(x)
-        x = torch.flatten(x, 1, -1)
-        return x, fmap
-class MultiScaleDiscriminator(torch.nn.Module):
-    def __init__(self, use_cond=False, c_in=1):
-        super(MultiScaleDiscriminator, self).__init__()
-        from utils.hparams import hparams
-        self.discriminators = nn.ModuleList([
-            DiscriminatorS(use_spectral_norm=True, use_cond=use_cond,
-                           upsample_rates=[4, 4, hparams['hop_size'] // 16],
-                           c_in=c_in),
-            DiscriminatorS(use_cond=use_cond,
-                           upsample_rates=[4, 4, hparams['hop_size'] // 32],
-                           c_in=c_in),
-            DiscriminatorS(use_cond=use_cond,
-                           upsample_rates=[4, 4, hparams['hop_size'] // 64],
-                           c_in=c_in),
-        ])
-        self.meanpools = nn.ModuleList([
-            AvgPool1d(4, 2, padding=1),
-            AvgPool1d(4, 2, padding=1)
-        ])
-    def forward(self, y, y_hat, mel=None):
-        y_d_rs = []
-        y_d_gs = []
-        fmap_rs = []
-        fmap_gs = []
-        for i, d in enumerate(self.discriminators):
-            if i != 0:
-                y = self.meanpools[i - 1](y)
-                y_hat = self.meanpools[i - 1](y_hat)
-            y_d_r, fmap_r = d(y, mel)
-            y_d_g, fmap_g = d(y_hat, mel)
-            y_d_rs.append(y_d_r)
-            fmap_rs.append(fmap_r)
-            y_d_gs.append(y_d_g)
-            fmap_gs.append(fmap_g)
-        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
-def feature_loss(fmap_r, fmap_g):
-    loss = 0
-    for dr, dg in zip(fmap_r, fmap_g):
-        for rl, gl in zip(dr, dg):
-            loss += torch.mean(torch.abs(rl - gl))
-    return loss * 2
-def discriminator_loss(disc_real_outputs, disc_generated_outputs):
-    r_losses = 0
-    g_losses = 0
-    for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
-        r_loss = torch.mean((1 - dr) ** 2)
-        g_loss = torch.mean(dg ** 2)
-        r_losses += r_loss
-        g_losses += g_loss
-    r_losses = r_losses / len(disc_real_outputs)
-    g_losses = g_losses / len(disc_real_outputs)
-    return r_losses, g_losses
-def cond_discriminator_loss(outputs):
-    loss = 0
-    for dg in outputs:
-        g_loss = torch.mean(dg ** 2)
-        loss += g_loss
-    loss = loss / len(outputs)
-    return loss
-def generator_loss(disc_outputs):
-    loss = 0
-    for dg in disc_outputs:
-        l = torch.mean((1 - dg) ** 2)
-        loss += l
-    loss = loss / len(disc_outputs)
-    return loss

+import torch
+import torch.nn.functional as F
+import torch.nn as nn
+from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
+from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
+from modules.parallel_wavegan.layers import UpsampleNetwork, ConvInUpsampleNetwork
+from modules.parallel_wavegan.models.source import SourceModuleHnNSF
+import numpy as np
+LRELU_SLOPE = 0.1
+def init_weights(m, mean=0.0, std=0.01):
+    classname = m.__class__.__name__
+    if classname.find("Conv") != -1:
+        m.weight.data.normal_(mean, std)
+def apply_weight_norm(m):
+    classname = m.__class__.__name__
+    if classname.find("Conv") != -1:
+        weight_norm(m)
+def get_padding(kernel_size, dilation=1):
+    return int((kernel_size * dilation - dilation) / 2)
+class ResBlock1(torch.nn.Module):
+    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
+        super(ResBlock1, self).__init__()
+        self.h = h
+        self.convs1 = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+                               padding=get_padding(kernel_size, dilation[0]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+                               padding=get_padding(kernel_size, dilation[1]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
+                               padding=get_padding(kernel_size, dilation[2])))
+        ])
+        self.convs1.apply(init_weights)
+        self.convs2 = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                               padding=get_padding(kernel_size, 1))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                               padding=get_padding(kernel_size, 1))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                               padding=get_padding(kernel_size, 1)))
+        ])
+        self.convs2.apply(init_weights)
+    def forward(self, x):
+        for c1, c2 in zip(self.convs1, self.convs2):
+            xt = F.leaky_relu(x, LRELU_SLOPE)
+            xt = c1(xt)
+            xt = F.leaky_relu(xt, LRELU_SLOPE)
+            xt = c2(xt)
+            x = xt + x
+        return x
+    def remove_weight_norm(self):
+        for l in self.convs1:
+            remove_weight_norm(l)
+        for l in self.convs2:
+            remove_weight_norm(l)
+class ResBlock2(torch.nn.Module):
+    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
+        super(ResBlock2, self).__init__()
+        self.h = h
+        self.convs = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+                               padding=get_padding(kernel_size, dilation[0]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+                               padding=get_padding(kernel_size, dilation[1])))
+        ])
+        self.convs.apply(init_weights)
+    def forward(self, x):
+        for c in self.convs:
+            xt = F.leaky_relu(x, LRELU_SLOPE)
+            xt = c(xt)
+            x = xt + x
+        return x
+    def remove_weight_norm(self):
+        for l in self.convs:
+            remove_weight_norm(l)
+class Conv1d1x1(Conv1d):
+    """1x1 Conv1d with customized initialization."""
+    def __init__(self, in_channels, out_channels, bias):
+        """Initialize 1x1 Conv1d module."""
+        super(Conv1d1x1, self).__init__(in_channels, out_channels,
+                                        kernel_size=1, padding=0,
+                                        dilation=1, bias=bias)
+class HifiGanGenerator(torch.nn.Module):
+    def __init__(self, h, c_out=1):
+        super(HifiGanGenerator, self).__init__()
+        self.h = h
+        self.num_kernels = len(h['resblock_kernel_sizes'])
+        self.num_upsamples = len(h['upsample_rates'])
+        if h['use_pitch_embed']:
+            self.harmonic_num = 8
+            self.f0_upsamp = torch.nn.Upsample(scale_factor=np.prod(h['upsample_rates']))
+            self.m_source = SourceModuleHnNSF(
+                sampling_rate=h['audio_sample_rate'],
+                harmonic_num=self.harmonic_num)
+            self.noise_convs = nn.ModuleList()
+        self.conv_pre = weight_norm(Conv1d(80, h['upsample_initial_channel'], 7, 1, padding=3))
+        resblock = ResBlock1 if h['resblock'] == '1' else ResBlock2
+        self.ups = nn.ModuleList()
+        for i, (u, k) in enumerate(zip(h['upsample_rates'], h['upsample_kernel_sizes'])):
+            c_cur = h['upsample_initial_channel'] // (2 ** (i + 1))
+            self.ups.append(weight_norm(
+                ConvTranspose1d(c_cur * 2, c_cur, k, u, padding=(k - u) // 2)))
+            if h['use_pitch_embed']:
+                if i + 1 < len(h['upsample_rates']):
+                    stride_f0 = np.prod(h['upsample_rates'][i + 1:])
+                    self.noise_convs.append(Conv1d(
+                        1, c_cur, kernel_size=stride_f0 * 2, stride=stride_f0, padding=stride_f0 // 2))
+                else:
+                    self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
+        self.resblocks = nn.ModuleList()
+        for i in range(len(self.ups)):
+            ch = h['upsample_initial_channel'] // (2 ** (i + 1))
+            for j, (k, d) in enumerate(zip(h['resblock_kernel_sizes'], h['resblock_dilation_sizes'])):
+                self.resblocks.append(resblock(h, ch, k, d))
+        self.conv_post = weight_norm(Conv1d(ch, c_out, 7, 1, padding=3))
+        self.ups.apply(init_weights)
+        self.conv_post.apply(init_weights)
+    def forward(self, x, f0=None):
+        if f0 is not None:
+            # harmonic-source signal, noise-source signal, uv flag
+            f0 = self.f0_upsamp(f0[:, None]).transpose(1, 2)
+            har_source, noi_source, uv = self.m_source(f0)
+            har_source = har_source.transpose(1, 2)
+        x = self.conv_pre(x)
+        for i in range(self.num_upsamples):
+            x = F.leaky_relu(x, LRELU_SLOPE)
+            x = self.ups[i](x)
+            if f0 is not None:
+                x_source = self.noise_convs[i](har_source)
+                x = x + x_source
+            xs = None
+            for j in range(self.num_kernels):
+                if xs is None:
+                    xs = self.resblocks[i * self.num_kernels + j](x)
+                else:
+                    xs += self.resblocks[i * self.num_kernels + j](x)
+            x = xs / self.num_kernels
+        x = F.leaky_relu(x)
+        x = self.conv_post(x)
+        x = torch.tanh(x)
+        return x
+    def remove_weight_norm(self):
+        print('Removing weight norm...')
+        for l in self.ups:
+            remove_weight_norm(l)
+        for l in self.resblocks:
+            l.remove_weight_norm()
+        remove_weight_norm(self.conv_pre)
+        remove_weight_norm(self.conv_post)
+class DiscriminatorP(torch.nn.Module):
+    def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False, use_cond=False, c_in=1):
+        super(DiscriminatorP, self).__init__()
+        self.use_cond = use_cond
+        if use_cond:
+            from utils.hparams import hparams
+            t = hparams['hop_size']
+            self.cond_net = torch.nn.ConvTranspose1d(80, 1, t * 2, stride=t, padding=t // 2)
+            c_in = 2
+        self.period = period
+        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+        self.convs = nn.ModuleList([
+            norm_f(Conv2d(c_in, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(2, 0))),
+        ])
+        self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
+    def forward(self, x, mel):
+        fmap = []
+        if self.use_cond:
+            x_mel = self.cond_net(mel)
+            x = torch.cat([x_mel, x], 1)
+        # 1d to 2d
+        b, c, t = x.shape
+        if t % self.period != 0:  # pad first
+            n_pad = self.period - (t % self.period)
+            x = F.pad(x, (0, n_pad), "reflect")
+            t = t + n_pad
+        x = x.view(b, c, t // self.period, self.period)
+        for l in self.convs:
+            x = l(x)
+            x = F.leaky_relu(x, LRELU_SLOPE)
+            fmap.append(x)
+        x = self.conv_post(x)
+        fmap.append(x)
+        x = torch.flatten(x, 1, -1)
+        return x, fmap
+class MultiPeriodDiscriminator(torch.nn.Module):
+    def __init__(self, use_cond=False, c_in=1):
+        super(MultiPeriodDiscriminator, self).__init__()
+        self.discriminators = nn.ModuleList([
+            DiscriminatorP(2, use_cond=use_cond, c_in=c_in),
+            DiscriminatorP(3, use_cond=use_cond, c_in=c_in),
+            DiscriminatorP(5, use_cond=use_cond, c_in=c_in),
+            DiscriminatorP(7, use_cond=use_cond, c_in=c_in),
+            DiscriminatorP(11, use_cond=use_cond, c_in=c_in),
+        ])
+    def forward(self, y, y_hat, mel=None):
+        y_d_rs = []
+        y_d_gs = []
+        fmap_rs = []
+        fmap_gs = []
+        for i, d in enumerate(self.discriminators):
+            y_d_r, fmap_r = d(y, mel)
+            y_d_g, fmap_g = d(y_hat, mel)
+            y_d_rs.append(y_d_r)
+            fmap_rs.append(fmap_r)
+            y_d_gs.append(y_d_g)
+            fmap_gs.append(fmap_g)
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+class DiscriminatorS(torch.nn.Module):
+    def __init__(self, use_spectral_norm=False, use_cond=False, upsample_rates=None, c_in=1):
+        super(DiscriminatorS, self).__init__()
+        self.use_cond = use_cond
+        if use_cond:
+            t = np.prod(upsample_rates)
+            self.cond_net = torch.nn.ConvTranspose1d(80, 1, t * 2, stride=t, padding=t // 2)
+            c_in = 2
+        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+        self.convs = nn.ModuleList([
+            norm_f(Conv1d(c_in, 128, 15, 1, padding=7)),
+            norm_f(Conv1d(128, 128, 41, 2, groups=4, padding=20)),
+            norm_f(Conv1d(128, 256, 41, 2, groups=16, padding=20)),
+            norm_f(Conv1d(256, 512, 41, 4, groups=16, padding=20)),
+            norm_f(Conv1d(512, 1024, 41, 4, groups=16, padding=20)),
+            norm_f(Conv1d(1024, 1024, 41, 1, groups=16, padding=20)),
+            norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
+        ])
+        self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
+    def forward(self, x, mel):
+        if self.use_cond:
+            x_mel = self.cond_net(mel)
+            x = torch.cat([x_mel, x], 1)
+        fmap = []
+        for l in self.convs:
+            x = l(x)
+            x = F.leaky_relu(x, LRELU_SLOPE)
+            fmap.append(x)
+        x = self.conv_post(x)
+        fmap.append(x)
+        x = torch.flatten(x, 1, -1)
+        return x, fmap
+class MultiScaleDiscriminator(torch.nn.Module):
+    def __init__(self, use_cond=False, c_in=1):
+        super(MultiScaleDiscriminator, self).__init__()
+        from utils.hparams import hparams
+        self.discriminators = nn.ModuleList([
+            DiscriminatorS(use_spectral_norm=True, use_cond=use_cond,
+                           upsample_rates=[4, 4, hparams['hop_size'] // 16],
+                           c_in=c_in),
+            DiscriminatorS(use_cond=use_cond,
+                           upsample_rates=[4, 4, hparams['hop_size'] // 32],
+                           c_in=c_in),
+            DiscriminatorS(use_cond=use_cond,
+                           upsample_rates=[4, 4, hparams['hop_size'] // 64],
+                           c_in=c_in),
+        ])
+        self.meanpools = nn.ModuleList([
+            AvgPool1d(4, 2, padding=1),
+            AvgPool1d(4, 2, padding=1)
+        ])
+    def forward(self, y, y_hat, mel=None):
+        y_d_rs = []
+        y_d_gs = []
+        fmap_rs = []
+        fmap_gs = []
+        for i, d in enumerate(self.discriminators):
+            if i != 0:
+                y = self.meanpools[i - 1](y)
+                y_hat = self.meanpools[i - 1](y_hat)
+            y_d_r, fmap_r = d(y, mel)
+            y_d_g, fmap_g = d(y_hat, mel)
+            y_d_rs.append(y_d_r)
+            fmap_rs.append(fmap_r)
+            y_d_gs.append(y_d_g)
+            fmap_gs.append(fmap_g)
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+def feature_loss(fmap_r, fmap_g):
+    loss = 0
+    for dr, dg in zip(fmap_r, fmap_g):
+        for rl, gl in zip(dr, dg):
+            loss += torch.mean(torch.abs(rl - gl))
+    return loss * 2
+def discriminator_loss(disc_real_outputs, disc_generated_outputs):
+    r_losses = 0
+    g_losses = 0
+    for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
+        r_loss = torch.mean((1 - dr) ** 2)
+        g_loss = torch.mean(dg ** 2)
+        r_losses += r_loss
+        g_losses += g_loss
+    r_losses = r_losses / len(disc_real_outputs)
+    g_losses = g_losses / len(disc_real_outputs)
+    return r_losses, g_losses
+def cond_discriminator_loss(outputs):
+    loss = 0
+    for dg in outputs:
+        g_loss = torch.mean(dg ** 2)
+        loss += g_loss
+    loss = loss / len(outputs)
+    return loss
+def generator_loss(disc_outputs):
+    loss = 0
+    for dg in disc_outputs:
+        l = torch.mean((1 - dg) ** 2)
+        loss += l
+    loss = loss / len(disc_outputs)
+    return loss

modules/hifigan/mel_utils.py CHANGED Viewed

@@ -1,80 +1,80 @@
-import numpy as np
-import torch
-import torch.utils.data
-from librosa.filters import mel as librosa_mel_fn
-from scipy.io.wavfile import read
-MAX_WAV_VALUE = 32768.0
-def load_wav(full_path):
-    sampling_rate, data = read(full_path)
-    return data, sampling_rate
-def dynamic_range_compression(x, C=1, clip_val=1e-5):
-    return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
-def dynamic_range_decompression(x, C=1):
-    return np.exp(x) / C
-def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
-    return torch.log(torch.clamp(x, min=clip_val) * C)
-def dynamic_range_decompression_torch(x, C=1):
-    return torch.exp(x) / C
-def spectral_normalize_torch(magnitudes):
-    output = dynamic_range_compression_torch(magnitudes)
-    return output
-def spectral_de_normalize_torch(magnitudes):
-    output = dynamic_range_decompression_torch(magnitudes)
-    return output
-mel_basis = {}
-hann_window = {}
-def mel_spectrogram(y, hparams, center=False, complex=False):
-    # hop_size: 512  # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
-    # win_size: 2048  # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
-    # fmin: 55  # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
-    # fmax: 10000  # To be increased/reduced depending on data.
-    # fft_size: 2048  # Extra window size is filled with 0 paddings to match this parameter
-    # n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax,
-    n_fft = hparams['fft_size']
-    num_mels = hparams['audio_num_mel_bins']
-    sampling_rate = hparams['audio_sample_rate']
-    hop_size = hparams['hop_size']
-    win_size = hparams['win_size']
-    fmin = hparams['fmin']
-    fmax = hparams['fmax']
-    y = y.clamp(min=-1., max=1.)
-    global mel_basis, hann_window
-    if fmax not in mel_basis:
-        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
-        mel_basis[str(fmax) + '_' + str(y.device)] = torch.from_numpy(mel).float().to(y.device)
-        hann_window[str(y.device)] = torch.hann_window(win_size).to(y.device)
-    y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
-                                mode='reflect')
-    y = y.squeeze(1)
-    spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
-                      center=center, pad_mode='reflect', normalized=False, onesided=True)
-    if not complex:
-        spec = torch.sqrt(spec.pow(2).sum(-1) + (1e-9))
-        spec = torch.matmul(mel_basis[str(fmax) + '_' + str(y.device)], spec)
-        spec = spectral_normalize_torch(spec)
-    else:
-        B, C, T, _ = spec.shape
-        spec = spec.transpose(1, 2)  # [B, T, n_fft, 2]
-    return spec

+import numpy as np
+import torch
+import torch.utils.data
+from librosa.filters import mel as librosa_mel_fn
+from scipy.io.wavfile import read
+MAX_WAV_VALUE = 32768.0
+def load_wav(full_path):
+    sampling_rate, data = read(full_path)
+    return data, sampling_rate
+def dynamic_range_compression(x, C=1, clip_val=1e-5):
+    return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)
+def dynamic_range_decompression(x, C=1):
+    return np.exp(x) / C
+def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
+    return torch.log(torch.clamp(x, min=clip_val) * C)
+def dynamic_range_decompression_torch(x, C=1):
+    return torch.exp(x) / C
+def spectral_normalize_torch(magnitudes):
+    output = dynamic_range_compression_torch(magnitudes)
+    return output
+def spectral_de_normalize_torch(magnitudes):
+    output = dynamic_range_decompression_torch(magnitudes)
+    return output
+mel_basis = {}
+hann_window = {}
+def mel_spectrogram(y, hparams, center=False, complex=False):
+    # hop_size: 512  # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
+    # win_size: 2048  # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate)
+    # fmin: 55  # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
+    # fmax: 10000  # To be increased/reduced depending on data.
+    # fft_size: 2048  # Extra window size is filled with 0 paddings to match this parameter
+    # n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax,
+    n_fft = hparams['fft_size']
+    num_mels = hparams['audio_num_mel_bins']
+    sampling_rate = hparams['audio_sample_rate']
+    hop_size = hparams['hop_size']
+    win_size = hparams['win_size']
+    fmin = hparams['fmin']
+    fmax = hparams['fmax']
+    y = y.clamp(min=-1., max=1.)
+    global mel_basis, hann_window
+    if fmax not in mel_basis:
+        mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax)
+        mel_basis[str(fmax) + '_' + str(y.device)] = torch.from_numpy(mel).float().to(y.device)
+        hann_window[str(y.device)] = torch.hann_window(win_size).to(y.device)
+    y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
+                                mode='reflect')
+    y = y.squeeze(1)
+    spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[str(y.device)],
+                      center=center, pad_mode='reflect', normalized=False, onesided=True)
+    if not complex:
+        spec = torch.sqrt(spec.pow(2).sum(-1) + (1e-9))
+        spec = torch.matmul(mel_basis[str(fmax) + '_' + str(y.device)], spec)
+        spec = spectral_normalize_torch(spec)
+    else:
+        B, C, T, _ = spec.shape
+        spec = spec.transpose(1, 2)  # [B, T, n_fft, 2]
+    return spec

modules/parallel_wavegan/models/parallel_wavegan.py CHANGED Viewed

@@ -1,434 +1,434 @@
-# -*- coding: utf-8 -*-
-# Copyright 2019 Tomoki Hayashi
-#  MIT License (https://opensource.org/licenses/MIT)
-"""Parallel WaveGAN Modules."""
-import logging
-import math
-import torch
-from torch import nn
-from modules.parallel_wavegan.layers import Conv1d
-from modules.parallel_wavegan.layers import Conv1d1x1
-from modules.parallel_wavegan.layers import ResidualBlock
-from modules.parallel_wavegan.layers import upsample
-from modules.parallel_wavegan import models
-class ParallelWaveGANGenerator(torch.nn.Module):
-    """Parallel WaveGAN Generator module."""
-    def __init__(self,
-                 in_channels=1,
-                 out_channels=1,
-                 kernel_size=3,
-                 layers=30,
-                 stacks=3,
-                 residual_channels=64,
-                 gate_channels=128,
-                 skip_channels=64,
-                 aux_channels=80,
-                 aux_context_window=2,
-                 dropout=0.0,
-                 bias=True,
-                 use_weight_norm=True,
-                 use_causal_conv=False,
-                 upsample_conditional_features=True,
-                 upsample_net="ConvInUpsampleNetwork",
-                 upsample_params={"upsample_scales": [4, 4, 4, 4]},
-                 use_pitch_embed=False,
-                 ):
-        """Initialize Parallel WaveGAN Generator module.
-        Args:
-            in_channels (int): Number of input channels.
-            out_channels (int): Number of output channels.
-            kernel_size (int): Kernel size of dilated convolution.
-            layers (int): Number of residual block layers.
-            stacks (int): Number of stacks i.e., dilation cycles.
-            residual_channels (int): Number of channels in residual conv.
-            gate_channels (int):  Number of channels in gated conv.
-            skip_channels (int): Number of channels in skip conv.
-            aux_channels (int): Number of channels for auxiliary feature conv.
-            aux_context_window (int): Context window size for auxiliary feature.
-            dropout (float): Dropout rate. 0.0 means no dropout applied.
-            bias (bool): Whether to use bias parameter in conv layer.
-            use_weight_norm (bool): Whether to use weight norm.
-                If set to true, it will be applied to all of the conv layers.
-            use_causal_conv (bool): Whether to use causal structure.
-            upsample_conditional_features (bool): Whether to use upsampling network.
-            upsample_net (str): Upsampling network architecture.
-            upsample_params (dict): Upsampling network parameters.
-        """
-        super(ParallelWaveGANGenerator, self).__init__()
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.aux_channels = aux_channels
-        self.layers = layers
-        self.stacks = stacks
-        self.kernel_size = kernel_size
-        # check the number of layers and stacks
-        assert layers % stacks == 0
-        layers_per_stack = layers // stacks
-        # define first convolution
-        self.first_conv = Conv1d1x1(in_channels, residual_channels, bias=True)
-        # define conv + upsampling network
-        if upsample_conditional_features:
-            upsample_params.update({
-                "use_causal_conv": use_causal_conv,
-            })
-            if upsample_net == "MelGANGenerator":
-                assert aux_context_window == 0
-                upsample_params.update({
-                    "use_weight_norm": False,  # not to apply twice
-                    "use_final_nonlinear_activation": False,
-                })
-                self.upsample_net = getattr(models, upsample_net)(**upsample_params)
-            else:
-                if upsample_net == "ConvInUpsampleNetwork":
-                    upsample_params.update({
-                        "aux_channels": aux_channels,
-                        "aux_context_window": aux_context_window,
-                    })
-                self.upsample_net = getattr(upsample, upsample_net)(**upsample_params)
-        else:
-            self.upsample_net = None
-        # define residual blocks
-        self.conv_layers = torch.nn.ModuleList()
-        for layer in range(layers):
-            dilation = 2 ** (layer % layers_per_stack)
-            conv = ResidualBlock(
-                kernel_size=kernel_size,
-                residual_channels=residual_channels,
-                gate_channels=gate_channels,
-                skip_channels=skip_channels,
-                aux_channels=aux_channels,
-                dilation=dilation,
-                dropout=dropout,
-                bias=bias,
-                use_causal_conv=use_causal_conv,
-            )
-            self.conv_layers += [conv]
-        # define output layers
-        self.last_conv_layers = torch.nn.ModuleList([
-            torch.nn.ReLU(inplace=True),
-            Conv1d1x1(skip_channels, skip_channels, bias=True),
-            torch.nn.ReLU(inplace=True),
-            Conv1d1x1(skip_channels, out_channels, bias=True),
-        ])
-        self.use_pitch_embed = use_pitch_embed
-        if use_pitch_embed:
-            self.pitch_embed = nn.Embedding(300, aux_channels, 0)
-            self.c_proj = nn.Linear(2 * aux_channels, aux_channels)
-        # apply weight norm
-        if use_weight_norm:
-            self.apply_weight_norm()
-    def forward(self, x, c=None, pitch=None, **kwargs):
-        """Calculate forward propagation.
-        Args:
-            x (Tensor): Input noise signal (B, C_in, T).
-            c (Tensor): Local conditioning auxiliary features (B, C ,T').
-            pitch (Tensor): Local conditioning pitch (B, T').
-        Returns:
-            Tensor: Output tensor (B, C_out, T)
-        """
-        # perform upsampling
-        if c is not None and self.upsample_net is not None:
-            if self.use_pitch_embed:
-                p = self.pitch_embed(pitch)
-                c = self.c_proj(torch.cat([c.transpose(1, 2), p], -1)).transpose(1, 2)
-            c = self.upsample_net(c)
-            assert c.size(-1) == x.size(-1), (c.size(-1), x.size(-1))
-        # encode to hidden representation
-        x = self.first_conv(x)
-        skips = 0
-        for f in self.conv_layers:
-            x, h = f(x, c)
-            skips += h
-        skips *= math.sqrt(1.0 / len(self.conv_layers))
-        # apply final layers
-        x = skips
-        for f in self.last_conv_layers:
-            x = f(x)
-        return x
-    def remove_weight_norm(self):
-        """Remove weight normalization module from all of the layers."""
-        def _remove_weight_norm(m):
-            try:
-                logging.debug(f"Weight norm is removed from {m}.")
-                torch.nn.utils.remove_weight_norm(m)
-            except ValueError:  # this module didn't have weight norm
-                return
-        self.apply(_remove_weight_norm)
-    def apply_weight_norm(self):
-        """Apply weight normalization module from all of the layers."""
-        def _apply_weight_norm(m):
-            if isinstance(m, torch.nn.Conv1d) or isinstance(m, torch.nn.Conv2d):
-                torch.nn.utils.weight_norm(m)
-                logging.debug(f"Weight norm is applied to {m}.")
-        self.apply(_apply_weight_norm)
-    @staticmethod
-    def _get_receptive_field_size(layers, stacks, kernel_size,
-                                  dilation=lambda x: 2 ** x):
-        assert layers % stacks == 0
-        layers_per_cycle = layers // stacks
-        dilations = [dilation(i % layers_per_cycle) for i in range(layers)]
-        return (kernel_size - 1) * sum(dilations) + 1
-    @property
-    def receptive_field_size(self):
-        """Return receptive field size."""
-        return self._get_receptive_field_size(self.layers, self.stacks, self.kernel_size)
-class ParallelWaveGANDiscriminator(torch.nn.Module):
-    """Parallel WaveGAN Discriminator module."""
-    def __init__(self,
-                 in_channels=1,
-                 out_channels=1,
-                 kernel_size=3,
-                 layers=10,
-                 conv_channels=64,
-                 dilation_factor=1,
-                 nonlinear_activation="LeakyReLU",
-                 nonlinear_activation_params={"negative_slope": 0.2},
-                 bias=True,
-                 use_weight_norm=True,
-                 ):
-        """Initialize Parallel WaveGAN Discriminator module.
-        Args:
-            in_channels (int): Number of input channels.
-            out_channels (int): Number of output channels.
-            kernel_size (int): Number of output channels.
-            layers (int): Number of conv layers.
-            conv_channels (int): Number of chnn layers.
-            dilation_factor (int): Dilation factor. For example, if dilation_factor = 2,
-                the dilation will be 2, 4, 8, ..., and so on.
-            nonlinear_activation (str): Nonlinear function after each conv.
-            nonlinear_activation_params (dict): Nonlinear function parameters
-            bias (bool): Whether to use bias parameter in conv.
-            use_weight_norm (bool) Whether to use weight norm.
-                If set to true, it will be applied to all of the conv layers.
-        """
-        super(ParallelWaveGANDiscriminator, self).__init__()
-        assert (kernel_size - 1) % 2 == 0, "Not support even number kernel size."
-        assert dilation_factor > 0, "Dilation factor must be > 0."
-        self.conv_layers = torch.nn.ModuleList()
-        conv_in_channels = in_channels
-        for i in range(layers - 1):
-            if i == 0:
-                dilation = 1
-            else:
-                dilation = i if dilation_factor == 1 else dilation_factor ** i
-                conv_in_channels = conv_channels
-            padding = (kernel_size - 1) // 2 * dilation
-            conv_layer = [
-                Conv1d(conv_in_channels, conv_channels,
-                       kernel_size=kernel_size, padding=padding,
-                       dilation=dilation, bias=bias),
-                getattr(torch.nn, nonlinear_activation)(inplace=True, **nonlinear_activation_params)
-            ]
-            self.conv_layers += conv_layer
-        padding = (kernel_size - 1) // 2
-        last_conv_layer = Conv1d(
-            conv_in_channels, out_channels,
-            kernel_size=kernel_size, padding=padding, bias=bias)
-        self.conv_layers += [last_conv_layer]
-        # apply weight norm
-        if use_weight_norm:
-            self.apply_weight_norm()
-    def forward(self, x):
-        """Calculate forward propagation.
-        Args:
-            x (Tensor): Input noise signal (B, 1, T).
-        Returns:
-            Tensor: Output tensor (B, 1, T)
-        """
-        for f in self.conv_layers:
-            x = f(x)
-        return x
-    def apply_weight_norm(self):
-        """Apply weight normalization module from all of the layers."""
-        def _apply_weight_norm(m):
-            if isinstance(m, torch.nn.Conv1d) or isinstance(m, torch.nn.Conv2d):
-                torch.nn.utils.weight_norm(m)
-                logging.debug(f"Weight norm is applied to {m}.")
-        self.apply(_apply_weight_norm)
-    def remove_weight_norm(self):
-        """Remove weight normalization module from all of the layers."""
-        def _remove_weight_norm(m):
-            try:
-                logging.debug(f"Weight norm is removed from {m}.")
-                torch.nn.utils.remove_weight_norm(m)
-            except ValueError:  # this module didn't have weight norm
-                return
-        self.apply(_remove_weight_norm)
-class ResidualParallelWaveGANDiscriminator(torch.nn.Module):
-    """Parallel WaveGAN Discriminator module."""
-    def __init__(self,
-                 in_channels=1,
-                 out_channels=1,
-                 kernel_size=3,
-                 layers=30,
-                 stacks=3,
-                 residual_channels=64,
-                 gate_channels=128,
-                 skip_channels=64,
-                 dropout=0.0,
-                 bias=True,
-                 use_weight_norm=True,
-                 use_causal_conv=False,
-                 nonlinear_activation="LeakyReLU",
-                 nonlinear_activation_params={"negative_slope": 0.2},
-                 ):
-        """Initialize Parallel WaveGAN Discriminator module.
-        Args:
-            in_channels (int): Number of input channels.
-            out_channels (int): Number of output channels.
-            kernel_size (int): Kernel size of dilated convolution.
-            layers (int): Number of residual block layers.
-            stacks (int): Number of stacks i.e., dilation cycles.
-            residual_channels (int): Number of channels in residual conv.
-            gate_channels (int):  Number of channels in gated conv.
-            skip_channels (int): Number of channels in skip conv.
-            dropout (float): Dropout rate. 0.0 means no dropout applied.
-            bias (bool): Whether to use bias parameter in conv.
-            use_weight_norm (bool): Whether to use weight norm.
-                If set to true, it will be applied to all of the conv layers.
-            use_causal_conv (bool): Whether to use causal structure.
-            nonlinear_activation_params (dict): Nonlinear function parameters
-        """
-        super(ResidualParallelWaveGANDiscriminator, self).__init__()
-        assert (kernel_size - 1) % 2 == 0, "Not support even number kernel size."
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.layers = layers
-        self.stacks = stacks
-        self.kernel_size = kernel_size
-        # check the number of layers and stacks
-        assert layers % stacks == 0
-        layers_per_stack = layers // stacks
-        # define first convolution
-        self.first_conv = torch.nn.Sequential(
-            Conv1d1x1(in_channels, residual_channels, bias=True),
-            getattr(torch.nn, nonlinear_activation)(
-                inplace=True, **nonlinear_activation_params),
-        )
-        # define residual blocks
-        self.conv_layers = torch.nn.ModuleList()
-        for layer in range(layers):
-            dilation = 2 ** (layer % layers_per_stack)
-            conv = ResidualBlock(
-                kernel_size=kernel_size,
-                residual_channels=residual_channels,
-                gate_channels=gate_channels,
-                skip_channels=skip_channels,
-                aux_channels=-1,
-                dilation=dilation,
-                dropout=dropout,
-                bias=bias,
-                use_causal_conv=use_causal_conv,
-            )
-            self.conv_layers += [conv]
-        # define output layers
-        self.last_conv_layers = torch.nn.ModuleList([
-            getattr(torch.nn, nonlinear_activation)(
-                inplace=True, **nonlinear_activation_params),
-            Conv1d1x1(skip_channels, skip_channels, bias=True),
-            getattr(torch.nn, nonlinear_activation)(
-                inplace=True, **nonlinear_activation_params),
-            Conv1d1x1(skip_channels, out_channels, bias=True),
-        ])
-        # apply weight norm
-        if use_weight_norm:
-            self.apply_weight_norm()
-    def forward(self, x):
-        """Calculate forward propagation.
-        Args:
-            x (Tensor): Input noise signal (B, 1, T).
-        Returns:
-            Tensor: Output tensor (B, 1, T)
-        """
-        x = self.first_conv(x)
-        skips = 0
-        for f in self.conv_layers:
-            x, h = f(x, None)
-            skips += h
-        skips *= math.sqrt(1.0 / len(self.conv_layers))
-        # apply final layers
-        x = skips
-        for f in self.last_conv_layers:
-            x = f(x)
-        return x
-    def apply_weight_norm(self):
-        """Apply weight normalization module from all of the layers."""
-        def _apply_weight_norm(m):
-            if isinstance(m, torch.nn.Conv1d) or isinstance(m, torch.nn.Conv2d):
-                torch.nn.utils.weight_norm(m)
-                logging.debug(f"Weight norm is applied to {m}.")
-        self.apply(_apply_weight_norm)
-    def remove_weight_norm(self):
-        """Remove weight normalization module from all of the layers."""
-        def _remove_weight_norm(m):
-            try:
-                logging.debug(f"Weight norm is removed from {m}.")
-                torch.nn.utils.remove_weight_norm(m)
-            except ValueError:  # this module didn't have weight norm
-                return
-        self.apply(_remove_weight_norm)

+# -*- coding: utf-8 -*-
+# Copyright 2019 Tomoki Hayashi
+#  MIT License (https://opensource.org/licenses/MIT)
+"""Parallel WaveGAN Modules."""
+import logging
+import math
+import torch
+from torch import nn
+from modules.parallel_wavegan.layers import Conv1d
+from modules.parallel_wavegan.layers import Conv1d1x1
+from modules.parallel_wavegan.layers import ResidualBlock
+from modules.parallel_wavegan.layers import upsample
+from modules.parallel_wavegan import models
+class ParallelWaveGANGenerator(torch.nn.Module):
+    """Parallel WaveGAN Generator module."""
+    def __init__(self,
+                 in_channels=1,
+                 out_channels=1,
+                 kernel_size=3,
+                 layers=30,
+                 stacks=3,
+                 residual_channels=64,
+                 gate_channels=128,
+                 skip_channels=64,
+                 aux_channels=80,
+                 aux_context_window=2,
+                 dropout=0.0,
+                 bias=True,
+                 use_weight_norm=True,
+                 use_causal_conv=False,
+                 upsample_conditional_features=True,
+                 upsample_net="ConvInUpsampleNetwork",
+                 upsample_params={"upsample_scales": [4, 4, 4, 4]},
+                 use_pitch_embed=False,
+                 ):
+        """Initialize Parallel WaveGAN Generator module.
+        Args:
+            in_channels (int): Number of input channels.
+            out_channels (int): Number of output channels.
+            kernel_size (int): Kernel size of dilated convolution.
+            layers (int): Number of residual block layers.
+            stacks (int): Number of stacks i.e., dilation cycles.
+            residual_channels (int): Number of channels in residual conv.
+            gate_channels (int):  Number of channels in gated conv.
+            skip_channels (int): Number of channels in skip conv.
+            aux_channels (int): Number of channels for auxiliary feature conv.
+            aux_context_window (int): Context window size for auxiliary feature.
+            dropout (float): Dropout rate. 0.0 means no dropout applied.
+            bias (bool): Whether to use bias parameter in conv layer.
+            use_weight_norm (bool): Whether to use weight norm.
+                If set to true, it will be applied to all of the conv layers.
+            use_causal_conv (bool): Whether to use causal structure.
+            upsample_conditional_features (bool): Whether to use upsampling network.
+            upsample_net (str): Upsampling network architecture.
+            upsample_params (dict): Upsampling network parameters.
+        """
+        super(ParallelWaveGANGenerator, self).__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.aux_channels = aux_channels
+        self.layers = layers
+        self.stacks = stacks
+        self.kernel_size = kernel_size
+        # check the number of layers and stacks
+        assert layers % stacks == 0
+        layers_per_stack = layers // stacks
+        # define first convolution
+        self.first_conv = Conv1d1x1(in_channels, residual_channels, bias=True)
+        # define conv + upsampling network
+        if upsample_conditional_features:
+            upsample_params.update({
+                "use_causal_conv": use_causal_conv,
+            })
+            if upsample_net == "MelGANGenerator":
+                assert aux_context_window == 0
+                upsample_params.update({
+                    "use_weight_norm": False,  # not to apply twice
+                    "use_final_nonlinear_activation": False,
+                })
+                self.upsample_net = getattr(models, upsample_net)(**upsample_params)
+            else:
+                if upsample_net == "ConvInUpsampleNetwork":
+                    upsample_params.update({
+                        "aux_channels": aux_channels,
+                        "aux_context_window": aux_context_window,
+                    })
+                self.upsample_net = getattr(upsample, upsample_net)(**upsample_params)
+        else:
+            self.upsample_net = None
+        # define residual blocks
+        self.conv_layers = torch.nn.ModuleList()
+        for layer in range(layers):
+            dilation = 2 ** (layer % layers_per_stack)
+            conv = ResidualBlock(
+                kernel_size=kernel_size,
+                residual_channels=residual_channels,
+                gate_channels=gate_channels,
+                skip_channels=skip_channels,
+                aux_channels=aux_channels,
+                dilation=dilation,
+                dropout=dropout,
+                bias=bias,
+                use_causal_conv=use_causal_conv,
+            )
+            self.conv_layers += [conv]
+        # define output layers
+        self.last_conv_layers = torch.nn.ModuleList([
+            torch.nn.ReLU(inplace=True),
+            Conv1d1x1(skip_channels, skip_channels, bias=True),
+            torch.nn.ReLU(inplace=True),
+            Conv1d1x1(skip_channels, out_channels, bias=True),
+        ])
+        self.use_pitch_embed = use_pitch_embed
+        if use_pitch_embed:
+            self.pitch_embed = nn.Embedding(300, aux_channels, 0)
+            self.c_proj = nn.Linear(2 * aux_channels, aux_channels)
+        # apply weight norm
+        if use_weight_norm:
+            self.apply_weight_norm()
+    def forward(self, x, c=None, pitch=None, **kwargs):
+        """Calculate forward propagation.
+        Args:
+            x (Tensor): Input noise signal (B, C_in, T).
+            c (Tensor): Local conditioning auxiliary features (B, C ,T').
+            pitch (Tensor): Local conditioning pitch (B, T').
+        Returns:
+            Tensor: Output tensor (B, C_out, T)
+        """
+        # perform upsampling
+        if c is not None and self.upsample_net is not None:
+            if self.use_pitch_embed:
+                p = self.pitch_embed(pitch)
+                c = self.c_proj(torch.cat([c.transpose(1, 2), p], -1)).transpose(1, 2)
+            c = self.upsample_net(c)
+            assert c.size(-1) == x.size(-1), (c.size(-1), x.size(-1))
+        # encode to hidden representation
+        x = self.first_conv(x)
+        skips = 0
+        for f in self.conv_layers:
+            x, h = f(x, c)
+            skips += h
+        skips *= math.sqrt(1.0 / len(self.conv_layers))
+        # apply final layers
+        x = skips
+        for f in self.last_conv_layers:
+            x = f(x)
+        return x
+    def remove_weight_norm(self):
+        """Remove weight normalization module from all of the layers."""
+        def _remove_weight_norm(m):
+            try:
+                logging.debug(f"Weight norm is removed from {m}.")
+                torch.nn.utils.remove_weight_norm(m)
+            except ValueError:  # this module didn't have weight norm
+                return
+        self.apply(_remove_weight_norm)
+    def apply_weight_norm(self):
+        """Apply weight normalization module from all of the layers."""
+        def _apply_weight_norm(m):
+            if isinstance(m, torch.nn.Conv1d) or isinstance(m, torch.nn.Conv2d):
+                torch.nn.utils.weight_norm(m)
+                logging.debug(f"Weight norm is applied to {m}.")
+        self.apply(_apply_weight_norm)
+    @staticmethod
+    def _get_receptive_field_size(layers, stacks, kernel_size,
+                                  dilation=lambda x: 2 ** x):
+        assert layers % stacks == 0
+        layers_per_cycle = layers // stacks
+        dilations = [dilation(i % layers_per_cycle) for i in range(layers)]
+        return (kernel_size - 1) * sum(dilations) + 1
+    @property
+    def receptive_field_size(self):
+        """Return receptive field size."""
+        return self._get_receptive_field_size(self.layers, self.stacks, self.kernel_size)
+class ParallelWaveGANDiscriminator(torch.nn.Module):
+    """Parallel WaveGAN Discriminator module."""
+    def __init__(self,
+                 in_channels=1,
+                 out_channels=1,
+                 kernel_size=3,
+                 layers=10,
+                 conv_channels=64,
+                 dilation_factor=1,
+                 nonlinear_activation="LeakyReLU",
+                 nonlinear_activation_params={"negative_slope": 0.2},
+                 bias=True,
+                 use_weight_norm=True,
+                 ):
+        """Initialize Parallel WaveGAN Discriminator module.
+        Args:
+            in_channels (int): Number of input channels.
+            out_channels (int): Number of output channels.
+            kernel_size (int): Number of output channels.
+            layers (int): Number of conv layers.
+            conv_channels (int): Number of chnn layers.
+            dilation_factor (int): Dilation factor. For example, if dilation_factor = 2,
+                the dilation will be 2, 4, 8, ..., and so on.
+            nonlinear_activation (str): Nonlinear function after each conv.
+            nonlinear_activation_params (dict): Nonlinear function parameters
+            bias (bool): Whether to use bias parameter in conv.
+            use_weight_norm (bool) Whether to use weight norm.
+                If set to true, it will be applied to all of the conv layers.
+        """
+        super(ParallelWaveGANDiscriminator, self).__init__()
+        assert (kernel_size - 1) % 2 == 0, "Not support even number kernel size."
+        assert dilation_factor > 0, "Dilation factor must be > 0."
+        self.conv_layers = torch.nn.ModuleList()
+        conv_in_channels = in_channels
+        for i in range(layers - 1):
+            if i == 0:
+                dilation = 1
+            else:
+                dilation = i if dilation_factor == 1 else dilation_factor ** i
+                conv_in_channels = conv_channels
+            padding = (kernel_size - 1) // 2 * dilation
+            conv_layer = [
+                Conv1d(conv_in_channels, conv_channels,
+                       kernel_size=kernel_size, padding=padding,
+                       dilation=dilation, bias=bias),
+                getattr(torch.nn, nonlinear_activation)(inplace=True, **nonlinear_activation_params)
+            ]
+            self.conv_layers += conv_layer
+        padding = (kernel_size - 1) // 2
+        last_conv_layer = Conv1d(
+            conv_in_channels, out_channels,
+            kernel_size=kernel_size, padding=padding, bias=bias)
+        self.conv_layers += [last_conv_layer]
+        # apply weight norm
+        if use_weight_norm:
+            self.apply_weight_norm()
+    def forward(self, x):
+        """Calculate forward propagation.
+        Args:
+            x (Tensor): Input noise signal (B, 1, T).
+        Returns:
+            Tensor: Output tensor (B, 1, T)
+        """
+        for f in self.conv_layers:
+            x = f(x)
+        return x
+    def apply_weight_norm(self):
+        """Apply weight normalization module from all of the layers."""
+        def _apply_weight_norm(m):
+            if isinstance(m, torch.nn.Conv1d) or isinstance(m, torch.nn.Conv2d):
+                torch.nn.utils.weight_norm(m)
+                logging.debug(f"Weight norm is applied to {m}.")
+        self.apply(_apply_weight_norm)
+    def remove_weight_norm(self):
+        """Remove weight normalization module from all of the layers."""
+        def _remove_weight_norm(m):
+            try:
+                logging.debug(f"Weight norm is removed from {m}.")
+                torch.nn.utils.remove_weight_norm(m)
+            except ValueError:  # this module didn't have weight norm
+                return
+        self.apply(_remove_weight_norm)
+class ResidualParallelWaveGANDiscriminator(torch.nn.Module):
+    """Parallel WaveGAN Discriminator module."""
+    def __init__(self,
+                 in_channels=1,
+                 out_channels=1,
+                 kernel_size=3,
+                 layers=30,
+                 stacks=3,
+                 residual_channels=64,
+                 gate_channels=128,
+                 skip_channels=64,
+                 dropout=0.0,
+                 bias=True,
+                 use_weight_norm=True,
+                 use_causal_conv=False,
+                 nonlinear_activation="LeakyReLU",
+                 nonlinear_activation_params={"negative_slope": 0.2},
+                 ):
+        """Initialize Parallel WaveGAN Discriminator module.
+        Args:
+            in_channels (int): Number of input channels.
+            out_channels (int): Number of output channels.
+            kernel_size (int): Kernel size of dilated convolution.
+            layers (int): Number of residual block layers.
+            stacks (int): Number of stacks i.e., dilation cycles.
+            residual_channels (int): Number of channels in residual conv.
+            gate_channels (int):  Number of channels in gated conv.
+            skip_channels (int): Number of channels in skip conv.
+            dropout (float): Dropout rate. 0.0 means no dropout applied.
+            bias (bool): Whether to use bias parameter in conv.
+            use_weight_norm (bool): Whether to use weight norm.
+                If set to true, it will be applied to all of the conv layers.
+            use_causal_conv (bool): Whether to use causal structure.
+            nonlinear_activation_params (dict): Nonlinear function parameters
+        """
+        super(ResidualParallelWaveGANDiscriminator, self).__init__()
+        assert (kernel_size - 1) % 2 == 0, "Not support even number kernel size."
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.layers = layers
+        self.stacks = stacks
+        self.kernel_size = kernel_size
+        # check the number of layers and stacks
+        assert layers % stacks == 0
+        layers_per_stack = layers // stacks
+        # define first convolution
+        self.first_conv = torch.nn.Sequential(
+            Conv1d1x1(in_channels, residual_channels, bias=True),
+            getattr(torch.nn, nonlinear_activation)(
+                inplace=True, **nonlinear_activation_params),
+        )
+        # define residual blocks
+        self.conv_layers = torch.nn.ModuleList()
+        for layer in range(layers):
+            dilation = 2 ** (layer % layers_per_stack)
+            conv = ResidualBlock(
+                kernel_size=kernel_size,
+                residual_channels=residual_channels,
+                gate_channels=gate_channels,
+                skip_channels=skip_channels,
+                aux_channels=-1,
+                dilation=dilation,
+                dropout=dropout,
+                bias=bias,
+                use_causal_conv=use_causal_conv,
+            )
+            self.conv_layers += [conv]
+        # define output layers
+        self.last_conv_layers = torch.nn.ModuleList([
+            getattr(torch.nn, nonlinear_activation)(
+                inplace=True, **nonlinear_activation_params),
+            Conv1d1x1(skip_channels, skip_channels, bias=True),
+            getattr(torch.nn, nonlinear_activation)(
+                inplace=True, **nonlinear_activation_params),
+            Conv1d1x1(skip_channels, out_channels, bias=True),
+        ])
+        # apply weight norm
+        if use_weight_norm:
+            self.apply_weight_norm()
+    def forward(self, x):
+        """Calculate forward propagation.
+        Args:
+            x (Tensor): Input noise signal (B, 1, T).
+        Returns:
+            Tensor: Output tensor (B, 1, T)
+        """
+        x = self.first_conv(x)
+        skips = 0
+        for f in self.conv_layers:
+            x, h = f(x, None)
+            skips += h
+        skips *= math.sqrt(1.0 / len(self.conv_layers))
+        # apply final layers
+        x = skips
+        for f in self.last_conv_layers:
+            x = f(x)
+        return x
+    def apply_weight_norm(self):
+        """Apply weight normalization module from all of the layers."""
+        def _apply_weight_norm(m):
+            if isinstance(m, torch.nn.Conv1d) or isinstance(m, torch.nn.Conv2d):
+                torch.nn.utils.weight_norm(m)
+                logging.debug(f"Weight norm is applied to {m}.")
+        self.apply(_apply_weight_norm)
+    def remove_weight_norm(self):
+        """Remove weight normalization module from all of the layers."""
+        def _remove_weight_norm(m):
+            try:
+                logging.debug(f"Weight norm is removed from {m}.")
+                torch.nn.utils.remove_weight_norm(m)
+            except ValueError:  # this module didn't have weight norm
+                return
+        self.apply(_remove_weight_norm)

usr/configs/midi/cascade/opencs/ds60_rel.yaml CHANGED Viewed

@@ -24,10 +24,11 @@ fs2_ckpt: 'checkpoints/0302_opencpop_fs_midi/model_ckpt_steps_160000.ckpt'  #
 task_cls: usr.diffsinger_task.DiffSingerMIDITask
 K_step: 60
-max_tokens: 40000
 predictor_layers: 5
 dilation_cycle_length: 4  # *
 rel_pos: true
 dur_predictor_layers: 5  # *
 max_updates: 160000
 gaussian_start: false

 task_cls: usr.diffsinger_task.DiffSingerMIDITask
 K_step: 60
+max_tokens: 36000
 predictor_layers: 5
 dilation_cycle_length: 4  # *
 rel_pos: true
 dur_predictor_layers: 5  # *
 max_updates: 160000
 gaussian_start: false
+mask_uv_prob: 0.15

usr/diff/shallow_diffusion_tts.py CHANGED Viewed

@@ -1,273 +1,324 @@
-import math
-import random
-from functools import partial
-from inspect import isfunction
-from pathlib import Path
-import numpy as np
-import torch
-import torch.nn.functional as F
-from torch import nn
-from tqdm import tqdm
-from einops import rearrange
-from modules.fastspeech.fs2 import FastSpeech2
-from modules.diffsinger_midi.fs2 import FastSpeech2MIDI
-from utils.hparams import hparams
-def exists(x):
-    return x is not None
-def default(val, d):
-    if exists(val):
-        return val
-    return d() if isfunction(d) else d
-# gaussian diffusion trainer class
-def extract(a, t, x_shape):
-    b, *_ = t.shape
-    out = a.gather(-1, t)
-    return out.reshape(b, *((1,) * (len(x_shape) - 1)))
-def noise_like(shape, device, repeat=False):
-    repeat_noise = lambda: torch.randn((1, *shape[1:]), device=device).repeat(shape[0], *((1,) * (len(shape) - 1)))
-    noise = lambda: torch.randn(shape, device=device)
-    return repeat_noise() if repeat else noise()
-def linear_beta_schedule(timesteps, max_beta=hparams.get('max_beta', 0.01)):
-    """
-    linear schedule
-    """
-    betas = np.linspace(1e-4, max_beta, timesteps)
-    return betas
-def cosine_beta_schedule(timesteps, s=0.008):
-    """
-    cosine schedule
-    as proposed in https://openreview.net/forum?id=-NEXDKk8gZ
-    """
-    steps = timesteps + 1
-    x = np.linspace(0, steps, steps)
-    alphas_cumprod = np.cos(((x / steps) + s) / (1 + s) * np.pi * 0.5) ** 2
-    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
-    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
-    return np.clip(betas, a_min=0, a_max=0.999)
-beta_schedule = {
-    "cosine": cosine_beta_schedule,
-    "linear": linear_beta_schedule,
-}
-class GaussianDiffusion(nn.Module):
-    def __init__(self, phone_encoder, out_dims, denoise_fn,
-                 timesteps=1000, K_step=1000, loss_type=hparams.get('diff_loss_type', 'l1'), betas=None, spec_min=None, spec_max=None):
-        super().__init__()
-        self.denoise_fn = denoise_fn
-        if hparams.get('use_midi') is not None and hparams['use_midi']:
-            self.fs2 = FastSpeech2MIDI(phone_encoder, out_dims)
-        else:
-            self.fs2 = FastSpeech2(phone_encoder, out_dims)
-        self.mel_bins = out_dims
-        if exists(betas):
-            betas = betas.detach().cpu().numpy() if isinstance(betas, torch.Tensor) else betas
-        else:
-            if 'schedule_type' in hparams.keys():
-                betas = beta_schedule[hparams['schedule_type']](timesteps)
-            else:
-                betas = cosine_beta_schedule(timesteps)
-        alphas = 1. - betas
-        alphas_cumprod = np.cumprod(alphas, axis=0)
-        alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1])
-        timesteps, = betas.shape
-        self.num_timesteps = int(timesteps)
-        self.K_step = K_step
-        self.loss_type = loss_type
-        to_torch = partial(torch.tensor, dtype=torch.float32)
-        self.register_buffer('betas', to_torch(betas))
-        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
-        self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev))
-        # calculations for diffusion q(x_t | x_{t-1}) and others
-        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod)))
-        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod)))
-        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod)))
-        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod)))
-        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1)))
-        # calculations for posterior q(x_{t-1} | x_t, x_0)
-        posterior_variance = betas * (1. - alphas_cumprod_prev) / (1. - alphas_cumprod)
-        # above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)
-        self.register_buffer('posterior_variance', to_torch(posterior_variance))
-        # below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain
-        self.register_buffer('posterior_log_variance_clipped', to_torch(np.log(np.maximum(posterior_variance, 1e-20))))
-        self.register_buffer('posterior_mean_coef1', to_torch(
-            betas * np.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod)))
-        self.register_buffer('posterior_mean_coef2', to_torch(
-            (1. - alphas_cumprod_prev) * np.sqrt(alphas) / (1. - alphas_cumprod)))
-        self.register_buffer('spec_min', torch.FloatTensor(spec_min)[None, None, :hparams['keep_bins']])
-        self.register_buffer('spec_max', torch.FloatTensor(spec_max)[None, None, :hparams['keep_bins']])
-    def q_mean_variance(self, x_start, t):
-        mean = extract(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
-        variance = extract(1. - self.alphas_cumprod, t, x_start.shape)
-        log_variance = extract(self.log_one_minus_alphas_cumprod, t, x_start.shape)
-        return mean, variance, log_variance
-    def predict_start_from_noise(self, x_t, t, noise):
-        return (
-                extract(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -
-                extract(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise
-        )
-    def q_posterior(self, x_start, x_t, t):
-        posterior_mean = (
-                extract(self.posterior_mean_coef1, t, x_t.shape) * x_start +
-                extract(self.posterior_mean_coef2, t, x_t.shape) * x_t
-        )
-        posterior_variance = extract(self.posterior_variance, t, x_t.shape)
-        posterior_log_variance_clipped = extract(self.posterior_log_variance_clipped, t, x_t.shape)
-        return posterior_mean, posterior_variance, posterior_log_variance_clipped
-    def p_mean_variance(self, x, t, cond, clip_denoised: bool):
-        noise_pred = self.denoise_fn(x, t, cond=cond)
-        x_recon = self.predict_start_from_noise(x, t=t, noise=noise_pred)
-        if clip_denoised:
-            x_recon.clamp_(-1., 1.)
-        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
-        return model_mean, posterior_variance, posterior_log_variance
-    @torch.no_grad()
-    def p_sample(self, x, t, cond, clip_denoised=True, repeat_noise=False):
-        b, *_, device = *x.shape, x.device
-        model_mean, _, model_log_variance = self.p_mean_variance(x=x, t=t, cond=cond, clip_denoised=clip_denoised)
-        noise = noise_like(x.shape, device, repeat_noise)
-        # no noise when t == 0
-        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
-        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
-    def q_sample(self, x_start, t, noise=None):
-        noise = default(noise, lambda: torch.randn_like(x_start))
-        return (
-                extract(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +
-                extract(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise
-        )
-    def p_losses(self, x_start, t, cond, noise=None, nonpadding=None):
-        noise = default(noise, lambda: torch.randn_like(x_start))
-        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
-        x_recon = self.denoise_fn(x_noisy, t, cond)
-        if self.loss_type == 'l1':
-            if nonpadding is not None:
-                loss = ((noise - x_recon).abs() * nonpadding.unsqueeze(1)).mean()
-            else:
-                # print('are you sure w/o nonpadding?')
-                loss = (noise - x_recon).abs().mean()
-        elif self.loss_type == 'l2':
-            loss = F.mse_loss(noise, x_recon)
-        else:
-            raise NotImplementedError()
-        return loss
-    def forward(self, txt_tokens, mel2ph=None, spk_embed=None,
-                ref_mels=None, f0=None, uv=None, energy=None, infer=False, **kwargs):
-        b, *_, device = *txt_tokens.shape, txt_tokens.device
-        ret = self.fs2(txt_tokens, mel2ph, spk_embed, ref_mels, f0, uv, energy,
-                       skip_decoder=(not infer), infer=infer, **kwargs)
-        cond = ret['decoder_inp'].transpose(1, 2)
-        if not infer:
-            t = torch.randint(0, self.K_step, (b,), device=device).long()
-            x = ref_mels
-            x = self.norm_spec(x)
-            x = x.transpose(1, 2)[:, None, :, :]  # [B, 1, M, T]
-            ret['diff_loss'] = self.p_losses(x, t, cond)
-            # nonpadding = (mel2ph != 0).float()
-            # ret['diff_loss'] = self.p_losses(x, t, cond, nonpadding=nonpadding)
-        else:
-            ret['fs2_mel'] = ret['mel_out']
-            fs2_mels = ret['mel_out']
-            t = self.K_step
-            fs2_mels = self.norm_spec(fs2_mels)
-            fs2_mels = fs2_mels.transpose(1, 2)[:, None, :, :]
-            x = self.q_sample(x_start=fs2_mels, t=torch.tensor([t - 1], device=device).long())
-            if hparams.get('gaussian_start') is not None and hparams['gaussian_start']:
-                print('===> gaussion start.')
-                shape = (cond.shape[0], 1, self.mel_bins, cond.shape[2])
-                x = torch.randn(shape, device=device)
-            for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
-                x = self.p_sample(x, torch.full((b,), i, device=device, dtype=torch.long), cond)
-            x = x[:, 0].transpose(1, 2)
-            if mel2ph is not None:  # for singing
-                ret['mel_out'] = self.denorm_spec(x) * ((mel2ph > 0).float()[:, :, None])
-            else:
-                ret['mel_out'] = self.denorm_spec(x)
-        return ret
-    def norm_spec(self, x):
-        return (x - self.spec_min) / (self.spec_max - self.spec_min) * 2 - 1
-    def denorm_spec(self, x):
-        return (x + 1) / 2 * (self.spec_max - self.spec_min) + self.spec_min
-    def cwt2f0_norm(self, cwt_spec, mean, std, mel2ph):
-        return self.fs2.cwt2f0_norm(cwt_spec, mean, std, mel2ph)
-    def out2mel(self, x):
-        return x
-class OfflineGaussianDiffusion(GaussianDiffusion):
-    def forward(self, txt_tokens, mel2ph=None, spk_embed=None,
-                ref_mels=None, f0=None, uv=None, energy=None, infer=False, **kwargs):
-        b, *_, device = *txt_tokens.shape, txt_tokens.device
-        ret = self.fs2(txt_tokens, mel2ph, spk_embed, ref_mels, f0, uv, energy,
-                       skip_decoder=True, infer=True, **kwargs)
-        cond = ret['decoder_inp'].transpose(1, 2)
-        fs2_mels = ref_mels[1]
-        ref_mels = ref_mels[0]
-        if not infer:
-            t = torch.randint(0, self.K_step, (b,), device=device).long()
-            x = ref_mels
-            x = self.norm_spec(x)
-            x = x.transpose(1, 2)[:, None, :, :]  # [B, 1, M, T]
-            ret['diff_loss'] = self.p_losses(x, t, cond)
-        else:
-            t = self.K_step
-            fs2_mels = self.norm_spec(fs2_mels)
-            fs2_mels = fs2_mels.transpose(1, 2)[:, None, :, :]
-            x = self.q_sample(x_start=fs2_mels, t=torch.tensor([t - 1], device=device).long())
-            if hparams.get('gaussian_start') is not None and hparams['gaussian_start']:
-                print('===> gaussion start.')
-                shape = (cond.shape[0], 1, self.mel_bins, cond.shape[2])
-                x = torch.randn(shape, device=device)
-            for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
-                x = self.p_sample(x, torch.full((b,), i, device=device, dtype=torch.long), cond)
-            x = x[:, 0].transpose(1, 2)
-            ret['mel_out'] = self.denorm_spec(x)
-        return ret

+import math
+import random
+from collections import deque
+from functools import partial
+from inspect import isfunction
+from pathlib import Path
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import nn
+from tqdm import tqdm
+from einops import rearrange
+from modules.fastspeech.fs2 import FastSpeech2
+from modules.diffsinger_midi.fs2 import FastSpeech2MIDI
+from utils.hparams import hparams
+def exists(x):
+    return x is not None
+def default(val, d):
+    if exists(val):
+        return val
+    return d() if isfunction(d) else d
+# gaussian diffusion trainer class
+def extract(a, t, x_shape):
+    b, *_ = t.shape
+    out = a.gather(-1, t)
+    return out.reshape(b, *((1,) * (len(x_shape) - 1)))
+def noise_like(shape, device, repeat=False):
+    repeat_noise = lambda: torch.randn((1, *shape[1:]), device=device).repeat(shape[0], *((1,) * (len(shape) - 1)))
+    noise = lambda: torch.randn(shape, device=device)
+    return repeat_noise() if repeat else noise()
+def linear_beta_schedule(timesteps, max_beta=hparams.get('max_beta', 0.01)):
+    """
+    linear schedule
+    """
+    betas = np.linspace(1e-4, max_beta, timesteps)
+    return betas
+def cosine_beta_schedule(timesteps, s=0.008):
+    """
+    cosine schedule
+    as proposed in https://openreview.net/forum?id=-NEXDKk8gZ
+    """
+    steps = timesteps + 1
+    x = np.linspace(0, steps, steps)
+    alphas_cumprod = np.cos(((x / steps) + s) / (1 + s) * np.pi * 0.5) ** 2
+    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
+    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
+    return np.clip(betas, a_min=0, a_max=0.999)
+beta_schedule = {
+    "cosine": cosine_beta_schedule,
+    "linear": linear_beta_schedule,
+}
+class GaussianDiffusion(nn.Module):
+    def __init__(self, phone_encoder, out_dims, denoise_fn,
+                 timesteps=1000, K_step=1000, loss_type=hparams.get('diff_loss_type', 'l1'), betas=None, spec_min=None, spec_max=None):
+        super().__init__()
+        self.denoise_fn = denoise_fn
+        if hparams.get('use_midi') is not None and hparams['use_midi']:
+            self.fs2 = FastSpeech2MIDI(phone_encoder, out_dims)
+        else:
+            self.fs2 = FastSpeech2(phone_encoder, out_dims)
+        self.mel_bins = out_dims
+        if exists(betas):
+            betas = betas.detach().cpu().numpy() if isinstance(betas, torch.Tensor) else betas
+        else:
+            if 'schedule_type' in hparams.keys():
+                betas = beta_schedule[hparams['schedule_type']](timesteps)
+            else:
+                betas = cosine_beta_schedule(timesteps)
+        alphas = 1. - betas
+        alphas_cumprod = np.cumprod(alphas, axis=0)
+        alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1])
+        timesteps, = betas.shape
+        self.num_timesteps = int(timesteps)
+        self.K_step = K_step
+        self.loss_type = loss_type
+        self.noise_list = deque(maxlen=4)
+        to_torch = partial(torch.tensor, dtype=torch.float32)
+        self.register_buffer('betas', to_torch(betas))
+        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
+        self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev))
+        # calculations for diffusion q(x_t | x_{t-1}) and others
+        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod)))
+        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod)))
+        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod)))
+        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod)))
+        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1)))
+        # calculations for posterior q(x_{t-1} | x_t, x_0)
+        posterior_variance = betas * (1. - alphas_cumprod_prev) / (1. - alphas_cumprod)
+        # above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)
+        self.register_buffer('posterior_variance', to_torch(posterior_variance))
+        # below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain
+        self.register_buffer('posterior_log_variance_clipped', to_torch(np.log(np.maximum(posterior_variance, 1e-20))))
+        self.register_buffer('posterior_mean_coef1', to_torch(
+            betas * np.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod)))
+        self.register_buffer('posterior_mean_coef2', to_torch(
+            (1. - alphas_cumprod_prev) * np.sqrt(alphas) / (1. - alphas_cumprod)))
+        self.register_buffer('spec_min', torch.FloatTensor(spec_min)[None, None, :hparams['keep_bins']])
+        self.register_buffer('spec_max', torch.FloatTensor(spec_max)[None, None, :hparams['keep_bins']])
+    def q_mean_variance(self, x_start, t):
+        mean = extract(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
+        variance = extract(1. - self.alphas_cumprod, t, x_start.shape)
+        log_variance = extract(self.log_one_minus_alphas_cumprod, t, x_start.shape)
+        return mean, variance, log_variance
+    def predict_start_from_noise(self, x_t, t, noise):
+        return (
+                extract(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -
+                extract(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise
+        )
+    def q_posterior(self, x_start, x_t, t):
+        posterior_mean = (
+                extract(self.posterior_mean_coef1, t, x_t.shape) * x_start +
+                extract(self.posterior_mean_coef2, t, x_t.shape) * x_t
+        )
+        posterior_variance = extract(self.posterior_variance, t, x_t.shape)
+        posterior_log_variance_clipped = extract(self.posterior_log_variance_clipped, t, x_t.shape)
+        return posterior_mean, posterior_variance, posterior_log_variance_clipped
+    def p_mean_variance(self, x, t, cond, clip_denoised: bool):
+        noise_pred = self.denoise_fn(x, t, cond=cond)
+        x_recon = self.predict_start_from_noise(x, t=t, noise=noise_pred)
+        if clip_denoised:
+            x_recon.clamp_(-1., 1.)
+        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
+        return model_mean, posterior_variance, posterior_log_variance
+    @torch.no_grad()
+    def p_sample(self, x, t, cond, clip_denoised=True, repeat_noise=False):
+        b, *_, device = *x.shape, x.device
+        model_mean, _, model_log_variance = self.p_mean_variance(x=x, t=t, cond=cond, clip_denoised=clip_denoised)
+        noise = noise_like(x.shape, device, repeat_noise)
+        # no noise when t == 0
+        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
+        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
+    @torch.no_grad()
+    def p_sample_plms(self, x, t, interval, cond, clip_denoised=True, repeat_noise=False):
+        """
+        Use the PLMS method from [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778).
+        """
+        def get_x_pred(x, noise_t, t):
+            a_t = extract(self.alphas_cumprod, t, x.shape)
+            if t[0] < interval:
+                a_prev = torch.ones_like(a_t)
+            else:
+                a_prev = extract(self.alphas_cumprod, torch.max(t-interval, torch.zeros_like(t)), x.shape)
+            a_t_sq, a_prev_sq = a_t.sqrt(), a_prev.sqrt()
+            x_delta = (a_prev - a_t) * ((1 / (a_t_sq * (a_t_sq + a_prev_sq))) * x - 1 / (a_t_sq * (((1 - a_prev) * a_t).sqrt() + ((1 - a_t) * a_prev).sqrt())) * noise_t)
+            x_pred = x + x_delta
+            return x_pred
+        noise_list = self.noise_list
+        noise_pred = self.denoise_fn(x, t, cond=cond)
+        if len(noise_list) == 0:
+            x_pred = get_x_pred(x, noise_pred, t)
+            noise_pred_prev = self.denoise_fn(x_pred, max(t-interval, 0), cond=cond)
+            noise_pred_prime = (noise_pred + noise_pred_prev) / 2
+        elif len(noise_list) == 1:
+            noise_pred_prime = (3 * noise_pred - noise_list[-1]) / 2
+        elif len(noise_list) == 2:
+            noise_pred_prime = (23 * noise_pred - 16 * noise_list[-1] + 5 * noise_list[-2]) / 12
+        elif len(noise_list) >= 3:
+            noise_pred_prime = (55 * noise_pred - 59 * noise_list[-1] + 37 * noise_list[-2] - 9 * noise_list[-3]) / 24
+        x_prev = get_x_pred(x, noise_pred_prime, t)
+        noise_list.append(noise_pred)
+        return x_prev
+    def q_sample(self, x_start, t, noise=None):
+        noise = default(noise, lambda: torch.randn_like(x_start))
+        return (
+                extract(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +
+                extract(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise
+        )
+    def p_losses(self, x_start, t, cond, noise=None, nonpadding=None):
+        noise = default(noise, lambda: torch.randn_like(x_start))
+        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
+        x_recon = self.denoise_fn(x_noisy, t, cond)
+        if self.loss_type == 'l1':
+            if nonpadding is not None:
+                loss = ((noise - x_recon).abs() * nonpadding.unsqueeze(1)).mean()
+            else:
+                # print('are you sure w/o nonpadding?')
+                loss = (noise - x_recon).abs().mean()
+        elif self.loss_type == 'l2':
+            loss = F.mse_loss(noise, x_recon)
+        else:
+            raise NotImplementedError()
+        return loss
+    def forward(self, txt_tokens, mel2ph=None, spk_embed=None,
+                ref_mels=None, f0=None, uv=None, energy=None, infer=False, **kwargs):
+        b, *_, device = *txt_tokens.shape, txt_tokens.device
+        ret = self.fs2(txt_tokens, mel2ph, spk_embed, ref_mels, f0, uv, energy,
+                       skip_decoder=(not infer), infer=infer, **kwargs)
+        cond = ret['decoder_inp'].transpose(1, 2)
+        if not infer:
+            t = torch.randint(0, self.K_step, (b,), device=device).long()
+            x = ref_mels
+            x = self.norm_spec(x)
+            x = x.transpose(1, 2)[:, None, :, :]  # [B, 1, M, T]
+            ret['diff_loss'] = self.p_losses(x, t, cond)
+            # nonpadding = (mel2ph != 0).float()
+            # ret['diff_loss'] = self.p_losses(x, t, cond, nonpadding=nonpadding)
+        else:
+            ret['fs2_mel'] = ret['mel_out']
+            fs2_mels = ret['mel_out']
+            t = self.K_step
+            fs2_mels = self.norm_spec(fs2_mels)
+            fs2_mels = fs2_mels.transpose(1, 2)[:, None, :, :]
+            x = self.q_sample(x_start=fs2_mels, t=torch.tensor([t - 1], device=device).long())
+            if hparams.get('gaussian_start') is not None and hparams['gaussian_start']:
+                print('===> gaussion start.')
+                shape = (cond.shape[0], 1, self.mel_bins, cond.shape[2])
+                x = torch.randn(shape, device=device)
+            if hparams.get('pndm_speedup'):
+                print('===> pndm speed:', hparams['pndm_speedup'])
+                self.noise_list = deque(maxlen=4)
+                iteration_interval = hparams['pndm_speedup']
+                for i in tqdm(reversed(range(0, t, iteration_interval)), desc='sample time step',
+                              total=t // iteration_interval):
+                    x = self.p_sample_plms(x, torch.full((b,), i, device=device, dtype=torch.long), iteration_interval,
+                                           cond)
+            else:
+                for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
+                    x = self.p_sample(x, torch.full((b,), i, device=device, dtype=torch.long), cond)
+            x = x[:, 0].transpose(1, 2)
+            if mel2ph is not None:  # for singing
+                ret['mel_out'] = self.denorm_spec(x) * ((mel2ph > 0).float()[:, :, None])
+            else:
+                ret['mel_out'] = self.denorm_spec(x)
+        return ret
+    def norm_spec(self, x):
+        return (x - self.spec_min) / (self.spec_max - self.spec_min) * 2 - 1
+    def denorm_spec(self, x):
+        return (x + 1) / 2 * (self.spec_max - self.spec_min) + self.spec_min
+    def cwt2f0_norm(self, cwt_spec, mean, std, mel2ph):
+        return self.fs2.cwt2f0_norm(cwt_spec, mean, std, mel2ph)
+    def out2mel(self, x):
+        return x
+class OfflineGaussianDiffusion(GaussianDiffusion):
+    def forward(self, txt_tokens, mel2ph=None, spk_embed=None,
+                ref_mels=None, f0=None, uv=None, energy=None, infer=False, **kwargs):
+        b, *_, device = *txt_tokens.shape, txt_tokens.device
+        ret = self.fs2(txt_tokens, mel2ph, spk_embed, ref_mels, f0, uv, energy,
+                       skip_decoder=True, infer=True, **kwargs)
+        cond = ret['decoder_inp'].transpose(1, 2)
+        fs2_mels = ref_mels[1]
+        ref_mels = ref_mels[0]
+        if not infer:
+            t = torch.randint(0, self.K_step, (b,), device=device).long()
+            x = ref_mels
+            x = self.norm_spec(x)
+            x = x.transpose(1, 2)[:, None, :, :]  # [B, 1, M, T]
+            ret['diff_loss'] = self.p_losses(x, t, cond)
+        else:
+            t = self.K_step
+            fs2_mels = self.norm_spec(fs2_mels)
+            fs2_mels = fs2_mels.transpose(1, 2)[:, None, :, :]
+            x = self.q_sample(x_start=fs2_mels, t=torch.tensor([t - 1], device=device).long())
+            if hparams.get('gaussian_start') is not None and hparams['gaussian_start']:
+                print('===> gaussion start.')
+                shape = (cond.shape[0], 1, self.mel_bins, cond.shape[2])
+                x = torch.randn(shape, device=device)
+            for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
+                x = self.p_sample(x, torch.full((b,), i, device=device, dtype=torch.long), cond)
+            x = x[:, 0].transpose(1, 2)
+            ret['mel_out'] = self.denorm_spec(x)
+        return ret

utils/hparams.py CHANGED Viewed

@@ -21,35 +21,30 @@ def override_config(old_config: dict, new_config: dict):
 def set_hparams(config='', exp_name='', hparams_str='', print_hparams=True, global_hparams=True):
-    if config == '' and exp_name == '':
-        parser = argparse.ArgumentParser(description='')
         parser.add_argument('--config', type=str, default='',
                             help='location of the data corpus')
         parser.add_argument('--exp_name', type=str, default='', help='exp_name')
-        parser.add_argument('-hp', '--hparams', type=str, default='',
                             help='location of the data corpus')
         parser.add_argument('--infer', action='store_true', help='infer')
         parser.add_argument('--validate', action='store_true', help='validate')
         parser.add_argument('--reset', action='store_true', help='reset hparams')
-        parser.add_argument('--remove', action='store_true', help='remove old ckpt')
         parser.add_argument('--debug', action='store_true', help='debug')
         args, unknown = parser.parse_known_args()
-        print("| Unknow hparams: ", unknown)
     else:
         args = Args(config=config, exp_name=exp_name, hparams=hparams_str,
-                    infer=False, validate=False, reset=False, debug=False, remove=False)
-    global hparams
-    assert args.config != '' or args.exp_name != ''
-    if args.config != '':
-        assert os.path.exists(args.config)
     config_chains = []
     loaded_config = set()
-    def load_config(config_fn):
-        # deep first inheritance and avoid the second visit of one node
-        if not os.path.exists(config_fn):
-            return {}
         with open(config_fn) as f:
             hparams_ = yaml.safe_load(f)
         loaded_config.add(config_fn)
@@ -58,10 +53,10 @@ def set_hparams(config='', exp_name='', hparams_str='', print_hparams=True, glob
             if not isinstance(hparams_['base_config'], list):
                 hparams_['base_config'] = [hparams_['base_config']]
             for c in hparams_['base_config']:
-                if c.startswith('.'):
-                    c = f'{os.path.dirname(config_fn)}/{c}'
-                    c = os.path.normpath(c)
                 if c not in loaded_config:
                     override_config(ret_hparams, load_config(c))
             override_config(ret_hparams, hparams_)
         else:
@@ -69,43 +64,36 @@ def set_hparams(config='', exp_name='', hparams_str='', print_hparams=True, glob
         config_chains.append(config_fn)
         return ret_hparams
     saved_hparams = {}
-    args_work_dir = ''
-    if args.exp_name != '':
-        args_work_dir = f'checkpoints/{args.exp_name}'
         ckpt_config_path = f'{args_work_dir}/config.yaml'
         if os.path.exists(ckpt_config_path):
-            with open(ckpt_config_path) as f:
-                saved_hparams_ = yaml.safe_load(f)
-                if saved_hparams_ is not None:
-                    saved_hparams.update(saved_hparams_)
     hparams_ = {}
-    if args.config != '':
-        hparams_.update(load_config(args.config))
     if not args.reset:
         hparams_.update(saved_hparams)
     hparams_['work_dir'] = args_work_dir
-    # Support config overriding in command line. Support list type config overriding.
-    # Examples: --hparams="a=1,b.c=2,d=[1 1 1]"
     if args.hparams != "":
         for new_hparam in args.hparams.split(","):
             k, v = new_hparam.split("=")
-            v = v.strip("\'\" ")
-            config_node = hparams_
-            for k_ in k.split(".")[:-1]:
-                config_node = config_node[k_]
-            k = k.split(".")[-1]
-            if v in ['True', 'False'] or type(config_node[k]) in [bool, list, dict]:
-                if type(config_node[k]) == list:
-                    v = v.replace(" ", ",")
-                config_node[k] = eval(v)
             else:
-                config_node[k] = type(config_node[k])(v)
-    if args_work_dir != '' and args.remove:
-        answer = input("REMOVE old checkpoint? Y/N [Default: N]: ")
-        if answer.lower() == "y":
-            remove_file(args_work_dir)
     if args_work_dir != '' and (not os.path.exists(ckpt_config_path) or args.reset) and not args.infer:
         os.makedirs(hparams_['work_dir'], exist_ok=True)
         with open(ckpt_config_path, 'w') as f:
@@ -114,11 +102,11 @@ def set_hparams(config='', exp_name='', hparams_str='', print_hparams=True, glob
     hparams_['infer'] = args.infer
     hparams_['debug'] = args.debug
     hparams_['validate'] = args.validate
-    hparams_['exp_name'] = args.exp_name
     global global_print_hparams
     if global_hparams:
         hparams.clear()
         hparams.update(hparams_)
     if print_hparams and global_print_hparams and global_hparams:
         print('| Hparams chains: ', config_chains)
         print('| Hparams: ')
@@ -126,5 +114,9 @@ def set_hparams(config='', exp_name='', hparams_str='', print_hparams=True, glob
             print(f"\033[;33;m{k}\033[0m: {v}, ", end="\n" if i % 5 == 4 else "")
         print("")
         global_print_hparams = False
     return hparams_

 def set_hparams(config='', exp_name='', hparams_str='', print_hparams=True, global_hparams=True):
+    if config == '':
+        parser = argparse.ArgumentParser(description='neural music')
         parser.add_argument('--config', type=str, default='',
                             help='location of the data corpus')
         parser.add_argument('--exp_name', type=str, default='', help='exp_name')
+        parser.add_argument('--hparams', type=str, default='',
                             help='location of the data corpus')
         parser.add_argument('--infer', action='store_true', help='infer')
         parser.add_argument('--validate', action='store_true', help='validate')
         parser.add_argument('--reset', action='store_true', help='reset hparams')
         parser.add_argument('--debug', action='store_true', help='debug')
         args, unknown = parser.parse_known_args()
     else:
         args = Args(config=config, exp_name=exp_name, hparams=hparams_str,
+                    infer=False, validate=False, reset=False, debug=False)
+    args_work_dir = ''
+    if args.exp_name != '':
+        args.work_dir = args.exp_name
+        args_work_dir = f'checkpoints/{args.work_dir}'
     config_chains = []
     loaded_config = set()
+    def load_config(config_fn):  # deep first
         with open(config_fn) as f:
             hparams_ = yaml.safe_load(f)
         loaded_config.add(config_fn)
             if not isinstance(hparams_['base_config'], list):
                 hparams_['base_config'] = [hparams_['base_config']]
             for c in hparams_['base_config']:
                 if c not in loaded_config:
+                    if c.startswith('.'):
+                        c = f'{os.path.dirname(config_fn)}/{c}'
+                        c = os.path.normpath(c)
                     override_config(ret_hparams, load_config(c))
             override_config(ret_hparams, hparams_)
         else:
         config_chains.append(config_fn)
         return ret_hparams
+    global hparams
+    assert args.config != '' or args_work_dir != ''
     saved_hparams = {}
+    if args_work_dir != 'checkpoints/':
         ckpt_config_path = f'{args_work_dir}/config.yaml'
         if os.path.exists(ckpt_config_path):
+            try:
+                with open(ckpt_config_path) as f:
+                    saved_hparams.update(yaml.safe_load(f))
+            except:
+                pass
+        if args.config == '':
+            args.config = ckpt_config_path
     hparams_ = {}
+    hparams_.update(load_config(args.config))
     if not args.reset:
         hparams_.update(saved_hparams)
     hparams_['work_dir'] = args_work_dir
     if args.hparams != "":
         for new_hparam in args.hparams.split(","):
             k, v = new_hparam.split("=")
+            if v in ['True', 'False'] or type(hparams_[k]) == bool:
+                hparams_[k] = eval(v)
             else:
+                hparams_[k] = type(hparams_[k])(v)
     if args_work_dir != '' and (not os.path.exists(ckpt_config_path) or args.reset) and not args.infer:
         os.makedirs(hparams_['work_dir'], exist_ok=True)
         with open(ckpt_config_path, 'w') as f:
     hparams_['infer'] = args.infer
     hparams_['debug'] = args.debug
     hparams_['validate'] = args.validate
     global global_print_hparams
     if global_hparams:
         hparams.clear()
         hparams.update(hparams_)
     if print_hparams and global_print_hparams and global_hparams:
         print('| Hparams chains: ', config_chains)
         print('| Hparams: ')
             print(f"\033[;33;m{k}\033[0m: {v}, ", end="\n" if i % 5 == 4 else "")
         print("")
         global_print_hparams = False
+    # print(hparams_.keys())
+    if hparams.get('exp_name') is None:
+        hparams['exp_name'] = args.exp_name
+    if hparams_.get('exp_name') is None:
+        hparams_['exp_name'] = args.exp_name
     return hparams_