inLine-XJY commited on
Commit
c1b1764
β€’
1 Parent(s): d276afe

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -127
README.md DELETED
@@ -1,127 +0,0 @@
1
- # AudioLCM: Text-to-Audio Generation with Latent Consistency Models
2
-
3
- #### Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao
4
-
5
- PyTorch Implementation of [AudioLCM]: a efficient and high-quality text-to-audio generation with latent consistency model.
6
-
7
- <!-- [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2301.12661)
8
- [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio)
9
- [![GitHub Stars](https://img.shields.io/github/stars/Text-to-Audio/Make-An-Audio?style=social)](https://github.com/Text-to-Audio/Make-An-Audio) -->
10
-
11
- We provide our implementation and pretrained models as open source in this repository.
12
-
13
- Visit our [demo page](https://audiolcm.github.io/) for audio samples.
14
-
15
- <!-- [Text-to-Audio HuggingFace Space](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio) | [Audio Inpainting HuggingFace Space](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio_inpaint) -->
16
-
17
- ## News
18
- <!-- - Jan, 2023: **[Make-An-Audio](https://arxiv.org/abs/2207.06389)** submitted to arxiv. -->
19
- - June, 2024: **[AudioLCM]** released in Github.
20
-
21
- ## Quick Started
22
- We provide an example of how you can generate high-fidelity samples quickly using AudioLCM.
23
-
24
- To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.
25
-
26
-
27
- ### Support Datasets and Pretrained Models
28
-
29
- Simply run following command to download the weights from [Google drive](https://drive.google.com/drive/folders/1zZTI3-nHrUIywKFqwxlFO6PjB66JA8jI?usp=drive_link).
30
- Download bert-base-uncased weights from [Hugging Face](https://huggingface.co/google-bert/bert-base-uncased). Down load t5-v1_1-large weights from [Hugging Face](https://huggingface.co/google/t5-v1_1-large). Download CLAP weights from [Hugging Face](https://huggingface.co/microsoft/msclap/blob/main/CLAP_weights_2022.pth).
31
-
32
- ```
33
- Download:
34
- audiolcm.ckpt and put it into ./ckpts
35
- BigVGAN vocoder and put it into ./vocoder/logs/bigvnat16k93.5w
36
- t5-v1_1-large and put it into ./ldm/modules/encoders/CLAP
37
- bert-base-uncased and put it into ./ldm/modules/encoders/CLAP
38
- CLAP_weights_2022.pth and put it into ./wav_evaluation/useful_ckpts/CLAP
39
- ```
40
- <!-- The directory structure should be:
41
- ```
42
- useful_ckpts/
43
- β”œβ”€β”€ bigvgan
44
- β”‚ β”œβ”€β”€ args.yml
45
- β”‚ └── best_netG.pt
46
- β”œβ”€β”€ CLAP
47
- β”‚ β”œβ”€β”€ config.yml
48
- β”‚ └── CLAP_weights_2022.pth
49
- └── maa1_full.ckpt
50
- ``` -->
51
-
52
-
53
- ### Dependencies
54
- See requirements in `requirement.txt`:
55
-
56
- ## Inference with pretrained model
57
- ```bash
58
- python scripts/txt2audio_for_lcm.py --ddim_steps 2 -b configs/audiolcm.yaml --sample_rate 16000 --vocoder-ckpt vocoder/logs/bigvnat16k93.5w --outdir results --test-dataset audiocaps -r ckpt/audiolcm.ckpt
59
- ```
60
- # Train
61
- ## dataset preparation
62
- We can't provide the dataset download link for copyright issues. We provide the process code to generate melspec.
63
- Before training, we need to construct the dataset information into a tsv file, which includes name (id for each audio), dataset (which dataset the audio belongs to), audio_path (the path of .wav file),caption (the caption of the audio) ,mel_path (the processed melspec file path of each audio). We provide a tsv file of audiocaps test set: ./audiocaps_test_16000_struct.tsv as a sample.
64
- ### generate the melspec file of audio
65
- Assume you have already got a tsv file to link each caption to its audio_path, which mean the tsv_file have "name","audio_path","dataset" and "caption" columns in it.
66
- To get the melspec of audio, run the following command, which will save mels in ./processed
67
- ```bash
68
- python ldm/data/preprocess/mel_spec.py --tsv_path tmp.tsv
69
- ```
70
- Add the duration into the tsv file
71
- ```bash
72
- python ldm/data/preprocess/add_duration.py
73
- ```
74
- ## Train variational autoencoder
75
- Assume we have processed several datasets, and save the .tsv files in data/*.tsv . Replace **data.params.spec_dir_path** with the **data**(the directory that contain tsvs) in the config file. Then we can train VAE with the following command. If you don't have 8 gpus in your machine, you can replace --gpus 0,1,...,gpu_nums
76
- ```bash
77
- python main.py --base configs/train/vae.yaml -t --gpus 0,1,2,3,4,5,6,7
78
- ```
79
- The training result will be save in ./logs/
80
- ## train latent diffsuion
81
- After Trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file.
82
- Run the following command to train Diffusion model
83
- ```bash
84
- python main.py --base configs/autoencoder1d.yaml -t --gpus 0,1,2,3,4,5,6,7
85
- ```
86
- The training result will be save in ./logs/
87
- # Evaluation
88
- ## generate audiocaps samples
89
- ```bash
90
- python scripts/txt2audio_for_lcm.py --ddim_steps 2 -b configs/audiolcm.yaml --sample_rate 16000 --vocoder-ckpt vocoder/logs/bigvnat16k93.5w --outdir results --test-dataset audiocaps -r ckpt/audiolcm.ckpt
91
- ```
92
-
93
- ## calculate FD,FAD,IS,KL
94
- install [audioldm_eval](https://github.com/haoheliu/audioldm_eval) by
95
- ```bash
96
- git clone git@github.com:haoheliu/audioldm_eval.git
97
- ```
98
- Then test with:
99
- ```bash
100
- python scripts/test.py --pred_wavsdir {the directory that saves the audios you generated} --gt_wavsdir {the directory that saves audiocaps test set waves}
101
- ```
102
- ## calculate Clap_score
103
- ```bash
104
- python wav_evaluation/cal_clap_score.py --tsv_path {the directory that saves the audios you generated}/result.tsv
105
- ```
106
-
107
-
108
- ## Acknowledgements
109
- This implementation uses parts of the code from the following Github repos:
110
- [Make-An-Audio](https://github.com/Text-to-Audio/Make-An-Audio)
111
- [CLAP](https://github.com/LAION-AI/CLAP),
112
- [Stable Diffusion](https://github.com/CompVis/stable-diffusion),
113
- as described in our code.
114
-
115
- <!-- ## Citations ##
116
- If you find this code useful in your research, please consider citing:
117
- ```bibtex
118
- @article{huang2023make,
119
- title={Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models},
120
- author={Huang, Rongjie and Huang, Jiawei and Yang, Dongchao and Ren, Yi and Liu, Luping and Li, Mingze and Ye, Zhenhui and Liu, Jinglin and Yin, Xiang and Zhao, Zhou},
121
- journal={arXiv preprint arXiv:2301.12661},
122
- year={2023}
123
- }
124
- ``` -->
125
-
126
- # Disclaimer ##
127
- Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.