Spaces:
Sleeping
Sleeping
primepake
commited on
Commit
·
4a1f5f8
1
Parent(s):
75e50c2
update LLM
Browse files- README.md +58 -12
- assets/image.png +0 -0
- dac-vae/audiotools/__pycache__/__init__.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/__pycache__/__init__.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/__pycache__/audio_signal.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/__pycache__/display.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/__pycache__/dsp.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/__pycache__/effects.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/__pycache__/ffmpeg.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/__pycache__/loudness.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/__pycache__/playback.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/__pycache__/util.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/__pycache__/whisper.cpython-310.pyc +0 -0
- dac-vae/audiotools/core/templates/__pycache__/__init__.cpython-310.pyc +0 -0
- dac-vae/audiotools/data/__pycache__/__init__.cpython-310.pyc +0 -0
- dac-vae/audiotools/data/__pycache__/datasets.cpython-310.pyc +0 -0
- dac-vae/audiotools/data/__pycache__/preprocess.cpython-310.pyc +0 -0
- dac-vae/audiotools/data/__pycache__/transforms.cpython-310.pyc +0 -0
- dac-vae/audiotools/metrics/__pycache__/__init__.cpython-310.pyc +0 -0
- dac-vae/audiotools/metrics/__pycache__/distance.cpython-310.pyc +0 -0
- dac-vae/audiotools/metrics/__pycache__/quality.cpython-310.pyc +0 -0
- dac-vae/audiotools/metrics/__pycache__/spectral.cpython-310.pyc +0 -0
- dac-vae/audiotools/ml/__pycache__/__init__.cpython-310.pyc +0 -0
- dac-vae/audiotools/ml/__pycache__/accelerator.cpython-310.pyc +0 -0
- dac-vae/audiotools/ml/__pycache__/decorators.cpython-310.pyc +0 -0
- dac-vae/audiotools/ml/__pycache__/experiment.cpython-310.pyc +0 -0
- dac-vae/audiotools/ml/layers/__pycache__/__init__.cpython-310.pyc +0 -0
- dac-vae/audiotools/ml/layers/__pycache__/base.cpython-310.pyc +0 -0
- dac-vae/audiotools/ml/layers/__pycache__/spectral_gate.cpython-310.pyc +0 -0
- dac-vae/model.py +1 -0
- requirements.txt +5 -1
- speech/config.yaml +1 -1
README.md
CHANGED
|
@@ -10,15 +10,11 @@ This repository provides an implementation of the MiniMax-Speech model, featurin
|
|
| 10 |
|
| 11 |
## Key Features
|
| 12 |
|
| 13 |
-
- [
|
| 14 |
-
- [
|
| 15 |
-
- [
|
| 16 |
-
- [
|
| 17 |
-
- [ ] **
|
| 18 |
-
- [ ] **Flow matching AE**: Flow matching training for autoencoders
|
| 19 |
-
- [ ] **Immiscible assignment**: Support immiscible adding noise while training
|
| 20 |
-
- [ ] **Contrastive Flow matching**: Support Contrastive training
|
| 21 |
-
|
| 22 |
## Architecture
|
| 23 |
|
| 24 |
### Stage 1: Audio to Discrete Tokens
|
|
@@ -76,12 +72,63 @@ pip install -r requirements.txt
|
|
| 76 |
|
| 77 |
3. **Stage 1: Auto Regressive Transformer**
|
| 78 |
```bash
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
```
|
| 81 |
|
| 82 |
4. **Stage 2: FLow matching decoder**
|
| 83 |
```bash
|
| 84 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
```
|
| 86 |
|
| 87 |
## Project Structure
|
|
@@ -134,7 +181,6 @@ If you use this code in your research, please cite:
|
|
| 134 |
This project follows the licensing terms of its dependencies:
|
| 135 |
- CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
|
| 136 |
- FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
|
| 137 |
-
- Original contributions: [Specify your license here]
|
| 138 |
|
| 139 |
## Acknowledgments
|
| 140 |
|
|
|
|
| 10 |
|
| 11 |
## Key Features
|
| 12 |
|
| 13 |
+
- [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
|
| 14 |
+
- [x] **Flow matching AE**: Flow matching training for autoencoders
|
| 15 |
+
- [x] **Immiscible assignment**: Support immiscible adding noise while training
|
| 16 |
+
- [x] **Contrastive Flow matching**: Support Contrastive training
|
| 17 |
+
- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
## Architecture
|
| 19 |
|
| 20 |
### Stage 1: Audio to Discrete Tokens
|
|
|
|
| 72 |
|
| 73 |
3. **Stage 1: Auto Regressive Transformer**
|
| 74 |
```bash
|
| 75 |
+
#!/bin/bash
|
| 76 |
+
pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
|
| 77 |
+
|
| 78 |
+
export CUDA_VISIBLE_DEVICES="0"
|
| 79 |
+
num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
|
| 80 |
+
job_id=1986
|
| 81 |
+
dist_backend="nccl"
|
| 82 |
+
num_workers=2
|
| 83 |
+
prefetch=100
|
| 84 |
+
train_engine=torch_ddp
|
| 85 |
+
model=llm
|
| 86 |
+
|
| 87 |
+
torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
|
| 88 |
+
train.py \
|
| 89 |
+
--train_engine $train_engine \
|
| 90 |
+
--config config.yaml \
|
| 91 |
+
--train_data data/data.list \
|
| 92 |
+
--cv_data data/data.list \
|
| 93 |
+
--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
|
| 94 |
+
--model $model \
|
| 95 |
+
--model_dir /data/checkpoint/$model/ \
|
| 96 |
+
--num_workers ${num_workers} \
|
| 97 |
+
--prefetch ${prefetch} \
|
| 98 |
+
--pin_memory \
|
| 99 |
+
--use_amp \
|
| 100 |
+
--comet_disabled
|
| 101 |
+
|
| 102 |
```
|
| 103 |
|
| 104 |
4. **Stage 2: FLow matching decoder**
|
| 105 |
```bash
|
| 106 |
+
#!/bin/bash
|
| 107 |
+
pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
|
| 108 |
+
export CUDA_VISIBLE_DEVICES="0"
|
| 109 |
+
num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
|
| 110 |
+
job_id=1986
|
| 111 |
+
dist_backend="nccl"
|
| 112 |
+
num_workers=2
|
| 113 |
+
prefetch=100
|
| 114 |
+
train_engine=torch_ddp
|
| 115 |
+
model=llm
|
| 116 |
+
|
| 117 |
+
torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
|
| 118 |
+
train.py \
|
| 119 |
+
--train_engine $train_engine \
|
| 120 |
+
--config config.yaml \
|
| 121 |
+
--train_data data/data.list \
|
| 122 |
+
--cv_data data/data.list \
|
| 123 |
+
--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
|
| 124 |
+
--model $model \
|
| 125 |
+
--model_dir /data/checkpoint/$model/ \
|
| 126 |
+
--num_workers ${num_workers} \
|
| 127 |
+
--prefetch ${prefetch} \
|
| 128 |
+
--pin_memory \
|
| 129 |
+
--use_amp \
|
| 130 |
+
--comet_disabled
|
| 131 |
+
|
| 132 |
```
|
| 133 |
|
| 134 |
## Project Structure
|
|
|
|
| 181 |
This project follows the licensing terms of its dependencies:
|
| 182 |
- CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
|
| 183 |
- FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
|
|
|
|
| 184 |
|
| 185 |
## Acknowledgments
|
| 186 |
|
assets/image.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
dac-vae/audiotools/__pycache__/__init__.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/__pycache__/__init__.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/__pycache__/__init__.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/__init__.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/__pycache__/audio_signal.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/__pycache__/audio_signal.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/audio_signal.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/__pycache__/display.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/__pycache__/display.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/display.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/__pycache__/dsp.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/__pycache__/dsp.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/dsp.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/__pycache__/effects.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/__pycache__/effects.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/effects.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/__pycache__/ffmpeg.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/__pycache__/ffmpeg.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/ffmpeg.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/__pycache__/loudness.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/__pycache__/loudness.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/loudness.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/__pycache__/playback.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/__pycache__/playback.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/playback.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/__pycache__/util.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/__pycache__/util.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/util.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/__pycache__/whisper.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/__pycache__/whisper.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/whisper.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/core/templates/__pycache__/__init__.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/core/templates/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/core/templates/__pycache__/__init__.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/data/__pycache__/__init__.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/data/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/data/__pycache__/__init__.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/data/__pycache__/datasets.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/data/__pycache__/datasets.cpython-310.pyc and b/dac-vae/audiotools/data/__pycache__/datasets.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/data/__pycache__/preprocess.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/data/__pycache__/preprocess.cpython-310.pyc and b/dac-vae/audiotools/data/__pycache__/preprocess.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/data/__pycache__/transforms.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/data/__pycache__/transforms.cpython-310.pyc and b/dac-vae/audiotools/data/__pycache__/transforms.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/metrics/__pycache__/__init__.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/metrics/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/metrics/__pycache__/__init__.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/metrics/__pycache__/distance.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/metrics/__pycache__/distance.cpython-310.pyc and b/dac-vae/audiotools/metrics/__pycache__/distance.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/metrics/__pycache__/quality.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/metrics/__pycache__/quality.cpython-310.pyc and b/dac-vae/audiotools/metrics/__pycache__/quality.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/metrics/__pycache__/spectral.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/metrics/__pycache__/spectral.cpython-310.pyc and b/dac-vae/audiotools/metrics/__pycache__/spectral.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/ml/__pycache__/__init__.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/ml/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/ml/__pycache__/__init__.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/ml/__pycache__/accelerator.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/ml/__pycache__/accelerator.cpython-310.pyc and b/dac-vae/audiotools/ml/__pycache__/accelerator.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/ml/__pycache__/decorators.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/ml/__pycache__/decorators.cpython-310.pyc and b/dac-vae/audiotools/ml/__pycache__/decorators.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/ml/__pycache__/experiment.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/ml/__pycache__/experiment.cpython-310.pyc and b/dac-vae/audiotools/ml/__pycache__/experiment.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/ml/layers/__pycache__/__init__.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/ml/layers/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/ml/layers/__pycache__/__init__.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/ml/layers/__pycache__/base.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/ml/layers/__pycache__/base.cpython-310.pyc and b/dac-vae/audiotools/ml/layers/__pycache__/base.cpython-310.pyc differ
|
|
|
dac-vae/audiotools/ml/layers/__pycache__/spectral_gate.cpython-310.pyc
CHANGED
|
Binary files a/dac-vae/audiotools/ml/layers/__pycache__/spectral_gate.cpython-310.pyc and b/dac-vae/audiotools/ml/layers/__pycache__/spectral_gate.cpython-310.pyc differ
|
|
|
dac-vae/model.py
CHANGED
|
@@ -495,6 +495,7 @@ class DACVAE(BaseModel, CodecMixin):
|
|
| 495 |
# print(f"Audio data shape: {audio_data.shape}")
|
| 496 |
length = audio_data.shape[-1]
|
| 497 |
audio_data = self.preprocess(audio_data, sample_rate)
|
|
|
|
| 498 |
z, m, logs = self.encode(audio_data)
|
| 499 |
x = self.decode(z)
|
| 500 |
return {
|
|
|
|
| 495 |
# print(f"Audio data shape: {audio_data.shape}")
|
| 496 |
length = audio_data.shape[-1]
|
| 497 |
audio_data = self.preprocess(audio_data, sample_rate)
|
| 498 |
+
print('audio_data: ', audio_data.shape)
|
| 499 |
z, m, logs = self.encode(audio_data)
|
| 500 |
x = self.decode(z)
|
| 501 |
return {
|
requirements.txt
CHANGED
|
@@ -37,4 +37,8 @@ torchaudio==2.3.1
|
|
| 37 |
transformers==4.40.1
|
| 38 |
uvicorn==0.30.0
|
| 39 |
wetext==0.0.4
|
| 40 |
-
wget==3.2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
transformers==4.40.1
|
| 38 |
uvicorn==0.30.0
|
| 39 |
wetext==0.0.4
|
| 40 |
+
wget==3.2
|
| 41 |
+
flatten_dict
|
| 42 |
+
julius
|
| 43 |
+
importlib_resources
|
| 44 |
+
randomname
|
speech/config.yaml
CHANGED
|
@@ -198,7 +198,7 @@ sort: !name:cosyvoice.dataset.processor.sort
|
|
| 198 |
sort_size: 500 # sort_size should be less than shuffle_size
|
| 199 |
batch: !name:cosyvoice.dataset.processor.batch
|
| 200 |
batch_type: 'dynamic'
|
| 201 |
-
max_frames_in_batch:
|
| 202 |
padding: !name:cosyvoice.dataset.processor.padding
|
| 203 |
use_spk_embedding: False # change to True during sft
|
| 204 |
use_speaker_encoder: !ref <use_speaker_encoder>
|
|
|
|
| 198 |
sort_size: 500 # sort_size should be less than shuffle_size
|
| 199 |
batch: !name:cosyvoice.dataset.processor.batch
|
| 200 |
batch_type: 'dynamic'
|
| 201 |
+
max_frames_in_batch: 50000
|
| 202 |
padding: !name:cosyvoice.dataset.processor.padding
|
| 203 |
use_spk_embedding: False # change to True during sft
|
| 204 |
use_speaker_encoder: !ref <use_speaker_encoder>
|