primepake commited on
Commit
4a1f5f8
·
1 Parent(s): 75e50c2

update LLM

Browse files
Files changed (32) hide show
  1. README.md +58 -12
  2. assets/image.png +0 -0
  3. dac-vae/audiotools/__pycache__/__init__.cpython-310.pyc +0 -0
  4. dac-vae/audiotools/core/__pycache__/__init__.cpython-310.pyc +0 -0
  5. dac-vae/audiotools/core/__pycache__/audio_signal.cpython-310.pyc +0 -0
  6. dac-vae/audiotools/core/__pycache__/display.cpython-310.pyc +0 -0
  7. dac-vae/audiotools/core/__pycache__/dsp.cpython-310.pyc +0 -0
  8. dac-vae/audiotools/core/__pycache__/effects.cpython-310.pyc +0 -0
  9. dac-vae/audiotools/core/__pycache__/ffmpeg.cpython-310.pyc +0 -0
  10. dac-vae/audiotools/core/__pycache__/loudness.cpython-310.pyc +0 -0
  11. dac-vae/audiotools/core/__pycache__/playback.cpython-310.pyc +0 -0
  12. dac-vae/audiotools/core/__pycache__/util.cpython-310.pyc +0 -0
  13. dac-vae/audiotools/core/__pycache__/whisper.cpython-310.pyc +0 -0
  14. dac-vae/audiotools/core/templates/__pycache__/__init__.cpython-310.pyc +0 -0
  15. dac-vae/audiotools/data/__pycache__/__init__.cpython-310.pyc +0 -0
  16. dac-vae/audiotools/data/__pycache__/datasets.cpython-310.pyc +0 -0
  17. dac-vae/audiotools/data/__pycache__/preprocess.cpython-310.pyc +0 -0
  18. dac-vae/audiotools/data/__pycache__/transforms.cpython-310.pyc +0 -0
  19. dac-vae/audiotools/metrics/__pycache__/__init__.cpython-310.pyc +0 -0
  20. dac-vae/audiotools/metrics/__pycache__/distance.cpython-310.pyc +0 -0
  21. dac-vae/audiotools/metrics/__pycache__/quality.cpython-310.pyc +0 -0
  22. dac-vae/audiotools/metrics/__pycache__/spectral.cpython-310.pyc +0 -0
  23. dac-vae/audiotools/ml/__pycache__/__init__.cpython-310.pyc +0 -0
  24. dac-vae/audiotools/ml/__pycache__/accelerator.cpython-310.pyc +0 -0
  25. dac-vae/audiotools/ml/__pycache__/decorators.cpython-310.pyc +0 -0
  26. dac-vae/audiotools/ml/__pycache__/experiment.cpython-310.pyc +0 -0
  27. dac-vae/audiotools/ml/layers/__pycache__/__init__.cpython-310.pyc +0 -0
  28. dac-vae/audiotools/ml/layers/__pycache__/base.cpython-310.pyc +0 -0
  29. dac-vae/audiotools/ml/layers/__pycache__/spectral_gate.cpython-310.pyc +0 -0
  30. dac-vae/model.py +1 -0
  31. requirements.txt +5 -1
  32. speech/config.yaml +1 -1
README.md CHANGED
@@ -10,15 +10,11 @@ This repository provides an implementation of the MiniMax-Speech model, featurin
10
 
11
  ## Key Features
12
 
13
- - [ ] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
14
- - [ ] **FSQ tokenizer training**: Training FSQ from scratch
15
- - [ ] **Two-Stage Architecture**: Optimized training pipeline with discrete and continuous representations
16
- - [ ] **Modular Design**: Separate components for audio codec and variational autoencoder
17
- - [ ] **CosyVoice2 Decoder**: Leverages proven components from the CosyVoice2's Decoder framework
18
- - [ ] **Flow matching AE**: Flow matching training for autoencoders
19
- - [ ] **Immiscible assignment**: Support immiscible adding noise while training
20
- - [ ] **Contrastive Flow matching**: Support Contrastive training
21
-
22
  ## Architecture
23
 
24
  ### Stage 1: Audio to Discrete Tokens
@@ -76,12 +72,63 @@ pip install -r requirements.txt
76
 
77
  3. **Stage 1: Auto Regressive Transformer**
78
  ```bash
79
- # Add feature extraction commands
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ```
81
 
82
  4. **Stage 2: FLow matching decoder**
83
  ```bash
84
- # Add main training command
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ```
86
 
87
  ## Project Structure
@@ -134,7 +181,6 @@ If you use this code in your research, please cite:
134
  This project follows the licensing terms of its dependencies:
135
  - CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
136
  - FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
137
- - Original contributions: [Specify your license here]
138
 
139
  ## Acknowledgments
140
 
 
10
 
11
  ## Key Features
12
 
13
+ - [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
14
+ - [x] **Flow matching AE**: Flow matching training for autoencoders
15
+ - [x] **Immiscible assignment**: Support immiscible adding noise while training
16
+ - [x] **Contrastive Flow matching**: Support Contrastive training
17
+ - [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
 
 
 
 
18
  ## Architecture
19
 
20
  ### Stage 1: Audio to Discrete Tokens
 
72
 
73
  3. **Stage 1: Auto Regressive Transformer**
74
  ```bash
75
+ #!/bin/bash
76
+ pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
77
+
78
+ export CUDA_VISIBLE_DEVICES="0"
79
+ num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
80
+ job_id=1986
81
+ dist_backend="nccl"
82
+ num_workers=2
83
+ prefetch=100
84
+ train_engine=torch_ddp
85
+ model=llm
86
+
87
+ torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
88
+ train.py \
89
+ --train_engine $train_engine \
90
+ --config config.yaml \
91
+ --train_data data/data.list \
92
+ --cv_data data/data.list \
93
+ --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
94
+ --model $model \
95
+ --model_dir /data/checkpoint/$model/ \
96
+ --num_workers ${num_workers} \
97
+ --prefetch ${prefetch} \
98
+ --pin_memory \
99
+ --use_amp \
100
+ --comet_disabled
101
+
102
  ```
103
 
104
  4. **Stage 2: FLow matching decoder**
105
  ```bash
106
+ #!/bin/bash
107
+ pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
108
+ export CUDA_VISIBLE_DEVICES="0"
109
+ num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
110
+ job_id=1986
111
+ dist_backend="nccl"
112
+ num_workers=2
113
+ prefetch=100
114
+ train_engine=torch_ddp
115
+ model=llm
116
+
117
+ torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
118
+ train.py \
119
+ --train_engine $train_engine \
120
+ --config config.yaml \
121
+ --train_data data/data.list \
122
+ --cv_data data/data.list \
123
+ --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
124
+ --model $model \
125
+ --model_dir /data/checkpoint/$model/ \
126
+ --num_workers ${num_workers} \
127
+ --prefetch ${prefetch} \
128
+ --pin_memory \
129
+ --use_amp \
130
+ --comet_disabled
131
+
132
  ```
133
 
134
  ## Project Structure
 
181
  This project follows the licensing terms of its dependencies:
182
  - CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
183
  - FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
 
184
 
185
  ## Acknowledgments
186
 
assets/image.png CHANGED

Git LFS Details

  • SHA256: f10f503661fd5331b31f6a2450391c12df4042ae1b7333d8b4c8646852d2ebae
  • Pointer size: 130 Bytes
  • Size of remote file: 32.6 kB

Git LFS Details

  • SHA256: f10f503661fd5331b31f6a2450391c12df4042ae1b7333d8b4c8646852d2ebae
  • Pointer size: 130 Bytes
  • Size of remote file: 32.6 kB
dac-vae/audiotools/__pycache__/__init__.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/__pycache__/__init__.cpython-310.pyc differ
 
dac-vae/audiotools/core/__pycache__/__init__.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/__init__.cpython-310.pyc differ
 
dac-vae/audiotools/core/__pycache__/audio_signal.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/__pycache__/audio_signal.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/audio_signal.cpython-310.pyc differ
 
dac-vae/audiotools/core/__pycache__/display.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/__pycache__/display.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/display.cpython-310.pyc differ
 
dac-vae/audiotools/core/__pycache__/dsp.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/__pycache__/dsp.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/dsp.cpython-310.pyc differ
 
dac-vae/audiotools/core/__pycache__/effects.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/__pycache__/effects.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/effects.cpython-310.pyc differ
 
dac-vae/audiotools/core/__pycache__/ffmpeg.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/__pycache__/ffmpeg.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/ffmpeg.cpython-310.pyc differ
 
dac-vae/audiotools/core/__pycache__/loudness.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/__pycache__/loudness.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/loudness.cpython-310.pyc differ
 
dac-vae/audiotools/core/__pycache__/playback.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/__pycache__/playback.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/playback.cpython-310.pyc differ
 
dac-vae/audiotools/core/__pycache__/util.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/__pycache__/util.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/util.cpython-310.pyc differ
 
dac-vae/audiotools/core/__pycache__/whisper.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/__pycache__/whisper.cpython-310.pyc and b/dac-vae/audiotools/core/__pycache__/whisper.cpython-310.pyc differ
 
dac-vae/audiotools/core/templates/__pycache__/__init__.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/core/templates/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/core/templates/__pycache__/__init__.cpython-310.pyc differ
 
dac-vae/audiotools/data/__pycache__/__init__.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/data/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/data/__pycache__/__init__.cpython-310.pyc differ
 
dac-vae/audiotools/data/__pycache__/datasets.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/data/__pycache__/datasets.cpython-310.pyc and b/dac-vae/audiotools/data/__pycache__/datasets.cpython-310.pyc differ
 
dac-vae/audiotools/data/__pycache__/preprocess.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/data/__pycache__/preprocess.cpython-310.pyc and b/dac-vae/audiotools/data/__pycache__/preprocess.cpython-310.pyc differ
 
dac-vae/audiotools/data/__pycache__/transforms.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/data/__pycache__/transforms.cpython-310.pyc and b/dac-vae/audiotools/data/__pycache__/transforms.cpython-310.pyc differ
 
dac-vae/audiotools/metrics/__pycache__/__init__.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/metrics/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/metrics/__pycache__/__init__.cpython-310.pyc differ
 
dac-vae/audiotools/metrics/__pycache__/distance.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/metrics/__pycache__/distance.cpython-310.pyc and b/dac-vae/audiotools/metrics/__pycache__/distance.cpython-310.pyc differ
 
dac-vae/audiotools/metrics/__pycache__/quality.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/metrics/__pycache__/quality.cpython-310.pyc and b/dac-vae/audiotools/metrics/__pycache__/quality.cpython-310.pyc differ
 
dac-vae/audiotools/metrics/__pycache__/spectral.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/metrics/__pycache__/spectral.cpython-310.pyc and b/dac-vae/audiotools/metrics/__pycache__/spectral.cpython-310.pyc differ
 
dac-vae/audiotools/ml/__pycache__/__init__.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/ml/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/ml/__pycache__/__init__.cpython-310.pyc differ
 
dac-vae/audiotools/ml/__pycache__/accelerator.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/ml/__pycache__/accelerator.cpython-310.pyc and b/dac-vae/audiotools/ml/__pycache__/accelerator.cpython-310.pyc differ
 
dac-vae/audiotools/ml/__pycache__/decorators.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/ml/__pycache__/decorators.cpython-310.pyc and b/dac-vae/audiotools/ml/__pycache__/decorators.cpython-310.pyc differ
 
dac-vae/audiotools/ml/__pycache__/experiment.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/ml/__pycache__/experiment.cpython-310.pyc and b/dac-vae/audiotools/ml/__pycache__/experiment.cpython-310.pyc differ
 
dac-vae/audiotools/ml/layers/__pycache__/__init__.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/ml/layers/__pycache__/__init__.cpython-310.pyc and b/dac-vae/audiotools/ml/layers/__pycache__/__init__.cpython-310.pyc differ
 
dac-vae/audiotools/ml/layers/__pycache__/base.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/ml/layers/__pycache__/base.cpython-310.pyc and b/dac-vae/audiotools/ml/layers/__pycache__/base.cpython-310.pyc differ
 
dac-vae/audiotools/ml/layers/__pycache__/spectral_gate.cpython-310.pyc CHANGED
Binary files a/dac-vae/audiotools/ml/layers/__pycache__/spectral_gate.cpython-310.pyc and b/dac-vae/audiotools/ml/layers/__pycache__/spectral_gate.cpython-310.pyc differ
 
dac-vae/model.py CHANGED
@@ -495,6 +495,7 @@ class DACVAE(BaseModel, CodecMixin):
495
  # print(f"Audio data shape: {audio_data.shape}")
496
  length = audio_data.shape[-1]
497
  audio_data = self.preprocess(audio_data, sample_rate)
 
498
  z, m, logs = self.encode(audio_data)
499
  x = self.decode(z)
500
  return {
 
495
  # print(f"Audio data shape: {audio_data.shape}")
496
  length = audio_data.shape[-1]
497
  audio_data = self.preprocess(audio_data, sample_rate)
498
+ print('audio_data: ', audio_data.shape)
499
  z, m, logs = self.encode(audio_data)
500
  x = self.decode(z)
501
  return {
requirements.txt CHANGED
@@ -37,4 +37,8 @@ torchaudio==2.3.1
37
  transformers==4.40.1
38
  uvicorn==0.30.0
39
  wetext==0.0.4
40
- wget==3.2
 
 
 
 
 
37
  transformers==4.40.1
38
  uvicorn==0.30.0
39
  wetext==0.0.4
40
+ wget==3.2
41
+ flatten_dict
42
+ julius
43
+ importlib_resources
44
+ randomname
speech/config.yaml CHANGED
@@ -198,7 +198,7 @@ sort: !name:cosyvoice.dataset.processor.sort
198
  sort_size: 500 # sort_size should be less than shuffle_size
199
  batch: !name:cosyvoice.dataset.processor.batch
200
  batch_type: 'dynamic'
201
- max_frames_in_batch: 25000
202
  padding: !name:cosyvoice.dataset.processor.padding
203
  use_spk_embedding: False # change to True during sft
204
  use_speaker_encoder: !ref <use_speaker_encoder>
 
198
  sort_size: 500 # sort_size should be less than shuffle_size
199
  batch: !name:cosyvoice.dataset.processor.batch
200
  batch_type: 'dynamic'
201
+ max_frames_in_batch: 50000
202
  padding: !name:cosyvoice.dataset.processor.padding
203
  use_spk_embedding: False # change to True during sft
204
  use_speaker_encoder: !ref <use_speaker_encoder>