Update README.md
Browse files
README.md
CHANGED
|
@@ -1,9 +1,16 @@
|
|
| 1 |
-
# DualCodec
|
|
|
|
|
|
|
|
|
|
| 2 |
## Installation
|
| 3 |
```bash
|
| 4 |
pip install dualcodec
|
| 5 |
```
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
## Available models
|
| 8 |
<!-- - 12hz_v1: DualCodec model trained with 12Hz sampling rate.
|
| 9 |
- 25hz_v1: DualCodec model trained with 25Hz sampling rate. -->
|
|
@@ -14,22 +21,23 @@ pip install dualcodec
|
|
| 14 |
| 25hz_v1 | 25Hz | Any from 1-12 (maximum 12) | 16384 | 1024 | 100K hours Emilia |
|
| 15 |
|
| 16 |
|
| 17 |
-
## How to inference
|
| 18 |
|
| 19 |
-
Download checkpoints to local:
|
| 20 |
```
|
| 21 |
# export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
|
| 22 |
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
|
| 23 |
-
huggingface-cli download amphion/dualcodec --local-dir dualcodec_ckpts
|
| 24 |
```
|
|
|
|
| 25 |
|
| 26 |
-
To inference an audio in a python script:
|
| 27 |
```python
|
| 28 |
import dualcodec
|
| 29 |
|
| 30 |
w2v_path = "./w2v-bert-2.0" # your downloaded path
|
| 31 |
dualcodec_model_path = "./dualcodec_ckpts" # your downloaded path
|
| 32 |
-
model_id = "12hz_v1" # or "25hz_v1"
|
| 33 |
|
| 34 |
dualcodec_model = dualcodec.get_model(model_id, dualcodec_model_path)
|
| 35 |
inference = dualcodec.Inference(dualcodec_model=dualcodec_model, dualcodec_path=dualcodec_model_path, w2v_path=w2v_path, device="cuda")
|
|
@@ -52,7 +60,66 @@ out_audio = dualcodec_model.decode_from_codes(semantic_codes, acoustic_codes)
|
|
| 52 |
torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
|
| 53 |
```
|
| 54 |
|
| 55 |
-
See "example.ipynb" for
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
-
##
|
| 58 |
-
Stay tuned for the training code release!
|
|
|
|
| 1 |
+
# DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
|
| 2 |
+
|
| 3 |
+
## About
|
| 4 |
+
|
| 5 |
## Installation
|
| 6 |
```bash
|
| 7 |
pip install dualcodec
|
| 8 |
```
|
| 9 |
|
| 10 |
+
## News
|
| 11 |
+
- 2025-01-22: I added training and finetuning instructions for DualCodec, version is v0.3.0.
|
| 12 |
+
- 2025-01-16: Finished writing DualCodec inference codes, the version is v0.1.0.
|
| 13 |
+
|
| 14 |
## Available models
|
| 15 |
<!-- - 12hz_v1: DualCodec model trained with 12Hz sampling rate.
|
| 16 |
- 25hz_v1: DualCodec model trained with 25Hz sampling rate. -->
|
|
|
|
| 21 |
| 25hz_v1 | 25Hz | Any from 1-12 (maximum 12) | 16384 | 1024 | 100K hours Emilia |
|
| 22 |
|
| 23 |
|
| 24 |
+
## How to inference DualCodec
|
| 25 |
|
| 26 |
+
### 1. Download checkpoints to local:
|
| 27 |
```
|
| 28 |
# export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
|
| 29 |
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
|
| 30 |
+
huggingface-cli download amphion/dualcodec dualcodec_12hz_16384_4096.safetensors dualcodec_25hz_16384_1024.safetensors w2vbert2_mean_var_stats_emilia.pt --local-dir dualcodec_ckpts
|
| 31 |
```
|
| 32 |
+
The second command downloads the two DualCodec model (12hz_v1 and 25hz_v1) checkpoints and a w2v-bert-2 mean and variance statistics to the local directory `dualcodec_ckpts`.
|
| 33 |
|
| 34 |
+
### 2. To inference an audio in a python script:
|
| 35 |
```python
|
| 36 |
import dualcodec
|
| 37 |
|
| 38 |
w2v_path = "./w2v-bert-2.0" # your downloaded path
|
| 39 |
dualcodec_model_path = "./dualcodec_ckpts" # your downloaded path
|
| 40 |
+
model_id = "12hz_v1" # select from available Model_IDs, "12hz_v1" or "25hz_v1"
|
| 41 |
|
| 42 |
dualcodec_model = dualcodec.get_model(model_id, dualcodec_model_path)
|
| 43 |
inference = dualcodec.Inference(dualcodec_model=dualcodec_model, dualcodec_path=dualcodec_model_path, w2v_path=w2v_path, device="cuda")
|
|
|
|
| 60 |
torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
|
| 61 |
```
|
| 62 |
|
| 63 |
+
See "example.ipynb" for a running example.
|
| 64 |
+
|
| 65 |
+
## DualCodec-based TTS models
|
| 66 |
+
### DualCodec-based TTS
|
| 67 |
+
|
| 68 |
+
## Benchmark results
|
| 69 |
+
### DualCodec audio quality
|
| 70 |
+
### DualCodec-based TTS
|
| 71 |
+
|
| 72 |
+
## Finetuning DualCodec
|
| 73 |
+
1. Install other necessary components for training:
|
| 74 |
+
```bash
|
| 75 |
+
pip install "dualcodec[train]"
|
| 76 |
+
```
|
| 77 |
+
2. Clone this repository and `cd` to project root folder.
|
| 78 |
+
|
| 79 |
+
3. Get discriminator checkpoints:
|
| 80 |
+
```bash
|
| 81 |
+
huggingface-cli download amphion/dualcodec --local-dir dualcodec_ckpts
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
4. To run example training on Emilia German data (streaming, no need to download files. Need to access Huggingface):
|
| 85 |
+
```bash
|
| 86 |
+
accelerate launch train.py --config-name=dualcodec_ft_12hzv1 \
|
| 87 |
+
trainer.batch_size=3 \
|
| 88 |
+
data.segment_speech.segment_length=24000
|
| 89 |
+
```
|
| 90 |
+
This trains from scratch a 12hz_v1 model with a training batch size of 3. (typically you need larger batch sizes)
|
| 91 |
+
|
| 92 |
+
To finetune a 25Hz_V1 model:
|
| 93 |
+
```bash
|
| 94 |
+
accelerate launch train.py --config-name=dualcodec_ft_25hzv1 \
|
| 95 |
+
trainer.batch_size=3 \
|
| 96 |
+
data.segment_speech.segment_length=24000
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
## Training DualCodec from scratch
|
| 101 |
+
1. Install other necessary components for training:
|
| 102 |
+
```bash
|
| 103 |
+
pip install dualcodec[train]
|
| 104 |
+
```
|
| 105 |
+
2. Clone this repository and `cd` to project root folder.
|
| 106 |
+
|
| 107 |
+
3. To run example training on example Emilia German data:
|
| 108 |
+
```bash
|
| 109 |
+
accelerate launch train.py --config-name=codec_train \
|
| 110 |
+
model=dualcodec_12hz_16384_4096_8vq \
|
| 111 |
+
trainer.batch_size=3 \
|
| 112 |
+
data.segment_speech.segment_length=24000
|
| 113 |
+
```
|
| 114 |
+
This trains from scratch a dualcodec_12hz_16384_4096_8vq model with a training batch size of 3. (typically you need larger batch sizes)
|
| 115 |
+
|
| 116 |
+
To train a 25Hz model:
|
| 117 |
+
```bash
|
| 118 |
+
accelerate launch train.py --config-name=codec_train \
|
| 119 |
+
model=dualcodec_25hz_16384_1024_12vq \
|
| 120 |
+
trainer.batch_size=3 \
|
| 121 |
+
data.segment_speech.segment_length=24000
|
| 122 |
+
|
| 123 |
+
```
|
| 124 |
|
| 125 |
+
## Citation
|
|
|