File size: 6,104 Bytes
0310dcd 02f860a 0310dcd 02f860a 0310dcd 02f860a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
language: en
tags:
- speech quantization
license: mit
datasets:
- LibriTTS
---
# Highlights
This model is used for speech codec or quantization on English utterances.
- Lower frame rate, 25 token/s for each quantizer
- Achieving higher codec quality under low band widths
- Training with structured dropout, enabling various band widths during inference with a single model
- Quantizing a raw speech waveform into a sequence of discrete tokens
# FunCodec model
This model is trained with [FunCodec](https://github.com/alibaba-damo-academy/FunCodec),
an open-source toolkits for speech quantization (codec) from the Damo academy, Alibaba Group.
This repository provides a pre-trained model on the LibriTTS corpus.
It can be applied to low-band-width speech communication, speech quantization, zero-shot speech synthesis
and other academic research topics.
Compared with [EnCodec](https://arxiv.org/abs/2210.13438) and [SoundStream](https://arxiv.org/abs/2107.03312),
the following improved techniques are utilized to train the model, resulting in higher codec quality and
[ViSQOL](https://github.com/google/visqol) scores under the same band width:
- The magnitude spectrum loss is employed to enhance the middle and high frequency signals
- Structured dropout is employed to smooth the code space, as well as enable various band widths in a single model
- Codes are initialized by k-means clusters rather than random values
- Codebooks are maintained with exponential moving average and dead-code-elimination mechanism, resulting in high utilization factor for codebooks.
## Model description
This model is a variational autoencoder that uses residual vector quantisation (RVQ) to obtain
several parallel sequences of discrete latent representations. Here is an overview of FunCodec models.
<p align="center">
<img src="fig/framework.png" alt="FunCodec architecture"/>
</p>
In general, FunCodec models consist of five modules: a domain transformation module,
an encoder, a RVQ module, a decoder and a domain inversion module.
- Domain Transformation:transfer signals into time domain, short-time frequency domain, magnitude-angle domain or magnitude-phase domain.
- Encoder:encode signals into compact representations with stacked convolutional and LSTM layers.
- Semantic tokens (Optional): augment encoder outputs with semantic tokens to enhance the content information, not used in this model.
- RVQ:quantize the representations into parallel sequences of discrete tokens with cascaded vector quantizers.
- Decoder:decode quantized embeddings into different signal domains the same as inputs.
- Domain Inversion:re-synthesize perceptible waveforms from different domains.
More details can be found at:
- Paper: [FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec](https://arxiv.org/abs/2309.07405)
- Codebase: [FunCodec](https://github.com/alibaba-damo-academy/FunCodec)
## Intended uses & sceneries
### Inference with FunCodec
You can extract codecs and reconstruct them back to waveforms with FunCodec repository.
#### FunCodec installation
```sh
# Install Pytorch GPU (version >= 1.12.0):
conda install pytorch==1.12.0
# for other versions, please refer: https://pytorch.org/get-started/locally
# Download codebase:
git clone https://github.com/alibaba-damo-academy/FunCodec.git
# Install FunCodec codebase:
cd FunCodec
pip install --editable ./
```
#### Codec extraction
```sh
# Enter the example directory
cd egs/LibriTTS/codec
# Specify the model name
model_name="audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch"
# Download the model
git lfs install
git clone https://huggingface.co/alibaba-damo/${model_name}
mkdir exp
mv ${model_name} exp/$model_name
# Extracting codec within the input file "input_wav.scp" and the codecs are saved under "outputs/codecs"
bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
--model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
--wav_scp input_wav.scp --out_dir outputs/codecs
# input_wav.scp has the following format:
# uttid1 path/to/file1.wav
# uttid2 path/to/file2.wav
# ...
```
### Reconstruct waveforms from codecs
```shell
# Reconstruct waveforms into "outputs/recon_wavs"
bash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
--model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
--wav_scp outputs/codecs/codecs.txt --out_dir outputs/recon_wavs
# codecs.txt is the output of stage 1, which has the following format:
# uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]]
# uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]]
# ...
```
### Inference with Huggingface Transformers
Inference with Huggingface transformers package is under development.
### Application sceneries
Running environment
- Currently, the model only passed the tests on Linux-x86_64. Mac and Windows systems are not tested.
Intended using sceneries
- This model is suitable for academic usages
- Speech quantization, codec and tokenization for English utterances
## Evaluation results
### Training configuration
- Feature info: raw waveform input
- Train info: Adam, lr 3e-4, batch_size 32, 2 gpu(Tesla V100), acc_grad 1, 300000 steps, speech_max_length 51200
- Loss info: L1, L2, discriminative loss
- Model info: SEANet, Conv, LSTM
- Train config: config.yaml
- Model size: 57.83 M parameters
### Experimental Results
Test set: LibriTTS-test, ViSQOL scores
| testset | 50 tk/s | 100 tk/s | 200 tk/s | 400 tk/s |
|:--------:|:--------:|:--------:|:--------:|:--------:|
| LibriTTS | 3.64 | 3.94 | 4.16 | 4.29 |
### Limitations and bias
- Not very robust to background noises and reverberation
### BibTeX entry and citation info
```BibTeX
@misc{du2023funcodec,
title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},
author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},
year={2023},
eprint={2309.07405},
archivePrefix={arXiv},
primaryClass={cs.Sound}
}
```
|