English
speech quantization
File size: 6,104 Bytes
0310dcd
02f860a
 
 
0310dcd
02f860a
 
0310dcd
02f860a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
language: en
tags:
- speech quantization
license: mit
datasets:
- LibriTTS
---

# Highlights
This model is used for speech codec or quantization on English utterances.
- Lower frame rate, 25 token/s for each quantizer
- Achieving higher codec quality under low band widths
- Training with structured dropout, enabling various band widths during inference with a single model
- Quantizing a raw speech waveform into a sequence of discrete tokens

# FunCodec model
This model is trained with [FunCodec](https://github.com/alibaba-damo-academy/FunCodec), 
an open-source toolkits for speech quantization (codec) from the Damo academy, Alibaba Group.
This repository provides a pre-trained model on the LibriTTS corpus.
It can be applied to low-band-width speech communication, speech quantization, zero-shot speech synthesis 
and other academic research topics.
Compared with [EnCodec](https://arxiv.org/abs/2210.13438) and [SoundStream](https://arxiv.org/abs/2107.03312), 
the following improved techniques are utilized to train the model, resulting in higher codec quality and 
[ViSQOL](https://github.com/google/visqol) scores under the same band width:
- The magnitude spectrum loss is employed to enhance the middle and high frequency signals
- Structured dropout is employed to smooth the code space, as well as enable various band widths in a single model
- Codes are initialized by k-means clusters rather than random values
- Codebooks are maintained with exponential moving average and dead-code-elimination mechanism, resulting in high utilization factor for codebooks.

## Model description
This model is a variational autoencoder that uses residual vector quantisation (RVQ) to obtain 
several parallel sequences of discrete latent representations. Here is an overview of FunCodec models.
<p align="center">
<img src="fig/framework.png" alt="FunCodec architecture"/>
</p>

In general, FunCodec models consist of five modules: a domain transformation module, 
an encoder, a RVQ module, a decoder and a domain inversion module.
- Domain Transformation:transfer signals into time domain, short-time frequency domain, magnitude-angle domain or magnitude-phase domain.
- Encoder:encode signals into compact representations with stacked convolutional and LSTM layers.
- Semantic tokens (Optional): augment encoder outputs with semantic tokens to enhance the content information, not used in this model.
- RVQ:quantize the representations into parallel sequences of discrete tokens with cascaded vector quantizers.
- Decoder:decode quantized embeddings into different signal domains the same as inputs.
- Domain Inversion:re-synthesize perceptible waveforms from different domains.

More details can be found at:
- Paper: [FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec](https://arxiv.org/abs/2309.07405)
- Codebase: [FunCodec](https://github.com/alibaba-damo-academy/FunCodec)

## Intended uses & sceneries
### Inference with FunCodec

You can extract codecs and reconstruct them back to waveforms with FunCodec repository.

#### FunCodec installation
```sh
# Install Pytorch GPU (version >= 1.12.0):
conda install pytorch==1.12.0
# for other versions, please refer: https://pytorch.org/get-started/locally

# Download codebase:
git clone https://github.com/alibaba-damo-academy/FunCodec.git

# Install FunCodec codebase:
cd FunCodec
pip install --editable ./
```

#### Codec extraction
```sh
# Enter the example directory 
cd egs/LibriTTS/codec
# Specify the model name
model_name="audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch"
# Download the model
git lfs install
git clone https://huggingface.co/alibaba-damo/${model_name}
mkdir exp
mv ${model_name} exp/$model_name
# Extracting codec within the input file "input_wav.scp" and the codecs are saved under "outputs/codecs"
bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
  --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
  --wav_scp input_wav.scp --out_dir outputs/codecs
# input_wav.scp has the following format:
# uttid1 path/to/file1.wav
# uttid2 path/to/file2.wav
# ...
```

### Reconstruct waveforms from codecs
```shell
# Reconstruct waveforms into "outputs/recon_wavs"
bash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
  --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
  --wav_scp outputs/codecs/codecs.txt --out_dir outputs/recon_wavs 
# codecs.txt is the output of stage 1, which has the following format:
# uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]]
# uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]]
# ...
```

### Inference with Huggingface Transformers
Inference with Huggingface transformers package is under development.

### Application sceneries
Running environment
- Currently, the model only passed the tests on Linux-x86_64. Mac and Windows systems are not tested.

Intended using sceneries
- This model is suitable for academic usages
- Speech quantization, codec and tokenization for English utterances

## Evaluation results

### Training configuration
- Feature info: raw waveform input
- Train info: Adam, lr 3e-4, batch_size 32, 2 gpu(Tesla V100), acc_grad 1, 300000 steps, speech_max_length 51200
- Loss info: L1, L2, discriminative loss
- Model info: SEANet, Conv, LSTM
- Train config: config.yaml
- Model size: 57.83 M parameters

### Experimental Results

Test set: LibriTTS-test, ViSQOL scores
| testset  | 50 tk/s  | 100 tk/s | 200 tk/s | 400 tk/s |
|:--------:|:--------:|:--------:|:--------:|:--------:|
| LibriTTS |   3.64   |   3.94   |   4.16   |   4.29   |

### Limitations and bias
- Not very robust to background noises and reverberation

### BibTeX entry and citation info
```BibTeX
@misc{du2023funcodec,
      title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},
      author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},
      year={2023},
      eprint={2309.07405},
      archivePrefix={arXiv},
      primaryClass={cs.Sound}
}
```