Update README.md
Browse files
README.md
CHANGED
@@ -4,18 +4,21 @@
|
|
4 |
{}
|
5 |
---
|
6 |
|
7 |
-
![](https://github.com/facebookresearch/encodec/raw/2d29d9353c2ff0ab1aeadc6a3d439854ee77da3e/architecture.png)
|
8 |
# Model Card for EnCodec
|
9 |
|
10 |
-
This model card provides details and information about EnCodec, a state-of-the-art real-time audio codec developed by
|
11 |
|
12 |
## Model Details
|
13 |
|
14 |
### Model Description
|
15 |
|
16 |
-
EnCodec is a high-fidelity audio codec leveraging neural networks. It introduces a streaming encoder-decoder architecture with quantized latent space, trained in an end-to-end fashion.
|
|
|
|
|
|
|
17 |
|
18 |
-
- **Developed by:**
|
19 |
- **Model type:** Audio Codec
|
20 |
|
21 |
### Model Sources
|
@@ -30,22 +33,22 @@ EnCodec is a high-fidelity audio codec leveraging neural networks. It introduces
|
|
30 |
|
31 |
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
32 |
|
33 |
-
EnCodec can be used directly as an audio codec for real-time compression and decompression of audio signals.
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
### Downstream Use
|
36 |
|
37 |
-
EnCodec can be fine-tuned for specific audio tasks or integrated into larger audio processing pipelines for applications such as speech
|
|
|
38 |
|
39 |
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
40 |
|
41 |
[More Information Needed]
|
42 |
|
43 |
-
### Out-of-Scope Use
|
44 |
-
|
45 |
-
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
46 |
-
|
47 |
-
[More Information Needed]
|
48 |
-
|
49 |
## How to Get Started with the Model
|
50 |
|
51 |
Use the following code to get started with the EnCodec model:
|
@@ -69,13 +72,11 @@ reconstructed_audio = model.decode(audio_codes)
|
|
69 |
|
70 |
## Training Details
|
71 |
|
|
|
|
|
|
|
72 |
### Training Data
|
73 |
|
74 |
-
We train all models for 300 epochs, with one epoch being 2,000 updates with the Adam optimizer with a
|
75 |
-
batch size of 64 examples of 1 second each, a learning rate of 3 · 10−4
|
76 |
-
, β1 = 0.5, and β2 = 0.9. All the models
|
77 |
-
are traind using 8 A100 GPUs. We use the balancer introduced in Section 3.4 with weights λt = 0.1, λf = 1,
|
78 |
-
λg = 3, λfeat = 3 for the 24 kHz models. For the 48 kHz model, we use instead λg = 4, λfeat = 4.
|
79 |
|
80 |
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
81 |
|
@@ -83,53 +84,45 @@ are traind using 8 A100 GPUs. We use the balancer introduced in Section 3.4 with
|
|
83 |
- DNS Challenge 4
|
84 |
- [Common Voice](https://huggingface.co/datasets/common_voice)
|
85 |
- For general audio:
|
86 |
-
- AudioSet
|
87 |
-
- FSD50K
|
88 |
- For music:
|
89 |
-
- Jamendo dataset
|
90 |
-
|
91 |
|
92 |
-
They used four different training strategies:
|
93 |
|
94 |
-
- (s1)
|
95 |
-
- (s2)
|
96 |
-
- (s3)
|
97 |
-
- (s4)
|
98 |
|
99 |
-
The audio is normalized by file and
|
100 |
-
that has been clipped. Finally we add reverberation using room impulse responses provided by the DNS
|
101 |
-
challenge with probability 0.2, and RT60 in the range [0.3, 1.3] except for the single-source music samples.
|
102 |
-
For testing, we use four categories: clean speech from DNS alone, clean speech mixed with FSDK50K sample,
|
103 |
-
Jamendo sample alone, proprietary music sample alone.
|
104 |
|
105 |
## Evaluation
|
106 |
|
107 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
108 |
|
109 |
-
|
110 |
-
|
|
|
111 |
crowd-sourcing platform, in which they were asked to rate the perceptual quality of the provided samples in
|
112 |
-
a range between 1 to 100.
|
113 |
and force at least 10 annotations per samples. To filter noisy annotations and outliers we remove annotators
|
114 |
who rate the reference recordings less then 90 in at least 20% of the cases, or rate the low-anchor recording
|
115 |
-
above 80 more than 50% of the time.
|
116 |
-
|
117 |
-
|
|
|
118 |
Nachmani et al., 2020; Chazan et al., 2021).
|
119 |
|
120 |
### Results
|
121 |
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
however, we found in preliminary results that they provide similar or worse results, hence we do not report them.
|
126 |
-
When considering the same bandwidth, EnCodec is superior to all evaluated baselines considering the
|
127 |
-
MUSHRA score. Notice, EnCodec at 3kbps reaches better performance on average than Lyra-v2 using
|
128 |
-
6kbps and Opus at 12kbps. When considering the additional language model over the codes, we can reduce
|
129 |
-
the bandwidth by ∼ 25 − 40%. For instance, we can reduce the bandwidth of the 3 kpbs model to 1.9 kbps.
|
130 |
-
We observe that for higher bandwidth, the compression ratio is lower, which could be explained by the small
|
131 |
-
size of the Transformer model used, making hard to model all codebooks together.
|
132 |
|
|
|
133 |
|
134 |
#### Summary
|
135 |
|
|
|
4 |
{}
|
5 |
---
|
6 |
|
7 |
+
![encodec image](https://github.com/facebookresearch/encodec/raw/2d29d9353c2ff0ab1aeadc6a3d439854ee77da3e/architecture.png)
|
8 |
# Model Card for EnCodec
|
9 |
|
10 |
+
This model card provides details and information about EnCodec, a state-of-the-art real-time audio codec developed by Meta AI.
|
11 |
|
12 |
## Model Details
|
13 |
|
14 |
### Model Description
|
15 |
|
16 |
+
EnCodec is a high-fidelity audio codec leveraging neural networks. It introduces a streaming encoder-decoder architecture with quantized latent space, trained in an end-to-end fashion.
|
17 |
+
The model simplifies and speeds up training using a single multiscale spectrogram adversary that efficiently reduces artifacts and produces high-quality samples.
|
18 |
+
It also includes a novel loss balancer mechanism that stabilizes training by decoupling the choice of hyperparameters from the typical scale of the loss.
|
19 |
+
Additionally, lightweight Transformer models are used to further compress the obtained representation while maintaining real-time performance.
|
20 |
|
21 |
+
- **Developed by:** Meta AI
|
22 |
- **Model type:** Audio Codec
|
23 |
|
24 |
### Model Sources
|
|
|
33 |
|
34 |
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
35 |
|
36 |
+
EnCodec can be used directly as an audio codec for real-time compression and decompression of audio signals.
|
37 |
+
It provides high-quality audio compression and efficient decoding. The model was trained on various bandwiths, which can be specified when encoding (compressing) and decoding (decompressing).
|
38 |
+
Two different setup exist for EnCodec:
|
39 |
+
|
40 |
+
- Non-streamable: the input audio is split into chunks of 1 seconds, with an overlap of 10 ms, which are then encoded.
|
41 |
+
- Streamable: weight normalizationis used on the convolution layers, and the input is not split into chunks but rather padded on the left.
|
42 |
|
43 |
### Downstream Use
|
44 |
|
45 |
+
EnCodec can be fine-tuned for specific audio tasks or integrated into larger audio processing pipelines for applications such as speech generation,
|
46 |
+
music generation, or text to speech tasks.
|
47 |
|
48 |
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
49 |
|
50 |
[More Information Needed]
|
51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
## How to Get Started with the Model
|
53 |
|
54 |
Use the following code to get started with the EnCodec model:
|
|
|
72 |
|
73 |
## Training Details
|
74 |
|
75 |
+
The model was trained for 300 epochs, with one epoch being 2,000 updates with the Adam optimizer with a batch size of 64 examples of 1 second each, a learning rate of 3 · 10−4
|
76 |
+
, β1 = 0.5, and β2 = 0.9. All the models are traind using 8 A100 GPUs.
|
77 |
+
|
78 |
### Training Data
|
79 |
|
|
|
|
|
|
|
|
|
|
|
80 |
|
81 |
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
82 |
|
|
|
84 |
- DNS Challenge 4
|
85 |
- [Common Voice](https://huggingface.co/datasets/common_voice)
|
86 |
- For general audio:
|
87 |
+
- [AudioSet](https://huggingface.co/datasets/Fhrozen/AudioSet2K22)
|
88 |
+
- [FSD50K](https://huggingface.co/datasets/Fhrozen/FSD50k)
|
89 |
- For music:
|
90 |
+
- [Jamendo dataset](https://huggingface.co/datasets/rkstgr/mtg-jamendo)
|
91 |
+
|
92 |
|
93 |
+
They used four different training strategies to sample for these datasets:
|
94 |
|
95 |
+
- (s1) sample a single source from Jamendo with probability 0.32;
|
96 |
+
- (s2) sample a single source from the other datasets with the same probability;
|
97 |
+
- (s3) mix two sources from all datasets with a probability of 0.24;
|
98 |
+
- (s4) mix three sources from all datasets except music with a probability of 0.12.
|
99 |
|
100 |
+
The audio is normalized by file and a random gain between -10 and 6 dB id applied.
|
|
|
|
|
|
|
|
|
101 |
|
102 |
## Evaluation
|
103 |
|
104 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
105 |
|
106 |
+
### Subjectif metric for restoration:
|
107 |
+
|
108 |
+
This models was evalutated using the MUSHRA protocol (Series, 2014), using both a hidden reference and a low anchor. Annotators were recruited using a
|
109 |
crowd-sourcing platform, in which they were asked to rate the perceptual quality of the provided samples in
|
110 |
+
a range between 1 to 100. They randomly select 50 samples of 5 seconds from each category of the the test set
|
111 |
and force at least 10 annotations per samples. To filter noisy annotations and outliers we remove annotators
|
112 |
who rate the reference recordings less then 90 in at least 20% of the cases, or rate the low-anchor recording
|
113 |
+
above 80 more than 50% of the time.
|
114 |
+
|
115 |
+
### Objective metric for restoration:
|
116 |
+
The ViSQOL()ink) metric was used together with the Scale-Invariant Signal-to-Noise Ration (SI-SNR) (Luo & Mesgarani, 2019;
|
117 |
Nachmani et al., 2020; Chazan et al., 2021).
|
118 |
|
119 |
### Results
|
120 |
|
121 |
+
The results of the evaluation demonstrate the superiority of EnCodec compared to the baselines across different bandwidths (1.5, 3, 6, and 12 kbps). Figure 3 provides an overview of the streamable setup results, while Table 1 offers a category-wise breakdown. Although alternative quantizers such as Gumbel-Softmax and DiffQ were explored, their preliminary results did not surpass or match the performance of EnCodec, so they are not included in the report.
|
122 |
+
|
123 |
+
When comparing EnCodec with the baselines at the same bandwidth, EnCodec consistently outperforms them in terms of MUSHRA score. Notably, EnCodec achieves better performance, on average, at 3 kbps compared to Lyra-v2 at 6 kbps and Opus at 12 kbps. Additionally, by incorporating the language model over the codes, it is possible to achieve a bandwidth reduction of approximately 25-40%. For example, the bandwidth of the 3 kbps model can be reduced to 1.9 kbps.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
124 |
|
125 |
+
Furthermore, it is observed that as the bandwidth increases, the compression ratio decreases. This behavior can be attributed to the small size of the Transformer model used, which makes it challenging to effectively model all codebooks together.
|
126 |
|
127 |
#### Summary
|
128 |
|