ArthurZ HF staff commited on
Commit
533ecd8
1 Parent(s): 5b038b4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +153 -0
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
5
+ ---
6
+
7
+ ![](https://github.com/facebookresearch/encodec/raw/2d29d9353c2ff0ab1aeadc6a3d439854ee77da3e/architecture.png)
8
+ # Model Card for EnCodec
9
+
10
+ This model card provides details and information about EnCodec, a state-of-the-art real-time audio codec developed by Facebook Research.
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ EnCodec is a high-fidelity audio codec leveraging neural networks. It introduces a streaming encoder-decoder architecture with quantized latent space, trained in an end-to-end fashion. The model simplifies and speeds up training using a single multiscale spectrogram adversary that efficiently reduces artifacts and produces high-quality samples. It also includes a novel loss balancer mechanism that stabilizes training by decoupling the choice of hyperparameters from the typical scale of the loss. Additionally, lightweight Transformer models are used to further compress the obtained representation while maintaining real-time performance.
17
+
18
+ - **Developed by:** Facebook Research
19
+ - **Model type:** Audio Codec
20
+
21
+ ### Model Sources
22
+
23
+ - **Repository:** [GitHub Repository](https://github.com/facebookresearch/encodec)
24
+ - **Paper:** [EnCodec: End-to-End Neural Audio Codec](https://arxiv.org/abs/2210.13438)
25
+
26
+ ## Uses
27
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
28
+
29
+ ### Direct Use
30
+
31
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
32
+
33
+ EnCodec can be used directly as an audio codec for real-time compression and decompression of audio signals. It provides high-quality audio compression and efficient decoding.
34
+
35
+ ### Downstream Use
36
+
37
+ EnCodec can be fine-tuned for specific audio tasks or integrated into larger audio processing pipelines for applications such as speech recognition, audio streaming, or voice communication systems.
38
+
39
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
40
+
41
+ [More Information Needed]
42
+
43
+ ### Out-of-Scope Use
44
+
45
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
46
+
47
+ [More Information Needed]
48
+
49
+ ## How to Get Started with the Model
50
+
51
+ Use the following code to get started with the EnCodec model:
52
+
53
+ ```python
54
+ import torch
55
+ from encodec import EnCodecModel
56
+
57
+ # Load the pre-trained EnCodec model
58
+ model = EnCodecModel()
59
+
60
+ # Load the audio data
61
+ audio_data = torch.load('audio.pt')
62
+
63
+ # Compress the audio
64
+ audio_codes = model.encode(audio_data)[0]
65
+
66
+ # Decompress the audio
67
+ reconstructed_audio = model.decode(audio_codes)
68
+ ```
69
+
70
+ ## Training Details
71
+
72
+ ### Training Data
73
+
74
+ We train all models for 300 epochs, with one epoch being 2,000 updates with the Adam optimizer with a
75
+ batch size of 64 examples of 1 second each, a learning rate of 3 · 10−4
76
+ , β1 = 0.5, and β2 = 0.9. All the models
77
+ are traind using 8 A100 GPUs. We use the balancer introduced in Section 3.4 with weights λt = 0.1, λf = 1,
78
+ λg = 3, λfeat = 3 for the 24 kHz models. For the 48 kHz model, we use instead λg = 4, λfeat = 4.
79
+
80
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ - For speech:
83
+ - DNS Challenge 4
84
+ - [Common Voice](https://huggingface.co/datasets/common_voice)
85
+ - For general audio:
86
+ - AudioSet
87
+ - FSD50K
88
+ - For music:
89
+ - Jamendo dataset (Bogdanov et al., 2019)
90
+ [More Information Needed]
91
+
92
+ They used four different training strategies:
93
+
94
+ - (s1) we sample a single source from Jamendo with probability 0.32;
95
+ - (s2) we sample a single source from the other datasets with the same probability;
96
+ - (s3) we mix two sources from all datasets with a probability of 0.24;
97
+ - (s4) we mix three sources from all datasets except music with a probability of 0.12.
98
+
99
+ The audio is normalized by file and we apply a random gain between -10 and 6 dB. We reject any sample
100
+ that has been clipped. Finally we add reverberation using room impulse responses provided by the DNS
101
+ challenge with probability 0.2, and RT60 in the range [0.3, 1.3] except for the single-source music samples.
102
+ For testing, we use four categories: clean speech from DNS alone, clean speech mixed with FSDK50K sample,
103
+ Jamendo sample alone, proprietary music sample alone.
104
+
105
+ ## Evaluation
106
+
107
+ <!-- This section describes the evaluation protocols and provides the results. -->
108
+
109
+ We consider both subjective and objective evaluation metrics. For the subjective tests we follow the MUSHRA
110
+ protocol (Series, 2014), using both a hidden reference and a low anchor. Annotators were recruited using a
111
+ crowd-sourcing platform, in which they were asked to rate the perceptual quality of the provided samples in
112
+ a range between 1 to 100. We randomly select 50 samples of 5 seconds from each category of the the test set
113
+ and force at least 10 annotations per samples. To filter noisy annotations and outliers we remove annotators
114
+ who rate the reference recordings less then 90 in at least 20% of the cases, or rate the low-anchor recording
115
+ above 80 more than 50% of the time. For objective metrics, we use ViSQOL (Hines et al., 2012; Chinen
116
+ et al., 2020) 2
117
+ , together with the Scale-Invariant Signal-to-Noise Ration (SI-SNR) (Luo & Mesgarani, 2019;
118
+ Nachmani et al., 2020; Chazan et al., 2021).
119
+
120
+ ### Results
121
+
122
+ We start with the results for EnCodec with a bandwidth in {1.5, 3, 6, 12} kbps and compare them to the
123
+ baselines. Results for the streamable setup are reported in Figure 3 and a breakdown per category in Table 1.
124
+ We additionally explored other quantizers such as Gumbel-Softmax and DiffQ (see details in Appendix A.2),
125
+ however, we found in preliminary results that they provide similar or worse results, hence we do not report them.
126
+ When considering the same bandwidth, EnCodec is superior to all evaluated baselines considering the
127
+ MUSHRA score. Notice, EnCodec at 3kbps reaches better performance on average than Lyra-v2 using
128
+ 6kbps and Opus at 12kbps. When considering the additional language model over the codes, we can reduce
129
+ the bandwidth by ∼ 25 − 40%. For instance, we can reduce the bandwidth of the 3 kpbs model to 1.9 kbps.
130
+ We observe that for higher bandwidth, the compression ratio is lower, which could be explained by the small
131
+ size of the Transformer model used, making hard to model all codebooks together.
132
+
133
+
134
+ #### Summary
135
+
136
+ EnCodec is a state-of-the-art real-time neural audio compression model that excels in producing high-fidelity audio samples at various sample rates and bandwidths. The model's performance was evaluated across different settings, ranging from 24kHz monophonic at 1.5 kbps to 48kHz stereophonic, showcasing both subjective and objective results (Figure 3 and Table 4). Notably, EnCodec incorporates a novel spectrogram-only adversarial loss, effectively reducing artifacts and enhancing sample quality. Training stability and interpretability were further enhanced through the introduction of a gradient balancer for the loss weights. Additionally, the study demonstrated that a compact Transformer model can be employed to achieve an additional bandwidth reduction of up to 40% without compromising quality, particularly in applications where low latency is not critical (e.g., music streaming).
137
+
138
+
139
+ ## Citation
140
+
141
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
142
+
143
+ **BibTeX:**
144
+
145
+ @misc{défossez2022high,
146
+ title={High Fidelity Neural Audio Compression},
147
+ author={Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi},
148
+ year={2022},
149
+ eprint={2210.13438},
150
+ archivePrefix={arXiv},
151
+ primaryClass={eess.AS}
152
+ }
153
+