FAQ
Here are some questions I encountered and my answers, hope this can help you.
I'm new to this and I don't know how to code-format this for it to be easier to read. sorry.
1. running on offline servers
download model folder and upload to your server
import os
os.environ['HF_DATASETS_OFFLINE'] = "1"
os.environ['TRANSFORMERS_OFFLINE'] = "1"
vae_path='path_to_model_direc'
model = EncodecModel.from_pretrained(vae_path, local_files_only=True)
model=model.cuda()
processor = AutoProcessor.from_pretrained(vae_path, local_files_only=True)
you will receive this warning that look really like an error messege.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration. Please open a PR/issue to update
preprocessor_config.json
to useimage_processor_type
instead offeature_extractor_type
. This warning will be removed in v4.40.
2.bandwidth
default bandwidth is 6, we can change it when calling model.encode:
encoder_outputs = model.encode(data["input_values"], data["padding_mask"],bandwidth=24)
3. frame/chunking
by default the model encodes on the entire waveform.
we can change it so the model first cuts the waveform into small pieces and encodes on each small piece without information from other pieces. this should be changed only for very long waveform as it decease performance according to my tests.
model.config.chunk_length_s=1 #seconds of each chunk
model.config.overlap=0.4 #overlap ratio
print(model.config.chunk_length) #result: 24000
print(model.config.chunk_stride) # result: 14400
the length detection code could be flawed(or intended?) and does not allow overlap >= 0.5
it should be like assert (length-offset)% stride==0 instead of assert length% stride==offset
4.code to pad the waveform if needed:
def pad_waveform_to_nearest_length(waveform, sample_rate, length=1, overlap=0.4):
stride=length*(1-overlap)
offset=length*overlap
waveform_length_sec = waveform.size(1) / sample_rate
target_length_sec = np.ceil((waveform_length_sec-offset)/ stride) * stride+offset # Calculate the target length
target_num_samples = int(target_length_sec * sample_rate)
pad_length = target_num_samples - waveform.size(1)
if pad_length > 0:
return torch.nn.functional.pad(waveform, (0, pad_length), mode='constant', value=0)
else:
return waveform
check this page for the model class https://github.com/huggingface/transformers/blob/096f304695f7e7b169b031f7814352e900ad71c4/src/transformers/models/encodec/modeling_encodec.py#L526
3. frame/chunking
by default the model encodes on the entire waveform.
we can change it so the model first cuts the waveform into small pieces and encodes on each small piece without information from other pieces. this should be changed only for very long waveform as it decease performance according to my tests.
model.config.chunk_length_s=1 #seconds of each chunk
model.config.overlap=0.4 #overlap ratio
print(model.config.chunk_length) #result: 24000
print(model.config.chunk_stride) # result: 14400the length detection code could be flawed(or intended?) and does not allow overlap >= 0.5
it should be like assert (length-offset)% stride==0 instead of assert length% stride==offset
@YS1619538093
How did you choose overlap of 0.4? The 48khz version of Encodec is configured with "chunk_length_s": 1.0 and "overlap": 0.01 by default:
https://huggingface.co/facebook/encodec_48khz/blob/c3def8e7185ac8c8efdce6eb8c4a651e487a503e/config.json#L21
Also in the original Encodec repo from Facebook, the default overlap is 0.01:
https://github.com/facebookresearch/encodec/blob/0e2d0aed29362c8e8f52494baf3e6f99056b214f/encodec/model.py#L90
Ok, so I did some testing of my own, using a 30 minute audio file (recorded speech). Basically just comparing the L1 error between the Encodec reconstruction and the original audio. The results show that larger overlap leads to slightly better reconstruction at a significant cost of processing time.
- Overlap of 0.01 is slightly worse than encoding the whole waveform in one shot, and is 3x slower.
- Overlap of 0.4 is slightly better than encoding the whole waveform in one shot, but is 5x slower.
The differences in the reconstruction error were imperceptible, at least to me (this is subject to individual hearing differences). I also repeated this for a 30 second cut from the beginning of the audio file and got similar results.
Note: You don't need a separate method to pad the waveform - just set chunk_length_s
and overlap
on the processor in addition to the model and it will take care of this automatically.
See the code and results here: https://gist.github.com/AbrahamSanders/6157f1fcf5b1b9b020483a3c87470cae