FAQ

#3
by YS1619538093 - opened

Here are some questions I encountered and my answers, hope this can help you.
I'm new to this and I don't know how to code-format this for it to be easier to read. sorry.

1. running on offline servers

download model folder and upload to your server

import os
os.environ['HF_DATASETS_OFFLINE'] = "1"
os.environ['TRANSFORMERS_OFFLINE'] = "1"
vae_path='path_to_model_direc'
model = EncodecModel.from_pretrained(vae_path, local_files_only=True)
model=model.cuda()
processor = AutoProcessor.from_pretrained(vae_path, local_files_only=True)

you will receive this warning that look really like an error messege.

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration. Please open a PR/issue to update preprocessor_config.json to use image_processor_type instead of feature_extractor_type. This warning will be removed in v4.40.

2.bandwidth

default bandwidth is 6, we can change it when calling model.encode:

encoder_outputs = model.encode(data["input_values"], data["padding_mask"],bandwidth=24)

3. frame/chunking

by default the model encodes on the entire waveform.

we can change it so the model first cuts the waveform into small pieces and encodes on each small piece without information from other pieces. this should be changed only for very long waveform as it decease performance according to my tests.

model.config.chunk_length_s=1 #seconds of each chunk
model.config.overlap=0.4 #overlap ratio
print(model.config.chunk_length) #result: 24000
print(model.config.chunk_stride) # result: 14400

the length detection code could be flawed(or intended?) and does not allow overlap >= 0.5
it should be like assert (length-offset)% stride==0 instead of assert length% stride==offset

4.code to pad the waveform if needed:

def pad_waveform_to_nearest_length(waveform, sample_rate, length=1, overlap=0.4):
stride=length*(1-overlap)
offset=length*overlap
waveform_length_sec = waveform.size(1) / sample_rate
target_length_sec = np.ceil((waveform_length_sec-offset)/ stride) * stride+offset # Calculate the target length
target_num_samples = int(target_length_sec * sample_rate)
pad_length = target_num_samples - waveform.size(1)
if pad_length > 0:
return torch.nn.functional.pad(waveform, (0, pad_length), mode='constant', value=0)
else:
return waveform

check this page for the model class https://github.com/huggingface/transformers/blob/096f304695f7e7b169b031f7814352e900ad71c4/src/transformers/models/encodec/modeling_encodec.py#L526

This comment has been hidden
YS1619538093 changed discussion title from Q&A to FAQ

3. frame/chunking

by default the model encodes on the entire waveform.

we can change it so the model first cuts the waveform into small pieces and encodes on each small piece without information from other pieces. this should be changed only for very long waveform as it decease performance according to my tests.

model.config.chunk_length_s=1 #seconds of each chunk
model.config.overlap=0.4 #overlap ratio
print(model.config.chunk_length) #result: 24000
print(model.config.chunk_stride) # result: 14400

the length detection code could be flawed(or intended?) and does not allow overlap >= 0.5
it should be like assert (length-offset)% stride==0 instead of assert length% stride==offset

@YS1619538093
How did you choose overlap of 0.4? The 48khz version of Encodec is configured with "chunk_length_s": 1.0 and "overlap": 0.01 by default:
https://huggingface.co/facebook/encodec_48khz/blob/c3def8e7185ac8c8efdce6eb8c4a651e487a503e/config.json#L21

Also in the original Encodec repo from Facebook, the default overlap is 0.01:
https://github.com/facebookresearch/encodec/blob/0e2d0aed29362c8e8f52494baf3e6f99056b214f/encodec/model.py#L90

Ok, so I did some testing of my own, using a 30 minute audio file (recorded speech). Basically just comparing the L1 error between the Encodec reconstruction and the original audio. The results show that larger overlap leads to slightly better reconstruction at a significant cost of processing time.

  • Overlap of 0.01 is slightly worse than encoding the whole waveform in one shot, and is 3x slower.
  • Overlap of 0.4 is slightly better than encoding the whole waveform in one shot, but is 5x slower.

The differences in the reconstruction error were imperceptible, at least to me (this is subject to individual hearing differences). I also repeated this for a 30 second cut from the beginning of the audio file and got similar results.

Note: You don't need a separate method to pad the waveform - just set chunk_length_s and overlap on the processor in addition to the model and it will take care of this automatically.
See the code and results here: https://gist.github.com/AbrahamSanders/6157f1fcf5b1b9b020483a3c87470cae

Sign up or log in to comment