Diffusers
AudioLDMPipeline

class_labels should be provided when num_class_embeds > 0

#7
by ctrlMarcio - opened

Hello, excellent work!

I encounter an error I cannot surpass when using audioldm models. In fact, what I'm trying to do is build the complete pipeline manually, following the tutorial present here.

My code is the following:

# %%
from diffusers import DDPMScheduler, UNet2DConditionModel, AutoencoderKL
from transformers import RobertaTokenizer, ClapTextModelWithProjection
import torch

# %%
model = "cvssp/audioldm"
device = "cuda"

# %%
vae = AutoencoderKL.from_pretrained(model, subfolder="vae").to(device)
tokenizer = RobertaTokenizer.from_pretrained(model, subfolder="tokenizer")
text_encoder = ClapTextModelWithProjection.from_pretrained(model, subfolder="text_encoder").to(device)
unet = UNet2DConditionModel.from_pretrained(model, subfolder="unet", num_class_embeds=0).to(device)
scheduler = DDPMScheduler.from_pretrained(model, subfolder="scheduler")

# %%
prompt = ["a photograph of an astronaut riding a horse"]
height = 512  # default height of Stable Diffusion
width = 512  # default width of Stable Diffusion
num_inference_steps = 25  # Number of denoising steps
guidance_scale = 7.5  # Scale for classifier-free guidance
generator = torch.manual_seed(0)  # Seed generator to create the inital latent noise
batch_size = len(prompt)

# %%
text_input = tokenizer(
    prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
)

with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(device))[0]

# %%
max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]

# %%
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

# %%
latents = torch.randn(
    (batch_size, unet.in_channels, height // 8, width // 8),
    generator=generator,
)
latents = latents.to(device)

# %%
latents = latents * scheduler.init_noise_sigma

# %%
from tqdm.auto import tqdm

scheduler.set_timesteps(num_inference_steps)

for t in tqdm(scheduler.timesteps):
    # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
    latent_model_input = torch.cat([latents] * 2)

    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)

    # predict the noise residual
    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

    # perform guidance
    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

    # compute the previous noisy sample x_t -> x_t-1
    latents = scheduler.step(noise_pred, t, latents).prev_sample

In the noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample line I get

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ in <module>:14                                                                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/admin/.local/lib/python3.8/site-packages/torch/nn/modules/module.py:1501 in _call_impl     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1498 โ”‚   โ”‚   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   โ”‚
โ”‚   1499 โ”‚   โ”‚   โ”‚   โ”‚   or _global_backward_pre_hooks or _global_backward_hooks                   โ”‚
โ”‚   1500 โ”‚   โ”‚   โ”‚   โ”‚   or _global_forward_hooks or _global_forward_pre_hooks):                   โ”‚
โ”‚ โฑ 1501 โ”‚   โ”‚   โ”‚   return forward_call(*args, **kwargs)                                          โ”‚
โ”‚   1502 โ”‚   โ”‚   # Do not call functions when jit is used                                          โ”‚
โ”‚   1503 โ”‚   โ”‚   full_backward_hooks, non_full_backward_hooks = [], []                             โ”‚
โ”‚   1504 โ”‚   โ”‚   backward_pre_hooks = []                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/admin/.local/lib/python3.8/site-packages/diffusers/models/unet_2d_condition.py:691 in      โ”‚
โ”‚ forward                                                                                          โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   688 โ”‚   โ”‚                                                                                      โ”‚
โ”‚   689 โ”‚   โ”‚   if self.class_embedding is not None:                                               โ”‚
โ”‚   690 โ”‚   โ”‚   โ”‚   if class_labels is None:                                                       โ”‚
โ”‚ โฑ 691 โ”‚   โ”‚   โ”‚   โ”‚   raise ValueError("class_labels should be provided when num_class_embeds    โ”‚
โ”‚   692 โ”‚   โ”‚   โ”‚                                                                                  โ”‚
โ”‚   693 โ”‚   โ”‚   โ”‚   if self.config.class_embed_type == "timestep":                                 โ”‚
โ”‚   694 โ”‚   โ”‚   โ”‚   โ”‚   class_labels = self.time_proj(class_labels)                                โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
ValueError: class_labels should be provided when num_class_embeds > 0

As far as I understand, in the config of the unet you set the num_class_embeds to null or none. So, this shouldn't be an error. A you can see by my code, I also tried to force the num_class_embeds to 0 when I initialize my unet object unsuccessfully.

Can you help me?

Thank you! ๐Ÿค—

I'm having the same issue here...

Here is what I found:
class_embed_type is set to simple_projection, which makes the class_embedding not None. When class_embedding is not None, class_labels are required.

But I'm not sure how to fix this problem yet...

@melissachen I did find that as well, but couldnโ€™t get over it. Sent an email to the main author to no avail. I believe this project might be deprecated.

Sign up or log in to comment