Problem generating pooled prompt embedding

#132
by pfeaster - opened

I'm trying to break up SDXL prompt generation into smaller steps for experimentation, and I'm running into a problem with generating the pooled prompt embedding.

Here's where things seem to go wrong (final_raw_embedding consists of the looked-up 1280-value tensors for each of the 77 tokens plus the position embeddings):

text_encoder_2=CLIPTextModelWithProjection.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",subfolder="text_encoder_2")
encoder_outputs = text_encoder_2.text_model.encoder(
inputs_embeds=final_raw_embedding,
attention_mask=None,
causal_attention_mask=causal_attention_mask.to(torch_device),
output_attentions=None,
output_hidden_states=True,
return_dict=None,
)

At this point, encoder_outputs.hidden_states[-2] gives me the correct second piece of the concatenated "regular" embedding. Based on my reading of the SDXL pipeline script, I'd have thought encoder_outputs[0] would then give me the correct pooled embedding.

But it doesn't. The usual output of text_encoder_2 (when it's fed a text prompt directly) looks like this, with encoder_outputs[0] being a tensor of shape (1,1280):

CLIPTextModelOutput(text_embeds=tensor([[-0.1136, 0.5139, -1.2741, ..., -0.8194, -0.5301, 0.8064]],
grad_fn=).....

But instead I'm getting this, with encoder_outputs[0] having the wrong values and shape (1,77,1280):

BaseModelOutput(last_hidden_state=tensor([[[ 0.1234, -0.5610, 0.3213, ..., 0.2541, 0.3055, -0.1040],
[-0.3271, -0.2220, 0.7678, ..., 0.3586, -0.0037, 0.0753].....

Can anyone point out what I'm doing wrong, or suggest how I could go about generating a correct pooled embedding from my final_raw_embedding?

Sign up or log in to comment