Token size limit

#22

by gebaltso - opened Jun 13, 2024

Jun 13, 2024

Hello, I would like to ask which is the size limit of the prompt token in sd3. Is it the 2 x 77 or I misunderstood?Thanks in advance.

OzzyGT

Jun 14, 2024

For now is 77, this is for the three text encoders. There's a PR for only the T5 to be higher which can be as high as 512 but for the clip ones it will still be 77.

mahmutc

Jun 14, 2024

•

edited Jun 14, 2024

hi @gebaltso do you mean 77 for prompt + 77 for negative prompt?
According to code yes it should be 77, but my side it truncates after 75 I don't know why.

OzzyGT

Jun 14, 2024

The real tokens are 75, the other two are for bos and eos. Also the 2 x 77 means that each clip model uses 77 tokens and since they're two this means 2 x 77.

vilarin

Jun 19, 2024

•

edited Jun 19, 2024

Because the example prompts has more than 77 tokens, I previously modified diffusers to support T5 512 long token.
But unfortunately this space is rarely used by anyone 😂mood.
https://huggingface.co/spaces/vilarin/sd3m-long

LolaRoseHB

Jun 26, 2024

this almost works:

from compel import Compel, ReturnedEmbeddingsType

compel = Compel(
truncate_long_prompts=False,
tokenizer=[
pipeline.tokenizer,
pipeline.tokenizer_2
],
text_encoder=[
pipeline.text_encoder,
pipeline.text_encoder_2
],
returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
requires_pooled=[
False,
True
]
)

conditioning, pooled = compel(prompt)
negative_embed, negative_pooled = compel(negative_prompt)
[conditioning, negative_embed] = compel.pad_conditioning_tensors_to_same_length(
[conditioning, negative_embed])

pipe = pipeline(output_type='pil', num_inference_steps=num_inference_steps, num_images_per_prompt=num_images_per_prompt, width=512, height=512,
prompt_embeds=conditioning, pooled_prompt_embeds=pooled, negative_prompt_embeds=negative_embed, negative_pooled_prompt_embeds=negative_pooled).images

gebaltso

Jul 23, 2024

•

edited Jul 23, 2024

For now is 77, this is for the three text encoders. There's a PR for only the T5 to be higher which can be as high as 512 but for the clip ones it will still be 77.

How can I use the T5? Is there an example on how to do that?

*Edit using both prompt and prompt_3 (T5):
image = pipe(
prompt=prompt,
prompt_3=prompt_3,
negative_prompt="",
num_inference_steps=28,
guidance_scale=4.5,
max_sequence_length=512,
).images[0]

OzzyGT

Jul 23, 2024

the documentation for this its still in the main branch so until the next release, this is the link.

https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_3#using-long-prompts-with-the-t5-text-encoder

If you want to use it with low VRAM there's documentation about it too.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment