Spaces:
Running
LTX Video
LTX Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 24 FPS videos at a 768x512 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image + text-to-video usecases.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
Available models:
Model name
Recommended dtype
torch.bfloat16
torch.bfloat16
Note: The recommended dtype is for the transformer component. The VAE and text encoders can be either torch.float32, torch.bfloat16 or torch.float16 but the recommended dtype is torch.bfloat16 as used in the original repository.
Loading Single Files
Loading the original LTX Video checkpoints is also possible with ~ModelMixin.from_single_file. We recommend using from_single_file for the Lightricks series of models, as they plan to release multiple models in the future in the single file format.
Copied
import torch from diffusers import AutoencoderKLLTXVideo, LTXImageToVideoPipeline, LTXVideoTransformer3DModel
# `single_file_url` could also be https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.1.safetensors single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors" transformer = LTXVideoTransformer3DModel.from_single_file( single_file_url, torch_dtype=torch.bfloat16 ) vae = AutoencoderKLLTXVideo.from_single_file(single_file_url, torch_dtype=torch.bfloat16) pipe = LTXImageToVideoPipeline.from_pretrained( "Lightricks/LTX-Video", transformer=transformer, vae=vae, torch_dtype=torch.bfloat16 )
# ... inference code ...
Alternatively, the pipeline can be used to load the weights with ~FromSingleFileMixin.from_single_file.
Copied
import torch from diffusers import LTXImageToVideoPipeline from transformers import T5EncoderModel, T5Tokenizer
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors" text_encoder = T5EncoderModel.from_pretrained( "Lightricks/LTX-Video", subfolder="text_encoder", torch_dtype=torch.bfloat16 ) tokenizer = T5Tokenizer.from_pretrained( "Lightricks/LTX-Video", subfolder="tokenizer", torch_dtype=torch.bfloat16 ) pipe = LTXImageToVideoPipeline.from_single_file( single_file_url, text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.bfloat16 )
Loading LTX GGUF checkpoints are also supported:
Copied
import torch from diffusers.utils import export_to_video from diffusers import LTXPipeline, LTXVideoTransformer3DModel, GGUFQuantizationConfig
ckpt_path = ( "https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3\_K\_S.gguf" ) transformer = LTXVideoTransformer3DModel.from_single_file( ckpt_path, quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), torch_dtype=torch.bfloat16, ) pipe = LTXPipeline.from_pretrained( "Lightricks/LTX-Video", transformer=transformer, torch_dtype=torch.bfloat16, ) pipe.enable_model_cpu_offload()
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage" negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
video = pipe( prompt=prompt, negative_prompt=negative_prompt, width=704, height=480, num_frames=161, num_inference_steps=50, ).frames[0] export_to_video(video, "output_gguf_ltx.mp4", fps=24)
Make sure to read the documentation on GGUF to learn more about our GGUF support.
Loading and running inference with LTX Video 0.9.1 weights.
Copied
import torch from diffusers import LTXPipeline from diffusers.utils import export_to_video
pipe = LTXPipeline.from_pretrained("a-r-r-o-w/LTX-Video-0.9.1-diffusers", torch_dtype=torch.bfloat16) pipe.to("cuda")
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage" negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
video = pipe( prompt=prompt, negative_prompt=negative_prompt, width=768, height=512, num_frames=161, decode_timestep=0.03, decode_noise_scale=0.025, num_inference_steps=50, ).frames[0] export_to_video(video, "output.mp4", fps=24)
Refer to this section to learn more about optimizing memory consumption.
Quantization
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
Refer to the Quantization overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized LTXPipeline for inference with bitsandbytes.
Copied
import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LTXVideoTransformer3DModel, LTXPipeline from diffusers.utils import export_to_video from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = T5EncoderModel.from_pretrained( "Lightricks/LTX-Video", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.float16, )
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = LTXVideoTransformer3DModel.from_pretrained( "Lightricks/LTX-Video", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, )
pipeline = LTXPipeline.from_pretrained( "Lightricks/LTX-Video", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", )
prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting." video = pipeline(prompt=prompt, num_frames=161, num_inference_steps=50).frames[0] export_to_video(video, "ship.mp4", fps=24)
LTXPipeline
class diffusers.LTXPipeline
( scheduler: FlowMatchEulerDiscreteSchedulervae: AutoencoderKLLTXVideotext_encoder: T5EncoderModeltokenizer: T5TokenizerFasttransformer: LTXVideoTransformer3DModel )
Parameters
- transformer (LTXVideoTransformer3DModel) β Conditional Transformer architecture to denoise the encoded video latents.
- scheduler (FlowMatchEulerDiscreteScheduler) β A scheduler to be used in combination with
transformerto denoise the encoded image latents. - vae (AutoencoderKLLTXVideo) β Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
- text_encoder (
T5EncoderModel) β T5, specifically the google/t5-v1_1-xxl variant. - tokenizer (
CLIPTokenizer) β Tokenizer of class CLIPTokenizer. - tokenizer (
T5TokenizerFast) β Second Tokenizer of class T5TokenizerFast.
Pipeline for text-to-video generation.
Reference: https://github.com/Lightricks/LTX-Video
__call__
( prompt: typing.Union[str, typing.List[str]] = Nonenegative_prompt: typing.Union[str, typing.List[str], NoneType] = Noneheight: int = 512width: int = 704num_frames: int = 161frame_rate: int = 25num_inference_steps: int = 50timesteps: typing.List[int] = Noneguidance_scale: float = 3num_videos_per_prompt: typing.Optional[int] = 1generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.Tensor] = Noneprompt_embeds: typing.Optional[torch.Tensor] = Noneprompt_attention_mask: typing.Optional[torch.Tensor] = Nonenegative_prompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_prompt_attention_mask: typing.Optional[torch.Tensor] = Nonedecode_timestep: typing.Union[float, typing.List[float]] = 0.0decode_noise_scale: typing.Union[float, typing.List[float], NoneType] = Noneoutput_type: typing.Optional[str] = 'pil'return_dict: bool = Trueattention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = Nonecallback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = Nonecallback_on_step_end_tensor_inputs: typing.List[str] = ['latents']max_sequence_length: int = 128 ) β export const metadata = 'undefined';~pipelines.ltx.LTXPipelineOutput or tuple
Expand 22 parameters
Parameters
- prompt (
strorList[str], optional) β The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead. - height (
int, defaults to512) β The height in pixels of the generated image. This is set to 480 by default for the best results. - width (
int, defaults to704) β The width in pixels of the generated image. This is set to 848 by default for the best results. - num_frames (
int, defaults to161) β The number of video frames to generate - num_inference_steps (
int, optional, defaults to 50) β The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int], optional) β Custom timesteps to use for the denoising process with schedulers which support atimestepsargument in theirset_timestepsmethod. If not defined, the default behavior whennum_inference_stepsis passed will be used. Must be in descending order. - guidance_scale (
float, defaults to3) β Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the textprompt, usually at the expense of lower image quality. - num_videos_per_prompt (
int, optional, defaults to 1) β The number of videos to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) β One or a list of torch generator(s) to make generation deterministic. - latents (
torch.Tensor, optional) β Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) β Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - prompt_attention_mask (
torch.Tensor, optional) β Pre-generated attention mask for text embeddings. - negative_prompt_embeds (
torch.FloatTensor, optional) β Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - negative_prompt_attention_mask (
torch.FloatTensor, optional) β Pre-generated attention mask for negative text embeddings. - decode_timestep (
float, defaults to0.0) β The timestep at which generated video is decoded. - decode_noise_scale (
float, defaults toNone) β The interpolation factor between random noise and denoised latents at the decode timestep. - output_type (
str, optional, defaults to"pil") β The output format of the generate image. Choose between PIL:PIL.Image.Imageornp.array. - return_dict (
bool, optional, defaults toTrue) β Whether or not to return a~pipelines.ltx.LTXPipelineOutputinstead of a plain tuple. - attention_kwargs (
dict, optional) β A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor. - callback_on_step_end (
Callable, optional) β A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs. - callback_on_step_end_tensor_inputs (
List, optional) β The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class. - max_sequence_length (
intdefaults to128) β Maximum sequence length to use with theprompt.
Returns
export const metadata = 'undefined';
~pipelines.ltx.LTXPipelineOutput or tuple
export const metadata = 'undefined';
If return_dict is True, ~pipelines.ltx.LTXPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
Copied
>>> import torch >>> from diffusers import LTXPipeline >>> from diffusers.utils import export_to_video
>>> pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16) >>> pipe.to("cuda")
>>> prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage" >>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
>>> video = pipe( ... prompt=prompt, ... negative_prompt=negative_prompt, ... width=704, ... height=480, ... num_frames=161, ... num_inference_steps=50, ... ).frames[0] >>> export_to_video(video, "output.mp4", fps=24)
encode_prompt
( prompt: typing.Union[str, typing.List[str]]negative_prompt: typing.Union[str, typing.List[str], NoneType] = Nonedo_classifier_free_guidance: bool = Truenum_videos_per_prompt: int = 1prompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_prompt_embeds: typing.Optional[torch.Tensor] = Noneprompt_attention_mask: typing.Optional[torch.Tensor] = Nonenegative_prompt_attention_mask: typing.Optional[torch.Tensor] = Nonemax_sequence_length: int = 128device: typing.Optional[torch.device] = Nonedtype: typing.Optional[torch.dtype] = None )
Parameters
- prompt (
strorList[str], optional) β prompt to be encoded - negative_prompt (
strorList[str], optional) β The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - do_classifier_free_guidance (
bool, optional, defaults toTrue) β Whether to use classifier free guidance or not. - num_videos_per_prompt (
int, optional, defaults to 1) β Number of videos that should be generated per prompt. torch device to place the resulting embeddings on - prompt_embeds (
torch.Tensor, optional) β Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor, optional) β Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - device β (
torch.device, optional): torch device - dtype β (
torch.dtype, optional): torch dtype
Encodes the prompt into text encoder hidden states.
LTXImageToVideoPipeline
class diffusers.LTXImageToVideoPipeline
( scheduler: FlowMatchEulerDiscreteSchedulervae: AutoencoderKLLTXVideotext_encoder: T5EncoderModeltokenizer: T5TokenizerFasttransformer: LTXVideoTransformer3DModel )
Parameters
- transformer (LTXVideoTransformer3DModel) β Conditional Transformer architecture to denoise the encoded video latents.
- scheduler (FlowMatchEulerDiscreteScheduler) β A scheduler to be used in combination with
transformerto denoise the encoded image latents. - vae (AutoencoderKLLTXVideo) β Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
- text_encoder (
T5EncoderModel) β T5, specifically the google/t5-v1_1-xxl variant. - tokenizer (
CLIPTokenizer) β Tokenizer of class CLIPTokenizer. - tokenizer (
T5TokenizerFast) β Second Tokenizer of class T5TokenizerFast.
Pipeline for image-to-video generation.
Reference: https://github.com/Lightricks/LTX-Video
__call__
( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = Noneprompt: typing.Union[str, typing.List[str]] = Nonenegative_prompt: typing.Union[str, typing.List[str], NoneType] = Noneheight: int = 512width: int = 704num_frames: int = 161frame_rate: int = 25num_inference_steps: int = 50timesteps: typing.List[int] = Noneguidance_scale: float = 3num_videos_per_prompt: typing.Optional[int] = 1generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.Tensor] = Noneprompt_embeds: typing.Optional[torch.Tensor] = Noneprompt_attention_mask: typing.Optional[torch.Tensor] = Nonenegative_prompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_prompt_attention_mask: typing.Optional[torch.Tensor] = Nonedecode_timestep: typing.Union[float, typing.List[float]] = 0.0decode_noise_scale: typing.Union[float, typing.List[float], NoneType] = Noneoutput_type: typing.Optional[str] = 'pil'return_dict: bool = Trueattention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = Nonecallback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = Nonecallback_on_step_end_tensor_inputs: typing.List[str] = ['latents']max_sequence_length: int = 128 ) β export const metadata = 'undefined';~pipelines.ltx.LTXPipelineOutput or tuple
Expand 23 parameters
Parameters
- image (
PipelineImageInput) β The input image to condition the generation on. Must be an image, a list of images or atorch.Tensor. - prompt (
strorList[str], optional) β The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead. - height (
int, defaults to512) β The height in pixels of the generated image. This is set to 480 by default for the best results. - width (
int, defaults to704) β The width in pixels of the generated image. This is set to 848 by default for the best results. - num_frames (
int, defaults to161) β The number of video frames to generate - num_inference_steps (
int, optional, defaults to 50) β The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int], optional) β Custom timesteps to use for the denoising process with schedulers which support atimestepsargument in theirset_timestepsmethod. If not defined, the default behavior whennum_inference_stepsis passed will be used. Must be in descending order. - guidance_scale (
float, defaults to3) β Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the textprompt, usually at the expense of lower image quality. - num_videos_per_prompt (
int, optional, defaults to 1) β The number of videos to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) β One or a list of torch generator(s) to make generation deterministic. - latents (
torch.Tensor, optional) β Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) β Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - prompt_attention_mask (
torch.Tensor, optional) β Pre-generated attention mask for text embeddings. - negative_prompt_embeds (
torch.FloatTensor, optional) β Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - negative_prompt_attention_mask (
torch.FloatTensor, optional) β Pre-generated attention mask for negative text embeddings. - decode_timestep (
float, defaults to0.0) β The timestep at which generated video is decoded. - decode_noise_scale (
float, defaults toNone) β The interpolation factor between random noise and denoised latents at the decode timestep. - output_type (
str, optional, defaults to"pil") β The output format of the generate image. Choose between PIL:PIL.Image.Imageornp.array. - return_dict (
bool, optional, defaults toTrue) β Whether or not to return a~pipelines.ltx.LTXPipelineOutputinstead of a plain tuple. - attention_kwargs (
dict, optional) β A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor. - callback_on_step_end (
Callable, optional) β A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs. - callback_on_step_end_tensor_inputs (
List, optional) β The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class. - max_sequence_length (
intdefaults to128) β Maximum sequence length to use with theprompt.
Returns
export const metadata = 'undefined';
~pipelines.ltx.LTXPipelineOutput or tuple
export const metadata = 'undefined';
If return_dict is True, ~pipelines.ltx.LTXPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
Copied
>>> import torch >>> from diffusers import LTXImageToVideoPipeline >>> from diffusers.utils import export_to_video, load_image
>>> pipe = LTXImageToVideoPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16) >>> pipe.to("cuda")
>>> image = load_image( ... "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png" ... ) >>> prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background. Flames engulf the structure, with smoke billowing into the air. Firefighters in protective gear rush to the scene, a fire truck labeled '38' visible behind them. The girl's neutral expression contrasts sharply with the chaos of the fire, creating a poignant and emotionally charged scene." >>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
>>> video = pipe( ... image=image, ... prompt=prompt, ... negative_prompt=negative_prompt, ... width=704, ... height=480, ... num_frames=161, ... num_inference_steps=50, ... ).frames[0] >>> export_to_video(video, "output.mp4", fps=24)
encode_prompt
( prompt: typing.Union[str, typing.List[str]]negative_prompt: typing.Union[str, typing.List[str], NoneType] = Nonedo_classifier_free_guidance: bool = Truenum_videos_per_prompt: int = 1prompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_prompt_embeds: typing.Optional[torch.Tensor] = Noneprompt_attention_mask: typing.Optional[torch.Tensor] = Nonenegative_prompt_attention_mask: typing.Optional[torch.Tensor] = Nonemax_sequence_length: int = 128device: typing.Optional[torch.device] = Nonedtype: typing.Optional[torch.dtype] = None )
Parameters
- prompt (
strorList[str], optional) β prompt to be encoded - negative_prompt (
strorList[str], optional) β The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - do_classifier_free_guidance (
bool, optional, defaults toTrue) β Whether to use classifier free guidance or not. - num_videos_per_prompt (
int, optional, defaults to 1) β Number of videos that should be generated per prompt. torch device to place the resulting embeddings on - prompt_embeds (
torch.Tensor, optional) β Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor, optional) β Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - device β (
torch.device, optional): torch device - dtype β (
torch.dtype, optional): torch dtype
Encodes the prompt into text encoder hidden states.
LTXPipelineOutput
class diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput
( frames: Tensor )
Parameters
- frames (
torch.Tensor,np.ndarray, or List[List[PIL.Image.Image]]) β List of video outputs - It can be a nested list of lengthbatch_size,with each sub-list containing denoised PIL image sequences of lengthnum_frames.It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width).
Output class for LTX pipelines.