BAAI
/

How to generate video from text

#2
by Peterkkk - opened

I see the demo and example codes showed how to generate image from text.
Could the current model and lib interface support generating video from text? How should I call the interface if could? Thanks!

And more, could you give some code example to call Emu2-Gen with multiple GPUs?

this is my demo running Emu2-Gen on 2 GPUs:
device_map = infer_auto_device_map(model, max_memory={0:'30GiB',1:'80GiB',}, no_split_module_classes=['Block', 'LlamaDecoderLayer'])
device_map["model.decoder.lm.lm_head"] = 0

model = load_checkpoint_and_dispatch(
    model,
    f'{path}/multimodal_encoder',
    device_map=device_map
).eval()

with init_empty_weights():
    pipe = DiffusionPipeline.from_pretrained(
        path,
        custom_pipeline="pipeline_emu2_gen",
        torch_dtype=torch.bfloat16,
        use_safetensors=True,
        variant="bf16",
        multimodal_encoder=model,
        tokenizer=tokenizer
    )

pipe.safety_checker.to("cuda:0")
pipe.unet.to("cuda:0")
pipe.vae.to("cuda:0")

The rest is the same as the code provided by the author.

Sign up or log in to comment