Workflow : I2V & T2V Music Video Creator with multi-scenes and custom audio

#92

by RuneXX - opened Apr 12

Owner Apr 12

•

I2V & T2V Music Video Creator with Multi-Scenes and custom audio
On-the-fly music video creation with any input audio and images.
Creates a multi-scene video from your input images.

the video example is low resolution, so it can be even better if you run it at full HD res

Each scene can be timed correctly with a node for setting independent length for each scene.

For music creation: Ace Step XL (native workflow inside ComfyUI available). You can of course use any and all other music sources ;-)

For scene creation: Qwen Image Edit / FireRed Image Edit (native workflow inside ComfyUI available). With multi-scene lora, and multi-angles lora, creating the input images can be easy in ComfyUI.
Or you can use Nano Banana, Qwen Chat, ChatGPT etc. (For some music it might be more suitable with multiple singers, or illustration scenes etc. All up to you)

In the folder there is also two "helper" workflows. One for creating music with Ace Step XL, and one for creating new scenes from an input image with Qwen Image Edit. The Ace Step XL workflow has an LLM text generate optional part to help with describing music style/elements and lyrics creation using Kijai's great text generate node. Soon Gemma 4 will also be supported an it should work even better.

Inspired by @vrgamedevgirl84 and @MattHVisual that both have done some great music video related generations ;-)

Feel free to play around with it. Its a bit Work-in-Progress, so changes might come ;-)
(and will add a single-pass variant that can both generate new scene as well as extend a scene, and can use combo of both in same workflow)
https://huggingface.co/RuneXX/LTX-2.3-Workflows/tree/main/Music-Video-Creator

NB! This workflow might need a lot of ram, so if you are on a weaker computer, probably better to create part by part and edit it together in a video editor, instead of a workflow that does multiple videos in a row ;-)

RuneXX

Owner Apr 12

•

edited Apr 12

From @MattHVisual
The above video is not made with this workflow, solely meant as a bit of inspiration to what can be possible ;-)

( you can read how he creates his LTX music video here: https://huggingface.co/RuneXX/LTX-2.3-Workflows/discussions/61 )

WanApp

Apr 13

•

edited Apr 13

Man, you are a Legend. I use ONLY your workflow, I'm not as experienced as you, but im an old man (47) and I make great videos for my young daughter. I also teach it and mostly troubleshooting for noral peoplpe or big companies. Will try this one for sure.

I also dumb your workflows down for the common mortal ( let's face it, not every friend we have can understand comfyui) so I make them in APP MODE and reduce the options or make dumb new one for ease of use. CHEERS MATE, YOU ARE MY COMFY HERE. Learning so much by understanding your workflows. Where can I grab you a beer or a coffee?

I also mention and credit you for every APP MODE using your workflow as a base. https://civitai.com/user/WanApp

RuneXX

Owner Apr 13

•

edited Apr 13

I already saw your page earlier, and following ;-) Plus i already put a link on the front page of this repro to yours, since app mode is a super user-friendly way to try comfy probably..
I haven't used it much yet, since its a new feature in comfy, and I been using the old way for quite a while and used to that way. Old habits die hard ;-)
But defo interesting with the app mode, will explore it more

WanApp

Apr 14

Yeah, the big plus with APP Mode is letting you focus on the actual generation instead of having your mind fumble to tweak again and again. The main point of getting your awesome workflow to be easier to install for noobs was changing the models back to the ones Comfyui Desktop (which most noobs should try first ) OG LTX2.3 templates so it lets the user download them directly, the power of Googling is fading with the noobs lol... ( working senior tech support for 2 decades teach you that) and the most complex task left is to update the custom nodes needed. (which is already something hard for someone new of not iTfriendly)
Basically I do with your workflow what I do at work, understand the GREAT MINDS and dumb it down for the Regular non iT people. Cheers mate.

tornyak

Apr 21

lovely workflow but for me it is ruined by neon junk during last 0.5 sec of each segment, and the neon junk is not produced by the first stage, that video looks good without the junk but it is added during the upscale phase at the very end....

RuneXX

Owner Apr 21

•

edited Apr 21

lovely workflow but for me it is ruined by neon junk during last 0.5 sec of each segment,

That is probably the upscaler, it was a bug in the v.1.0 upscaler.
So try the fixed upscaler, version 1.1 that fixes artifacts in last frames
https://huggingface.co/Lightricks/LTX-2.3/tree/main

See if that works better ;-)

tornyak

Apr 21

lovely workflow but for me it is ruined by neon junk during last 0.5 sec of each segment,

That is probably the upscaler, it was a bug in the v.1.0 upscaler.
So try the fixed upscaler, version 1.1 that fixes artifacts in last frames
https://huggingface.co/Lightricks/LTX-2.3/tree/main

See if that works better ;-)

Yes, I figured that out soon after I posted here, Gemini kept telling me that is a known bug in upscaler I use but then I remembered I read about new version of upscaler that was released weeks ago for long videos and I forgot to download it (usually I immediately download the new stuff). The new fixed upscaler works fine, no neon horrors at the end
One of my findings with the wf is that using vocals only frequently breaks lip-sync in a segment while with full track used lip sync is perfect. I am thinking to mod wf and add a few more segments so I could pull off 3 minutes song with good set of storyboard photos, I think it should be possible. This is pretty much a WF I was looking for as an alternative to Vrgamedevgirl's automation wf which relies on Nanobanana (pay wall) .

This is first test run with upscaler 1.1, 480p > Davinchi Resolve added 10 frames crossfades between segments > topaz upscale 2x and film grain.

RuneXX

Owner Apr 21

•

edited Apr 21

That looks great ;-)

And yes the workflow is modular, so you can easily extend it to more...
I might also add a looping one (where its just one group, but looped over and over. For as many loops as you specify/want) Already have such a wf called "long video", but thats for a single shot (single scene).

If you do extend to more groups, you simply have to connect the same way as the other groups (the output frames from previous group).
And additionally at the seed, change that to a higher seed than last group.

And lastly (the only "tricky" part), is that the length is hardcoded. In other words, if length of group 4 is 10 seconds, this 10 seconds value is set to window_seconds_04 variable/param (look under the node that set length at the group). This param is then used at the load audio node for duration.
You can simply remove the duration part of the initial audio though, specially if you plan to use full track, or almost full
(look under the Trim Audio duration nodes and remove the calculated duration part)

And saying all this, i will see if i can find a way to make it even easier to copy paste groups to extend. Maybe some other way to calculate the duration of the audio (or perhap just drop that entirely, its not strictly needed, its just to save a bit of time / ram, but its probably very little save)

vrgamedevgirl84

Apr 23

lovely workflow but for me it is ruined by neon junk during last 0.5 sec of each segment,

That is probably the upscaler, it was a bug in the v.1.0 upscaler.
So try the fixed upscaler, version 1.1 that fixes artifacts in last frames
https://huggingface.co/Lightricks/LTX-2.3/tree/main

See if that works better ;-)

Yes, I figured that out soon after I posted here, Gemini kept telling me that is a known bug in upscaler I use but then I remembered I read about new version of upscaler that was released weeks ago for long videos and I forgot to download it (usually I immediately download the new stuff). The new fixed upscaler works fine, no neon horrors at the end
One of my findings with the wf is that using vocals only frequently breaks lip-sync in a segment while with full track used lip sync is perfect. I am thinking to mod wf and add a few more segments so I could pull off 3 minutes song with good set of storyboard photos, I think it should be possible. This is pretty much a WF I was looking for as an alternative to Vrgamedevgirl's automation wf which relies on Nanobanana (pay wall) .

This is first test run with upscaler 1.1, 480p > Davinchi Resolve added 10 frames crossfades between segments > topaz upscale 2x and film grain.

FYI, I have two workflows, and one of them does not require Nano banana. There is also a z-image one or you can swap it out for any text to image model. Just so you know. thanks!

tornyak

Apr 23

lovely workflow but for me it is ruined by neon junk during last 0.5 sec of each segment,

That is probably the upscaler, it was a bug in the v.1.0 upscaler.
So try the fixed upscaler, version 1.1 that fixes artifacts in last frames
https://huggingface.co/Lightricks/LTX-2.3/tree/main

See if that works better ;-)

Yes, I figured that out soon after I posted here, Gemini kept telling me that is a known bug in upscaler I use but then I remembered I read about new version of upscaler that was released weeks ago for long videos and I forgot to download it (usually I immediately download the new stuff). The new fixed upscaler works fine, no neon horrors at the end
One of my findings with the wf is that using vocals only frequently breaks lip-sync in a segment while with full track used lip sync is perfect. I am thinking to mod wf and add a few more segments so I could pull off 3 minutes song with good set of storyboard photos, I think it should be possible. This is pretty much a WF I was looking for as an alternative to Vrgamedevgirl's automation wf which relies on Nanobanana (pay wall) .

This is first test run with upscaler 1.1, 480p > Davinchi Resolve added 10 frames crossfades between segments > topaz upscale 2x and film grain.

FYI, I have two workflows, and one of them does not require Nano banana. There is also a z-image one or you can swap it out for any text to image model. Just so you know. thanks!

me thinks I would sooner find 60% enriched Uranium in Iran than that workflow, or as chatgpt said "I know exactly what you’re talking about. That workflow does exist, but it’s a bit “underground” and not always packaged as a single clean download like people expect." :D

vrgamedevgirl84

Apr 23

lovely workflow but for me it is ruined by neon junk during last 0.5 sec of each segment,

That is probably the upscaler, it was a bug in the v.1.0 upscaler.
So try the fixed upscaler, version 1.1 that fixes artifacts in last frames
https://huggingface.co/Lightricks/LTX-2.3/tree/main

See if that works better ;-)

Yes, I figured that out soon after I posted here, Gemini kept telling me that is a known bug in upscaler I use but then I remembered I read about new version of upscaler that was released weeks ago for long videos and I forgot to download it (usually I immediately download the new stuff). The new fixed upscaler works fine, no neon horrors at the end
One of my findings with the wf is that using vocals only frequently breaks lip-sync in a segment while with full track used lip sync is perfect. I am thinking to mod wf and add a few more segments so I could pull off 3 minutes song with good set of storyboard photos, I think it should be possible. This is pretty much a WF I was looking for as an alternative to Vrgamedevgirl's automation wf which relies on Nanobanana (pay wall) .

This is first test run with upscaler 1.1, 480p > Davinchi Resolve added 10 frames crossfades between segments > topaz upscale 2x and film grain.

FYI, I have two workflows, and one of them does not require Nano banana. There is also a z-image one or you can swap it out for any text to image model. Just so you know. thanks!

me thinks I would sooner find 60% enriched Uranium in Iran than that workflow, or as chatgpt said "I know exactly what you’re talking about. That workflow does exist, but it’s a bit “underground” and not always packaged as a single clean download like people expect." :D

LOL - my workflows are all in my workflows folder of my custom nodes. I made it soooo user friendly you only have to load an audio file and update some text and run the first workflow, then the 2nd workflow grabs the data from the first one so you only have to run the 2nd one. You don't even have to do anything in the workflow that creates the video. Its like so easy its hard lol
I have a bunch of walkthrough videos, not sure if you watched them, if you ever wanted to give it another chance i would be happy to help you and walk you through it so you can see just how easy it is.

tornyak

Apr 23

lovely workflow but for me it is ruined by neon junk during last 0.5 sec of each segment,

That is probably the upscaler, it was a bug in the v.1.0 upscaler.
So try the fixed upscaler, version 1.1 that fixes artifacts in last frames
https://huggingface.co/Lightricks/LTX-2.3/tree/main

See if that works better ;-)

Yes, I figured that out soon after I posted here, Gemini kept telling me that is a known bug in upscaler I use but then I remembered I read about new version of upscaler that was released weeks ago for long videos and I forgot to download it (usually I immediately download the new stuff). The new fixed upscaler works fine, no neon horrors at the end
One of my findings with the wf is that using vocals only frequently breaks lip-sync in a segment while with full track used lip sync is perfect. I am thinking to mod wf and add a few more segments so I could pull off 3 minutes song with good set of storyboard photos, I think it should be possible. This is pretty much a WF I was looking for as an alternative to Vrgamedevgirl's automation wf which relies on Nanobanana (pay wall) .

This is first test run with upscaler 1.1, 480p > Davinchi Resolve added 10 frames crossfades between segments > topaz upscale 2x and film grain.

FYI, I have two workflows, and one of them does not require Nano banana. There is also a z-image one or you can swap it out for any text to image model. Just so you know. thanks!

me thinks I would sooner find 60% enriched Uranium in Iran than that workflow, or as chatgpt said "I know exactly what you’re talking about. That workflow does exist, but it’s a bit “underground” and not always packaged as a single clean download like people expect." :D

LOL - my workflows are all in my workflows folder of my custom nodes. I made it soooo user friendly you only have to load an audio file and update some text and run the first workflow, then the 2nd workflow grabs the data from the first one so you only have to run the 2nd one. You don't even have to do anything in the workflow that creates the video. Its like so easy its hard lol
I have a bunch of walkthrough videos, not sure if you watched them, if you ever wanted to give it another chance i would be happy to help you and walk you through it so you can see just how easy it is.

I laughed so hard when chatgpt wrote that it cannot find the link and then started explaining literally like it is somewhere hidden on the dark net, I found it in the node directory!

JohnJohnSmith

Apr 24

•

edited Apr 24

Hi, first thank you for amazing wf! I've tried "LTX-2.3_-_I2V_T2V_Music-Video-Creator_multi-scene_custom_audio" and I found some strange behaviour. The ref image from last scene (number5) is used as start reference - so the clip starts from image linked to scene number5 ! But I can see the right ref image from scene 1 flashed once (one frame shown only).I have no idea what is wrong on my side. In my clip project all 5 scenes has different images. The ConfyUI 0.19. Of course the configuration with one scene only is working. Thanks in advance if you can give me some hint.

RuneXX

Owner Apr 24

Try download the workflow again. See if it still happens.
I did also see this in an earlier version, seemed like comfyui went a bit too far with "cache" or optimize that it just used last image input.
But if you are on latest version of the workflow, and it still happens, i'll find a fix ;-)

JohnJohnSmith

Apr 24

•

edited Apr 24

Try download the workflow again. See if it still happens.
I did also see this in an earlier version, seemed like comfyui went a bit too far with "cache" or optimize that it just used last image input.
But if you are on latest version of the workflow, and it still happens, i'll find a fix ;-)

Thanks! The "LTX-2.3_-_I2V_T2V_Music-Video-Creator_multi-scene_custom_audio_Low_RAM.json" aka temporary-folder version wf works as expected! All images are splitted correctly. Some non-critical exceptions occured. 1) on Windows10 LoadVideosFromFolder cannot find relative path "outpit\MusicVideo..." and I've used full path. 2) temporary scene files were doubled (no-audio and audio version) during final video compilation, I guess there some setting in ComfyUI or some node that produce these. 3) If some scene is bypassed then SimpleCalculatorKJ cannot calculate total duration autiomatically, gives an error.
I repeate these problems aren't critical and can be corrected easily by manual input. Big thanks for sharing great wf collection!

RuneXX

Owner Apr 24

•

edited Apr 24

The double video thing, is the Video Combine node. It saves the workflow as a PNG, then it gets confused and saves a silent video, and then the actual video (the node was made in the era when all videos were silent).
So for now, in that node settings (comfyUI settings, under that node), you can turn off save png, and turn off save meta i think you can do in the node itself (or the settings).
And that will stop it from saving doubles.

Actually forgot about that whole thing, since i turned that off long ago. That might mess up the "merge all saved files to one video" at the very end. Will see if i can figure out a fix (using some naming convention)

And a bit related, path.. hmm... maybe it should be absolute path in that first "set folder" after all. I was thinking that comfyUI would automagically use /output/.... then the rest as path. But maybe Win10 wont do that.

And yes, the workflow is a bit "hard-coded" in that it uses all the groups. Specifically to calculate the trim audio node at the audio input.
But this might not be strictly needed. It was to prevent decoding say a 5 minutes long mp3 file, when you only need 30-40 seconds for the LTX video.

I'll take a new look at the wf soon, see if it can be made a bit more "bulletproof" and flexible to how many groups used etc ;-)

JohnJohnSmith

Apr 24

•

edited Apr 24

The double video thing, is the Video Combine node. It saves the workflow as a PNG, then it gets confused and saves a silent video, and then the actual video (the node was made in the era when all videos were silent).
So for now, in that node settings (comfyUI settings, under that node), you can turn off save png, and turn off save meta i think you can do in the node itself (or the settings).
And that will stop it from saving doubles.
Actually forgot about that whole thing, since i turned that off long ago. That might mess up the "merge all saved files to one video" at the very end. Will see if i can figure out a fix (using some naming convention)

I found in VHS "keep required intermediate files" option that enable no-audio files. When is switched off the doubling problem gone.

I was thinking that comfyUI would automagically use /output/...

Yes, therefore the scenes-files were saved according to relative "output/" folder. But at final build stage it seems like LoadVideosFromFolder KJ node problem. May be it uses "input" folder))

I'll take a new look at the wf soon, see if it can be made a bit more "bulletproof" and flexible to how many groups used etc ;-)

Excellent! I guess the musical clips generation is in popular demand on AI-video area and this wf will be very useful.

tornyak

Apr 28

Coco - Old Money Queen 🤭
pipeline:
WF 720p (70 minutes) > D. Resolve crossdissolve added between segments > Topaz AI Video upscale 1080p and refining with Iris2 model.

RuneXX

Owner Apr 28

That looks amazing ;-) really nice ;-)
well done

tornyak

Apr 30

This took some effort to make as originally intended I also modified workflow and added two more segments and have used Licon VBVR lora at 0.5 str it helps with very complex prompting a bit but at higher str it tends to change faces or create blur around faces in Lipsync videos.

RuneXX

Owner Apr 30

wow ... that looks impressive ;-) artistic and all
And the 3 characters and extreme close up, realistic human with animated background... i bet that took a few runs to get right ;-)
But the end result looks amazing ;-)

tornyak

Apr 30

wow ... that looks impressive ;-) artistic and all
And the 3 characters and extreme close up, realistic human with animated background... i bet that took a few runs to get right ;-)
But the end result looks amazing ;-)

yeah prompting getting it right took a whole day, the key was to get good immaculate starting images with clear separation of character from background also klein9b was better choice than Qwen or Z image just because images looked more natural or to be more accurate less AI polished and more suitable for what I had in mind (but it meant more rerolls of seed becuse of multiple fingers and legs.. 😁 , it's annoying it generates perfect image but the person has 3 legs even anatomy slider lora can't fix it always...

tornyak

Apr 30

Could prompt relay encode with timeline (single image ) be implemented in this music video creator wf?

RuneXX

Owner May 1

•

edited May 1

Could prompt relay encode with timeline (single image ) be implemented in this music video creator wf?

Yes, it could do a lot of the same, and already uploaded a workflow with custom audio.
It will probably take a few more runs, since ideally each Prompt Relay one is 20-30 seconds or so max (before ltx starts to degrade)
But a few runs, continuing on, you can make a nice little music video

(already made a little music video test run, worked well. Will do a little demo one for inspiration)

The uploaded Prompt Relay with custom audio just has one image input (but you can prompt for scene changes etc).
But i'll upload a multi-image variant as well

tornyak

May 1

So limit of this wf for me is 115 seconds 720p video, after that it reports ooms crashes after 115 sec during batch combine, basically for longer videos at 720p or 1080p there are length limitations . I have come to the point where I would need audio slices to be generated individually for each of segments and then segment videos saved and then merged into one final file. I played a bit with low res version of WT I think it does something like that need to revisit it and try to use it for 720p gen

aurobet

May 1

Yes, I've been able to get very good results if you use the latest workflow (with the temporary folder option) and disabling the 'save metadata' on the video combine nodes. Thanks very much RuneXX, this really is a groundbreaking step forward!

RuneXX

Owner May 1

•

edited May 1

Happy to hear ;-)

save metadata' on the video combine nodes.

ah yes I should add a note about that one. I keep forgetting that one since i turned off that feature long long ago

Edit: Ah, of course Kijai has a node ;-) Will add that to the workflow when updating. It basically saves the video during the vae decode, free the intermediate image memory, and no meta. (will test run it )

RuneXX

Owner May 3

•

edited May 3

(just a low res 720p example video)

Updated "low ram" save video per segment workflow :

More accurate frames to audio : now it uses the actual calculated frames to set audio duration instead of the window seconds input. (since it follows actual frames, it shouldn't matter if its round or ceil)
Better fps condition inside each group
Better save intermediate files using native "Save Video" node (and not using VHS combine that can end up saving extra silent video and png for workflow meta data)
Changed slash direction for directory path to match that of the node (should hopefully fix some OS not finding the path for the full video)
"Window seconds" input removed, to make extending groups more easily (and the trim audio input node at the very start is set to 2 minutes, but can set it higher or lower to match the workflow duration)
(the trim audio node at the very start is also useful to continue the song with a 2nd, 3rd etc generation, by using the audio start from to match where the previous video ended)

Should hopefully work better ;-)

If it breaks at the very end where it combines all the saved intermediate video files, I'll upload a manual combine video "helper" workflow so one can easily combine the intermediate video files manually (the intermedia video files are saved even if the end auto combine node should fail, in a temporary folder).
This helper workflow also have optional frame interpolation and upscale video part.

RuneXX

Owner May 4

•

edited May 4

Updated regular "music video" workflow
(this variant of the workflow does not save intermediate files, but rather generates it all in one go...)

Removed previous batch image nodes with overlap and forward (these are more typical for overlapping frames, and this workflow does not have that)
Each group's output frames stored in set variables, and merged at end with image batch multi for the full video (instead of accumulating frames and forward to next groups). Might be a little less RAM heavy.
... plus all the small things from above (more accurate audio duration, better fps conditioning etc)<
A bit simpler alternative than the "save each segment" wf.

(not really sure both workflows are needed, since the save intermediate files wf might be the better option for many/most. But since it used to be 2 variants.. They are quite similar though)

Plus made an excuse for a little rock video when testing ;-) 🤘🎸🎶

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment