--- title: Mini-dalle Video-clip Maker emoji: 馃惃 colorFrom: purple colorTo: indigo sdk: gradio sdk_version: 4.19.2 app_file: app.py pinned: false license: mit --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # DALL路E Video Clip Maker This python project uses DALL路E Mini, through Replicate's API, to generate a photo-montage video from a song. Given a YouTube url, the program will extract the audio and transcript of the video and use the lyrics in the transcript as text prompts for DALL路E Mini. ## Usage The project can be accessed through a single command: `python3 main.py --token ` An output example for the video in "Here Comes the Sun" by The Beatles: Note that the project only works with YouTube videos that have a transcription. # Blog Post ## 1. Interacting with the Replicate API to run DALL路E Mini [Replicate](https://replicate.com) is a service to run open-source machine learning models from the cloud. The Replicate API enables you to use all Replicate models inside a python script, which is the core of this project. All of the machinery is wrapped in the `DalleImageGenerator` class in `dall_e.py`, which does all the interaction with Replicate. Let's have a look at the code it runs in order to generate images from text. In order to create an API object and specify the model we'd like to use, we first need an API token which is available [here](https://replicate.com/docs/api) after subscribing to Replicate. ``` import os import replicate os.environ["REPLICATE_API_TOKEN"] = dalle = replicate.models.get("kuprel/min-dalle") urls = self.dalle.predict(text=, grid_size=, log2_supercondition_factor=) ``` In this case, the model returns a list of urls to all intermediate images generated by DALL路E Mini. We want the final output, so we call ``` get_image(list(urls)[-1]) ``` to download the last one using python's urllib library ## 2. Downloading content from YouTube All the code in this section appears in **download_from_youtube.py**. ### Downloading the transcript There is a very cool python package called **YouTubeTranscriptApi** and, as its name implies, it's going to be very usefull The **YouTubeTranscriptApi.get_transcript** function needs a youtube video ID, so we'll first extract if from the video url using urllib: The function **get_video_id** in the file does exactly that and the main lines of code to get the transcripts are: ``` id = get_video_id(url) transcript = YouTubeTranscriptApi.get_transcript(id, languages=['en']) ``` str is a python dictionary with entries 'text', 'star' 'duration' indicating the starting time of each line of the lyrics and its duration. ### Downloading the audio I used a library called **youtube_dl** that can download an .mp3 file with the sound of a YouTube video. The usage is fairly simple and is wrapped in the download_mp3 function in the file ``` import youtube_dl ydl_opts = { 'outtmpl': , 'format': 'bestaudio/best', 'postprocessors': [{ 'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192', }], } with youtube_dl.YoutubeDL(ydl_opts) as ydl: ydl.download([url]) ``` ## 3. Making a video clip The rest of the code is conceptually simple. Using the transcript lines as prompts to DALL路E Mini, we get images and combine them with the .mp3 to video clip. In practice, there are some things to pay attention to in order to make the timing of the lyrics sound and visuals play together. Let's go through the code: We loop over the transcript dictionary we previously downloaded: ``` for (text, start, end) in transcript: ``` Given the duration of the current and an input argument args.sec_per_img we calculate how many images wee need. Also, DALL路E Mini generates a square grid of images, so if we want N images, we need to tell it to generate a grid of dimension
\sqrt{N}
. The calculation is: ``` grid_size = max(get_sqrt(duration / args.sec_per_img), 1) ``` Now we ask Replicate for images from DALL路E Mini: ``` images = dalle.generate_images(text, grid_size, text_adherence=3) ``` If we want to generate a movie clip in a specific fps (Higher fps mean more accuracy in the timing because we can change image more frequently) we usually need to write each image for multiple frames. The calculation I did is: ``` frames_per_image = int(duration * args.fps) // len(images) ``` Now, we use **opencv** package to write the lyrics as subtitles on the frame ``` frame = cv2.cvtColor(images[j], cv2.COLOR_RGBA2BGR) frame = put_subtitles_on_frame(frame, text, resize_factor) frames.append(frame) ``` Where **put_subtitles_on_frame** is a function in utils.py that makes use of the **cv2.putText** function Finally, we can write all the aggregated frames into a file: ``` video = cv2.VideoWriter(vid_path, 0, args.fps, (img_dim , img_dim)) for i, frame in enumerate(frames): video.write(frame) cv2.destroyAllWindows() video.release() ``` The code itself is in the **get_frames** function in **main.py** and is a little bit more elaborated. It also fills the gaps parts of the song where there are no lyrics with images prompted by the last sentence or the song's name. ## 4. Sound and video mixing Now that we have video, we only need to mix it with the downloaded .mp3 file. We'll use FFMPEG for this with Shell commands executed from python. The first of the two commands below cuts the mp3 file to fit the length of the generated video in cases where the lyrics doesn't cover all the song. The second command mixes the two into a new file with video and song: ``` os.system(f"ffmpeg -ss 00:00:00 -t {video_duration} -i '{mp3_path}' -map 0:a -acodec libmp3lame '{f'data/{args.song_name}/tmp.mp3'}'") os.system(f"ffmpeg -i '{vid_path}' -i '{f'data/{args.song_name}/tmp.mp3'}' -map 0 -map 1:a -c:v copy -shortest '{final_vid_path}'") ``` # TODO - [ ] Fix subtitles no whitespace problems - [ ] Allow working on raw .mp3 and .srt files instead of urls only - [ ] Support automatic generated youtube transcriptions - [ ] Better timing of subtitles and sound - [ ] Find way to upload video without copyrights infringement - [ ] Use other text to image models from Replicate