llinahosna commited on
Commit
d2b22fb
1 Parent(s): 34dc018

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -0
README.md CHANGED
@@ -11,3 +11,168 @@ license: mit
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
+ # DALL·E Video Clip Maker
15
+
16
+ This python project uses DALL·E Mini, through Replicate's API, to generate a photo-montage video
17
+ from a song.
18
+
19
+ Given a YouTube url, the program will extract the audio and transcript of the video and use the lyrics
20
+ in the transcript as text prompts for DALL·E Mini.
21
+
22
+ ## Usage
23
+
24
+ The project can be accessed through a single command:
25
+
26
+ `python3 main.py <youtube url> --token <your replicate API token>`
27
+
28
+ An output example for the video in "Here Comes the Sun" by The Beatles:
29
+
30
+ <img src="misc/frame-432.png" width="250"/> <img src="misc/frame-177.png" width="250"/>
31
+ <img src="misc/frame-316.png" width="250"/> <img src="misc/frame-633.png" width="250"/>
32
+ <img src="misc/frame-1724.png" width="250"/> <img src="misc/frame-1328.png" width="250"/>
33
+
34
+ Note that the project only works with YouTube videos that have a transcription.
35
+
36
+ # Blog Post
37
+
38
+ ## 1. Interacting with the Replicate API to run DALL·E Mini
39
+
40
+ [Replicate](https://replicate.com) is a service to run open-source machine learning models from the cloud. The Replicate API enables you to use all Replicate models inside a python script, which is the core of this project.
41
+
42
+ All of the machinery is wrapped in the `DalleImageGenerator` class in `dall_e.py`, which does all the interaction with Replicate.
43
+
44
+ Let's have a look at the code it runs in order to generate images from text.
45
+
46
+ In order to create an API object and specify the model we'd like to use, we first need an API token
47
+ which is available [here](https://replicate.com/docs/api) after subscribing to Replicate.
48
+
49
+ ```
50
+ import os
51
+ import replicate
52
+ os.environ["REPLICATE_API_TOKEN"] = <Your Api access token from Replicate>
53
+ dalle = replicate.models.get("kuprel/min-dalle")
54
+ urls = self.dalle.predict(text=<your prompt>, grid_size=<How many images to generate>, log2_supercondition_factor=<A parameter controlling the output relevance to text>)
55
+ ```
56
+
57
+ In this case, the model returns a list of urls to all intermediate images generated by DALL·E Mini.
58
+
59
+ We want the final output, so we call
60
+ ```
61
+ get_image(list(urls)[-1])
62
+ ```
63
+ to download the last one using python's urllib library
64
+
65
+ ## 2. Downloading content from YouTube
66
+ All the code in this section appears in **download_from_youtube.py**.
67
+
68
+ ### Downloading the transcript
69
+
70
+ There is a very cool python package called **YouTubeTranscriptApi** and, as its name implies, it's going to be very usefull
71
+
72
+ The **YouTubeTranscriptApi.get_transcript** function needs a youtube video ID, so we'll first extract if from the video url using urllib:
73
+ The function **get_video_id** in the file does exactly that
74
+
75
+ and the main lines of code to get the transcripts are:
76
+
77
+ ```
78
+ id = get_video_id(url)
79
+ transcript = YouTubeTranscriptApi.get_transcript(id, languages=['en'])
80
+ ```
81
+ str is a python dictionary with entries 'text', 'star' 'duration'
82
+ indicating the starting time of each line of the lyrics and its duration.
83
+
84
+ ### Downloading the audio
85
+
86
+ I used a library called **youtube_dl** that can download an .mp3 file with the sound of a YouTube video.
87
+
88
+ The usage is fairly simple and is wrapped in the download_mp3 function in the file
89
+ ```
90
+ import youtube_dl
91
+ ydl_opts = {
92
+ 'outtmpl': <specify output file path>,
93
+ 'format': 'bestaudio/best',
94
+ 'postprocessors': [{
95
+ 'key': 'FFmpegExtractAudio',
96
+ 'preferredcodec': 'mp3',
97
+ 'preferredquality': '192',
98
+ }],
99
+ }
100
+ with youtube_dl.YoutubeDL(ydl_opts) as ydl:
101
+ ydl.download([url])
102
+ ```
103
+
104
+ ## 3. Making a video clip
105
+ The rest of the code is conceptually simple. Using the transcript lines as prompts to DALL·E Mini, we get images and combine
106
+ them with the .mp3 to video clip.
107
+
108
+ In practice, there are some things to pay attention to in order to make the timing of the lyrics sound and visuals play together.
109
+
110
+ Let's go through the code:
111
+
112
+ We loop over the transcript dictionary we previously downloaded:
113
+
114
+ ```
115
+ for (text, start, end) in transcript:
116
+ ```
117
+
118
+ Given the duration of the current and an input argument args.sec_per_img we calculate how many images wee need.
119
+ Also, DALL·E Mini generates a square grid of images, so if we want N images, we need to tell it to generate a grid of
120
+ dimension <pre xml:lang="latex">\sqrt{N}</pre>. The calculation is:
121
+
122
+ ```
123
+ grid_size = max(get_sqrt(duration / args.sec_per_img), 1)
124
+ ```
125
+
126
+ Now we ask Replicate for images from DALL·E Mini:
127
+ ```
128
+ images = dalle.generate_images(text, grid_size, text_adherence=3)
129
+ ```
130
+ If we want to generate a movie clip in a specific fps (Higher fps mean more accuracy in the timing because we can
131
+ change image more frequently) we usually need to write each image for multiple frames.
132
+
133
+ The calculation I did is:
134
+ ```
135
+ frames_per_image = int(duration * args.fps) // len(images)
136
+ ```
137
+
138
+ Now, we use **opencv** package to write the lyrics as subtitles on the frame
139
+ ```
140
+ frame = cv2.cvtColor(images[j], cv2.COLOR_RGBA2BGR)
141
+ frame = put_subtitles_on_frame(frame, text, resize_factor)
142
+ frames.append(frame)
143
+ ```
144
+ Where **put_subtitles_on_frame** is a function in utils.py that makes use of the **cv2.putText** function
145
+
146
+ Finally, we can write all the aggregated frames into a file:
147
+ ```
148
+ video = cv2.VideoWriter(vid_path, 0, args.fps, (img_dim , img_dim))
149
+ for i, frame in enumerate(frames):
150
+ video.write(frame)
151
+ cv2.destroyAllWindows()
152
+ video.release()
153
+ ```
154
+
155
+ The code itself is in the **get_frames** function in **main.py** and is a little bit more elaborated. It also fills the
156
+ gaps parts of the song where there are no lyrics with images prompted by the last sentence or the song's name.
157
+
158
+ ## 4. Sound and video mixing
159
+
160
+ Now that we have video, we only need to mix it with the downloaded .mp3 file.
161
+
162
+ We'll use FFMPEG for this with Shell commands executed from python.
163
+
164
+ The first of the two commands below cuts the mp3 file to fit the length of the generated video in cases where the lyrics
165
+ doesn't cover all the song. The second command mixes the two into a new file with video and song:
166
+
167
+ ```
168
+ os.system(f"ffmpeg -ss 00:00:00 -t {video_duration} -i '{mp3_path}' -map 0:a -acodec libmp3lame '{f'data/{args.song_name}/tmp.mp3'}'")
169
+ os.system(f"ffmpeg -i '{vid_path}' -i '{f'data/{args.song_name}/tmp.mp3'}' -map 0 -map 1:a -c:v copy -shortest '{final_vid_path}'")
170
+ ```
171
+
172
+ # TODO
173
+ - [ ] Fix subtitles no whitespace problems
174
+ - [ ] Allow working on raw .mp3 and .srt files instead of urls only
175
+ - [ ] Support automatic generated youtube transcriptions
176
+ - [ ] Better timing of subtitles and sound
177
+ - [ ] Find way to upload video without copyrights infringement
178
+ - [ ] Use other text to image models from Replicate