Spaces:
Runtime error
Runtime error
llinahosna
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -11,3 +11,168 @@ license: mit
|
|
11 |
---
|
12 |
|
13 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
---
|
12 |
|
13 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
14 |
+
# DALL路E Video Clip Maker
|
15 |
+
|
16 |
+
This python project uses DALL路E Mini, through Replicate's API, to generate a photo-montage video
|
17 |
+
from a song.
|
18 |
+
|
19 |
+
Given a YouTube url, the program will extract the audio and transcript of the video and use the lyrics
|
20 |
+
in the transcript as text prompts for DALL路E Mini.
|
21 |
+
|
22 |
+
## Usage
|
23 |
+
|
24 |
+
The project can be accessed through a single command:
|
25 |
+
|
26 |
+
`python3 main.py <youtube url> --token <your replicate API token>`
|
27 |
+
|
28 |
+
An output example for the video in "Here Comes the Sun" by The Beatles:
|
29 |
+
|
30 |
+
<img src="misc/frame-432.png" width="250"/> <img src="misc/frame-177.png" width="250"/>
|
31 |
+
<img src="misc/frame-316.png" width="250"/> <img src="misc/frame-633.png" width="250"/>
|
32 |
+
<img src="misc/frame-1724.png" width="250"/> <img src="misc/frame-1328.png" width="250"/>
|
33 |
+
|
34 |
+
Note that the project only works with YouTube videos that have a transcription.
|
35 |
+
|
36 |
+
# Blog Post
|
37 |
+
|
38 |
+
## 1. Interacting with the Replicate API to run DALL路E Mini
|
39 |
+
|
40 |
+
[Replicate](https://replicate.com) is a service to run open-source machine learning models from the cloud. The Replicate API enables you to use all Replicate models inside a python script, which is the core of this project.
|
41 |
+
|
42 |
+
All of the machinery is wrapped in the `DalleImageGenerator` class in `dall_e.py`, which does all the interaction with Replicate.
|
43 |
+
|
44 |
+
Let's have a look at the code it runs in order to generate images from text.
|
45 |
+
|
46 |
+
In order to create an API object and specify the model we'd like to use, we first need an API token
|
47 |
+
which is available [here](https://replicate.com/docs/api) after subscribing to Replicate.
|
48 |
+
|
49 |
+
```
|
50 |
+
import os
|
51 |
+
import replicate
|
52 |
+
os.environ["REPLICATE_API_TOKEN"] = <Your Api access token from Replicate>
|
53 |
+
dalle = replicate.models.get("kuprel/min-dalle")
|
54 |
+
urls = self.dalle.predict(text=<your prompt>, grid_size=<How many images to generate>, log2_supercondition_factor=<A parameter controlling the output relevance to text>)
|
55 |
+
```
|
56 |
+
|
57 |
+
In this case, the model returns a list of urls to all intermediate images generated by DALL路E Mini.
|
58 |
+
|
59 |
+
We want the final output, so we call
|
60 |
+
```
|
61 |
+
get_image(list(urls)[-1])
|
62 |
+
```
|
63 |
+
to download the last one using python's urllib library
|
64 |
+
|
65 |
+
## 2. Downloading content from YouTube
|
66 |
+
All the code in this section appears in **download_from_youtube.py**.
|
67 |
+
|
68 |
+
### Downloading the transcript
|
69 |
+
|
70 |
+
There is a very cool python package called **YouTubeTranscriptApi** and, as its name implies, it's going to be very usefull
|
71 |
+
|
72 |
+
The **YouTubeTranscriptApi.get_transcript** function needs a youtube video ID, so we'll first extract if from the video url using urllib:
|
73 |
+
The function **get_video_id** in the file does exactly that
|
74 |
+
|
75 |
+
and the main lines of code to get the transcripts are:
|
76 |
+
|
77 |
+
```
|
78 |
+
id = get_video_id(url)
|
79 |
+
transcript = YouTubeTranscriptApi.get_transcript(id, languages=['en'])
|
80 |
+
```
|
81 |
+
str is a python dictionary with entries 'text', 'star' 'duration'
|
82 |
+
indicating the starting time of each line of the lyrics and its duration.
|
83 |
+
|
84 |
+
### Downloading the audio
|
85 |
+
|
86 |
+
I used a library called **youtube_dl** that can download an .mp3 file with the sound of a YouTube video.
|
87 |
+
|
88 |
+
The usage is fairly simple and is wrapped in the download_mp3 function in the file
|
89 |
+
```
|
90 |
+
import youtube_dl
|
91 |
+
ydl_opts = {
|
92 |
+
'outtmpl': <specify output file path>,
|
93 |
+
'format': 'bestaudio/best',
|
94 |
+
'postprocessors': [{
|
95 |
+
'key': 'FFmpegExtractAudio',
|
96 |
+
'preferredcodec': 'mp3',
|
97 |
+
'preferredquality': '192',
|
98 |
+
}],
|
99 |
+
}
|
100 |
+
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
|
101 |
+
ydl.download([url])
|
102 |
+
```
|
103 |
+
|
104 |
+
## 3. Making a video clip
|
105 |
+
The rest of the code is conceptually simple. Using the transcript lines as prompts to DALL路E Mini, we get images and combine
|
106 |
+
them with the .mp3 to video clip.
|
107 |
+
|
108 |
+
In practice, there are some things to pay attention to in order to make the timing of the lyrics sound and visuals play together.
|
109 |
+
|
110 |
+
Let's go through the code:
|
111 |
+
|
112 |
+
We loop over the transcript dictionary we previously downloaded:
|
113 |
+
|
114 |
+
```
|
115 |
+
for (text, start, end) in transcript:
|
116 |
+
```
|
117 |
+
|
118 |
+
Given the duration of the current and an input argument args.sec_per_img we calculate how many images wee need.
|
119 |
+
Also, DALL路E Mini generates a square grid of images, so if we want N images, we need to tell it to generate a grid of
|
120 |
+
dimension <pre xml:lang="latex">\sqrt{N}</pre>. The calculation is:
|
121 |
+
|
122 |
+
```
|
123 |
+
grid_size = max(get_sqrt(duration / args.sec_per_img), 1)
|
124 |
+
```
|
125 |
+
|
126 |
+
Now we ask Replicate for images from DALL路E Mini:
|
127 |
+
```
|
128 |
+
images = dalle.generate_images(text, grid_size, text_adherence=3)
|
129 |
+
```
|
130 |
+
If we want to generate a movie clip in a specific fps (Higher fps mean more accuracy in the timing because we can
|
131 |
+
change image more frequently) we usually need to write each image for multiple frames.
|
132 |
+
|
133 |
+
The calculation I did is:
|
134 |
+
```
|
135 |
+
frames_per_image = int(duration * args.fps) // len(images)
|
136 |
+
```
|
137 |
+
|
138 |
+
Now, we use **opencv** package to write the lyrics as subtitles on the frame
|
139 |
+
```
|
140 |
+
frame = cv2.cvtColor(images[j], cv2.COLOR_RGBA2BGR)
|
141 |
+
frame = put_subtitles_on_frame(frame, text, resize_factor)
|
142 |
+
frames.append(frame)
|
143 |
+
```
|
144 |
+
Where **put_subtitles_on_frame** is a function in utils.py that makes use of the **cv2.putText** function
|
145 |
+
|
146 |
+
Finally, we can write all the aggregated frames into a file:
|
147 |
+
```
|
148 |
+
video = cv2.VideoWriter(vid_path, 0, args.fps, (img_dim , img_dim))
|
149 |
+
for i, frame in enumerate(frames):
|
150 |
+
video.write(frame)
|
151 |
+
cv2.destroyAllWindows()
|
152 |
+
video.release()
|
153 |
+
```
|
154 |
+
|
155 |
+
The code itself is in the **get_frames** function in **main.py** and is a little bit more elaborated. It also fills the
|
156 |
+
gaps parts of the song where there are no lyrics with images prompted by the last sentence or the song's name.
|
157 |
+
|
158 |
+
## 4. Sound and video mixing
|
159 |
+
|
160 |
+
Now that we have video, we only need to mix it with the downloaded .mp3 file.
|
161 |
+
|
162 |
+
We'll use FFMPEG for this with Shell commands executed from python.
|
163 |
+
|
164 |
+
The first of the two commands below cuts the mp3 file to fit the length of the generated video in cases where the lyrics
|
165 |
+
doesn't cover all the song. The second command mixes the two into a new file with video and song:
|
166 |
+
|
167 |
+
```
|
168 |
+
os.system(f"ffmpeg -ss 00:00:00 -t {video_duration} -i '{mp3_path}' -map 0:a -acodec libmp3lame '{f'data/{args.song_name}/tmp.mp3'}'")
|
169 |
+
os.system(f"ffmpeg -i '{vid_path}' -i '{f'data/{args.song_name}/tmp.mp3'}' -map 0 -map 1:a -c:v copy -shortest '{final_vid_path}'")
|
170 |
+
```
|
171 |
+
|
172 |
+
# TODO
|
173 |
+
- [ ] Fix subtitles no whitespace problems
|
174 |
+
- [ ] Allow working on raw .mp3 and .srt files instead of urls only
|
175 |
+
- [ ] Support automatic generated youtube transcriptions
|
176 |
+
- [ ] Better timing of subtitles and sound
|
177 |
+
- [ ] Find way to upload video without copyrights infringement
|
178 |
+
- [ ] Use other text to image models from Replicate
|