Spaces:
Build error
Build error
update blog
Browse files
app.py
CHANGED
@@ -185,17 +185,9 @@ demo = gr.Blocks()
|
|
185 |
with demo:
|
186 |
gr.Markdown("""# **Create Any GIF From Your Favorite Videos!** """)
|
187 |
gr.Markdown("""
|
188 |
-
In this Gradio-Space Blog I will be taking you through my efforts in reproducing the brilliant app [Edit Video By Editing Text](https://huggingface.co/spaces/radames/edit-video-by-editing-text) by [@radames](https://huggingface.co/radames).
|
189 |
-
|
190 |
-
My valule-add are -
|
191 |
- A permanent supply for your own new GIFs
|
192 |
- This Space written in th form of a Notebook or a Blog if I may, to help someone understand how they can too build this kind of an app.
|
193 |
-
|
194 |
-
I will start with a short note about Radames's app and tools used in it -
|
195 |
-
- It is a supercool and handy proof of concept of a simple video editor where you can edit a video by playing with its audio transcriptions (ASR pipeline output).
|
196 |
-
- The app uses Huggingface [Automatic Speech Recognition Pipeline](https://huggingface.co/tasks/automatic-speech-recognition) build over Wav2Vec2 model using CTC which allows you to predict text transcriptions along with the timestamps for every characters and pauses.
|
197 |
-
- The app uses FFmpeg library to a good extent to clip and merge videos. FFmpeg is an open-source library for video handling consisting of a suite of functions for handling video, audio, and other multimedia files.
|
198 |
-
|
199 |
""")
|
200 |
|
201 |
with gr.Row():
|
@@ -235,7 +227,72 @@ with demo:
|
|
235 |
return video[0]
|
236 |
|
237 |
examples.click(load_examples, examples, input_video)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
238 |
|
|
|
|
|
|
|
|
|
239 |
|
240 |
button_transcript.click(generate_transcripts, input_video, [text_transcript, text_words, text_wordstimestamps ])
|
241 |
button_gifs.click(generate_gifs, [text_gif_transcript, text_words, text_wordstimestamps], out_gif )
|
|
|
185 |
with demo:
|
186 |
gr.Markdown("""# **Create Any GIF From Your Favorite Videos!** """)
|
187 |
gr.Markdown("""
|
188 |
+
In this Gradio-Space Blog I will be taking you through my efforts in reproducing the brilliant app [Edit Video By Editing Text](https://huggingface.co/spaces/radames/edit-video-by-editing-text) by [@radames](https://huggingface.co/radames). My valule-add are -
|
|
|
|
|
189 |
- A permanent supply for your own new GIFs
|
190 |
- This Space written in th form of a Notebook or a Blog if I may, to help someone understand how they can too build this kind of an app.
|
|
|
|
|
|
|
|
|
|
|
|
|
191 |
""")
|
192 |
|
193 |
with gr.Row():
|
|
|
227 |
return video[0]
|
228 |
|
229 |
examples.click(load_examples, examples, input_video)
|
230 |
+
|
231 |
+
with gr.Row():
|
232 |
+
gr.Markdown("""
|
233 |
+
I will start with a short note on my understanding of Radames's app and tools used in it -
|
234 |
+
|
235 |
+
- His is a supercool and handy proof of concept of a simple video editor where you can edit a video by playing with its audio transcriptions (ASR pipeline output).
|
236 |
+
- Both of our apps uses **Huggingface's [Automatic Speech Recognition Pipeline]**(https://huggingface.co/tasks/automatic-speech-recognition) build over **Wav2Vec2** model which internally uses CTC to improve predictions. The pipeline allows you to predict text transcriptions along with the timestamps for every characters and pauses that are there in the audio text.
|
237 |
+
- His app uses FFmpeg library to a good extent to clip and merge videos. FFmpeg is an open-source library for video handling consisting of a suite of functions for handling video, audio, and other multimedia files. My app uses FFmpeg as well as Moviepy to do the bulk of video+audio processing.
|
238 |
+
|
239 |
+
Let me briefly take you through the code and process involved in building this app *step by step* π lol -
|
240 |
+
- Firstly, I have used ffmpeg to extract audio from video (this code line is directly from Radames's above app) -
|
241 |
+
|
242 |
+
```
|
243 |
+
audio_memory, _ = ffmpeg.input(video_path).output('-', format="wav", ac=1, ar='16k').overwrite_output().global_args('-loglevel', 'quiet').run(capture_stdout=True)
|
244 |
+
```
|
245 |
+
- Then I am calling the ASR model as a service, using the Accelerated Inference API. Below is the code snippet for doing so -
|
246 |
+
|
247 |
+
```
|
248 |
+
def query(in_audio):
|
249 |
+
payload = json.dumps({ "inputs": base64.b64encode(in_audio).decode("utf-8"),
|
250 |
+
"parameters": {
|
251 |
+
"return_timestamps": "char",
|
252 |
+
"chunk_length_s": 10,
|
253 |
+
"stride_length_s": [4, 2]
|
254 |
+
},
|
255 |
+
"options": {"use_gpu": False}
|
256 |
+
}).encode("utf-8")
|
257 |
+
|
258 |
+
response = requests.request("POST", API_URL, data=payload)
|
259 |
+
|
260 |
+
json_response = json.loads(response.content.decode("utf-8"))
|
261 |
+
|
262 |
+
return json_response
|
263 |
+
```
|
264 |
+
- The transcript thus generated might have some words which are not correctly interpreted, for example, *tomorrow* is translated as 'to morrow', *hard at it* is translated as 'hot ati' and so on. However this won't hinder in the use-case I am demoing here, so we let's move on.
|
265 |
+
|
266 |
+
> do it just do it don't let your dreams be dreams yesterday you said to morrow so just do it make you dreams can't yro just do it some people dream of success while you're going to wake up and work hot ati nothing is impossible you should get to the point where any one else would quit and you're luck in a stop there no what are you waiting for do et jot do it just you can just do it if you're tired is starting over stop giving up
|
267 |
+
|
268 |
+
- The other output generated by this ASR pipeline is a list of character timestamps dictionaries, look at the below sample to get an idea -
|
269 |
+
|
270 |
+
```
|
271 |
+
{'text': 'D', 'timestamp': [2.36, 2.38]},
|
272 |
+
{'text': 'O', 'timestamp': [2.52, 2.56]},
|
273 |
+
{'text': ' ', 'timestamp': [2.68, 2.72]},
|
274 |
+
{'text': 'I', 'timestamp': [2.84, 2.86]},
|
275 |
+
{'text': 'T', 'timestamp': [2.88, 2.92]},
|
276 |
+
{'text': ' ', 'timestamp': [2.94, 2.98]},
|
277 |
+
{'text': 'J', 'timestamp': [4.48, 4.52]},
|
278 |
+
```
|
279 |
+
|
280 |
+
- I have then used *moviepy* library to extract / concat videos into smaller clips and also to save the final processed videofile as a.GIF image.
|
281 |
+
```
|
282 |
+
import moviepy.editor as mp
|
283 |
+
|
284 |
+
video = mp.VideoFileClip(video_path)
|
285 |
+
final_clip = video.subclip(start_seconds, end_seconds)
|
286 |
+
|
287 |
+
#writing to RAM
|
288 |
+
final_clip.write_gif("gifimage.gif") #, program='ffmpeg', tempfiles=True, fps=15, fuzz=3)
|
289 |
+
final_clip.write_videofile("gifimage.mp4")
|
290 |
+
final_clip.close()
|
291 |
|
292 |
+
```
|
293 |
+
|
294 |
+
|
295 |
+
""")
|
296 |
|
297 |
button_transcript.click(generate_transcripts, input_video, [text_transcript, text_words, text_wordstimestamps ])
|
298 |
button_gifs.click(generate_gifs, [text_gif_transcript, text_words, text_wordstimestamps], out_gif )
|