Spaces:
Running
A newer version of the Gradio SDK is available:
5.38.0
title: ClipScript
emoji: π¬
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 5.33.1
app_file: app.py
pinned: false
license: mit
short_description: Transforms videos and audio into ready-to-publish blogs.
tags:
- agent-demo-track
video_overview: https://youtu.be/8DUxlj79NqM
π¬ ClipScript: Video-to-Blog Transformer
ClipScript is a powerful application that transforms any video or audio content into a polished, ready-to-publish blog post. Simply provide a YouTube URL or upload an audio file, and let our AI agent handle the rest.
Video Overview
Watch a video demonstrating how to use ClipScript and what it is abut here!
Features
- YouTube & File Uploads: Works with YouTube links or direct audio/video file uploads.
- AI-Powered Transcription: Utilizes a state-of-the-art ASR model for highly accurate transcription.
- Agentic Blog Generation: An expert AI writing agent converts the raw transcript into a structured, engaging blog post, automatically removing conversational filler and adding SEO-friendly formatting.
- Interactive Refinement: Chat with the AI agent to refine the generated blog post until it's perfect.
- Secure & Scalable: Powered by Modal for secure, scalable, and efficient backend processing.
Hugging Face Agent Demo Track
This application has been submitted to the Agent Demo Track. It showcases an "AI agent" that acts as an expert blog writer and editor, taking a high-level goal (transforming a transcript) and executing a series of steps to achieve it.
Core Technology
Speech-to-Text: NVIDIA Parakeet TDT 0.6B V2
The transcription engine is powered by nvidia/parakeet-tdt-0.6b-v2
. This model is ranked #1 on the Hugging Face Open ASR Leaderboard, achieving the best overall average Word Error Rate (WER) and RTFx (real-time factor) score, making it one of the fastest and most accurate ASR models available.
For a deep dive into the model's architecture and performance, check out the official model card and the Open ASR Leaderboard.
For audio longer than 30 minutes, the SST model automatically segments content into optimal chunks and processes them in parallel, enabling fast transcription of hours-long content while maintaining accuracy and context.
Content Generation: AI Writing Agent
An AI writing agent, accessed via OpenRouter, converts the raw transcript into a polished, structured blog post, ready for publishing.
Backend Infrastructure: Modal
The backend is built on Modal for security, scalability, and performance.
Secure Sandboxed Execution: All media processing occurs in isolated Modal environments, keeping potentially malicious files separate from the Gradio server.
High-Performance File System: Modal Volumes provide fast, reliable file transfer and access for user uploads.
This architecture keeps the frontend lightweight while offloading intensive tasks to secure, scalable cloud resources.
Architecture
The following diagram illustrates the complete data flow, from user input in the Gradio application to the final blog post generation.
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference