Spaces:

lablab-ai-amd-developer-hackathon
/

ElevenClip-AI

Running

App Files Files Community

ElevenClip-AI / README.md

jakgritb

fix: add per-segment HRE edit plans

c511df7 verified 1 day ago

preview code

raw

history blame contribute delete

14.1 kB

metadata

title: ElevenClip AI
emoji: ✂️
colorFrom: red
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit

ElevenClip AI ✂️

AMD Developer Hackathon 2026 — Track 3: Vision & Multimodal AI

Turn livestream recordings or uploaded videos into TikTok-ready highlight clips using true multimodal AI — vision, audio, and text analyzed simultaneously on AMD Instinct MI300X.

Demo

Try it live: HuggingFace Space

What It Does

ElevenClip AI ingests an uploaded video and automatically finds the best moments to clip for TikTok using three AI modalities working together. The backend also keeps optional yt-dlp/YouTube support, but the public demo focuses on uploads because public video platforms can trigger anti-bot restrictions.

Modality	Model	What it detects
Vision	Qwen2.5-VL-7B on ROCm	Excitement, faces, action type, humor, TikTok potential
Audio	insanely-fast-whisper (ROCm)	Word-level transcript + language detection
Audio Signal	librosa	RMS energy → loud/quiet moments
Vision+Text	Qwen2.5-VL (multimodal)	Frame + transcript context fused together
Text	Python keyword scorer + Qwen2.5-VL text prompt	Style keyword matching, emoji selection

Highlight Scoring Formula

final_score = 0.40 × vision_score + 0.35 × audio_energy + 0.25 × text_keywords

where:
  vision_score = 0.5 × excitement + 0.3 × tiktok_potential + 0.2 × humor_level

AI Pipeline

┌─ Input ──────────────────────────────────────────────────────────┐
│  Uploaded video file (YouTube backend support is optional)       │
└──────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─ Audio Extraction (ffmpeg) ──────────────────────────────────────┐
│  16kHz mono WAV for Whisper                                      │
└──────────────────────────────────────────────────────────────────┘
           │
    ┌──────┴──────┐
    │             │  ← PARALLEL on AMD GPU ─────────────────────────
    ▼             ▼
┌─ Scene     ┌─ Whisper ROCm ────────────────────────────────────┐
│  Detection │  insanely-fast-whisper (SDPA attention, 4.45×)    │
│  PyScene   │  → transcript + word-level timestamps             │
│  Detect    │  → auto language detection                        │
└─────┬──────┴───────────────────────────────────────────────────┘
      │                    │
      ▼                    ▼
┌─ Frame Sampling ──────────────────────────────────────────────────┐
│  3 frames per scene (20%, 50%, 80% of scene)                     │
└──────────────────────────────────────────────────────────────────┘
           │
           ▼  ← CONCURRENT requests to vLLM ──────────────────────
┌─ Qwen2.5-VL Multimodal Analysis ───────────────────────────────────┐
│  Input per scene: [frame1] [frame2] [frame3] + transcript text   │
│  Output: excitement_score, tiktok_potential, face_bbox,          │
│          emotion, action_type, humor_level, highlight_reason     │
│  All scenes sent concurrently — vLLM batches on AMD MI300X       │
└──────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─ Multi-Signal Scoring ────────────────────────────────────────────┐
│  score = 0.40×vision + 0.35×audio_energy + 0.25×text_keywords   │
│  Select top-N non-overlapping clips (min 30s gap)                │
└──────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─ Branch ──────────────────────────────────────────────────────────┐
│                                                                   │
│  Normal Mode              HRE (High-Retention Editing)           │
│  ─────────────            ──────────────────────────────         │
│  • pysubs2 ASS            • Per-segment AI edit plan             │
│  • User style config      • Auto-zoom per segment (zoompan)      │
│  • Font/color/animation   • Word / phrase / sentence captions    │
│  • Karaoke/pop/fade       • Top / bottom / left / right captions │
│  • AMD AMF encode         • Qwen2.5-VL emoji selection           │
└──────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─ Editor (/editor) ────────────────────────────────────────────────┐
│  • Per-clip subtitle timeline editing                            │
│  • Global style override (live preview)                          │
│  • Re-render + download MP4                                      │
└──────────────────────────────────────────────────────────────────┘

AMD GPU Optimizations

ROCm 6.3 — all model inference on AMD Instinct MI300X
vLLM — serves Qwen2.5-VL with continuous batching and PagedAttention
SDPA attention — PyTorch 2.0 Scaled Dot-Product Attention for Whisper (4.45× faster on ROCm)
float16 inference — 7B model fits in ~14 GB VRAM, leaves 50+ GB for large videos
h264_amf — AMD VCE hardware encoder for clip extraction (falls back to libx264)
Parallel pipeline — scene detection (CPU) + Whisper (GPU) run simultaneously
Concurrent vLLM requests — all scenes sent to Qwen2.5-VL in parallel; server batches them

Two Output Modes

Normal Subtitles

Full creative control over:

Font family (Noto Sans Thai, Noto Sans SC, Montserrat, Impact, ...)
Font size, bold/italic/underline
4-layer ASS colors: primary, secondary, outline, shadow
Display mode: word-by-word or sentence
Animation: Fade / Karaoke / Pop / Typewriter / Bounce
Alignment (3×3 grid) + margin sliders
Per-subtitle-line style overrides in the editor

High-Retention Editing (HRE)

AI chooses everything:

A per-segment edit plan with timestamps
Auto-zoom direction and speed per segment (ffmpeg zoompan)
Caption mode per segment: word, phrase, or sentence
Caption placement per segment: top, bottom, left, right, or center
Caption color, size, and pop emphasis based on segment energy
Qwen2.5-VL selects contextually-appropriate emoji overlay

Multilingual Support

Layer	Coverage
UI language	ไทย · English · 中文
Video input language	Auto-detect + 15+ (Whisper)
Subtitle output language	Thai (Noto Sans Thai) · Chinese (Noto Sans SC) · Japanese (Noto Sans JP) · Korean (Noto Sans KR) · English + more
Cross-lingual	Whisper translate → English when English subtitles are requested; multilingual transcription/subtitle timing uses Whisper language support
Character-level splitting	Thai and Chinese use character-level subtitle timing (no word spaces)

Tech Stack

Layer	Technology
Vision AI	Qwen2.5-VL-7B-Instruct (Apache 2.0) via vLLM
Speech-to-Text	insanely-fast-whisper with PyTorch SDPA on ROCm
Audio Analysis	librosa — RMS energy per scene
Scene Detection	PySceneDetect — ContentDetector
Video Download	yt-dlp
Video Processing	ffmpeg (AMD AMF hardware encode)
Subtitle Engine	pysubs2 — full ASS format with karaoke tags
GPU	AMD Instinct MI300X via ROCm 6.3
Frontend	Next.js 16.2.4 App Router + Tailwind CSS
Backend	FastAPI + WebSocket (real-time progress)
Deployment	HuggingFace Spaces public demo + AMD GPU Cloud backend

Judge Demo

Public visitors can open the HuggingFace Space and click Try Demo to see a simulated flow without using AMD GPU credits. Full AMD MI300X generation is protected by an access code shared only in the lablab.ai submission notes for judges.

Recommended judging flow:

Open the HuggingFace Space.
Click Try Demo for the instant public demo.
Enter the judge access code from the lablab.ai submission notes to run real generation on AMD GPU Cloud.
Upload a short MP4 sample for the real run.

Local Development

For the real development/demo path, run the frontend locally and point it at the AMD GPU Cloud backend:

# frontend/.env.local
NEXT_PUBLIC_API_URL=http://129.212.178.101:8080
NEXT_PUBLIC_DEMO_ENABLED=true
NEXT_PUBLIC_DEMO_ONLY=false

cd frontend
npm install
npm run dev  # http://localhost:3000

The AMD GPU Cloud backend runs FastAPI on :8080 and vLLM/Qwen2.5-VL on :8000. For development without a GPU, the backend can still run with fallback stubs (stubbed Whisper, fallback vision scores).

Safe Public Demo Setup

ElevenClip AI supports three deployment modes:

Mode	Frontend runs on	Backend/vLLM runs on	Use when
Local dev	Your laptop (`localhost:3000`)	AMD GPU Cloud (`129.212.178.101:8080`)	Iterating quickly while using MI300X remotely
HF public shell	HuggingFace Space CPU	AMD GPU Cloud	Public hackathon page, real generation gated by access code
HF self-contained GPU	HuggingFace Space	HuggingFace Space GPU	Only if the Space has suitable ROCm/AMD GPU hardware

For the current CPU Basic HuggingFace Space, use it as the public UI and keep real generation on AMD GPU Cloud:

# frontend/.env.local for local development
NEXT_PUBLIC_API_URL=http://129.212.178.101:8080
NEXT_PUBLIC_DEMO_ENABLED=true
NEXT_PUBLIC_DEMO_ONLY=false

On the AMD GPU Cloud backend, protect expensive GPU endpoints before exposing the demo:

export DEMO_ACCESS_CODE="share-this-only-with-judges"
export MAX_CONCURRENT_JOBS=1
export MAX_UPLOAD_MB=300
export VLLM_IDLE_TIMEOUT=300

When DEMO_ACCESS_CODE is set, /api/process, /api/video-info, and vLLM start/stop endpoints require the X-Demo-Key header. The frontend shows a Demo Access Code field and sends that header automatically. Leave DEMO_ACCESS_CODE unset only for private/local testing.

For a self-contained HuggingFace GPU Space, leave NEXT_PUBLIC_API_URL="" so nginx routes /api, /ws, and /downloads to FastAPI inside the same Space. Only use this mode if the Space hardware is actually GPU-capable.

For the public HuggingFace Space, set NEXT_PUBLIC_DEMO_ONLY=true. Visitors can open the UI and run the simulated demo without touching AMD GPU credits. Judges can enter the access code to run real generation against the protected AMD GPU Cloud backend.

The current Docker setup keeps NEXT_PUBLIC_API_URL="" so the browser calls the HF Space on the same origin, then FastAPI forwards real judge requests to REMOTE_BACKEND_URL. This avoids browser mixed-content blocking from an HTTPS Space calling an HTTP AMD Cloud IP directly.

# HF Space / Docker runtime
NEXT_PUBLIC_API_URL=
NEXT_PUBLIC_DEMO_ONLY=true
REMOTE_BACKEND_URL=http://129.212.178.101:8080

Hackathon Compliance

Requirement	Status
Track 3: Vision & Multimodal AI	✅ Qwen2.5-VL processes frames + audio simultaneously
AMD Developer Cloud	✅ All inference on AMD Instinct MI300X via ROCm 6.3
ROCm acceleration	✅ vLLM + SDPA Whisper + h264_amf encoder
Qwen partner integration	✅ Qwen2.5-VL as primary multimodal model and text/emoji prompt model
HuggingFace Space	✅ `lablab-ai-amd-developer-hackathon/ElevenClip-AI`
Public GitHub repo	✅ `JakgritB/ElevenClip-AI`
Ship It challenge	✅ Social posts tagging @AIatAMD + @lablab
MIT license	✅

License

MIT — see LICENSE