algorembrant commited on
Commit
04850df
Β·
verified Β·
1 Parent(s): ae25742

Upload 4 files

Browse files
Files changed (4) hide show
  1. .gitignore +1 -0
  2. README.md +135 -0
  3. requirements.txt +7 -0
  4. transcriber.py +639 -0
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ venv/
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ⚑ Universal Media Transcriber
2
+
3
+ Convert **YouTube, YouTube Music, Spotify, and direct audio/video URLs** into transcript `.txt` files β€” extremely fast.
4
+
5
+ ---
6
+
7
+ ## ✨ Features
8
+
9
+ | Feature | Detail |
10
+ |---|---|
11
+ | **Native captions** | YouTube captions grabbed instantly β€” no audio download |
12
+ | **Whisper fallback** | `faster-whisper` (up to 4Γ— faster than OpenAI Whisper) |
13
+ | **Spotify** | Tracks / albums / playlists via `spotdl` β†’ Whisper |
14
+ | **Direct audio** | `.mp3 .mp4 .wav .m4a .webm .ogg` and more |
15
+ | **Playlist support** | Auto-expand playlists, channels, albums |
16
+ | **Batch + parallel** | Multiple URLs, concurrent workers |
17
+ | **Smart cache** | Re-run same URL instantly |
18
+ | **Auto-install** | Deps install themselves on first run |
19
+
20
+ ---
21
+
22
+ ## πŸš€ Quick Start
23
+
24
+ ```bash
25
+ # Single YouTube video
26
+ python transcriber.py https://youtu.be/VIDEO_ID
27
+
28
+ # Multiple URLs
29
+ python transcriber.py URL1 URL2 URL3
30
+
31
+ # From a file (one URL per line)
32
+ python transcriber.py --file urls.txt
33
+
34
+ # Full YouTube playlist
35
+ python transcriber.py --playlist https://youtube.com/playlist?list=PLAYLIST_ID
36
+
37
+ # Spotify track
38
+ python transcriber.py https://open.spotify.com/track/TRACK_ID
39
+
40
+ # Force Whisper (ignore captions)
41
+ python transcriber.py URL --whisper
42
+
43
+ # Larger model for better accuracy
44
+ python transcriber.py URL --model large-v3
45
+
46
+ # Merge all into one file
47
+ python transcriber.py URL1 URL2 --merge
48
+
49
+ # Custom output folder
50
+ python transcriber.py URL --output ./my_transcripts
51
+ ```
52
+
53
+ ---
54
+
55
+ ## πŸ›  Options
56
+
57
+ ```
58
+ urls One or more media URLs
59
+ --file, -f Text file with one URL per line
60
+ --output, -o Output directory (default: ./transcripts)
61
+ --merge, -m Merge all transcripts into one file
62
+ --whisper, -w Force Whisper (skip caption check)
63
+ --model tiny | base | small | medium | large-v2 | large-v3
64
+ --workers Parallel workers (default: 4)
65
+ --no-cache Disable transcript cache
66
+ --playlist Expand playlist/channel into individual videos
67
+ --clear-cache Wipe the cache
68
+ ```
69
+
70
+ ---
71
+
72
+ ## πŸ“¦ Dependencies (auto-installed)
73
+
74
+ - `yt-dlp` β€” universal media downloader
75
+ - `youtube-transcript-api` β€” instant YouTube captions
76
+ - `faster-whisper` β€” optimized Whisper (CTranslate2)
77
+ - `spotdl` β€” Spotify downloader
78
+ - `rich` β€” terminal UI
79
+
80
+ **Requires:** `ffmpeg` installed on your system:
81
+ ```bash
82
+ # macOS
83
+ brew install ffmpeg
84
+
85
+ # Ubuntu/Debian
86
+ sudo apt install ffmpeg
87
+
88
+ # Windows
89
+ winget install ffmpeg
90
+ ```
91
+
92
+ ---
93
+
94
+ ## ⚑ Speed Guide
95
+
96
+ | Source | Method | Speed |
97
+ |---|---|---|
98
+ | YouTube with captions | Native API | < 2 sec |
99
+ | YouTube no captions | Whisper `base` | ~realtime |
100
+ | Spotify music | spotdl + Whisper | depends on length |
101
+ | Direct audio | Whisper | ~realtime |
102
+
103
+ **Model accuracy vs speed:**
104
+ `tiny` β†’ `base` β†’ `small` β†’ `medium` β†’ `large-v3`
105
+ (fastest) (most accurate)
106
+
107
+ ---
108
+
109
+ ## πŸ“„ Output Format
110
+
111
+ ```
112
+ ======================================================================
113
+ TITLE : My Video Title
114
+ UPLOADER : Channel Name
115
+ DURATION : 0:15:42
116
+ SOURCE : youtube
117
+ METHOD : native_captions
118
+ URL : https://youtu.be/...
119
+ ======================================================================
120
+
121
+ [0:00:00] Hello and welcome to this video...
122
+ [0:00:05] Today we're going to talk about...
123
+ ```
124
+
125
+ ---
126
+
127
+ ## πŸ—‚ URL File Format
128
+
129
+ ```
130
+ # urls.txt β€” lines starting with # are comments
131
+ https://youtu.be/VIDEO1
132
+ https://youtu.be/VIDEO2
133
+ https://open.spotify.com/track/TRACK_ID
134
+ https://example.com/podcast.mp3
135
+ ```
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ yt-dlp
2
+ youtube-transcript-api
3
+ faster-whisper
4
+ rich
5
+ spotdl
6
+ requests
7
+ torch
transcriber.py ADDED
@@ -0,0 +1,639 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Universal Media Transcriber
4
+ Supports: YouTube, YouTube Music, Spotify, Direct Audio/Video URLs
5
+ Blazing fast: uses native captions when available, falls back to faster-whisper
6
+ """
7
+
8
+ import os
9
+ import sys
10
+ import re
11
+ import json
12
+ import time
13
+ import shutil
14
+ import hashlib
15
+ import argparse
16
+ import tempfile
17
+ import subprocess
18
+ from pathlib import Path
19
+ from datetime import timedelta
20
+ from concurrent.futures import ThreadPoolExecutor, as_completed
21
+ from urllib.parse import urlparse, parse_qs
22
+
23
+ # Add current directory to PATH so local ffmpeg.exe can be found
24
+ script_dir = str(Path(__file__).parent.absolute())
25
+ if script_dir not in os.environ["PATH"]:
26
+ os.environ["PATH"] = script_dir + os.pathsep + os.environ["PATH"]
27
+
28
+ # ─────────────────────────────────────────────────────────────────────────────
29
+ # AUTO-INSTALLER β€” installs missing deps silently on first run
30
+ # ─────────────────────────────────────────────────────────────────────────────
31
+
32
+ REQUIRED = {
33
+ "yt_dlp": "yt-dlp",
34
+ "youtube_transcript_api": "youtube-transcript-api",
35
+ "faster_whisper": "faster-whisper",
36
+ "rich": "rich",
37
+ "spotdl": "spotdl",
38
+ "requests": "requests",
39
+ }
40
+
41
+ def ensure_deps():
42
+ missing = []
43
+ for module, pkg in REQUIRED.items():
44
+ try:
45
+ __import__(module)
46
+ except ImportError:
47
+ missing.append(pkg)
48
+ if missing:
49
+ print(f"[setup] Installing: {', '.join(missing)} ...")
50
+ subprocess.check_call(
51
+ [sys.executable, "-m", "pip", "install", "--quiet", "--break-system-packages"] + missing
52
+ )
53
+ print("[setup] Done. Reloading...\n")
54
+
55
+ ensure_deps()
56
+
57
+ # ─────────────────────────────────────────────────────────────────────────────
58
+ # IMPORTS (after install)
59
+ # ─────────────────────────────────────────────────────────────────────────────
60
+
61
+ import yt_dlp
62
+ import requests
63
+ from rich.console import Console
64
+ from rich.progress import Progress, SpinnerColumn, TextColumn, BarColumn, TimeElapsedColumn
65
+ from rich.panel import Panel
66
+ from rich.table import Table
67
+ from rich import print as rprint
68
+ from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
69
+
70
+ console = Console()
71
+
72
+ # ─────────────────────────────────────────────────────────────────────────────
73
+ # CONSTANTS & CONFIG
74
+ # ─────────────────────────────────────────────────────────────────────────────
75
+
76
+ WHISPER_MODEL = "base" # tiny / base / small / medium / large-v3
77
+ WHISPER_DEVICE = "auto" # auto / cpu / cuda
78
+ WHISPER_THREADS = os.cpu_count() # use all cores
79
+ AUDIO_FORMAT = "mp3"
80
+ MAX_WORKERS = 4 # parallel jobs
81
+ CACHE_DIR = Path.home() / ".transcriber_cache"
82
+ CACHE_DIR.mkdir(exist_ok=True)
83
+
84
+ # Language preference order for YouTube captions
85
+ LANG_PREF = ["en", "en-US", "en-GB", "en-AU", "en-CA", "en-IN", "en-IE", "en-NZ", "en-PH", "en-ZA", "en-orig", "a.en"]
86
+
87
+ # ─────────────────────────────────────────────────────────────────────────────
88
+ # URL DETECTION
89
+ # ─────────────────────────────────────────────────────────────────────────────
90
+
91
+ def detect_source(url: str) -> str:
92
+ """Returns: youtube | youtube_music | spotify | audio | unknown"""
93
+ parsed = urlparse(url)
94
+ host = parsed.netloc.lower().replace("www.", "")
95
+
96
+ if host in ("youtube.com", "youtu.be", "m.youtube.com"):
97
+ return "youtube"
98
+ if host in ("music.youtube.com",):
99
+ return "youtube_music"
100
+ if host in ("open.spotify.com", "spotify.com"):
101
+ return "spotify"
102
+ if any(url.lower().endswith(ext) for ext in [
103
+ ".mp3", ".mp4", ".wav", ".ogg", ".flac", ".m4a", ".webm",
104
+ ".aac", ".opus", ".mkv", ".avi", ".mov"
105
+ ]):
106
+ return "audio"
107
+ # Try to detect by content-type via HEAD
108
+ try:
109
+ r = requests.head(url, timeout=5, allow_redirects=True)
110
+ ct = r.headers.get("content-type", "")
111
+ if "audio" in ct or "video" in ct:
112
+ return "audio"
113
+ except Exception:
114
+ pass
115
+ return "unknown"
116
+
117
+
118
+ def extract_youtube_id(url: str) -> str | None:
119
+ """Extract video ID from any YouTube URL format."""
120
+ patterns = [
121
+ r"(?:v=|youtu\.be/|embed/|shorts/)([A-Za-z0-9_-]{11})",
122
+ ]
123
+ for p in patterns:
124
+ m = re.search(p, url)
125
+ if m:
126
+ return m.group(1)
127
+ return None
128
+
129
+
130
+ def extract_spotify_type(url: str) -> tuple[str, str]:
131
+ """Returns (type, id) e.g. ('track', 'abc123')"""
132
+ m = re.search(r"spotify\.com/(track|album|playlist|episode|show)/([A-Za-z0-9]+)", url)
133
+ if m:
134
+ return m.group(1), m.group(2)
135
+ return "unknown", ""
136
+
137
+ # ─────────────────────────────────────────────────────────────────────────────
138
+ # CACHE
139
+ # ─────────────────────────────────────────────────────────────────────────────
140
+
141
+ def cache_key(url: str) -> str:
142
+ return hashlib.md5(url.encode()).hexdigest()
143
+
144
+ def cache_get(url: str) -> str | None:
145
+ path = CACHE_DIR / f"{cache_key(url)}.txt"
146
+ if path.exists():
147
+ return path.read_text(encoding="utf-8")
148
+ return None
149
+
150
+ def cache_set(url: str, text: str):
151
+ path = CACHE_DIR / f"{cache_key(url)}.txt"
152
+ path.write_text(text, encoding="utf-8")
153
+
154
+ # ─────────────────────────────────────────────────────────────────────────────
155
+ # WHISPER ENGINE (lazy-loaded, singleton)
156
+ # ─────────────────────────────────────────────────────────────────────────────
157
+
158
+ _whisper_model = None
159
+
160
+ def get_whisper():
161
+ global _whisper_model
162
+ if _whisper_model is None:
163
+ from faster_whisper import WhisperModel
164
+ device = WHISPER_DEVICE
165
+ if device == "auto":
166
+ try:
167
+ import torch
168
+ device = "cuda" if torch.cuda.is_available() else "cpu"
169
+ except ImportError:
170
+ device = "cpu"
171
+ console.log(f"[cyan]Loading Whisper [{WHISPER_MODEL}] on {device}...[/cyan]")
172
+ compute = "float16" if device == "cuda" else "int8"
173
+ _whisper_model = WhisperModel(WHISPER_MODEL, device=device, compute_type=compute,
174
+ num_workers=WHISPER_THREADS, cpu_threads=WHISPER_THREADS)
175
+ return _whisper_model
176
+
177
+
178
+ def transcribe_audio_file(audio_path: str, lang: str = None) -> str:
179
+ """Transcribe a local audio file with faster-whisper. Returns full transcript text."""
180
+ model = get_whisper()
181
+ opts = dict(beam_size=5, word_timestamps=False, vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500))
182
+ if lang:
183
+ opts["language"] = lang
184
+ segments, info = model.transcribe(audio_path, **opts)
185
+ lines = []
186
+ for seg in segments:
187
+ ts = str(timedelta(seconds=int(seg.start))).zfill(8)
188
+ lines.append(f"[{ts}] {seg.text.strip()}")
189
+ return "\n".join(lines)
190
+
191
+ # ─────────────────────────────────────────────────────────────────────────────
192
+ # YOUTUBE / YOUTUBE MUSIC β€” caption-first, whisper fallback
193
+ # ─────────────────────────────────────────────────────────────────────────────
194
+
195
+ def fetch_youtube_captions(video_id: str) -> str | None:
196
+ """Try to get native captions (instant, no download)."""
197
+ try:
198
+ transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
199
+ # prefer manual > auto-generated, english first
200
+ transcript = None
201
+ for lang in LANG_PREF:
202
+ try:
203
+ transcript = transcript_list.find_transcript([lang])
204
+ break
205
+ except Exception:
206
+ pass
207
+ if transcript is None:
208
+ # grab whatever is first
209
+ transcript = next(iter(transcript_list))
210
+ entries = transcript.fetch()
211
+ lines = []
212
+ for e in entries:
213
+ ts = str(timedelta(seconds=int(e["start"]))).zfill(8)
214
+ lines.append(f"[{ts}] {e['text'].strip()}")
215
+ return "\n".join(lines)
216
+ except (TranscriptsDisabled, NoTranscriptFound):
217
+ return None
218
+ except Exception as exc:
219
+ console.log(f"[yellow]Caption fetch warning: {exc}[/yellow]")
220
+ return None
221
+
222
+
223
+ def download_audio_yt(url: str, out_dir: str) -> str:
224
+ """Download audio from YouTube/YouTube Music using yt-dlp. Returns file path."""
225
+ ydl_opts = {
226
+ "format": "bestaudio/best",
227
+ "outtmpl": os.path.join(out_dir, "%(id)s.%(ext)s"),
228
+ "postprocessors": [{
229
+ "key": "FFmpegExtractAudio",
230
+ "preferredcodec": AUDIO_FORMAT,
231
+ "preferredquality": "128",
232
+ }],
233
+ "quiet": True,
234
+ "no_warnings": True,
235
+ "concurrent_fragment_downloads": 8,
236
+ }
237
+ with yt_dlp.YoutubeDL(ydl_opts) as ydl:
238
+ info = ydl.extract_info(url, download=True)
239
+ video_id = info.get("id", "audio")
240
+ return os.path.join(out_dir, f"{video_id}.{AUDIO_FORMAT}")
241
+
242
+
243
+ def get_video_metadata(url: str) -> dict:
244
+ """Get title, uploader, duration without downloading."""
245
+ ydl_opts = {"quiet": True, "no_warnings": True, "skip_download": True}
246
+ try:
247
+ with yt_dlp.YoutubeDL(ydl_opts) as ydl:
248
+ info = ydl.extract_info(url, download=False)
249
+ return {
250
+ "title": info.get("title", "Unknown"),
251
+ "uploader": info.get("uploader", "Unknown"),
252
+ "duration": info.get("duration", 0),
253
+ "description": info.get("description", ""),
254
+ "upload_date": info.get("upload_date", ""),
255
+ }
256
+ except Exception:
257
+ return {"title": "Unknown", "uploader": "Unknown", "duration": 0}
258
+
259
+
260
+ def transcribe_youtube(url: str, force_whisper: bool = False) -> dict:
261
+ """Full pipeline for YouTube / YouTube Music."""
262
+ video_id = extract_youtube_id(url) or "unknown"
263
+ meta = get_video_metadata(url)
264
+
265
+ transcript_text = None
266
+ method = "unknown"
267
+
268
+ if not force_whisper:
269
+ console.log(f"[cyan]Trying native captions for[/cyan] [bold]{meta['title']}[/bold]")
270
+ transcript_text = fetch_youtube_captions(video_id)
271
+ if transcript_text:
272
+ method = "native_captions"
273
+ console.log("[green]βœ“ Got captions instantly (no download needed)[/green]")
274
+
275
+ if transcript_text is None:
276
+ console.log("[yellow]No captions β†’ downloading audio for Whisper...[/yellow]")
277
+ with tempfile.TemporaryDirectory() as tmpdir:
278
+ audio_path = download_audio_yt(url, tmpdir)
279
+ console.log(f"[cyan]Transcribing with Whisper [{WHISPER_MODEL}]...[/cyan]")
280
+ transcript_text = transcribe_audio_file(audio_path)
281
+ method = f"whisper_{WHISPER_MODEL}"
282
+
283
+ return {
284
+ "url": url,
285
+ "source": "youtube",
286
+ "method": method,
287
+ "meta": meta,
288
+ "transcript": transcript_text,
289
+ }
290
+
291
+ # ─────────────────────────────────────────────────────────────────────────────
292
+ # SPOTIFY
293
+ # ─────────────────────────────────────────────────────────────────────────────
294
+
295
+ def transcribe_spotify(url: str) -> dict:
296
+ """Download Spotify track/episode then transcribe."""
297
+ sp_type, sp_id = extract_spotify_type(url)
298
+
299
+ # Podcasts/episodes: yt-dlp can handle them sometimes
300
+ if sp_type == "episode":
301
+ console.log("[cyan]Spotify episode β€” trying yt-dlp...[/cyan]")
302
+ try:
303
+ with tempfile.TemporaryDirectory() as tmpdir:
304
+ audio_path = download_audio_yt(url, tmpdir)
305
+ meta = get_video_metadata(url)
306
+ transcript_text = transcribe_audio_file(audio_path)
307
+ return {
308
+ "url": url,
309
+ "source": "spotify_episode",
310
+ "method": f"whisper_{WHISPER_MODEL}",
311
+ "meta": meta,
312
+ "transcript": transcript_text,
313
+ }
314
+ except Exception as e:
315
+ console.log(f"[yellow]yt-dlp failed for Spotify episode: {e}[/yellow]")
316
+
317
+ # Music tracks / albums / playlists β†’ spotdl
318
+ console.log("[cyan]Spotify music β€” downloading via spotdl...[/cyan]")
319
+ with tempfile.TemporaryDirectory() as tmpdir:
320
+ result = subprocess.run(
321
+ [sys.executable, "-m", "spotdl", url, "--output", tmpdir,
322
+ "--format", "mp3", "--bitrate", "128k", "--print-errors"],
323
+ capture_output=True, text=True
324
+ )
325
+ # find downloaded files
326
+ audio_files = list(Path(tmpdir).glob("*.mp3")) + list(Path(tmpdir).glob("*.m4a"))
327
+ if not audio_files:
328
+ raise RuntimeError(f"spotdl produced no files.\n{result.stderr}")
329
+
330
+ transcripts = []
331
+ for af in sorted(audio_files):
332
+ console.log(f"[cyan]Transcribing:[/cyan] {af.name}")
333
+ t = transcribe_audio_file(str(af))
334
+ transcripts.append(f"=== {af.stem} ===\n{t}")
335
+
336
+ return {
337
+ "url": url,
338
+ "source": f"spotify_{sp_type}",
339
+ "method": f"spotdl+whisper_{WHISPER_MODEL}",
340
+ "meta": {"title": f"Spotify {sp_type.title()}", "uploader": "Spotify"},
341
+ "transcript": "\n\n".join(transcripts),
342
+ }
343
+
344
+ # ─────────────────────────────────────────────────────────────────────────────
345
+ # DIRECT AUDIO / VIDEO URL
346
+ # ─────────────────────────────────────────────────────────────────────────────
347
+
348
+ def transcribe_direct_audio(url: str) -> dict:
349
+ """Download a direct audio/video file and transcribe."""
350
+ console.log(f"[cyan]Downloading direct audio:[/cyan] {url}")
351
+ with tempfile.TemporaryDirectory() as tmpdir:
352
+ ydl_opts = {
353
+ "outtmpl": os.path.join(tmpdir, "audio.%(ext)s"),
354
+ "quiet": True,
355
+ "no_warnings": True,
356
+ "concurrent_fragment_downloads": 8,
357
+ }
358
+ with yt_dlp.YoutubeDL(ydl_opts) as ydl:
359
+ info = ydl.extract_info(url, download=True)
360
+ title = info.get("title", Path(url).stem) if info else Path(url).stem
361
+
362
+ audio_files = list(Path(tmpdir).iterdir())
363
+ if not audio_files:
364
+ raise RuntimeError("No file downloaded")
365
+ audio_path = str(audio_files[0])
366
+ console.log(f"[cyan]Transcribing:[/cyan] {Path(audio_path).name}")
367
+ transcript_text = transcribe_audio_file(audio_path)
368
+
369
+ return {
370
+ "url": url,
371
+ "source": "audio",
372
+ "method": f"whisper_{WHISPER_MODEL}",
373
+ "meta": {"title": title, "uploader": "Direct"},
374
+ "transcript": transcript_text,
375
+ }
376
+
377
+ # ─────────────────────────────────────────────────────────────────────────────
378
+ # PLAYLIST / BATCH EXPANSION
379
+ # ─────────────────────────────────────────────────────────────────────────────
380
+
381
+ def expand_playlist(url: str) -> list[str]:
382
+ """Return list of individual video URLs from a playlist/album/channel."""
383
+ ydl_opts = {
384
+ "quiet": True,
385
+ "no_warnings": True,
386
+ "extract_flat": True,
387
+ "skip_download": True,
388
+ }
389
+ try:
390
+ with yt_dlp.YoutubeDL(ydl_opts) as ydl:
391
+ info = ydl.extract_info(url, download=False)
392
+ if "entries" in info:
393
+ urls = []
394
+ for e in info["entries"]:
395
+ if e and e.get("url"):
396
+ urls.append(e["url"])
397
+ elif e and e.get("id"):
398
+ urls.append(f"https://www.youtube.com/watch?v={e['id']}")
399
+ return urls
400
+ except Exception as exc:
401
+ console.log(f"[yellow]Playlist expansion warning: {exc}[/yellow]")
402
+ return [url]
403
+
404
+ # ─────────────────────────────────────────────────────────────────────────────
405
+ # MAIN ROUTER
406
+ # ─────────────────────────────────────────────────────────────────────────────
407
+
408
+ def transcribe_url(url: str, force_whisper: bool = False, use_cache: bool = True) -> dict:
409
+ """Route URL to the correct transcription pipeline."""
410
+ url = url.strip()
411
+
412
+ if use_cache:
413
+ cached = cache_get(url)
414
+ if cached:
415
+ console.log(f"[green]βœ“ Cache hit:[/green] {url[:60]}")
416
+ return {"url": url, "source": "cache", "method": "cache",
417
+ "meta": {"title": "Cached"}, "transcript": cached}
418
+
419
+ source = detect_source(url)
420
+ console.log(f"[bold blue]Source detected:[/bold blue] {source} β†’ {url[:70]}")
421
+
422
+ if source in ("youtube", "youtube_music"):
423
+ result = transcribe_youtube(url, force_whisper=force_whisper)
424
+ elif source == "spotify":
425
+ result = transcribe_spotify(url)
426
+ elif source == "audio":
427
+ result = transcribe_direct_audio(url)
428
+ else:
429
+ # Try yt-dlp as generic fallback (handles 1000+ sites)
430
+ console.log("[yellow]Unknown source β€” trying yt-dlp generic handler...[/yellow]")
431
+ result = transcribe_direct_audio(url)
432
+
433
+ if use_cache:
434
+ cache_set(url, result["transcript"])
435
+
436
+ return result
437
+
438
+ # ─────────────────────────────────────────────────────────────────────────────
439
+ # OUTPUT FORMATTING
440
+ # ─────────────────────────────────────────────────────────────────────────────
441
+
442
+ def format_transcript(result: dict, include_header: bool = True) -> str:
443
+ meta = result.get("meta", {})
444
+ title = meta.get("title", "Unknown")
445
+ uploader = meta.get("uploader", "Unknown")
446
+ duration = meta.get("duration", 0)
447
+ dur_str = str(timedelta(seconds=int(duration))) if duration else "N/A"
448
+ method = result.get("method", "unknown")
449
+ url = result.get("url", "")
450
+
451
+ header = ""
452
+ if include_header:
453
+ header = (
454
+ f"{'='*70}\n"
455
+ f"TITLE : {title}\n"
456
+ f"UPLOADER : {uploader}\n"
457
+ f"DURATION : {dur_str}\n"
458
+ f"SOURCE : {result.get('source','')}\n"
459
+ f"METHOD : {method}\n"
460
+ f"URL : {url}\n"
461
+ f"{'='*70}\n\n"
462
+ )
463
+
464
+ return header + result["transcript"] + "\n"
465
+
466
+
467
+ def safe_filename(title: str) -> str:
468
+ title = re.sub(r'[<>:"/\\|?*]', "_", title)
469
+ title = title.strip(". ")[:80]
470
+ return title or "transcript"
471
+
472
+ # ─────────────────────────────────────────────────────────────────────────────
473
+ # BATCH PROCESSING
474
+ # ─────────────────────────────────────────────────────────────────────────────
475
+
476
+ def process_batch(urls: list[str], output_dir: Path, force_whisper: bool,
477
+ use_cache: bool, merge: bool, workers: int):
478
+ output_dir.mkdir(parents=True, exist_ok=True)
479
+ results = []
480
+ errors = []
481
+
482
+ console.rule("[bold green]Universal Media Transcriber[/bold green]")
483
+ console.print(f"[dim]URLs: {len(urls)} | Workers: {workers} | Model: {WHISPER_MODEL}[/dim]\n")
484
+
485
+ def job(url):
486
+ t0 = time.time()
487
+ try:
488
+ r = transcribe_url(url, force_whisper=force_whisper, use_cache=use_cache)
489
+ r["elapsed"] = round(time.time() - t0, 1)
490
+ return r
491
+ except Exception as exc:
492
+ return {"url": url, "error": str(exc), "elapsed": round(time.time() - t0, 1)}
493
+
494
+ with Progress(
495
+ SpinnerColumn(),
496
+ TextColumn("[progress.description]{task.description}"),
497
+ BarColumn(),
498
+ TextColumn("[progress.percentage]{task.percentage:>3.0f}%"),
499
+ TimeElapsedColumn(),
500
+ console=console,
501
+ ) as progress:
502
+ task = progress.add_task("Transcribing...", total=len(urls))
503
+ with ThreadPoolExecutor(max_workers=workers) as pool:
504
+ futures = {pool.submit(job, u): u for u in urls}
505
+ for fut in as_completed(futures):
506
+ result = fut.result()
507
+ if "error" in result:
508
+ errors.append(result)
509
+ console.log(f"[red]βœ— Error:[/red] {result['url'][:60]} β†’ {result['error']}")
510
+ else:
511
+ results.append(result)
512
+ console.log(f"[green]βœ“[/green] {result['meta'].get('title','?')[:50]} [{result['elapsed']}s]")
513
+ progress.advance(task)
514
+
515
+ # ── Write files ─────────────────────────────────────────────────────────
516
+ if merge and results:
517
+ merged_path = output_dir / "merged_transcript.txt"
518
+ with open(merged_path, "w", encoding="utf-8") as f:
519
+ for r in results:
520
+ f.write(format_transcript(r))
521
+ f.write("\n" + "─" * 70 + "\n\n")
522
+ console.print(f"\n[bold green]βœ“ Merged transcript:[/bold green] {merged_path}")
523
+ else:
524
+ for r in results:
525
+ title = r["meta"].get("title", "transcript")
526
+ fname = safe_filename(title) + ".txt"
527
+ out_path = output_dir / fname
528
+ # avoid collisions
529
+ if out_path.exists():
530
+ stem = out_path.stem
531
+ out_path = output_dir / f"{stem}_{cache_key(r['url'])[:6]}.txt"
532
+ out_path.write_text(format_transcript(r), encoding="utf-8")
533
+ console.print(f"[green]βœ“ Saved:[/green] {out_path}")
534
+
535
+ # ── Summary table ────────────────────────────────────────────────────────
536
+ table = Table(title="\n Summary", show_lines=True)
537
+ table.add_column("Title", style="cyan", max_width=40)
538
+ table.add_column("Method", style="magenta")
539
+ table.add_column("Time", justify="right")
540
+ table.add_column("Status", justify="center")
541
+
542
+ for r in results:
543
+ table.add_row(
544
+ r["meta"].get("title", "?")[:38],
545
+ r.get("method", "?"),
546
+ f"{r['elapsed']}s",
547
+ "[green]βœ“[/green]",
548
+ )
549
+ for r in errors:
550
+ table.add_row(r["url"][:38], "β€”", f"{r['elapsed']}s", "[red]βœ—[/red]")
551
+
552
+ console.print(table)
553
+ console.print(f"\n[bold]Done:[/bold] {len(results)} ok, {len(errors)} failed β†’ [dim]{output_dir}[/dim]")
554
+
555
+ # ─────────────────────────────────────────────────────────────────────────────
556
+ # CLI
557
+ # ─────────────────────────────────────────────────────────────────────────────
558
+
559
+ def main():
560
+ global WHISPER_MODEL
561
+ parser = argparse.ArgumentParser(
562
+ description=" Universal Media Transcriber β€” YouTube, Spotify, Audio & more",
563
+ formatter_class=argparse.RawDescriptionHelpFormatter,
564
+ epilog="""
565
+ Examples:
566
+ python transcriber.py https://youtu.be/dQw4w9WgXcQ
567
+ python transcriber.py URL1 URL2 URL3 --merge
568
+ python transcriber.py --file urls.txt --output ./transcripts
569
+ python transcriber.py https://open.spotify.com/track/... --whisper
570
+ python transcriber.py https://youtu.be/... --model large-v3
571
+ python transcriber.py --playlist https://youtube.com/playlist?list=...
572
+ """
573
+ )
574
+ parser.add_argument("urls", nargs="*", help="One or more media URLs")
575
+ parser.add_argument("--file", "-f", help="Text file with one URL per line")
576
+ parser.add_argument("--output", "-o", default="./transcripts", help="Output directory (default: ./transcripts)")
577
+ parser.add_argument("--merge", "-m", action="store_true", help="Merge all transcripts into one file")
578
+ parser.add_argument("--whisper", "-w", action="store_true", help="Force Whisper (skip caption check)")
579
+ parser.add_argument("--model", default=WHISPER_MODEL,
580
+ choices=["tiny", "base", "small", "medium", "large-v2", "large-v3"],
581
+ help="Whisper model size (default: base)")
582
+ parser.add_argument("--workers", type=int, default=MAX_WORKERS, help="Parallel workers (default: 4)")
583
+ parser.add_argument("--no-cache", action="store_true", help="Disable transcript cache")
584
+ parser.add_argument("--playlist", action="store_true", help="Treat URL as playlist β€” expand all videos")
585
+ parser.add_argument("--clear-cache", action="store_true", help="Clear the transcript cache and exit")
586
+
587
+ args = parser.parse_args()
588
+
589
+ if args.clear_cache:
590
+ shutil.rmtree(CACHE_DIR, ignore_errors=True)
591
+ CACHE_DIR.mkdir(exist_ok=True)
592
+ console.print("[green]Cache cleared.[/green]")
593
+ return
594
+
595
+ # Collect URLs
596
+ all_urls = list(args.urls)
597
+ if args.file:
598
+ path = Path(args.file)
599
+ if not path.exists():
600
+ console.print(f"[red]File not found: {path}[/red]")
601
+ sys.exit(1)
602
+ lines = path.read_text().splitlines()
603
+ all_urls += [l.strip() for l in lines if l.strip() and not l.startswith("#")]
604
+
605
+ if not all_urls:
606
+ parser.print_help()
607
+ sys.exit(0)
608
+
609
+ # Expand playlists
610
+ if args.playlist or len(all_urls) == 1:
611
+ expanded = []
612
+ for u in all_urls:
613
+ exp = expand_playlist(u)
614
+ if len(exp) > 1:
615
+ console.log(f"[cyan]Playlist expanded:[/cyan] {len(exp)} items")
616
+ expanded.extend(exp)
617
+ all_urls = expanded
618
+
619
+ # Deduplicate preserving order
620
+ seen = set()
621
+ deduped = []
622
+ for u in all_urls:
623
+ if u not in seen:
624
+ seen.add(u)
625
+ deduped.append(u)
626
+ all_urls = deduped
627
+
628
+ process_batch(
629
+ urls=all_urls,
630
+ output_dir=Path(args.output),
631
+ force_whisper=args.whisper,
632
+ use_cache=not args.no_cache,
633
+ merge=args.merge,
634
+ wocd rkers=args.workers,
635
+ )
636
+
637
+
638
+ if __name__ == "__main__":
639
+ main()