amitgcode commited on
Commit
e6580d2
·
verified ·
1 Parent(s): 5314046

Initial Commit

Browse files
Files changed (11) hide show
  1. ReadMe.md +179 -0
  2. YouTubeAgent.py +169 -0
  3. app.py +222 -0
  4. config.py +88 -0
  5. dbcone.py +39 -0
  6. embeddings.py +173 -0
  7. fetch_youtube_videos.py +115 -0
  8. main.py +73 -0
  9. requirements.txt +0 -0
  10. summary.py +122 -0
  11. transcribe_videos.py +124 -0
ReadMe.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # VidInsight AI: AI-Powered YouTube Content Analyzer
2
+
3
+ ## Overview
4
+ VidInsight AI is an AI-powered application designed to analyze YouTube videos for a given subject, extract insights, provide transcriptions, topic, summary, key-points and a new content idea!
5
+ The application is built to assist:
6
+ - content creators,
7
+ - educators & researchers, and
8
+ - everyday users in understanding video content quickly and effectively.
9
+
10
+ ---
11
+ This ReadMe file documents the current phase of the project and will be updated as new features are implemented.
12
+
13
+ **Current Features (Asif's Code):**
14
+
15
+ 1. YouTube Video Retrieval:
16
+ • Fetches up to 10 YouTube videos based on a user-provided topic.
17
+ • Filters videos based on criteria such as keywords, view counts, and trusted channels.
18
+ • Selects the top 3 videos based on relevance and view counts.
19
+
20
+ 2. Transcription:
21
+ • Transcribes audio from the top 3 selected videos using OpenAI’s Whisper model.
22
+ • Saves the complete transcripts in an `output` folder for further processing.
23
+
24
+ 3. User Interface:
25
+ • Input
26
+ • Provides a user-friendly interface built with Gradio.
27
+ • Output
28
+ • Displays video details (title, channel, views) and a preview of the transcription.
29
+ • Analysis (Topic, Summary & Key Points)
30
+ • Content Idea with comprehensive details
31
+ ---
32
+
33
+ ## Project Structure
34
+
35
+ VidInsight-AI/\
36
+ ├── app.py              # Gradio web interface for user interaction\
37
+ ├── config.py              # Configuration file for API keys and filters\
38
+ ├── fetch_youtube_videos.py              # Fetches and filters YouTube videos\
39
+ ├── transcribe_videos.py              # Transcribes videos and saves transcripts\
40
+ ├── summary.py              # Generates summaries from transcriptions\
41
+ ├── YouTubeAgent.py              # Creates content ideas using Gemini AI\
42
+ ├── main.py              # CLI-based alternative to run the app\
43
+ ├── requirements.txt              # Project dependencies\
44
+ ├── keys1.env              # Environment variables (API keys)\
45
+ └── output/              # Folder for saved transcripts\
46
+ &nbsp;&nbsp;&nbsp; └── <video_id>.txt &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Transcripts saved as text files\
47
+
48
+ ### Key Components:
49
+ 1. Interface Files:
50
+ • `app.py`: Web interface using Gradio
51
+ • `main.py`: Command-line interface
52
+ 2. Core Processing Files:
53
+ • `fetch_youtube_videos.py`: Video retrieval
54
+ • `transcribe_videos.py`: Audio transcription
55
+ • `summary.py`: Content summarization
56
+ • `YouTubeAgent.py`: Content idea generation
57
+ 3. Configuration Files:
58
+ • `config.py`: Settings and filters
59
+ • `keys1.env`: API keys
60
+ • `requirements.txt`: Dependencies
61
+ 4. Output Directory:
62
+ • `output/`: Stores generated transcripts
63
+
64
+ ---
65
+
66
+ ## Setup Instructions (need to be completed)
67
+
68
+ 1. Prerequisites\
69
+ • Python 3.8 or higher\
70
+ • FFmpeg installed on the system (for audio processing)
71
+ • A YouTube Data API key (create one via Google Cloud Console)
72
+ • A GEMINI API key
73
+ • A TAVILY API key
74
+
75
+ 3. Installation
76
+ 1. Clone the repository:
77
+ ```python
78
+ git clone <repository_url>
79
+ ```
80
+ 2. Install required dependencies:
81
+
82
+ 3. Set up your API key:
83
+ • Create a `.env` file or update `keys1.env` with your YouTube API key:
84
+ ```python
85
+ YOUTUBE_API_KEY="your_api_key_here"
86
+ GEMINI_API_KEY="your_api_key_here"
87
+ TAVILY_API_KEY="your_api_key_here"
88
+ ```
89
+ \
90
+
91
+ 4. Running the Application\
92
+ • Using the Gradio Interface:
93
+ ```python
94
+ python app.py
95
+ ```
96
+ \
97
+ • Using the CLI:
98
+ ```python
99
+ python main.py
100
+ ```
101
+ ---
102
+
103
+ ## Usage
104
+
105
+ #### Gradio App
106
+ 1. Enter a topic in the “Enter learning topic” field (e.g., “Machine Learning”).
107
+ 2. Click “Submit” to fetch and analyze videos.
108
+ 3. View results, including:
109
+ • Video title, channel name, view count.
110
+ • A preview of the transcription.
111
+ • The path to the saved transcript file.
112
+ • Topic, Summary, and Key-Points
113
+ • A New Content Idea with Compreehensive Details
114
+ #### Output Folder
115
+ • Complete transcripts are saved in the `output/` folder as `.txt` files.
116
+ ��� File names are based on unique YouTube video IDs (e.g., `ukzFI9rgwfU.txt`).
117
+
118
+ ---
119
+
120
+ ## Configuration
121
+
122
+ The `config.py` file allows customization of filtering criteria:
123
+ ```python
124
+ FILTER_CONFIG = {
125
+ "videoDuration": "medium", # Focus on videos between 4 and 20 minutes
126
+ "order": "relevance", # Sort by relevance
127
+ "trusted_channels": {
128
+ "Khan Academy": "UC4a-Gbdw7vOaccHmFo40b9g",
129
+ "edX": "UCEBb1b_L6zDS3xTUrIALZOw",
130
+ "Coursera": "UC58aowNEXHHnflR_5YTtP4g",
131
+ },
132
+ "teaching_keywords": {"tutorial", "lesson", "course", "how-to", "introduction", "basics"},
133
+ "non_teaching_keywords": {"fun", "experiment", "joke", "prank", "vlog"},
134
+ "max_results": 10, # Maximum number of videos fetched from YouTube API
135
+ "min_view_count": 10000 # Minimum view count for relevance
136
+ }
137
+ ```
138
+
139
+ ---
140
+
141
+ ## Known Issues
142
+ 1. If no results are found or an error occurs during video fetching, the app displays an error message in JSON format.
143
+ 2. Ensure that valid topics are entered; overly broad or unrelated topics may not yield meaningful results.
144
+
145
+ ---
146
+
147
+ ## Future Features
148
+ 1. Multilingual Support (Future):
149
+ • Add support for transcription in other languages (e.g., Spanish, French).
150
+
151
+ 2. Interactive Q&A (Future):
152
+ • Allow users to ask questions about analyzed video content.
153
+
154
+ ---
155
+
156
+ ## 🛠️ Technology Stack
157
+
158
+ | Task | Technology |
159
+ | -------- | ------- |
160
+ | Video Retrieval | YouTube Data API, google-api-python-client |
161
+ | Transcription | yt-dlp, OpenAI Whisper |
162
+ | Summarization | Gemini AI, LangChain |
163
+ | Content Generation | Gemini AI, LangChain |
164
+ | Vectorizaton | ____ |
165
+ | Vector Database | ____ |
166
+
167
+
168
+ ---
169
+ ## 📌 Contributors
170
+ • Asif Khan – Developer and Project Lead
171
+ • Kade Thomas – Summarization Specialist
172
+ • Amit Gaikwad - Vector Database Specialist
173
+ • Simranpreet Saini – AI Agent Specialist
174
+ • Jason Brooks – Documentation Specialist
175
+
176
+ ---
177
+ ## 🙏 Acknowledgements
178
+ - Special thanks to Firas Obeid for being an advisor on the project
179
+ - Special thanks to OpenAI, Hugging Face, and YouTube API, Gemini API, and Tavily API for providing the tools that made this project possible. 🚀
YouTubeAgent.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # YouTube Content Idea Generator Module
3
+
4
+ This module leverages Google's Gemini AI to generate structured content ideas for YouTube videos
5
+ based on provided summaries and key points.
6
+
7
+ ## Summary
8
+ - Uses Gemini AI model for content generation
9
+ - Creates detailed video proposals including:
10
+ - Title and hook
11
+ - Main talking points
12
+ - Video structure
13
+ - Thumbnail concepts
14
+ - Target audience
15
+ - SEO keywords
16
+ - Formats output with clear section separation
17
+
18
+ ## Dependencies
19
+
20
+ ### System Requirements
21
+ - Python 3.8+
22
+ - Internet connection for API calls
23
+
24
+ ### Package Dependencies
25
+ 1. **langchain-google-genai**
26
+ - Install: `pip install langchain-google-genai`
27
+ - Purpose: Interface with Gemini AI model
28
+
29
+ 2. **langchain-community**
30
+ - Install: `pip install langchain-community`
31
+ - Purpose: Access to Tavily search tools
32
+
33
+ 3. **python-dotenv**
34
+ - Install: `pip install python-dotenv`
35
+ - Purpose: Load environment variables
36
+
37
+ ### Project Dependencies
38
+ 1. **keys1.env file**
39
+ - Must contain:
40
+ - GEMINI_API_KEY
41
+ - TAVILY_API_KEY
42
+ - Format:
43
+ ```
44
+ GEMINI_API_KEY=your_gemini_api_key
45
+ TAVILY_API_KEY=your_tavily_api_key
46
+ ```
47
+
48
+ 2. **Input Requirements**
49
+ - Dictionary containing:
50
+ - summary: Text summarizing content
51
+ - keypoints: List of key points
52
+
53
+ ## Functions
54
+ generateidea(input)
55
+ - Args: Dictionary with 'summary' and 'keypoints'
56
+ - Returns: Formatted string containing structured content idea
57
+ - Error Returns: Error message if generation fails
58
+
59
+ ## Returns
60
+ Structured string containing:
61
+ 1. Title
62
+ 2. Description/Hook
63
+ 3. Main Talking Points
64
+ 4. Video Structure
65
+ 5. Thumbnail Concepts
66
+ 6. Target Audience
67
+ 7. Estimated Length
68
+ 8. SEO Keywords
69
+
70
+ ## Error Handling
71
+ - Returns error message if:
72
+ - API keys are missing
73
+ - API calls fail
74
+ - Response formatting fails
75
+ """
76
+
77
+
78
+ from langchain_google_genai import ChatGoogleGenerativeAI
79
+ from langchain_community.tools.tavily_search import TavilySearchResults
80
+ from dotenv import load_dotenv, find_dotenv
81
+ import os
82
+ from langchain.agents import initialize_agent
83
+ from langchain_community.agent_toolkits.load_tools import load_tools
84
+
85
+ # Load environment variables
86
+ load_dotenv(find_dotenv('keys1.env'))
87
+
88
+ # Set the model name and API keys
89
+ GEMINI_MODEL = "gemini-1.5-flash"
90
+ GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
91
+ os.environ["TAVILY_API_KEY"] = os.getenv("TAVILY_API_KEY")
92
+
93
+ def generateidea(input):
94
+ """Generate content ideas based on summary and key points."""
95
+ try:
96
+ # Initialize the model with higher temperature for creativity
97
+ llm = ChatGoogleGenerativeAI(
98
+ google_api_key=GEMINI_API_KEY,
99
+ model=GEMINI_MODEL,
100
+ temperature=0.7,
101
+ top_p=0.9,
102
+ max_output_tokens=2048 # Ensure longer output
103
+ )
104
+
105
+ # Create a specific prompt template
106
+ prompt = f"""
107
+ Based on this content:
108
+ Summary: {input["summary"]}
109
+ Key Points: {input["keypoints"]}
110
+
111
+ Generate a detailed YouTube video idea using exactly this format:
112
+
113
+ 1. **Title:**
114
+ [Create an attention-grabbing, SEO-friendly title]
115
+
116
+ 2. **Description/Hook:**
117
+ [Write 2-3 compelling sentences that hook viewers]
118
+
119
+ 3. **Main Talking Points:**
120
+ • [Main point 1]
121
+ • [Main point 2]
122
+ • [Main point 3]
123
+ • [Main point 4]
124
+ • [Main point 5]
125
+
126
+ 4. **Suggested Video Structure:**
127
+ • [00:00-02:00] Introduction
128
+ • [02:00-05:00] First Topic
129
+ • [05:00-08:00] Second Topic
130
+ • [08:00-12:00] Third Topic
131
+ • [12:00-15:00] Examples and Applications
132
+ • [15:00-17:00] Conclusion
133
+
134
+ 5. **Potential Thumbnail Concepts:**
135
+ • [Thumbnail idea 1]
136
+ • [Thumbnail idea 2]
137
+ • [Thumbnail idea 3]
138
+
139
+ 6. **Target Audience:**
140
+ [Describe ideal viewer demographic and background]
141
+
142
+ 7. **Estimated Video Length:**
143
+ [Specify length in minutes]
144
+
145
+ 8. **Keywords for SEO:**
146
+ [List 8-10 relevant keywords separated by commas]
147
+
148
+ Ensure each section is detailed and properly formatted.
149
+ """
150
+
151
+ # Generate response directly with LLM
152
+ response = llm.predict(prompt)
153
+
154
+ # Format the response
155
+ formatted_response = response.replace("1. **", "\n\n1. **")
156
+ formatted_response = formatted_response.replace("2. **", "\n\n2. **")
157
+ formatted_response = formatted_response.replace("3. **", "\n\n3. **")
158
+ formatted_response = formatted_response.replace("4. **", "\n\n4. **")
159
+ formatted_response = formatted_response.replace("5. **", "\n\n5. **")
160
+ formatted_response = formatted_response.replace("6. **", "\n\n6. **")
161
+ formatted_response = formatted_response.replace("7. **", "\n\n7. **")
162
+ formatted_response = formatted_response.replace("8. **", "\n\n8. **")
163
+
164
+ return formatted_response.strip()
165
+
166
+ except Exception as e:
167
+ return f"Error generating content idea: {str(e)}"
168
+
169
+
app.py ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Main Application Module (Gradio Interface)
3
+
4
+ This module provides the web interface and core functionality for the VidInsight AI application,
5
+ integrating video fetching, transcription, summarization, and content idea generation.
6
+
7
+ ## Summary
8
+ - Creates a Gradio web interface
9
+ - Processes user topic input
10
+ - Coordinates video fetching and transcription
11
+ - Generates summaries and content ideas
12
+ - Displays results in a formatted JSON output
13
+
14
+ ## Dependencies
15
+
16
+ ### System Requirements
17
+ - Python 3.8+
18
+ - Internet connection for API calls
19
+ - FFmpeg for audio processing
20
+
21
+ ### Package Dependencies
22
+ 1. **gradio==3.50.2**
23
+ - Install: `pip install gradio`
24
+ - Purpose: Web interface creation
25
+
26
+ 2. **Other Project Packages**
27
+ - fetch_youtube_videos
28
+ - transcribe_videos
29
+ - summary
30
+ - YouTubeAgent
31
+
32
+ ### Project Dependencies
33
+ 1. **Local Modules**
34
+ - fetch_youtube_videos.py: For YouTube video retrieval
35
+ - transcribe_videos.py: For video transcription
36
+ - summary.py: For generating summaries
37
+ - YouTubeAgent.py: For content idea generation
38
+
39
+ 2. **Output Directory**
40
+ - 'output/' folder for saving transcriptions
41
+
42
+ ## Functions
43
+
44
+ 1. format_results(results)
45
+ - Formats view counts with commas
46
+ - Cleans transcript preview text
47
+
48
+ 2. analyze(topic)
49
+ - Main processing function
50
+ - Coordinates all operations:
51
+ - Video fetching
52
+ - Transcription
53
+ - Summary generation
54
+ - Content idea creation
55
+
56
+ ## Returns
57
+ JSON output containing:
58
+ 1. Video Information
59
+ - Title
60
+ - Channel
61
+ - Views
62
+ - Transcript preview
63
+ - File paths
64
+ 2. Analysis
65
+ - Topic title
66
+ - Summary
67
+ - Key points
68
+ - Content ideas
69
+
70
+ ## Error Handling
71
+ - Empty topic validation
72
+ - Video fetching errors
73
+ - Transcription failures
74
+ - Analysis generation issues
75
+
76
+ """
77
+
78
+
79
+ import gradio as gr
80
+ from fetch_youtube_videos import fetch_videos
81
+ from transcribe_videos import transcribe_and_save
82
+ from summary import generate_combined_summary_and_key_points
83
+ from YouTubeAgent import generateidea
84
+ from embeddings import mainApp
85
+
86
+ def format_results(results):
87
+ """Format results for better display"""
88
+ if isinstance(results, list):
89
+ for result in results:
90
+ if 'Views' in result:
91
+ result['Views'] = f"{result['Views']:,}" # Format numbers with commas
92
+ if 'Transcript Preview' in result:
93
+ result['Transcript Preview'] = result['Transcript Preview'].replace('\n', ' ')
94
+ return results
95
+
96
+ def analyze(topic):
97
+ """
98
+ Fetch videos, transcribe them, and generate analysis including summaries and content ideas.
99
+ """
100
+ if not topic.strip():
101
+ return {"error": "⚠️ Please enter a topic to analyze"}
102
+
103
+ try:
104
+ # Fetch videos based on topic
105
+ videos = fetch_videos(topic)
106
+
107
+ if isinstance(videos, str):
108
+ return {"error": f"⚠️ {videos}"}
109
+
110
+ if not videos:
111
+ return {"error": "⚠️ No relevant videos found for this topic."}
112
+
113
+ results = []
114
+ transcriptions = [] # Store transcriptions for summary generation
115
+
116
+ # Process each video
117
+ for video in videos:
118
+ transcription_result = transcribe_and_save(video['url'])
119
+
120
+ if "error" in transcription_result:
121
+ results.append({
122
+ 'Video': video['title'],
123
+ 'Channel': video['channel'],
124
+ 'Views': video['views'],
125
+ 'Transcript Preview': transcription_result["error"]
126
+ })
127
+ else:
128
+ results.append({
129
+ 'Video': video['title'],
130
+ 'Channel': video['channel'],
131
+ 'Views': video['views'],
132
+ 'Transcript Preview': transcription_result["transcription"][:500] + "...",
133
+ 'Transcript File': transcription_result["file_path"]
134
+ })
135
+ # Add transcription for summary generation
136
+ transcriptions.append(transcription_result["transcription"])
137
+
138
+ # Generate summary and content ideas if transcriptions exist
139
+ if transcriptions:
140
+
141
+ mainApp(topic)
142
+
143
+ topic_title, summary, key_points = generate_combined_summary_and_key_points(transcriptions)
144
+
145
+ # Generate content idea
146
+ input_for_idea = {
147
+ "summary": summary,
148
+ "keypoints": key_points
149
+ }
150
+ content_idea = generateidea(input_for_idea)
151
+
152
+ # Add analysis to results
153
+ results.append({
154
+ "Analysis": {
155
+ "Topic Title": topic_title,
156
+ "Summary": summary,
157
+ "Key Points": key_points,
158
+ "Content Idea": content_idea
159
+ }
160
+ })
161
+
162
+ return format_results(results)
163
+
164
+ except Exception as e:
165
+ return {"error": f"⚠️ An unexpected error occurred: {str(e)}"}
166
+
167
+ # Create Gradio interface with improved styling
168
+ with gr.Blocks(theme=gr.themes.Soft()) as app:
169
+ gr.Markdown(
170
+ """
171
+ # 🎥 VidInsight AI
172
+ ### AI-Powered YouTube Content Analyzer
173
+
174
+ This tool helps you:
175
+ - 📝 Get transcriptions of educational videos
176
+ - 📊 Generate summaries and key points
177
+ - 💡 Create content ideas
178
+ """
179
+ )
180
+
181
+ with gr.Row():
182
+ with gr.Column(scale=2):
183
+ topic_input = gr.Textbox(
184
+ label="Enter Topic",
185
+ placeholder="e.g., Machine Learning, Data Science, Python Programming",
186
+ lines=2
187
+ )
188
+
189
+ with gr.Column(scale=1):
190
+ submit_btn = gr.Button("🔍 Analyze", variant="primary")
191
+ clear_btn = gr.Button("🗑️ Clear")
192
+
193
+ with gr.Row():
194
+ output = gr.JSON(
195
+ label="Analysis Results",
196
+ show_label=True
197
+ )
198
+
199
+ # Add footer
200
+ gr.Markdown(
201
+ """
202
+ ---
203
+ 📌 **Note**: This tool analyzes educational YouTube videos and generates AI-powered insights.
204
+
205
+ Made by VidInsight Team 🤖
206
+ """
207
+ )
208
+
209
+ # Set up button actions
210
+ submit_btn.click(
211
+ fn=analyze,
212
+ inputs=topic_input,
213
+ outputs=output,
214
+ api_name="analyze"
215
+ )
216
+ clear_btn.click(lambda: None, None, topic_input, queue=False)
217
+
218
+ if __name__ == "__main__":
219
+ app.launch()
220
+
221
+
222
+
config.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Configuration Module for VidInsight AI
3
+
4
+ This module manages configuration settings and environment variables for the VidInsight AI project.
5
+
6
+ ## Summary
7
+ - Loads API keys from environment file
8
+ - Defines filtering criteria for YouTube video search
9
+ - Configures trusted educational channels
10
+ - Sets up keyword-based content filtering
11
+ - Establishes quality thresholds (views, duration)
12
+
13
+ ## Dependencies
14
+
15
+ ### System Requirements
16
+ - Python 3.8+
17
+ - Read access to environment file location
18
+
19
+ ### Package Dependencies
20
+ 1. **python-dotenv**
21
+ - Install: `pip install python-dotenv`
22
+ - Purpose: Load environment variables from file
23
+
24
+ ### Project Dependencies
25
+ 1. **keys1.env file**
26
+ - Must contain:
27
+ - YOUTUBE_API_KEY
28
+ - Format:
29
+ ```
30
+ YOUTUBE_API_KEY=your_youtube_api_key_here
31
+ ```
32
+ - Location: Project root directory
33
+
34
+ ## Configuration Parameters
35
+
36
+ ### Video Search Settings
37
+ - videoDuration: "medium" (4-20 minutes)
38
+ - order: "relevance"
39
+ - max_results: 10 videos per search
40
+ - min_view_count: 10,000 views threshold
41
+
42
+ ### Content Filtering
43
+ 1. Trusted Channels (Whitelist):
44
+ - Khan Academy
45
+ - edX
46
+ - Coursera
47
+
48
+ 2. Keyword Filters:
49
+ - Teaching Keywords (Positive):
50
+ {tutorial, lesson, course, how-to, introduction, basics}
51
+ - Non-Teaching Keywords (Negative):
52
+ {fun, experiment, joke, prank, vlog}
53
+
54
+ ## Notes
55
+ - Keep keys1.env secure and never commit to version control
56
+ - Adjust filter criteria as needed for different use cases
57
+ - Channel IDs must be exact matches for trusted channel filtering
58
+
59
+ """
60
+
61
+
62
+
63
+ # dependencies
64
+ from dotenv import load_dotenv, find_dotenv
65
+ import os
66
+
67
+ # Load environment variables from .env file
68
+ load_dotenv(find_dotenv('keys1.env'))
69
+
70
+ # YouTube API Configuration
71
+ YOUTUBE_API_KEY = os.getenv("YOUTUBE_API_KEY")
72
+
73
+ # Content Filter Settings
74
+ FILTER_CONFIG = {
75
+ "videoDuration": "medium", # Focus on videos between 4 and 20 minutes
76
+ "order": "relevance", # Sort by relevance
77
+ # Trusted Channels: Only videos from these channels will bypass keyword filters
78
+ "trusted_channels": {
79
+ "Khan Academy": "UC4a-Gbdw7vOaccHmFo40b9g",
80
+ "edX": "UCEBb1b_L6zDS3xTUrIALZOw",
81
+ "Coursera": "UC58aowNEXHHnflR_5YTtP4g",
82
+ },
83
+ "teaching_keywords": {"tutorial", "lesson", "course", "how-to", "introduction", "basics"}, # Videos containing these words are prioritized
84
+ "non_teaching_keywords": {"fun", "experiment", "joke", "prank", "vlog"}, #Videos containing these words are deprioritized or ignored
85
+ # "blocked_keywords": {"fun", "experiment", "joke", "prank", "vlog"},
86
+ "max_results": 10, # Limit search results to 10 videos
87
+ "min_view_count": 10000 # Minimum view count for relevance
88
+ }
dbcone.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pinecone import Pinecone, ServerlessSpec
2
+ import time
3
+ import os
4
+
5
+ pc_database = None
6
+
7
+
8
+ def getDatabase():
9
+ pine_cone_key = os.getenv("PINECONE_API_KEY")
10
+
11
+ global pc_database
12
+
13
+ if pc_database is None:
14
+ pc_database = Pinecone(api_key = pine_cone_key)
15
+
16
+ return pc_database
17
+
18
+ def getDatabaseIndex(index_name):
19
+
20
+ local_db = getDatabase()
21
+
22
+ if not local_db.has_index(index_name):
23
+ local_db.create_index(
24
+ name=index_name,
25
+ dimension=384, # Replace with your model dimensions
26
+ metric="cosine", # Replace with your model metric
27
+ spec=ServerlessSpec(
28
+ cloud="aws",
29
+ region="us-east-1"
30
+ )
31
+ )
32
+
33
+ while not local_db.describe_index(index_name).status['ready']:
34
+ time.sleep(1)
35
+
36
+ index = local_db.Index(index_name)
37
+ return index
38
+
39
+
embeddings.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from sentence_transformers import SentenceTransformer
2
+ from dotenv import load_dotenv, find_dotenv
3
+ from dbcone import getDatabase
4
+ from dbcone import getDatabaseIndex
5
+ import os
6
+ import uuid
7
+ import pandas as pd
8
+ import numpy as np
9
+ from pathlib import Path
10
+ from summary import generate_combined_summary_and_key_points
11
+
12
+ sentence_model = None
13
+ inputDir = None
14
+ outputDir = None
15
+ topic = None
16
+ db_index_name = None
17
+ db_namespace_name = None
18
+
19
+
20
+
21
+ def initialize_model():
22
+ global sentence_model
23
+ sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
24
+
25
+ def get_model():
26
+ if sentence_model is None:
27
+ initialize_model()
28
+ return sentence_model
29
+
30
+ def get_sentence_embedding(sentence):
31
+ model = get_model()
32
+ return model.encode(sentence)
33
+
34
+ def getOutputDir(outputDirectory):
35
+
36
+ outputDir = Path(outputDirectory)
37
+
38
+ if not os.path.exists(outputDir):
39
+ os.makedirs(outputDir)
40
+ return outputDir
41
+
42
+ def read_files(inputDirectory, outputDirectory, topic=None):
43
+
44
+ inputDir = Path(inputDirectory)
45
+
46
+ embeded_lst = []
47
+
48
+ if ( (not os.path.exists(inputDir)) or (not os.path.isdir(inputDir)) ):
49
+ return embeded_lst
50
+
51
+ files = os.listdir(inputDir)
52
+
53
+ if topic is None:
54
+ topic = os.path.basename(inputDir)
55
+
56
+ if len(files) <= 0:
57
+ return embeded_lst
58
+
59
+ outputDir = getOutputDir(outputDirectory)
60
+
61
+ for file in files:
62
+ if file.endswith(".txt"):
63
+ file_path = os.path.join(inputDir, file)
64
+
65
+ if os.path.isfile(file_path):
66
+
67
+ with open(file_path, 'r') as f:
68
+
69
+ text = f.read()
70
+ embedding = get_sentence_embedding(text)
71
+ f.close()
72
+
73
+ if not os.path.isfile(os.path.join(outputDir, file)):
74
+ os.rename(file_path, os.path.join(outputDir, file))
75
+ else:
76
+ os.remove(file_path)
77
+
78
+ (topic_gen, summary, keypoints) = generate_combined_summary_and_key_points(text)
79
+
80
+ if (topic_gen is not None):
81
+ topic += " - " + topic_gen
82
+
83
+ embeded_lst.append(
84
+ {
85
+ "id" : str(uuid.uuid4().hex),
86
+ "metadata": {
87
+ 'text':text,
88
+ "topic": topic,
89
+ "summary": summary,
90
+ "keypoints":keypoints
91
+ },
92
+ "values": embedding.tolist()
93
+ }
94
+ )
95
+
96
+ return embeded_lst
97
+
98
+ def save_to_database(embeded_lst, index_name = 'test_videos' ,namespace="sample-namespace"):
99
+
100
+ if len(embeded_lst) > 0 :
101
+ db_index = getDatabaseIndex(index_name)
102
+
103
+ db_index.upsert(
104
+ vectors=embeded_lst,
105
+ namespace=namespace
106
+ )
107
+
108
+
109
+ def embed_text_files(inputDir, outputDir, topic):
110
+
111
+ return read_files(inputDirectory=inputDir, outputDirectory=outputDir, topic=topic)
112
+
113
+ def configureApp(given_topic):
114
+
115
+ global inputDir, outputDir, topic, db_index_name, db_namespace_name
116
+
117
+ currPath = Path.cwd()
118
+
119
+ inputDir = os.path.join( currPath, 'output')
120
+ outputDir = os.path.join(currPath, 'processed')
121
+
122
+ topic = given_topic
123
+ db_index_name = 'samplevideos'
124
+ db_namespace_name="video-namespace"
125
+
126
+ load_dotenv(find_dotenv('Keys1.env'))
127
+ initialize_model()
128
+ getDatabase()
129
+
130
+ return True
131
+
132
+ def fetch_from_database(search_text, topics =[] ,top_k = 5, index_name = 'test-videos' ,namespace="sample-namespace"):
133
+
134
+ db_index = getDatabaseIndex(index_name)
135
+
136
+ results = db_index.query(namespace=namespace,
137
+ vector=np.array(get_sentence_embedding(search_text)).tolist(),
138
+ top_k=top_k,
139
+ include_values=True,
140
+ include_metadata=True,
141
+ filter={
142
+ "topic": {"$in": topics},
143
+ }
144
+ )
145
+
146
+ return results
147
+
148
+ def captureData():
149
+
150
+ global inputDir, outputDir, topic, db_index_name, db_namespace_name
151
+
152
+ embeded_lst = embed_text_files(inputDir, outputDir, topic)
153
+
154
+ save_to_database(embeded_lst, index_name =db_index_name, namespace=db_namespace_name)
155
+
156
+
157
+ def queryRepository(search_text, topic):
158
+
159
+ global db_index_name, db_namespace_name
160
+
161
+ result = fetch_from_database(search_text, topics=[topic], index_name = db_index_name, namespace=db_namespace_name)
162
+
163
+ print(f'Results: {result}')
164
+
165
+
166
+ def mainApp(topic):
167
+
168
+ configureApp(topic)
169
+ captureData()
170
+
171
+
172
+ if __name__ == "__main__":
173
+ mainApp()
fetch_youtube_videos.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # YouTube Video Fetcher Module
3
+
4
+ This module is responsible for searching, filtering, and retrieving educational YouTube videos based on user queries.
5
+
6
+ ## Summary
7
+ - Fetches videos using YouTube Data API v3
8
+ - Filters videos based on:
9
+ - Duration (medium length: 4-20 minutes)
10
+ - View count (minimum threshold)
11
+ - Teaching vs non-teaching keywords
12
+ - Trusted educational channels
13
+ - Returns top 3 most relevant videos sorted by view count
14
+
15
+ ## Dependencies
16
+
17
+ ### System Requirements
18
+ - Python 3.8+
19
+ - Internet connection for API calls
20
+
21
+ ### Package Dependencies
22
+ - google-api-python-client==2.104.0
23
+ Install: `pip install google-api-python-client`
24
+
25
+ ### Project Dependencies
26
+ 1. config.py
27
+ - Provides YOUTUBE_API_KEY
28
+ - Contains FILTER_CONFIG dictionary with:
29
+ - videoDuration
30
+ - order
31
+ - trusted_channels
32
+ - teaching_keywords
33
+ - non_teaching_keywords
34
+ - max_results
35
+ - min_view_count
36
+
37
+ 2. Environment Setup
38
+ - keys1.env file with YouTube API key
39
+ - YouTube Data API access enabled in Google Cloud Console
40
+
41
+ ## Returns
42
+ - List of dictionaries containing:
43
+ - title: Video title
44
+ - url: YouTube video URL
45
+ - channel: Channel name
46
+ - views: View count
47
+ - Or error message string if fetch fails
48
+
49
+ ## Error Handling
50
+ - Returns error message if:
51
+ - API key is invalid
52
+ - API quota is exceeded
53
+ - Network connection fails
54
+ - YouTube API request fails
55
+ """
56
+
57
+ # Import Dependencies
58
+ from googleapiclient.discovery import build
59
+ from config import YOUTUBE_API_KEY, FILTER_CONFIG
60
+
61
+ def fetch_videos(topic):
62
+ """Fetch relevant YouTube videos based on topic and filter criteria."""
63
+ try:
64
+ youtube = build('youtube', 'v3', developerKey=YOUTUBE_API_KEY)
65
+
66
+ # Fetch videos from YouTube API
67
+ search_response = youtube.search().list(
68
+ q=topic,
69
+ part="snippet",
70
+ type="video",
71
+ maxResults=FILTER_CONFIG["max_results"], # Limit to max_results directly
72
+ videoDuration=FILTER_CONFIG["videoDuration"],
73
+ order=FILTER_CONFIG["order"]
74
+ ).execute()
75
+
76
+ # Process video results
77
+ videos = []
78
+ for item in search_response.get('items', []):
79
+ video_id = item['id']['videoId']
80
+ title = item['snippet']['title'].lower()
81
+ description = item['snippet']['description'].lower()
82
+ channel_id = item['snippet']['channelId']
83
+
84
+ # Fetch video statistics (views)
85
+ stats_response = youtube.videos().list(
86
+ part="statistics",
87
+ id=video_id
88
+ ).execute()
89
+
90
+ stats = stats_response.get('items', [{}])[0].get('statistics', {})
91
+ view_count = int(stats.get("viewCount", 0))
92
+
93
+ # Apply filters: minimum views, keywords, trusted channels
94
+ if view_count < FILTER_CONFIG["min_view_count"]:
95
+ continue
96
+
97
+ teaching_score = len(set(title.split() + description.split()) & FILTER_CONFIG["teaching_keywords"])
98
+ noise_score = len(set(title.split() + description.split()) & FILTER_CONFIG["non_teaching_keywords"])
99
+ # noise_score = len(set(title.split() + description.split()) & FILTER_CONFIG["blocked_keywords"])
100
+
101
+ is_trusted_channel = channel_id in FILTER_CONFIG["trusted_channels"].values()
102
+
103
+ if teaching_score > noise_score or is_trusted_channel:
104
+ videos.append({
105
+ 'title': item['snippet']['title'],
106
+ 'url': f'https://youtu.be/{video_id}',
107
+ 'channel': item['snippet']['channelTitle'],
108
+ 'views': view_count,
109
+ })
110
+
111
+ # Sort by views (descending) and return top 3 videos
112
+ return sorted(videos, key=lambda x: x['views'], reverse=True)[:3]
113
+
114
+ except Exception as e:
115
+ return f"Error fetching videos: {str(e)}"
main.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Command Line Interface Module
3
+
4
+ This module provides a CLI alternative to the Gradio web interface for the VidInsight AI application.
5
+
6
+ ## Summary
7
+ - Offers command-line interaction for video analysis
8
+ - Provides sequential processing of videos
9
+ - Displays results directly in the terminal
10
+ - Serves as a debugging and testing tool
11
+
12
+ ## Dependencies
13
+
14
+ ### System Requirements
15
+ - Python 3.8+
16
+ - Internet connection for API calls
17
+ - FFmpeg for audio processing
18
+
19
+ ### Package Dependencies
20
+ No additional package installations required beyond project dependencies
21
+
22
+ ### Project Dependencies
23
+ 1. **Local Modules**
24
+ - fetch_youtube_videos.py: For YouTube video retrieval
25
+ - transcribe_videos.py: For video transcription
26
+
27
+ ## Functions
28
+ main()
29
+ - Gets user input for topic
30
+ - Coordinates video fetching and transcription
31
+ - Displays results in terminal format
32
+
33
+ ## Usage Example
34
+ python main.py
35
+ Enter topic to analyze: Machine Learning
36
+
37
+ ## Returns
38
+ Terminal output containing:
39
+ 1. Video Information
40
+ - Title
41
+ - URL
42
+ 2. Transcription Status
43
+ - Success/failure messages
44
+ - Transcription text or error
45
+
46
+ ## Error Handling
47
+ - Video fetching errors
48
+ - Transcription failures
49
+ - Invalid input handling
50
+
51
+ """
52
+
53
+
54
+ from fetch_youtube_videos import fetch_videos
55
+ from transcribe_videos import transcribe
56
+
57
+ def main():
58
+ topic = input("Enter topic to analyze: ")
59
+ print("\nFetching videos...")
60
+
61
+ videos = fetch_videos(topic)
62
+ if isinstance(videos, str):
63
+ print(f"Error: {videos}")
64
+ return
65
+
66
+ for idx, video in enumerate(videos, 1):
67
+ print(f"\nVideo {idx}: {video['title']}")
68
+ print(f"URL: {video['url']}")
69
+ print("Transcribing...")
70
+ print(transcribe(video['url']))
71
+
72
+ if __name__ == "__main__":
73
+ main()
requirements.txt ADDED
Binary file (16.7 kB). View file
 
summary.py ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Video Summary Generator Module
3
+
4
+ This module processes transcribed video content to generate summaries, key points, and topic titles
5
+ using Google's Gemini AI model.
6
+
7
+ ## Summary
8
+ - Takes multiple video transcriptions as input
9
+ - Concatenates transcriptions for unified analysis
10
+ - Uses Gemini AI to generate:
11
+ - Relevant topic title
12
+ - Concise content summary
13
+ - Key points from the content
14
+ - Handles response parsing and error cases
15
+
16
+ ## Dependencies
17
+
18
+ ### System Requirements
19
+ - Python 3.8+
20
+ - Internet connection for API calls
21
+
22
+ ### Package Dependencies
23
+ 1. **langchain-google-genai**
24
+ - Install: `pip install langchain-google-genai`
25
+ - Purpose: Interface with Gemini AI model
26
+
27
+ 2. **python-dotenv**
28
+ - Install: `pip install python-dotenv`
29
+ - Purpose: Load environment variables
30
+
31
+ ### Project Dependencies
32
+ 1. **keys1.env file**
33
+ - Must contain: GEMINI_API_KEY
34
+ - Format: GEMINI_API_KEY=your_api_key_here
35
+
36
+ 2. **Input Requirements**
37
+ - Transcription texts from processed videos
38
+ - Non-empty transcription content
39
+
40
+ ## Functions
41
+ generate_combined_summary_and_key_points(transcriptions)
42
+ - Args: List of transcription texts
43
+ - Returns: Tuple of (topic_title, summary, key_points)
44
+ - Error Returns: Error messages with empty lists if processing fails
45
+
46
+ ## Returns
47
+ Tuple containing:
48
+ 1. topic_title (str): Generated title for the content
49
+ 2. summary (str): Concise summary of all transcriptions
50
+ 3. key_points (list): List of main points extracted
51
+
52
+ ## Error Handling
53
+ - Returns error messages if:
54
+ - Transcriptions are empty
55
+ - Gemini API fails to respond
56
+ - Response parsing fails
57
+ """
58
+
59
+ import os
60
+ import glob
61
+ from dotenv import load_dotenv, find_dotenv
62
+ from langchain_google_genai import ChatGoogleGenerativeAI
63
+
64
+ def generate_combined_summary_and_key_points(transcriptions):
65
+ if not all(transcriptions):
66
+ return "Error: No transcription text provided.", [], ""
67
+
68
+ # Concatenate the transcriptions into one single string
69
+ concatenated_transcriptions = "\n".join(transcriptions)
70
+
71
+ prompt = f"""
72
+ The following are transcriptions of videos:
73
+ ---
74
+ {concatenated_transcriptions}
75
+ ---
76
+ Based on the content, generate a relevant topic title for the transcriptions.
77
+ Then, summarize the key insights and extract the main points from these transcriptions together.
78
+ Ignore sponsors and focus more on the details rather than the overall outline.
79
+ Format your response as:
80
+ Topic Title: [Generated topic title]
81
+
82
+ Summary:
83
+ [Concise summary of the transcriptions]
84
+
85
+ Key Points:
86
+ - [Key point 1]
87
+ - [Key point 2]
88
+ - [Key point 3]
89
+ """
90
+ # Load environment variables
91
+ load_dotenv(find_dotenv('keys1.env'))
92
+
93
+ # Get API key
94
+ GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
95
+ GEMINI_MODEL = "gemini-1.5-flash"
96
+
97
+ # Initialize Gemini API
98
+ llm = ChatGoogleGenerativeAI(model=GEMINI_MODEL, api_key=GEMINI_API_KEY)
99
+
100
+ # Generate the response from the model
101
+ response = llm.predict(prompt)
102
+
103
+ if not response:
104
+ return "Error: No response generated.", [], ""
105
+
106
+ # Extract topic title, summary, and key points from response
107
+ topic_title_start = response.find("Topic Title:")
108
+ summary_start = response.find("Summary:")
109
+ key_points_start = response.find("Key Points:")
110
+
111
+ if topic_title_start != -1 and summary_start != -1 and key_points_start != -1:
112
+ topic_title = response[topic_title_start + len("Topic Title:"): summary_start].strip()
113
+ summary = response[summary_start + len("Summary:"): key_points_start].strip()
114
+ key_points_str = response[key_points_start + len("Key Points:"):].strip()
115
+ key_points = [point.strip(" -") for point in key_points_str.split("\n")]
116
+ else:
117
+ topic_title = "Error: Unable to generate topic title."
118
+ summary = "Error: Unable to extract summary."
119
+ key_points = []
120
+
121
+ return topic_title, summary, key_points
122
+
transcribe_videos.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Video Transcription Module
3
+
4
+ This module handles the audio extraction and transcription of YouTube videos using Whisper AI.
5
+
6
+ ## Summary
7
+ - Downloads audio from YouTube videos using yt-dlp
8
+ - Transcribes audio using OpenAI's Whisper model
9
+ - Saves transcriptions as text files
10
+ - Handles various YouTube URL formats
11
+ - Provides error handling for failed downloads/transcriptions
12
+
13
+ ## Dependencies
14
+
15
+ ### System Requirements
16
+ 1. **FFmpeg**
17
+ - Windows: Install via chocolatey `choco install ffmpeg`
18
+ - Mac: Install via homebrew `brew install ffmpeg`
19
+ - Linux: `sudo apt-get install ffmpeg`
20
+ 2. Python 3.8+
21
+ 3. Sufficient disk space for temporary audio files
22
+
23
+ ### Package Dependencies
24
+ 1. **openai-whisper==20231106**
25
+ - Install: `pip install openai-whisper`
26
+ - Purpose: Audio transcription
27
+
28
+ 2. **yt-dlp==2023.11.16**
29
+ - Install: `pip install yt-dlp`
30
+ - Purpose: YouTube audio downloading
31
+
32
+ 3. **torch**
33
+ - Install: `pip install torch`
34
+ - Purpose: Required by Whisper for model operations
35
+
36
+ ### Project Dependencies
37
+ 1. **output/** directory
38
+ - Must exist or have permissions to create
39
+ - Stores transcription text files
40
+
41
+ ## Functions
42
+ 1. extract_video_id(url)
43
+ - Extracts YouTube video ID from various URL formats
44
+ - Handles both youtube.com and youtu.be URLs
45
+
46
+ 2. transcribe_and_save(url, output_dir="output")
47
+ - Downloads audio
48
+ - Performs transcription
49
+ - Saves result to file
50
+ - Returns file path and transcription text
51
+
52
+ ## Returns
53
+ Dictionary containing:
54
+ - file_path: Path to saved transcription
55
+ - transcription: Full transcription text
56
+ - error: Error message if transcription fails
57
+
58
+ ## Error Handling
59
+ - Returns error dictionary if:
60
+ - Video URL is invalid
61
+ - Audio download fails
62
+ - Transcription fails
63
+ - File writing fails
64
+ """
65
+
66
+
67
+ # import dependencies
68
+ import whisper
69
+ import yt_dlp
70
+ import os
71
+
72
+ # Load Whisper model
73
+ MODEL = whisper.load_model("base")
74
+ # MODEL = whisper.load_model("base", weights_only=True)
75
+
76
+ def extract_video_id(url):
77
+ """
78
+ Extracts the video ID from a YouTube URL.
79
+ Args:
80
+ url (str): YouTube video URL.
81
+ Returns:
82
+ str: Video ID.
83
+ """
84
+ if "v=" in url:
85
+ return url.split("v=")[-1]
86
+ elif "youtu.be/" in url:
87
+ return url.split("youtu.be/")[-1]
88
+ return "unknown_video_id"
89
+
90
+ def transcribe_and_save(url, output_dir="output"):
91
+ """
92
+ Transcribe audio from a YouTube video and save it to a file.
93
+ Args:
94
+ url (str): YouTube video URL.
95
+ output_dir (str): Directory to save the transcription.
96
+ Returns:
97
+ dict: Contains the file path and transcription text.
98
+ """
99
+ try:
100
+ # Download audio with yt-dlp
101
+ with yt_dlp.YoutubeDL({'format': 'bestaudio'}) as ydl:
102
+ info = ydl.extract_info(url, download=False)
103
+ audio_url = info['url']
104
+
105
+ # Transcribe audio
106
+ result = MODEL.transcribe(audio_url)
107
+ transcription = result['text']
108
+
109
+ # Create output directory if it doesn't exist
110
+ os.makedirs(output_dir, exist_ok=True)
111
+
112
+ # Use video ID as file name
113
+ video_id = extract_video_id(url)
114
+ file_path = os.path.join(output_dir, f"{video_id}.txt")
115
+
116
+ # Save transcription to a file
117
+ with open(file_path, "w", encoding="utf-8") as file:
118
+ file.write(transcription)
119
+
120
+ return {"file_path": file_path, "transcription": transcription}
121
+
122
+ except Exception as e:
123
+ return {"error": f"Transcription failed: {str(e)}"}
124
+