gr8monk3ys commited on
Commit
3e00d6f
·
verified ·
1 Parent(s): aba280a

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +86 -6
  2. app.py +288 -0
  3. requirements.txt +2 -0
README.md CHANGED
@@ -1,12 +1,92 @@
1
  ---
2
- title: Ml Interview Prep
3
- emoji: 📈
4
- colorFrom: red
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 6.5.1
 
8
  app_file: app.py
9
  pinned: false
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: ML Interview Prep
3
+ emoji: 🎯
4
+ colorFrom: yellow
5
+ colorTo: red
6
  sdk: gradio
7
+ sdk_version: 5.9.1
8
+ python_version: "3.10"
9
  app_file: app.py
10
  pinned: false
11
+ license: mit
12
+ short_description: Practice ML and Data Science interview questions
13
  ---
14
 
15
+ # ML Interview Prep
16
+
17
+ An interactive tool for practicing machine learning and data science interview questions. Features 500+ curated questions across 10 categories with detailed expert answers.
18
+
19
+ ## Features
20
+
21
+ ### 500+ Interview Questions
22
+ Comprehensive coverage of ML/DS interview topics from top tech companies.
23
+
24
+ ### 10 Categories
25
+ - Statistics & Probability
26
+ - ML Theory & Algorithms
27
+ - Deep Learning
28
+ - Natural Language Processing
29
+ - Computer Vision
30
+ - System Design
31
+ - SQL & Databases
32
+ - Python Programming
33
+ - Feature Engineering
34
+ - A/B Testing & Experimentation
35
+
36
+ ### Three Difficulty Levels
37
+ - **Easy** - Fundamentals and basic concepts
38
+ - **Medium** - Applied knowledge and trade-offs
39
+ - **Hard** - Advanced topics and edge cases
40
+
41
+ ### Practice Modes
42
+
43
+ **Quiz Mode**
44
+ - Random questions based on your filters
45
+ - Try to answer before revealing the solution
46
+ - Track your progress
47
+
48
+ **Flashcard Mode**
49
+ - Quick review of key concepts
50
+ - Flip cards to see answers
51
+ - Great for last-minute prep
52
+
53
+ **Browse Mode**
54
+ - Search and filter all questions
55
+ - Study specific topics in depth
56
+
57
+ ### Company Tags
58
+ Questions tagged by company (Google, Meta, Amazon, etc.) so you can focus on company-specific prep.
59
+
60
+ ## How to Use
61
+
62
+ 1. **Select categories** you want to practice
63
+ 2. **Choose difficulty** level
64
+ 3. **Pick a mode** (Quiz, Flashcard, or Browse)
65
+ 4. **Start practicing!**
66
+
67
+ ## Question Sources
68
+
69
+ Questions are curated from:
70
+ - Real interview experiences shared online
71
+ - Common ML/DS interview patterns
72
+ - Academic fundamentals
73
+ - Industry best practices
74
+
75
+ ## Example Questions
76
+
77
+ **ML Theory (Medium):**
78
+ > "Explain the bias-variance tradeoff and how it affects model selection."
79
+
80
+ **Deep Learning (Hard):**
81
+ > "How would you handle class imbalance in a neural network for fraud detection?"
82
+
83
+ **System Design (Hard):**
84
+ > "Design a real-time recommendation system for a streaming platform."
85
+
86
+ ## License
87
+
88
+ MIT
89
+
90
+ ## Author
91
+
92
+ Built by [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys)
app.py ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ML Interview Prep - Interactive practice for ML/DS interview questions.
3
+ """
4
+
5
+ import gradio as gr
6
+ import pandas as pd
7
+ import random
8
+ from pathlib import Path
9
+
10
+ # ---------------------------------------------------------------------------
11
+ # Sample Question Database
12
+ # ---------------------------------------------------------------------------
13
+
14
+ # Embedded sample questions (in production, load from dataset)
15
+ QUESTIONS = [
16
+ # Statistics
17
+ {"id": "1", "question": "Explain the difference between Type I and Type II errors.", "answer": "Type I error (false positive) occurs when we reject a true null hypothesis. Type II error (false negative) occurs when we fail to reject a false null hypothesis. In ML terms, Type I is like flagging a legitimate transaction as fraud, while Type II is missing actual fraud. The tradeoff between these errors is controlled by the significance level (alpha) and power (1-beta) of the test.", "category": "Statistics", "difficulty": "easy", "company_tags": "Google|Meta|Amazon", "topic_tags": "hypothesis testing|statistical inference"},
18
+ {"id": "2", "question": "What is the Central Limit Theorem and why is it important?", "answer": "The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population's distribution. This is crucial because it allows us to make inferences about population parameters using normal distribution properties, enables hypothesis testing and confidence intervals, and justifies many statistical methods even when the underlying data isn't normally distributed.", "category": "Statistics", "difficulty": "easy", "company_tags": "Google|Meta|Netflix", "topic_tags": "probability|distributions"},
19
+ {"id": "3", "question": "How would you handle multicollinearity in a regression model?", "answer": "Multicollinearity can be addressed by: 1) Removing highly correlated features based on VIF (Variance Inflation Factor) > 5-10, 2) Using regularization (Ridge/Lasso) which shrinks correlated coefficients, 3) PCA to create uncorrelated components, 4) Domain knowledge to select the most meaningful feature among correlated ones. The choice depends on whether you need interpretable coefficients (remove features) or just prediction (regularization).", "category": "Statistics", "difficulty": "medium", "company_tags": "Meta|Airbnb|Uber", "topic_tags": "regression|feature selection"},
20
+
21
+ # ML Theory
22
+ {"id": "4", "question": "Explain the bias-variance tradeoff.", "answer": "The bias-variance tradeoff describes the tension between two sources of prediction error. High bias means the model is too simple and underfits (e.g., linear regression on nonlinear data). High variance means the model is too complex and overfits (e.g., unpruned decision tree). Total error = Bias² + Variance + Irreducible noise. We balance this through model selection, regularization, and ensemble methods. Cross-validation helps find the sweet spot.", "category": "ML Theory", "difficulty": "easy", "company_tags": "Google|Amazon|Microsoft", "topic_tags": "fundamentals|model selection"},
23
+ {"id": "5", "question": "What's the difference between bagging and boosting?", "answer": "Bagging (Bootstrap Aggregating) trains models in parallel on random subsets of data, then averages predictions. It reduces variance (Random Forest). Boosting trains models sequentially, with each model focusing on errors of previous ones. It reduces bias (XGBoost, AdaBoost). Bagging works well when base models overfit; boosting works well when they underfit. Boosting is more prone to overfitting but often achieves higher accuracy with proper tuning.", "category": "ML Theory", "difficulty": "medium", "company_tags": "Google|Meta|Apple", "topic_tags": "ensemble|algorithms"},
24
+ {"id": "6", "question": "How does gradient descent work and what are its variants?", "answer": "Gradient descent iteratively updates parameters in the direction of steepest descent of the loss function: θ = θ - α∇L(θ). Variants include: Batch GD (uses all data, stable but slow), Stochastic GD (one sample, noisy but fast), Mini-batch GD (compromise). Advanced optimizers like Adam combine momentum (accumulates past gradients) and RMSprop (adaptive learning rates). Choice depends on dataset size, convergence needs, and computational resources.", "category": "ML Theory", "difficulty": "medium", "company_tags": "Google|DeepMind|OpenAI", "topic_tags": "optimization|training"},
25
+
26
+ # Deep Learning
27
+ {"id": "7", "question": "Explain the vanishing gradient problem and how to address it.", "answer": "Vanishing gradients occur when gradients become very small during backpropagation in deep networks, preventing early layers from learning. Causes include sigmoid/tanh activations (derivatives < 1). Solutions: 1) ReLU activation (gradient = 1 for positive inputs), 2) Batch/Layer normalization (stabilizes activations), 3) Residual connections (skip connections allow gradient flow), 4) Proper weight initialization (Xavier/He). Modern architectures like ResNet and Transformers incorporate these solutions.", "category": "Deep Learning", "difficulty": "medium", "company_tags": "Google|Meta|OpenAI", "topic_tags": "neural networks|training"},
28
+ {"id": "8", "question": "What is the attention mechanism and why is it important?", "answer": "Attention allows models to focus on relevant parts of input when producing output. It computes weighted combinations of values based on query-key similarity: Attention(Q,K,V) = softmax(QK^T/√d)V. Importance: 1) Captures long-range dependencies without sequential processing, 2) Provides interpretability through attention weights, 3) Enables parallelization (vs RNNs). Self-attention (Transformers) revolutionized NLP and is now used in vision (ViT) and other domains.", "category": "Deep Learning", "difficulty": "medium", "company_tags": "Google|OpenAI|Anthropic", "topic_tags": "transformers|architectures"},
29
+ {"id": "9", "question": "How would you handle class imbalance in deep learning?", "answer": "Strategies include: 1) Data-level: oversampling minority (SMOTE), undersampling majority, data augmentation, 2) Algorithm-level: class weights in loss function, focal loss (down-weights easy examples), threshold adjustment, 3) Ensemble: combine models trained on balanced subsets. For neural networks specifically: stratified batching, two-phase training (pretrain on balanced, fine-tune on original). Evaluation should use precision-recall curves and F1 rather than accuracy.", "category": "Deep Learning", "difficulty": "hard", "company_tags": "Amazon|PayPal|Stripe", "topic_tags": "imbalanced data|training"},
30
+
31
+ # NLP
32
+ {"id": "10", "question": "Explain word embeddings and their evolution.", "answer": "Word embeddings map words to dense vectors capturing semantic meaning. Evolution: 1) One-hot encoding (sparse, no semantics), 2) Word2Vec/GloVe (static embeddings from co-occurrence), 3) ELMo (contextualized via bidirectional LSTM), 4) BERT/GPT (contextualized via Transformers). Key insight: words with similar contexts have similar vectors. Modern embeddings are contextual (same word gets different vectors based on context) and can be fine-tuned for downstream tasks.", "category": "NLP", "difficulty": "medium", "company_tags": "Google|OpenAI|Meta", "topic_tags": "embeddings|representations"},
33
+ {"id": "11", "question": "What are the key differences between BERT and GPT?", "answer": "BERT (Bidirectional Encoder Representations from Transformers) uses masked language modeling and sees full context bidirectionally. Best for understanding tasks (classification, NER, QA). GPT (Generative Pre-trained Transformer) uses autoregressive language modeling, predicting next token left-to-right. Best for generation tasks. BERT is encoder-only, GPT is decoder-only. For tasks needing both understanding and generation, encoder-decoder (T5) or large autoregressive models (GPT-4) with in-context learning work well.", "category": "NLP", "difficulty": "medium", "company_tags": "Google|OpenAI|Microsoft", "topic_tags": "transformers|language models"},
34
+
35
+ # System Design
36
+ {"id": "12", "question": "Design a recommendation system for an e-commerce platform.", "answer": "Components: 1) Data collection: user behavior (clicks, purchases, time), item features, context (time, device), 2) Candidate generation: collaborative filtering (user-item matrix factorization), content-based (item similarity), 3) Ranking: ML model combining features (user, item, context) to predict engagement, 4) Serving: precompute for cold start, real-time for logged-in users, 5) Feedback loop: A/B testing, handling cold start (popular items, explore/exploit). Key tradeoffs: latency vs personalization, diversity vs relevance, short vs long-term engagement.", "category": "System Design", "difficulty": "hard", "company_tags": "Amazon|Netflix|Spotify", "topic_tags": "recommendations|architecture"},
37
+ {"id": "13", "question": "How would you design an ML pipeline for real-time fraud detection?", "answer": "Architecture: 1) Data ingestion: Kafka for streaming transactions, 2) Feature engineering: real-time features (velocity, device fingerprint) + batch features (historical patterns), 3) Model: ensemble of rules + ML (isolation forest, XGBoost) for sub-100ms latency, 4) Serving: feature store for consistency, model versioning, 5) Feedback: human review loop, delayed labels, continuous retraining. Key considerations: class imbalance, adversarial adaptation, explainability for disputes, cost of false positives/negatives.", "category": "System Design", "difficulty": "hard", "company_tags": "PayPal|Stripe|Square", "topic_tags": "fraud|real-time systems"},
38
+
39
+ # Feature Engineering
40
+ {"id": "14", "question": "What are the most important feature engineering techniques?", "answer": "Key techniques: 1) Numerical: scaling (StandardScaler, MinMax), log transform for skewed data, binning, polynomial features, 2) Categorical: one-hot encoding, target encoding (with smoothing to avoid leakage), frequency encoding, 3) Temporal: lag features, rolling statistics, cyclical encoding (sin/cos for hours), 4) Text: TF-IDF, embeddings, 5) Interactions: domain-driven feature combinations. The best technique depends on the algorithm (trees don't need scaling) and domain knowledge. Always validate with cross-validation.", "category": "Feature Engineering", "difficulty": "medium", "company_tags": "Airbnb|Uber|Meta", "topic_tags": "preprocessing|features"},
41
+
42
+ # A/B Testing
43
+ {"id": "15", "question": "How do you determine sample size for an A/B test?", "answer": "Sample size depends on: 1) Baseline conversion rate (p), 2) Minimum detectable effect (MDE), 3) Significance level (α, typically 0.05), 4) Power (1-β, typically 0.8). Formula: n = 2 * ((z_α/2 + z_β)² * p(1-p)) / MDE². Practical considerations: higher baseline = more power, smaller MDE needs more samples, account for multiple testing if many variants. Use power calculators. For ratio metrics, variance is harder to estimate—consider pre-experiment data analysis.", "category": "A/B Testing", "difficulty": "medium", "company_tags": "Google|Meta|Netflix", "topic_tags": "experimentation|statistics"},
44
+ ]
45
+
46
+ # Convert to DataFrame
47
+ questions_df = pd.DataFrame(QUESTIONS)
48
+
49
+ # ---------------------------------------------------------------------------
50
+ # Application State
51
+ # ---------------------------------------------------------------------------
52
+
53
+ class QuizState:
54
+ def __init__(self):
55
+ self.current_question = None
56
+ self.answered = 0
57
+ self.questions_seen = []
58
+
59
+ quiz_state = QuizState()
60
+
61
+ # ---------------------------------------------------------------------------
62
+ # Core Functions
63
+ # ---------------------------------------------------------------------------
64
+
65
+ def get_random_question(categories: list, difficulties: list) -> dict:
66
+ """Get a random question matching filters."""
67
+ filtered = questions_df.copy()
68
+
69
+ if categories and "All" not in categories:
70
+ filtered = filtered[filtered["category"].isin(categories)]
71
+
72
+ if difficulties and "All" not in difficulties:
73
+ filtered = filtered[filtered["difficulty"].isin([d.lower() for d in difficulties])]
74
+
75
+ if filtered.empty:
76
+ return None
77
+
78
+ # Avoid repeating recent questions
79
+ available = filtered[~filtered["id"].isin(quiz_state.questions_seen[-10:])]
80
+ if available.empty:
81
+ available = filtered
82
+
83
+ question = available.sample(1).iloc[0].to_dict()
84
+ quiz_state.questions_seen.append(question["id"])
85
+ quiz_state.current_question = question
86
+
87
+ return question
88
+
89
+
90
+ def format_question(question: dict) -> str:
91
+ """Format question for display."""
92
+ if not question:
93
+ return "No questions match your filters. Try selecting different categories or difficulties."
94
+
95
+ companies = question.get("company_tags", "").replace("|", ", ")
96
+
97
+ output = f"""## {question['question']}
98
+
99
+ **Category:** {question['category']} | **Difficulty:** {question['difficulty'].title()}
100
+
101
+ **Common at:** {companies}
102
+ """
103
+ return output
104
+
105
+
106
+ def format_answer(question: dict) -> str:
107
+ """Format answer for display."""
108
+ if not question:
109
+ return ""
110
+
111
+ topics = question.get("topic_tags", "").replace("|", ", ")
112
+
113
+ output = f"""## Answer
114
+
115
+ {question['answer']}
116
+
117
+ ---
118
+
119
+ **Topics:** {topics}
120
+ """
121
+ return output
122
+
123
+
124
+ def start_quiz(categories: list, difficulties: list) -> tuple[str, str, str]:
125
+ """Start a new quiz question."""
126
+ question = get_random_question(categories, difficulties)
127
+ quiz_state.answered += 1
128
+
129
+ question_text = format_question(question)
130
+ status = f"Question #{quiz_state.answered}"
131
+
132
+ return question_text, "", status
133
+
134
+
135
+ def reveal_answer() -> str:
136
+ """Reveal the answer to current question."""
137
+ if quiz_state.current_question:
138
+ return format_answer(quiz_state.current_question)
139
+ return "No question loaded. Click 'Next Question' first."
140
+
141
+
142
+ def browse_questions(category: str, difficulty: str, search: str) -> str:
143
+ """Browse and filter all questions."""
144
+ filtered = questions_df.copy()
145
+
146
+ if category and category != "All":
147
+ filtered = filtered[filtered["category"] == category]
148
+
149
+ if difficulty and difficulty != "All":
150
+ filtered = filtered[filtered["difficulty"] == difficulty.lower()]
151
+
152
+ if search:
153
+ mask = (
154
+ filtered["question"].str.contains(search, case=False) |
155
+ filtered["answer"].str.contains(search, case=False)
156
+ )
157
+ filtered = filtered[mask]
158
+
159
+ if filtered.empty:
160
+ return "No questions match your search."
161
+
162
+ output = f"## Found {len(filtered)} Questions\n\n"
163
+
164
+ for _, row in filtered.iterrows():
165
+ output += f"### {row['question']}\n\n"
166
+ output += f"**{row['category']}** | **{row['difficulty'].title()}**\n\n"
167
+ output += f"{row['answer']}\n\n"
168
+ output += "---\n\n"
169
+
170
+ return output
171
+
172
+
173
+ # ---------------------------------------------------------------------------
174
+ # Gradio Interface
175
+ # ---------------------------------------------------------------------------
176
+
177
+ CATEGORIES = ["All"] + sorted(questions_df["category"].unique().tolist())
178
+ DIFFICULTIES = ["All", "Easy", "Medium", "Hard"]
179
+
180
+ with gr.Blocks(title="ML Interview Prep", theme=gr.themes.Soft()) as demo:
181
+ gr.Markdown("""
182
+ # ML Interview Prep
183
+
184
+ Practice machine learning and data science interview questions.
185
+ Choose your categories and difficulty, then test your knowledge!
186
+ """)
187
+
188
+ with gr.Tabs():
189
+ # Quiz Mode Tab
190
+ with gr.TabItem("Quiz Mode"):
191
+ gr.Markdown("### Practice with random questions")
192
+
193
+ with gr.Row():
194
+ category_select = gr.Dropdown(
195
+ choices=CATEGORIES,
196
+ value=["All"],
197
+ multiselect=True,
198
+ label="Categories",
199
+ )
200
+ difficulty_select = gr.Dropdown(
201
+ choices=DIFFICULTIES,
202
+ value=["All"],
203
+ multiselect=True,
204
+ label="Difficulties",
205
+ )
206
+
207
+ with gr.Row():
208
+ next_btn = gr.Button("Next Question", variant="primary")
209
+ reveal_btn = gr.Button("Show Answer", variant="secondary")
210
+
211
+ status_text = gr.Textbox(label="Status", value="Click 'Next Question' to start")
212
+ question_output = gr.Markdown(label="Question")
213
+ answer_output = gr.Markdown(label="Answer")
214
+
215
+ next_btn.click(
216
+ fn=start_quiz,
217
+ inputs=[category_select, difficulty_select],
218
+ outputs=[question_output, answer_output, status_text],
219
+ )
220
+
221
+ reveal_btn.click(
222
+ fn=reveal_answer,
223
+ inputs=[],
224
+ outputs=answer_output,
225
+ )
226
+
227
+ # Browse Mode Tab
228
+ with gr.TabItem("Browse All"):
229
+ gr.Markdown("### Search and filter all questions")
230
+
231
+ with gr.Row():
232
+ browse_category = gr.Dropdown(
233
+ choices=CATEGORIES,
234
+ value="All",
235
+ label="Category",
236
+ )
237
+ browse_difficulty = gr.Dropdown(
238
+ choices=DIFFICULTIES,
239
+ value="All",
240
+ label="Difficulty",
241
+ )
242
+ search_box = gr.Textbox(
243
+ label="Search",
244
+ placeholder="Search questions and answers...",
245
+ )
246
+
247
+ search_btn = gr.Button("Search", variant="primary")
248
+ browse_output = gr.Markdown()
249
+
250
+ search_btn.click(
251
+ fn=browse_questions,
252
+ inputs=[browse_category, browse_difficulty, search_box],
253
+ outputs=browse_output,
254
+ )
255
+
256
+ # Stats Tab
257
+ with gr.TabItem("About"):
258
+ stats = f"""
259
+ ### Question Database Statistics
260
+
261
+ - **Total Questions:** {len(questions_df)}
262
+ - **Categories:** {', '.join(questions_df['category'].unique())}
263
+ - **Difficulty Distribution:**
264
+ - Easy: {len(questions_df[questions_df['difficulty'] == 'easy'])}
265
+ - Medium: {len(questions_df[questions_df['difficulty'] == 'medium'])}
266
+ - Hard: {len(questions_df[questions_df['difficulty'] == 'hard'])}
267
+
268
+ ### Tips for Interview Prep
269
+
270
+ 1. **Start with fundamentals** - Ensure you understand basic concepts before advanced topics
271
+ 2. **Practice explaining** - Say your answers out loud as if in an interview
272
+ 3. **Understand trade-offs** - Most questions have nuanced answers depending on context
273
+ 4. **Know your projects** - Be ready to connect concepts to your own experience
274
+ 5. **Ask clarifying questions** - In real interviews, it's good to ask about constraints
275
+
276
+ ### Company-Specific Prep
277
+
278
+ Questions are tagged with companies where similar questions are commonly asked.
279
+ Filter by company in the Browse tab to focus your preparation.
280
+
281
+ ---
282
+
283
+ Built by [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys)
284
+ """
285
+ gr.Markdown(stats)
286
+
287
+ if __name__ == "__main__":
288
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ gradio==5.9.1
2
+ pandas>=2.0.0