- ๐ฌ Walkthrough Video
- The Goal
- Main Hypothesis
- Dataset
- Part 2 โ EDA
- Step 1 โ First Look
- Step 2 โ Data Cleaning
- Step 3 โ A first look at the numeric columns
- Step 3 โ Descriptive Statistics: The Skew Problem
- Step 4 โ Outliers
- Step 5 โ Visualizations Answering Six Questions
- Q1: Which categories dominate the trending list (count)?
- Q2: Which category has the highest views per video?
- Q3: How fast do videos go from posted to trending?
- Q4 โ Headline: Does posting time of day or day of week affect views?
- Q5: Do shorter or longer titles get more views?
- Q6: Are big-name channels outperforming smaller ones?
- Step 1 โ First Look
- Part 3 โ Baseline Linear Regression
- Part 4 โ Feature Engineering + Clustering
- Part 5 โ Three Improved Regression Models
- Part 7 โ Regression โ Classification
- Part 8 โ Three Classification Models
- This isn't a bug; it's an honest consequence of the data structure.
- Refining the Winning Model โ Hyperparameter Tuning
- Making the Model Actionable โ What-If Analysis
- Live Prediction Dashboard
- Conclusion
- Limitations & Honest Caveats
- How to Use the Models
๐ฌ Walkthrough Video
The Goal
Given features known at the moment a YouTube video is posted โ its category, posting time, channel, title characteristics โ can I predict how many views it will get when it peaks on the trending list?
Two related questions, two models:
- Regression โ predict the actual view count (a number)
- Classification โ predict whether the video will be a Flop, Average performer, or Hit
This README walks through every decision I made โ what I cleaned, what I engineered, why I picked Random Forest over Gradient Boosting, why "Average" is the hardest class to predict, and what the model is honestly capable of.
Main Hypothesis
A YouTube video's view count is influenced by when it's posted.
Videos uploaded during peak audience-activity windows (likely evenings and weekends) will accumulate more views than those posted during off-hours.
Proposed mechanism: Posting at peak times โ more viewers see the video early โ YouTube's recommendation algorithm interprets that as quality and boosts the video further.
Dataset
| Property | Detail |
|---|---|
| Source | YouTube Trending Video Dataset โ Kaggle |
| Country slice used | United States |
| Raw size | 268,787 rows ร 16 columns |
| Date range | 2020 โ 2024 |
| Final cleaned size | 47,079 unique videos |
Part 2 โ EDA
Step 1 โ First Look
The first inspection revealed three issues:
| Finding | What it meant for me |
|---|---|
| 268,787 rows but only ~47K unique videos | Same video on multiple trending days โ must deduplicate |
| Date columns stored as strings | Must convert to real datetimes |
| Description column had ~1.7% missing values | Need an imputation decision |
Step 2 โ Data Cleaning
| Step | Action | Reason |
|---|---|---|
| 2.1 | Convert date columns to datetime | Enables time-based feature engineering |
| 2.2 | Sort by view_count โ drop duplicates by video_id | One row per video at peak views |
| 2.3 | Fill missing descriptions with empty string | Preserves the "missing" signal |
| 2.4 | Remove zero-view and impossible-date rows | 62 rows (0.13%) โ almost certainly bugs |
| 2.5 | Map numeric categoryId โ category_name | "10" becomes "Music" for storytelling |
Step 3 โ A first look at the numeric columns
Before computing summary statistics, I plotted the distribution of each numeric column on a log scale (raw scale would be unreadable due to extreme skew):
Three takeaways jump out:
- view_count, likes, and comment_count all show roughly bell-shaped log distributions โ they're skewed but log-transforms will fix that, which justifies my log target choice for modeling
- dislikes is dominated by a giant zero spike because YouTube hid public dislike counts in late 2021. Most rows from after that change have zero dislikes โ that's why dislikes is unusable as a real signal
- All four columns confirm the extreme skew that drives my log-transform decision
Step 3 โ Descriptive Statistics: The Skew Problem
The most important number from this step:
view_countskewness = 71
Anything above 3 is "highly skewed." A skew of 71 is extreme โ a few mega-viral videos drag the mean far above the median. After applying log transformation, skew dropped to 0.6 (almost normal).
This single finding drove every modeling decision afterward โ I would predict log(view_count) instead of raw views.
Step 4 โ Outliers
I checked outliers using both IQR and Z-score methods. Most "outliers" turned out to be real viral hits (BTS, MrBeast, BLACKPINK), not data errors. I removed only one row โ a Discord ad with 1.4 billion views, almost 5ร the next highest video.
Step 5 โ Visualizations Answering Six Questions
Q1: Which categories dominate the trending list (count)?
Gaming, Entertainment, and Music dominate by count, with Gaming on top.
Q2: Which category has the highest views per video?
Music wins on per-video views (median ~1.7M), beating every other category. So Music is a per-video winner, not a volume winner.
Q3: How fast do videos go from posted to trending?
Median ~5.4 days. The double-peak pattern around days 5 and 6 hints at YouTube's trending algorithm cycling roughly weekly.
Q4 โ Headline: Does posting time of day or day of week affect views?
Yes โ but for a non-obvious reason. The strongest effect is a sharp spike at 9 AM UTC, especially on Mondays (median jumps to ~2.7M views). 9 AM UTC is the standard global music release window โ labels coordinate drops at that hour. So my hypothesis was partially confirmed, but the mechanism is industry release coordination, not raw user activity.
Q5: Do shorter or longer titles get more views?
Shorter titles win. Monotonic trend โ title_length becomes a feature in modeling.
Q6: Are big-name channels outperforming smaller ones?
Yes โ dramatically. Channels with 100+ trending videos get ~3ร more views than one-hit channels. This is the strongest single pattern in the data โ bigger than category, bigger than posting time.
Part 3 โ Baseline Linear Regression
| Setting | Value |
|---|---|
| Features | 3 (categoryId one-hot + 2 boolean flags) |
| Target | log1p(view_count) |
| Train/test split | 80/20, random_state=42 |
| Rยฒ (log scale) | 0.066 โ the benchmark to beat |
Rยฒ of 0.066 means the model only explains 6.6% of view variation โ intentionally weak so I can measure improvement.
Part 4 โ Feature Engineering + Clustering
Engineered Features (22 added)
| Category | Count | Examples |
|---|---|---|
| Time | 6 | publish_hour, days_to_trend, is_weekend |
| Title | 7 | title_length, title_has_emoji, title_caps_ratio |
| Tags | 3 | n_tags, has_tags, avg_tag_len |
| Description | 4 | desc_length, desc_n_links, desc_n_hashtags |
| Channel | 2 | channel_n_videos, channel_n_videos_log |
The strongest single correlations with log-views were days_to_trend (+0.32), hours_to_trend (+0.32), and channel_n_videos_log (+0.27).
Clustering โ K-Means with K=5
I picked K-Means for three reasons: my features are all numeric (Euclidean distance fits), I expected blob-shaped clusters (typical music releases, typical sports highlights), and K-Means lets me defend my choice with the elbow plot.
Cluster interpretations: Each cluster shows up as a different color in this 2D projection of the 14-dimensional feature space:
The clusters look somewhat overlapping in the plot because PCA only shows 28.8% of the variance โ the rest of the separation lives in dimensions I can't draw on a 2D chart. But the cluster profile heatmap above already proved each cluster has its own distinctive feature combination.
| Cluster | Size | Defining Trait | Story | Median Views |
|---|---|---|---|---|
| 0 | 20,909 | Short titles, few tags | Quick uploads โ Gaming-heavy | 1.04M |
| 1 | 13,317 | Long descriptive titles | Sports/News-style | 0.95M |
| 2 | 2,051 | 100% question-mark titles | Engagement-bait | 0.79M |
| 3 | 8,326 | Long descriptions, many links | Promotional / linked-out | 1.28M |
| 4 | 2,476 | 100% emoji-titled, Music-heavy | Emoji-titled music drops | 1.71M |
Cluster 4 (emoji-titled music) had the highest median views, validating the 9 AM UTC release-window finding from EDA Q4.
Part 5 โ Three Improved Regression Models
| Model | Rยฒ (log) | MAE (raw views) | Improvement vs baseline |
|---|---|---|---|
| Baseline (3 features) | 0.066 | 2,028,788 | โ |
| Linear (improved) | 0.289 | 1,972,409 | 4.4ร |
| Gradient Boosting | 0.425 | 1,644,413 | 6.4ร |
| Random Forest | 0.491 | 1,514,934 | 7.4ร โ winner |
Why Random Forest won: It captured non-linear patterns like "videos posted at 9 AM by Music channels with 100+ uploads" โ interactions linear regression literally can't represent.
Bigger lesson: Going from 3 โ 42 features lifted Linear from 0.066 โ 0.289 (4.4ร). Switching to Random Forest gave another 1.7ร on top. Most of the gain came from feature engineering, not model choice.
Part 7 โ Regression โ Classification
I reframed the same target as a classification problem by binning view_count into 3 quantiles:
| Class | Range | Count | % |
|---|---|---|---|
| Flop | < 676,620 views | ~15,500 | 33% |
| Average | 676,620 โ 1,758,898 views | ~16,000 | 34% |
| Hit | โฅ 1,758,898 views | ~15,500 | 33% |
I used macro-F1 (not accuracy) as my primary metric because it averages F1 across all classes equally, catching per-class weaknesses that accuracy hides.
Part 8 โ Three Classification Models
| Model | Macro-F1 | Accuracy |
|---|---|---|
| Logistic Regression | 0.520 | 0.527 |
| Gradient Boosting | 0.572 | 0.572 |
| Random Forest | 0.592 | 0.594 |
The Average class is hardest to predict
| Class | Random Forest F1 |
|---|---|
| Flop | 0.64 |
| Average | 0.48 โ weakest |
| Hit | 0.66 |
Why this happens: Quantile binning slices a continuous variable. Videos near the boundaries genuinely look similar to videos on either side. Flops and Hits live at the extremes, where features are distinctive. The middle is by definition less distinctive โ and therefore harder to learn.
This isn't a bug; it's an honest consequence of the data structure.
Refining the Winning Model โ Hyperparameter Tuning
The Random Forest in Part 5 used reasonable but unsystematic settings. Using RandomizedSearchCV with 3-fold cross-validation, I systematically searched a grid of hyperparameter combinations โ n_estimators, max_depth, min_samples_leaf, and max_features โ and picked the best.
The tuned model adds a small improvement on top of the Part 5 winner. The lift isn't dramatic because Random Forest is already a robust default. This reinforces a lesson from earlier in the project: time spent on features beats time spent on tuning, until the easier wins are exhausted.
Making the Model Actionable โ What-If Analysis
A model's Rยฒ is academic. A more useful question for a creator is:
"If I could change one thing about my next video, what would maximize my predicted views?"
I take a representative "median" video from the test set and perturb one feature at a time, asking the model how its prediction shifts. The result translates the abstract model into directional creator guidance.
Each bar is the change in predicted views from a single what-if change. Green = the change helps; coral = it hurts. The biggest gains for this representative video come from lengthening the title and adding an emoji โ interesting because it contradicts my earlier "shorter titles are better" finding for the average video. The model has learned that the relationship isn't perfectly monotonic; for some videos, longer titles do help.
Important caveat: Correlation in the model isn't causation in the real world. The model learned that channels with more trending history get more views, but that doesn't mean a small creator can fake it. These results are best read as "videos that look like X tend to get more views" โ useful directional guidance, not a step-by-step formula.
Live Prediction Dashboard
An interactive Gradio app where you can adjust posting hour, title length, channel size, category, and other features with sliders and watch the predicted view count and class (Flop / Average / Hit) update in real time.
Useful as both a sanity check (do predictions move sensibly when I change a feature?) and as a demonstration that the model is fast enough to use interactively. The biggest swings come from channel_n_videos โ consistent with what the feature importance chart said.
Conclusion
The hypothesis was partially supported, but the mechanism was different than expected.
Posting time does correlate with views, but the strongest pattern was the 9 AM UTC music release window โ industry coordination, not user activity. The single most predictive feature turned out to be channel size, not posting time.
Final results vs. baseline:
- Regression: Rยฒ 0.066 โ 0.491 (7.4ร improvement)
- Classification: Macro-F1 ~0.33 (random) โ 0.59
Three lessons from this project:
- Feature engineering >> model choice. The 22 engineered features did most of the heavy lifting.
- EDA earns the right to make modeling decisions. Every choice (log transform, leakage exclusion, outlier handling) had an EDA-driven justification.
- Honest negative findings score better than false positives. Saying "the Average class is hard because quantile binning produces non-distinctive middles" is more valuable than pretending the model is perfect.
Limitations & Honest Caveats
days_to_trendis borderline pre-upload. It's known after a video has trended, so it can't be used for true "predict-before-upload" inference. I included it because it's the strongest single signal and within assignment scope.- Rยฒ of 49% is real but capped. YouTube's algorithm and viral chance contain irreducible noise.
- Trending bias. Every video in the dataset did trend, so the model can't predict whether a brand-new video will trend at all โ only how big it'll get if it does.
How to Use the Models
import pickle
from huggingface_hub import hf_hub_download
# Regression
reg_path = hf_hub_download(
repo_id="benjac8/youtube-trending-views-predictor",
filename="regression_random_forest.pkl"
)
with open(reg_path, 'rb') as f:
reg = pickle.load(f)
reg_model = reg['model']
print(f"Regression Rยฒ on test: {reg['r2_log_test']:.3f}")
# Classification
clf_path = hf_hub_download(
repo_id="benjac8/youtube-trending-views-predictor",
filename="classification_random_forest.pkl"
)
with open(clf_path, 'rb') as f:
clf = pickle.load(f)
clf_model = clf['model']
class_names = clf['class_names'] # ['Flop', 'Average', 'Hit']
Both models expect a feature matrix with the exact 42 columns listed in feature_names, in that order.















