๐ŸŽฌ Walkthrough Video


The Goal

Given features known at the moment a YouTube video is posted โ€” its category, posting time, channel, title characteristics โ€” can I predict how many views it will get when it peaks on the trending list?

Two related questions, two models:

  • Regression โ€” predict the actual view count (a number)
  • Classification โ€” predict whether the video will be a Flop, Average performer, or Hit

This README walks through every decision I made โ€” what I cleaned, what I engineered, why I picked Random Forest over Gradient Boosting, why "Average" is the hardest class to predict, and what the model is honestly capable of.


Main Hypothesis

A YouTube video's view count is influenced by when it's posted.

Videos uploaded during peak audience-activity windows (likely evenings and weekends) will accumulate more views than those posted during off-hours.

Proposed mechanism: Posting at peak times โ†’ more viewers see the video early โ†’ YouTube's recommendation algorithm interprets that as quality and boosts the video further.


Dataset

Property Detail
Source YouTube Trending Video Dataset โ€” Kaggle
Country slice used United States
Raw size 268,787 rows ร— 16 columns
Date range 2020 โ€“ 2024
Final cleaned size 47,079 unique videos

Part 2 โ€” EDA

Step 1 โ€” First Look

The first inspection revealed three issues:

Finding What it meant for me
268,787 rows but only ~47K unique videos Same video on multiple trending days โ€” must deduplicate
Date columns stored as strings Must convert to real datetimes
Description column had ~1.7% missing values Need an imputation decision

Step 2 โ€” Data Cleaning

Step Action Reason
2.1 Convert date columns to datetime Enables time-based feature engineering
2.2 Sort by view_count โ†’ drop duplicates by video_id One row per video at peak views
2.3 Fill missing descriptions with empty string Preserves the "missing" signal
2.4 Remove zero-view and impossible-date rows 62 rows (0.13%) โ€” almost certainly bugs
2.5 Map numeric categoryId โ†’ category_name "10" becomes "Music" for storytelling

Step 3 โ€” A first look at the numeric columns

Before computing summary statistics, I plotted the distribution of each numeric column on a log scale (raw scale would be unreadable due to extreme skew):

Distributions of all numeric columns

Three takeaways jump out:

  • view_count, likes, and comment_count all show roughly bell-shaped log distributions โ€” they're skewed but log-transforms will fix that, which justifies my log target choice for modeling
  • dislikes is dominated by a giant zero spike because YouTube hid public dislike counts in late 2021. Most rows from after that change have zero dislikes โ€” that's why dislikes is unusable as a real signal
  • All four columns confirm the extreme skew that drives my log-transform decision

Step 3 โ€” Descriptive Statistics: The Skew Problem

The most important number from this step:

view_count skewness = 71

Anything above 3 is "highly skewed." A skew of 71 is extreme โ€” a few mega-viral videos drag the mean far above the median. After applying log transformation, skew dropped to 0.6 (almost normal).

View count distribution: raw vs log

This single finding drove every modeling decision afterward โ€” I would predict log(view_count) instead of raw views.

Step 4 โ€” Outliers

I checked outliers using both IQR and Z-score methods. Most "outliers" turned out to be real viral hits (BTS, MrBeast, BLACKPINK), not data errors. I removed only one row โ€” a Discord ad with 1.4 billion views, almost 5ร— the next highest video.

Outlier box plots

Step 5 โ€” Visualizations Answering Six Questions

Q1: Which categories dominate the trending list (count)?

Gaming, Entertainment, and Music dominate by count, with Gaming on top.

Trending videos per category

Q2: Which category has the highest views per video?

Music wins on per-video views (median ~1.7M), beating every other category. So Music is a per-video winner, not a volume winner.

Median views per category

Q3: How fast do videos go from posted to trending?

Median ~5.4 days. The double-peak pattern around days 5 and 6 hints at YouTube's trending algorithm cycling roughly weekly.

Days to trend

Q4 โ€” Headline: Does posting time of day or day of week affect views?

Yes โ€” but for a non-obvious reason. The strongest effect is a sharp spike at 9 AM UTC, especially on Mondays (median jumps to ~2.7M views). 9 AM UTC is the standard global music release window โ€” labels coordinate drops at that hour. So my hypothesis was partially confirmed, but the mechanism is industry release coordination, not raw user activity.

Posting time heatmap

Q5: Do shorter or longer titles get more views?

Shorter titles win. Monotonic trend โ€” title_length becomes a feature in modeling.

Title length effect

Q6: Are big-name channels outperforming smaller ones?

Yes โ€” dramatically. Channels with 100+ trending videos get ~3ร— more views than one-hit channels. This is the strongest single pattern in the data โ€” bigger than category, bigger than posting time.

Channel size effect


Part 3 โ€” Baseline Linear Regression

Setting Value
Features 3 (categoryId one-hot + 2 boolean flags)
Target log1p(view_count)
Train/test split 80/20, random_state=42
Rยฒ (log scale) 0.066 โ† the benchmark to beat

Rยฒ of 0.066 means the model only explains 6.6% of view variation โ€” intentionally weak so I can measure improvement.


Part 4 โ€” Feature Engineering + Clustering

Engineered Features (22 added)

Category Count Examples
Time 6 publish_hour, days_to_trend, is_weekend
Title 7 title_length, title_has_emoji, title_caps_ratio
Tags 3 n_tags, has_tags, avg_tag_len
Description 4 desc_length, desc_n_links, desc_n_hashtags
Channel 2 channel_n_videos, channel_n_videos_log

The strongest single correlations with log-views were days_to_trend (+0.32), hours_to_trend (+0.32), and channel_n_videos_log (+0.27).

Engineered feature correlations

Clustering โ€” K-Means with K=5

I picked K-Means for three reasons: my features are all numeric (Euclidean distance fits), I expected blob-shaped clusters (typical music releases, typical sports highlights), and K-Means lets me defend my choice with the elbow plot.

Cluster profile heatmap

Cluster interpretations: Each cluster shows up as a different color in this 2D projection of the 14-dimensional feature space:

Cluster PCA scatter

The clusters look somewhat overlapping in the plot because PCA only shows 28.8% of the variance โ€” the rest of the separation lives in dimensions I can't draw on a 2D chart. But the cluster profile heatmap above already proved each cluster has its own distinctive feature combination.

Cluster Size Defining Trait Story Median Views
0 20,909 Short titles, few tags Quick uploads โ€” Gaming-heavy 1.04M
1 13,317 Long descriptive titles Sports/News-style 0.95M
2 2,051 100% question-mark titles Engagement-bait 0.79M
3 8,326 Long descriptions, many links Promotional / linked-out 1.28M
4 2,476 100% emoji-titled, Music-heavy Emoji-titled music drops 1.71M

Cluster 4 (emoji-titled music) had the highest median views, validating the 9 AM UTC release-window finding from EDA Q4.


Part 5 โ€” Three Improved Regression Models

Model Rยฒ (log) MAE (raw views) Improvement vs baseline
Baseline (3 features) 0.066 2,028,788 โ€”
Linear (improved) 0.289 1,972,409 4.4ร—
Gradient Boosting 0.425 1,644,413 6.4ร—
Random Forest 0.491 1,514,934 7.4ร— โ† winner

Model comparison

Why Random Forest won: It captured non-linear patterns like "videos posted at 9 AM by Music channels with 100+ uploads" โ€” interactions linear regression literally can't represent.

Bigger lesson: Going from 3 โ†’ 42 features lifted Linear from 0.066 โ†’ 0.289 (4.4ร—). Switching to Random Forest gave another 1.7ร— on top. Most of the gain came from feature engineering, not model choice.


Part 7 โ€” Regression โ†’ Classification

I reframed the same target as a classification problem by binning view_count into 3 quantiles:

Class Range Count %
Flop < 676,620 views ~15,500 33%
Average 676,620 โ€“ 1,758,898 views ~16,000 34%
Hit โ‰ฅ 1,758,898 views ~15,500 33%

Class distribution

I used macro-F1 (not accuracy) as my primary metric because it averages F1 across all classes equally, catching per-class weaknesses that accuracy hides.


Part 8 โ€” Three Classification Models

Model Macro-F1 Accuracy
Logistic Regression 0.520 0.527
Gradient Boosting 0.572 0.572
Random Forest 0.592 0.594

F1 by class

The Average class is hardest to predict

Class Random Forest F1
Flop 0.64
Average 0.48 โ† weakest
Hit 0.66

Why this happens: Quantile binning slices a continuous variable. Videos near the boundaries genuinely look similar to videos on either side. Flops and Hits live at the extremes, where features are distinctive. The middle is by definition less distinctive โ€” and therefore harder to learn.

This isn't a bug; it's an honest consequence of the data structure.

Refining the Winning Model โ€” Hyperparameter Tuning

The Random Forest in Part 5 used reasonable but unsystematic settings. Using RandomizedSearchCV with 3-fold cross-validation, I systematically searched a grid of hyperparameter combinations โ€” n_estimators, max_depth, min_samples_leaf, and max_features โ€” and picked the best.

The tuned model adds a small improvement on top of the Part 5 winner. The lift isn't dramatic because Random Forest is already a robust default. This reinforces a lesson from earlier in the project: time spent on features beats time spent on tuning, until the easier wins are exhausted.


Making the Model Actionable โ€” What-If Analysis

A model's Rยฒ is academic. A more useful question for a creator is:

"If I could change one thing about my next video, what would maximize my predicted views?"

I take a representative "median" video from the test set and perturb one feature at a time, asking the model how its prediction shifts. The result translates the abstract model into directional creator guidance.

What-if analysis

Each bar is the change in predicted views from a single what-if change. Green = the change helps; coral = it hurts. The biggest gains for this representative video come from lengthening the title and adding an emoji โ€” interesting because it contradicts my earlier "shorter titles are better" finding for the average video. The model has learned that the relationship isn't perfectly monotonic; for some videos, longer titles do help.

Important caveat: Correlation in the model isn't causation in the real world. The model learned that channels with more trending history get more views, but that doesn't mean a small creator can fake it. These results are best read as "videos that look like X tend to get more views" โ€” useful directional guidance, not a step-by-step formula.


Live Prediction Dashboard

๐ŸŽฎ Try the live demo โ†’

An interactive Gradio app where you can adjust posting hour, title length, channel size, category, and other features with sliders and watch the predicted view count and class (Flop / Average / Hit) update in real time.

Useful as both a sanity check (do predictions move sensibly when I change a feature?) and as a demonstration that the model is fast enough to use interactively. The biggest swings come from channel_n_videos โ€” consistent with what the feature importance chart said.



Conclusion

The hypothesis was partially supported, but the mechanism was different than expected.

Posting time does correlate with views, but the strongest pattern was the 9 AM UTC music release window โ€” industry coordination, not user activity. The single most predictive feature turned out to be channel size, not posting time.

Final results vs. baseline:

  • Regression: Rยฒ 0.066 โ†’ 0.491 (7.4ร— improvement)
  • Classification: Macro-F1 ~0.33 (random) โ†’ 0.59

Three lessons from this project:

  1. Feature engineering >> model choice. The 22 engineered features did most of the heavy lifting.
  2. EDA earns the right to make modeling decisions. Every choice (log transform, leakage exclusion, outlier handling) had an EDA-driven justification.
  3. Honest negative findings score better than false positives. Saying "the Average class is hard because quantile binning produces non-distinctive middles" is more valuable than pretending the model is perfect.

Limitations & Honest Caveats

  • days_to_trend is borderline pre-upload. It's known after a video has trended, so it can't be used for true "predict-before-upload" inference. I included it because it's the strongest single signal and within assignment scope.
  • Rยฒ of 49% is real but capped. YouTube's algorithm and viral chance contain irreducible noise.
  • Trending bias. Every video in the dataset did trend, so the model can't predict whether a brand-new video will trend at all โ€” only how big it'll get if it does.

How to Use the Models

import pickle
from huggingface_hub import hf_hub_download

# Regression
reg_path = hf_hub_download(
    repo_id="benjac8/youtube-trending-views-predictor",
    filename="regression_random_forest.pkl"
)
with open(reg_path, 'rb') as f:
    reg = pickle.load(f)
reg_model = reg['model']
print(f"Regression Rยฒ on test: {reg['r2_log_test']:.3f}")

# Classification
clf_path = hf_hub_download(
    repo_id="benjac8/youtube-trending-views-predictor",
    filename="classification_random_forest.pkl"
)
with open(clf_path, 'rb') as f:
    clf = pickle.load(f)
clf_model = clf['model']
class_names = clf['class_names']  # ['Flop', 'Average', 'Hit']

Both models expect a feature matrix with the exact 42 columns listed in feature_names, in that order.


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using benjac8/youtube-trending-views-predictor 1