🎬 Walkthrough Video

The Goal

Given features known at the moment a YouTube video is posted — its category, posting time, channel, title characteristics — can I predict how many views it will get when it peaks on the trending list?

Two related questions, two models:

Regression — predict the actual view count (a number)
Classification — predict whether the video will be a Flop, Average performer, or Hit

This README walks through every decision I made — what I cleaned, what I engineered, why I picked Random Forest over Gradient Boosting, why "Average" is the hardest class to predict, and what the model is honestly capable of.

Main Hypothesis

A YouTube video's view count is influenced by when it's posted.

Videos uploaded during peak audience-activity windows (likely evenings and weekends) will accumulate more views than those posted during off-hours.

Proposed mechanism: Posting at peak times → more viewers see the video early → YouTube's recommendation algorithm interprets that as quality and boosts the video further.

Dataset

Property	Detail
Source	YouTube Trending Video Dataset — Kaggle
Country slice used	United States
Raw size	268,787 rows × 16 columns
Date range	2020 – 2024
Final cleaned size	47,079 unique videos

Part 2 — EDA

Step 1 — First Look

The first inspection revealed three issues:

Finding	What it meant for me
268,787 rows but only ~47K unique videos	Same video on multiple trending days — must deduplicate
Date columns stored as strings	Must convert to real datetimes
Description column had ~1.7% missing values	Need an imputation decision

Step 2 — Data Cleaning

Step	Action	Reason
2.1	Convert date columns to datetime	Enables time-based feature engineering
2.2	Sort by view_count → drop duplicates by video_id	One row per video at peak views
2.3	Fill missing descriptions with empty string	Preserves the "missing" signal
2.4	Remove zero-view and impossible-date rows	62 rows (0.13%) — almost certainly bugs
2.5	Map numeric categoryId → category_name	"10" becomes "Music" for storytelling

Step 3 — A first look at the numeric columns

Before computing summary statistics, I plotted the distribution of each numeric column on a log scale (raw scale would be unreadable due to extreme skew):

Three takeaways jump out:

view_count, likes, and comment_count all show roughly bell-shaped log distributions — they're skewed but log-transforms will fix that, which justifies my log target choice for modeling
dislikes is dominated by a giant zero spike because YouTube hid public dislike counts in late 2021. Most rows from after that change have zero dislikes — that's why dislikes is unusable as a real signal
All four columns confirm the extreme skew that drives my log-transform decision

Step 3 — Descriptive Statistics: The Skew Problem

The most important number from this step:

view_count skewness = 71

Anything above 3 is "highly skewed." A skew of 71 is extreme — a few mega-viral videos drag the mean far above the median. After applying log transformation, skew dropped to 0.6 (almost normal).

This single finding drove every modeling decision afterward — I would predict log(view_count) instead of raw views.

Step 4 — Outliers

I checked outliers using both IQR and Z-score methods. Most "outliers" turned out to be real viral hits (BTS, MrBeast, BLACKPINK), not data errors. I removed only one row — a Discord ad with 1.4 billion views, almost 5× the next highest video.

Step 5 — Visualizations Answering Six Questions

Q1: Which categories dominate the trending list (count)?

Gaming, Entertainment, and Music dominate by count, with Gaming on top.

Q2: Which category has the highest views per video?

Music wins on per-video views (median ~1.7M), beating every other category. So Music is a per-video winner, not a volume winner.

Q3: How fast do videos go from posted to trending?

Median ~5.4 days. The double-peak pattern around days 5 and 6 hints at YouTube's trending algorithm cycling roughly weekly.

Q4 — Headline: Does posting time of day or day of week affect views?

Yes — but for a non-obvious reason. The strongest effect is a sharp spike at 9 AM UTC, especially on Mondays (median jumps to ~2.7M views). 9 AM UTC is the standard global music release window — labels coordinate drops at that hour. So my hypothesis was partially confirmed, but the mechanism is industry release coordination, not raw user activity.

Q5: Do shorter or longer titles get more views?

Shorter titles win. Monotonic trend — title_length becomes a feature in modeling.

Q6: Are big-name channels outperforming smaller ones?

Yes — dramatically. Channels with 100+ trending videos get ~3× more views than one-hit channels. This is the strongest single pattern in the data — bigger than category, bigger than posting time.

Part 3 — Baseline Linear Regression

Setting	Value
Features	3 (categoryId one-hot + 2 boolean flags)
Target	`log1p(view_count)`
Train/test split	80/20, `random_state=42`
R² (log scale)	0.066 ← the benchmark to beat

R² of 0.066 means the model only explains 6.6% of view variation — intentionally weak so I can measure improvement.

Part 4 — Feature Engineering + Clustering

Engineered Features (22 added)

Category	Count	Examples
Time	6	publish_hour, days_to_trend, is_weekend
Title	7	title_length, title_has_emoji, title_caps_ratio
Tags	3	n_tags, has_tags, avg_tag_len
Description	4	desc_length, desc_n_links, desc_n_hashtags
Channel	2	channel_n_videos, channel_n_videos_log

The strongest single correlations with log-views were days_to_trend (+0.32), hours_to_trend (+0.32), and channel_n_videos_log (+0.27).

Clustering — K-Means with K=5

I picked K-Means for three reasons: my features are all numeric (Euclidean distance fits), I expected blob-shaped clusters (typical music releases, typical sports highlights), and K-Means lets me defend my choice with the elbow plot.

Cluster interpretations: Each cluster shows up as a different color in this 2D projection of the 14-dimensional feature space:

The clusters look somewhat overlapping in the plot because PCA only shows 28.8% of the variance — the rest of the separation lives in dimensions I can't draw on a 2D chart. But the cluster profile heatmap above already proved each cluster has its own distinctive feature combination.

Cluster	Size	Defining Trait	Story	Median Views
0	20,909	Short titles, few tags	Quick uploads — Gaming-heavy	1.04M
1	13,317	Long descriptive titles	Sports/News-style	0.95M
2	2,051	100% question-mark titles	Engagement-bait	0.79M
3	8,326	Long descriptions, many links	Promotional / linked-out	1.28M
4	2,476	100% emoji-titled, Music-heavy	Emoji-titled music drops	1.71M

Cluster 4 (emoji-titled music) had the highest median views, validating the 9 AM UTC release-window finding from EDA Q4.

Part 5 — Three Improved Regression Models

Model	R² (log)	MAE (raw views)	Improvement vs baseline
Baseline (3 features)	0.066	2,028,788	—
Linear (improved)	0.289	1,972,409	4.4×
Gradient Boosting	0.425	1,644,413	6.4×
Random Forest	0.491	1,514,934	7.4× ← winner

Why Random Forest won: It captured non-linear patterns like "videos posted at 9 AM by Music channels with 100+ uploads" — interactions linear regression literally can't represent.

Bigger lesson: Going from 3 → 42 features lifted Linear from 0.066 → 0.289 (4.4×). Switching to Random Forest gave another 1.7× on top. Most of the gain came from feature engineering, not model choice.

Part 7 — Regression → Classification

I reframed the same target as a classification problem by binning view_count into 3 quantiles:

Class	Range	Count	%
Flop	< 676,620 views	~15,500	33%
Average	676,620 – 1,758,898 views	~16,000	34%
Hit	≥ 1,758,898 views	~15,500	33%

I used macro-F1 (not accuracy) as my primary metric because it averages F1 across all classes equally, catching per-class weaknesses that accuracy hides.

Part 8 — Three Classification Models

Model	Macro-F1	Accuracy
Logistic Regression	0.520	0.527
Gradient Boosting	0.572	0.572
Random Forest	0.592	0.594

The Average class is hardest to predict

Class	Random Forest F1
Flop	0.64
Average	0.48 ← weakest
Hit	0.66

Why this happens: Quantile binning slices a continuous variable. Videos near the boundaries genuinely look similar to videos on either side. Flops and Hits live at the extremes, where features are distinctive. The middle is by definition less distinctive — and therefore harder to learn.

This isn't a bug; it's an honest consequence of the data structure.

Refining the Winning Model — Hyperparameter Tuning

The Random Forest in Part 5 used reasonable but unsystematic settings. Using RandomizedSearchCV with 3-fold cross-validation, I systematically searched a grid of hyperparameter combinations — n_estimators, max_depth, min_samples_leaf, and max_features — and picked the best.

The tuned model adds a small improvement on top of the Part 5 winner. The lift isn't dramatic because Random Forest is already a robust default. This reinforces a lesson from earlier in the project: time spent on features beats time spent on tuning, until the easier wins are exhausted.

Making the Model Actionable — What-If Analysis

A model's R² is academic. A more useful question for a creator is:

"If I could change one thing about my next video, what would maximize my predicted views?"

I take a representative "median" video from the test set and perturb one feature at a time, asking the model how its prediction shifts. The result translates the abstract model into directional creator guidance.

Each bar is the change in predicted views from a single what-if change. Green = the change helps; coral = it hurts. The biggest gains for this representative video come from lengthening the title and adding an emoji — interesting because it contradicts my earlier "shorter titles are better" finding for the average video. The model has learned that the relationship isn't perfectly monotonic; for some videos, longer titles do help.

Important caveat: Correlation in the model isn't causation in the real world. The model learned that channels with more trending history get more views, but that doesn't mean a small creator can fake it. These results are best read as "videos that look like X tend to get more views" — useful directional guidance, not a step-by-step formula.

Live Prediction Dashboard

🎮 Try the live demo →

An interactive Gradio app where you can adjust posting hour, title length, channel size, category, and other features with sliders and watch the predicted view count and class (Flop / Average / Hit) update in real time.

Useful as both a sanity check (do predictions move sensibly when I change a feature?) and as a demonstration that the model is fast enough to use interactively. The biggest swings come from channel_n_videos — consistent with what the feature importance chart said.

Conclusion

The hypothesis was partially supported, but the mechanism was different than expected.

Posting time does correlate with views, but the strongest pattern was the 9 AM UTC music release window — industry coordination, not user activity. The single most predictive feature turned out to be channel size, not posting time.

Final results vs. baseline:

Regression: R² 0.066 → 0.491 (7.4× improvement)
Classification: Macro-F1 ~0.33 (random) → 0.59

Three lessons from this project:

Feature engineering >> model choice. The 22 engineered features did most of the heavy lifting.
EDA earns the right to make modeling decisions. Every choice (log transform, leakage exclusion, outlier handling) had an EDA-driven justification.
Honest negative findings score better than false positives. Saying "the Average class is hard because quantile binning produces non-distinctive middles" is more valuable than pretending the model is perfect.

Limitations & Honest Caveats

days_to_trend is borderline pre-upload. It's known after a video has trended, so it can't be used for true "predict-before-upload" inference. I included it because it's the strongest single signal and within assignment scope.
R² of 49% is real but capped. YouTube's algorithm and viral chance contain irreducible noise.
Trending bias. Every video in the dataset did trend, so the model can't predict whether a brand-new video will trend at all — only how big it'll get if it does.

How to Use the Models

import pickle
from huggingface_hub import hf_hub_download

# Regression
reg_path = hf_hub_download(
    repo_id="benjac8/youtube-trending-views-predictor",
    filename="regression_random_forest.pkl"
)
with open(reg_path, 'rb') as f:
    reg = pickle.load(f)
reg_model = reg['model']
print(f"Regression R² on test: {reg['r2_log_test']:.3f}")

# Classification
clf_path = hf_hub_download(
    repo_id="benjac8/youtube-trending-views-predictor",
    filename="classification_random_forest.pkl"
)
with open(clf_path, 'rb') as f:
    clf = pickle.load(f)
clf_model = clf['model']
class_names = clf['class_names']  # ['Flop', 'Average', 'Hit']

Both models expect a feature matrix with the exact 42 columns listed in feature_names, in that order.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

benjac8
/

youtube-trending-views-predictor