YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Revenue Prediction and User Classification Pipeline
This repository contains a comprehensive data science project focused on predicting revenue and classifying user spending behavior. The workflow moves from raw behavioral data to advanced machine learning modeling, utilizing clustering to define user personas and boosting algorithms for high-precision predictions.
Project Objectives
- Regression: Predict the exact dollar amount paid by a user to provide precise revenue forecasting and understand specific behavioral spend-drivers.
- Classification: Categorize users into Low vs. High Payer groups by leveraging regression results to identify and target the critical 13% minority segment.
Dataset Overview & EDA Transitions
The original dataset consisted of 276,843 entries with 16 raw features covering demographics and transactional history. During the Exploratory Data Analysis (EDA) stage, two major structural changes were implemented:
- Noise Reduction: Streamlined the dataset to 128,595 records, removing extreme anomalies to improve model generalization.
- Dimensionality Reduction: Used PCA to collapse multi-collinear variables and DBSCAN to create new "Cluster" features representing distinct behavioral personas.
Key Insights from EDA
- Target Imbalance: Identified a significant 87/13 split, shifting our evaluation focus to PR-AUC.
- Behavioral Dominance: Features like
playtime_per_dayandis_auto_renewshowed much higher predictive power than demographic data. - Persona Clusters: Users form specific "behavioral islands" rather than a linear spectrum, making clustering essential for accuracy.
Research and findings
Auto renewal The most striking insight is that customers who auto-renew are far less likely to churn.
playtime per day customer engagement, as measured by playtime per day, appears to be a much stronger predictor of retention than registration year or plan price
Visual Analysis
1. Behavioral Clustering (DBSCAN & PCA)
We reduced the feature space into Principal Components to distinguish between stable "Standard Personas" and high-variance behavioral outliers.

As clearly shown, the gold line stands for the average user.
Another group is the teal, blue, and purple users which are the niche users which have one extreme feature that pushes them out of the average as recognized by the compute.
The black dots are the Outlier Group, which obviously made it worse for my prediction and lowered my R2 score the former time. This is actually my win.
2. Regression Model Evolution (Revenue Prediction)
Focusing on minimizing the MAE, CatBoost emerged as the winner with a 23.3% $R^2$ improvement and an average error of only $10.91.

3. Targeting the 13% Minority (Classification)
Prioritizing PR-AUC allowed us to catch the minority segment effectively.
Winner - CatBoost achieved a PR-AUC of 0.922, successfully identifying 47x more "Low Payers" than the baseline.

Repository Structure
| File | Description |
|---|---|
Assignment #2.2 - Ohad Kamhaji.ipynb |
Full Python pipeline. |
catboost_regression_model.pkl |
Model for numerical revenue prediction. |
catboost_classification_model.pkl |
Model for payer segment classification. |
scaler.pkl |
Object for input normalization. |
pca_transformer.pkl |
Object for feature reduction. |
Installation
pip install catboost xgboost scikit-learn pandas seaborn matplotlib

