YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Revenue Prediction and User Classification Pipeline

This repository contains a comprehensive data science project focused on predicting revenue and classifying user spending behavior. The workflow moves from raw behavioral data to advanced machine learning modeling, utilizing clustering to define user personas and boosting algorithms for high-precision predictions.

Project Objectives

  1. Regression: Predict the exact dollar amount paid by a user to provide precise revenue forecasting and understand specific behavioral spend-drivers.
  2. Classification: Categorize users into Low vs. High Payer groups by leveraging regression results to identify and target the critical 13% minority segment.

Dataset Overview & EDA Transitions

The original dataset consisted of 276,843 entries with 16 raw features covering demographics and transactional history. During the Exploratory Data Analysis (EDA) stage, two major structural changes were implemented:

  • Noise Reduction: Streamlined the dataset to 128,595 records, removing extreme anomalies to improve model generalization.
  • Dimensionality Reduction: Used PCA to collapse multi-collinear variables and DBSCAN to create new "Cluster" features representing distinct behavioral personas.

Key Insights from EDA

  • Target Imbalance: Identified a significant 87/13 split, shifting our evaluation focus to PR-AUC.
  • Behavioral Dominance: Features like playtime_per_day and is_auto_renew showed much higher predictive power than demographic data.
  • Persona Clusters: Users form specific "behavioral islands" rather than a linear spectrum, making clustering essential for accuracy.

Research and findings

  • Auto renewal The most striking insight is that customers who auto-renew are far less likely to churn.

  • Auto renewal

  • playtime per day customer engagement, as measured by playtime per day, appears to be a much stronger predictor of retention than registration year or plan price

  • playtime per day

Visual Analysis

1. Behavioral Clustering (DBSCAN & PCA)

We reduced the feature space into Principal Components to distinguish between stable "Standard Personas" and high-variance behavioral outliers. Behavioral Clusters

As clearly shown, the gold line stands for the average user.

Another group is the teal, blue, and purple users which are the niche users which have one extreme feature that pushes them out of the average as recognized by the compute.

The black dots are the Outlier Group, which obviously made it worse for my prediction and lowered my R2 score the former time. This is actually my win.

2. Regression Model Evolution (Revenue Prediction)

Focusing on minimizing the MAE, CatBoost emerged as the winner with a 23.3% $R^2$ improvement and an average error of only $10.91. Regression Metrics

3. Targeting the 13% Minority (Classification)

Prioritizing PR-AUC allowed us to catch the minority segment effectively. Winner - CatBoost achieved a PR-AUC of 0.922, successfully identifying 47x more "Low Payers" than the baseline. Classification Comparison


Repository Structure

File Description
Assignment #2.2 - Ohad Kamhaji.ipynb Full Python pipeline.
catboost_regression_model.pkl Model for numerical revenue prediction.
catboost_classification_model.pkl Model for payer segment classification.
scaler.pkl Object for input normalization.
pca_transformer.pkl Object for feature reduction.

Installation

pip install catboost xgboost scikit-learn pandas seaborn matplotlib
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support