rajvivan
/

fraud-detection-system

Joblib

Model card Files Files and versions

xet

Community

rajvivan commited on 9 days ago

Commit

35852d7

verified ·

1 Parent(s): bf86f89

Remove author details and improve IEEE formatting

Browse files

Files changed (1) hide show

paper/fraud_detection_paper.tex +124 -64

paper/fraud_detection_paper.tex CHANGED Viewed

@@ -1,42 +1,55 @@
 \documentclass[journal]{IEEEtran}
 \usepackage{cite}
 \usepackage{amsmath,amssymb,amsfonts}
 \usepackage{graphicx}
 \usepackage{textcomp}
 \usepackage{xcolor}
-\usepackage{listings}
 \usepackage{booktabs}
-\usepackage{hyperref}
 \usepackage{multirow}
 \usepackage{array}
 \usepackage{float}
 \lstset{
   language=Python,
-  basicstyle=\ttfamily\footnotesize,
   keywordstyle=\color{blue},
   stringstyle=\color{red},
   commentstyle=\color{green!60!black},
-  numbers=left,
-  numberstyle=\tiny\color{gray},
   breaklines=true,
   frame=single,
-  captionpos=b
 }
 \begin{document}
 \title{A Comprehensive Ensemble-Based Framework for Credit Card Fraud Detection with Explainable AI}
-\author{
-\IEEEauthorblockN{Raj Vivan}
-\IEEEauthorblockA{Department of Computer Science\\
-\textit{Independent Research}\\
-Email: rajvivan@example.com}
-}
 \maketitle
 \begin{abstract}
 Credit card fraud poses a significant threat to the global financial ecosystem, with estimated losses exceeding \$32 billion annually. This paper presents a comprehensive end-to-end fraud detection framework that systematically evaluates and compares seven machine learning approaches: Logistic Regression, Random Forest, XGBoost, LightGBM, Multilayer Perceptron, Autoencoder-based anomaly detection, and a Voting Ensemble. Using the benchmark European Cardholder dataset (284,807 transactions, 0.173\% fraud rate), we engineer 12 novel features and address the extreme class imbalance through both SMOTE oversampling and cost-sensitive learning with class weights. Our XGBoost model achieves the best performance with a PR-AUC of 0.8166, precision of 0.9048, recall of 0.8028, and F1-score of 0.8507 on the held-out test set. We demonstrate that optimizing the decision threshold from the default 0.5 to 0.55 improves F1 from 0.8507 to 0.8636. Comprehensive model explainability via SHAP and LIME analysis reveals that PCA components V4, V14, and V12 are the primary discriminative features. Error analysis shows that false negatives arise from sophisticated fraud patterns that closely mimic legitimate transaction behavior. We deploy the model as a production-ready FastAPI service achieving sub-10ms inference latency. The framework includes automated concept drift monitoring and retraining recommendations. All code, models, and results are publicly available.
 \end{abstract}
@@ -45,13 +58,17 @@ Credit card fraud poses a significant threat to the global financial ecosystem,
 Fraud detection, credit card, machine learning, XGBoost, ensemble learning, explainable AI, SHAP, class imbalance, anomaly detection
 \end{IEEEkeywords}
 \section{Introduction}
-Financial fraud detection has become one of the most critical applications of machine learning in the modern digital economy. The proliferation of electronic payment systems has led to an exponential increase in both the volume of transactions and the sophistication of fraudulent activities \cite{dal2015credit}. According to the Nilson Report, global card fraud losses reached \$32.34 billion in 2021 and are projected to exceed \$43 billion by 2026 \cite{nilson2022}.
-The fundamental challenge in fraud detection lies in the extreme class imbalance inherent in transaction data. In typical datasets, fraudulent transactions constitute less than 0.5\% of all transactions \cite{pozzolo2015calibrating}. This imbalance renders conventional classification metrics such as accuracy misleading and necessitates specialized evaluation criteria including Precision-Recall AUC and Matthews Correlation Coefficient \cite{saito2015precision}.
-Previous approaches to fraud detection have ranged from rule-based expert systems \cite{bolton2002statistical} to sophisticated deep learning architectures \cite{zhang2021fraud}. While deep learning methods have shown promise, tree-based ensemble methods such as XGBoost and LightGBM continue to demonstrate competitive or superior performance on tabular financial data \cite{shwartz2022tabular}, particularly when augmented with careful feature engineering and proper handling of class imbalance.
 This paper makes the following contributions:
 \begin{enumerate}
@@ -63,35 +80,43 @@ This paper makes the following contributions:
     \item Quantitative business impact analysis estimating financial savings from deployment.
 \end{enumerate}
 \section{Related Work}
-Credit card fraud detection has been extensively studied across multiple paradigms. Dal Pozzolo et al. \cite{dal2015credit} provided a foundational analysis of the challenges posed by class imbalance and concept drift in real-world fraud detection systems. Their work established that undersampling strategies could be effective but risked losing valuable information from the majority class.
-Chawla et al. \cite{chawla2002smote} introduced SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic minority class samples by interpolating between existing examples. Subsequent work by Fernandez et al. \cite{fernandez2018smote} demonstrated that SMOTE should be applied exclusively to training data, as applying it before splitting introduces data leakage.
-Ensemble methods have shown particular promise in fraud detection. Xuan et al. \cite{xuan2018random} demonstrated that Random Forests achieve robust performance through bagging and feature randomization. Chen and Guestrin \cite{chen2016xgboost} introduced XGBoost, which has since become a dominant method for tabular data classification, including fraud detection \cite{taha2020detection}.
-Ke et al. \cite{ke2017lightgbm} proposed LightGBM with leaf-wise tree growth and gradient-based one-side sampling, achieving faster training with comparable accuracy. Prokhorenkova et al. \cite{prokhorenkova2018catboost} introduced CatBoost with ordered boosting to handle categorical features natively.
-Deep learning approaches have also been explored. Pumsirirat and Yan \cite{pumsirirat2018credit} employed autoencoders for anomaly-based fraud detection, training exclusively on legitimate transactions and detecting fraud through reconstruction error. Zhang et al. \cite{zhang2021fraud} proposed attention-based recurrent neural networks that capture sequential transaction patterns.
-Explainability in fraud detection has gained importance due to regulatory requirements. Lundberg and Lee \cite{lundberg2017unified} introduced SHAP (SHapley Additive exPlanations), providing consistent feature attribution. Ribeiro et al. \cite{ribeiro2016lime} proposed LIME (Local Interpretable Model-agnostic Explanations) for instance-level interpretability. Belle and Papantonis \cite{belle2021principles} surveyed explainable AI methods applicable to financial decision-making.
-Akiba et al. \cite{akiba2019optuna} introduced Optuna, a hyperparameter optimization framework using Tree-structured Parzen Estimators (TPE) that efficiently explores complex search spaces.
-Recent work by Shwartz-Ziv and Armon \cite{shwartz2022tabular} demonstrated that well-tuned tree-based methods still outperform deep learning on most tabular datasets, supporting our choice of XGBoost as the primary model. Grinsztajn et al. \cite{grinsztajn2022tree} further corroborated this finding with extensive benchmarking.
 \section{Dataset and Exploratory Data Analysis}
 \subsection{Dataset Description}
-We use the European Cardholder Credit Card Fraud Detection dataset \cite{dal2015credit}, containing 284,807 transactions made over two days in September 2013. The dataset includes 28 PCA-transformed features (V1--V28), the original \texttt{Time} and \texttt{Amount} features, and a binary \texttt{Class} label (0 = legitimate, 1 = fraud).
 \subsection{Class Distribution}
 The dataset exhibits extreme class imbalance with only 492 fraudulent transactions (0.173\%), yielding an imbalance ratio of approximately 1:577. This severe imbalance necessitates specialized handling during both training and evaluation.
-\begin{table}[h]
 \centering
 \caption{Class Distribution in the Dataset}
 \label{tab:class_dist}
@@ -119,13 +144,17 @@ Our exploratory analysis revealed five critical findings:
     \item \textbf{Feature Scale}: V1--V28 are PCA-transformed; only Time and Amount require normalization.
 \end{enumerate}
-\begin{figure}[h]
 \centering
-\includegraphics[width=\columnwidth]{figures/class_distribution.png}
 \caption{Class distribution showing extreme imbalance (0.173\% fraud rate).}
 \label{fig:class_dist}
 \end{figure}
 \section{Methodology}
 \subsection{Feature Engineering}
@@ -164,7 +193,7 @@ M = \sqrt{\sum_{i=1}^{28} V_i^2}
 We compare two approaches for handling the 1:577 class imbalance:
-\textbf{SMOTE} \cite{chawla2002smote}: Applied exclusively to the training set after splitting, generating synthetic fraud samples to achieve a 1:2 minority-to-majority ratio.
 \textbf{Cost-Sensitive Learning}: Applying class weights inversely proportional to class frequency:
 \begin{equation}
@@ -218,12 +247,16 @@ P(\text{fraud}|x) = \frac{1}{3}\sum_{m=1}^{3} P_m(\text{fraud}|x)
 \subsection{Hyperparameter Optimization}
-We use Optuna \cite{akiba2019optuna} with Tree-structured Parzen Estimators (TPE) to tune the top three models, optimizing PR-AUC on the validation set:
 \begin{equation}
 \theta^* = \arg\max_{\theta} \text{PR-AUC}(f_\theta, \mathcal{D}_{val})
 \end{equation}
 \section{Experimental Setup}
 \subsection{Environment}
@@ -241,11 +274,15 @@ Given the extreme class imbalance, we report six metrics:
     \item \textbf{MCC}: $\frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$
 \end{itemize}
 \section{Results and Discussion}
 \subsection{Model Comparison}
-\begin{table*}[t]
 \centering
 \caption{Comprehensive Model Comparison on Test Set (Threshold = 0.5)}
 \label{tab:results}
@@ -270,22 +307,22 @@ Table~\ref{tab:results} presents the comprehensive evaluation results. XGBoost a
 Key observations:
-\textbf{Tree-based models dominate}: XGBoost, Random Forest, and LightGBM consistently outperform the neural network approaches, consistent with findings by Shwartz-Ziv and Armon \cite{shwartz2022tabular}.
 \textbf{Class weight handling matters}: Logistic Regression achieves high recall (0.8873) but extremely low precision (0.0488), indicating that the linear decision boundary with class weights is too aggressive in flagging transactions.
 \textbf{Autoencoder limitations}: While achieving perfect recall (1.0), the autoencoder suffers from extremely low precision (0.0033), flagging nearly all transactions as anomalous. This suggests that the reconstruction-based approach is too sensitive for this PCA-transformed feature space.
-\begin{figure}[h]
 \centering
-\includegraphics[width=\columnwidth]{figures/roc_curves.png}
 \caption{ROC curves for all models. XGBoost and Voting Ensemble achieve the highest AUC.}
 \label{fig:roc}
 \end{figure}
-\begin{figure}[h]
 \centering
-\includegraphics[width=\columnwidth]{figures/pr_curves.png}
 \caption{Precision-Recall curves. PR-AUC is the primary metric for imbalanced classification.}
 \label{fig:pr}
 \end{figure}
@@ -294,7 +331,7 @@ Key observations:
 The default threshold of 0.5 is suboptimal for imbalanced data. Our analysis reveals that a threshold of 0.55 maximizes F1-score:
-\begin{table}[h]
 \centering
 \caption{Threshold Sensitivity for XGBoost}
 \label{tab:threshold}
@@ -314,7 +351,7 @@ The default threshold of 0.5 is suboptimal for imbalanced data. Our analysis rev
 \subsection{Business Impact}
-\begin{table}[h]
 \centering
 \caption{Business Impact Analysis (Test Set)}
 \label{tab:business}
@@ -326,7 +363,7 @@ XGBoost & 6,966 & 1,711 & 6,936 \\
 Ensemble & 6,966 & 1,711 & 6,921 \\
 RF (Tuned) & 6,722 & 1,955 & 6,682 \\
 LR & 7,699 & 978 & 1,554 \\
-Autoencoder & 8,677 & 0 & -97,368 \\
 \bottomrule
 \end{tabular}
 \end{table}
@@ -337,13 +374,17 @@ Table~\ref{tab:business} demonstrates that XGBoost provides the highest net savi
 SHAP analysis reveals that V4 (mean $|\text{SHAP}| = 1.913$), V14 (1.843), and PCA\_magnitude (1.113) are the primary fraud discriminators. These features correspond to specific latent patterns in the PCA-transformed space that distinguish fraudulent from legitimate behavior.
-\begin{figure}[h]
 \centering
-\includegraphics[width=\columnwidth]{figures/shap_summary.png}
 \caption{SHAP summary plot showing feature contributions to fraud predictions.}
 \label{fig:shap}
 \end{figure}
 \section{Error Analysis}
 \subsection{False Negative Analysis}
@@ -356,7 +397,11 @@ The 6 false positives have a mean predicted fraud probability of 0.827, with fea
 \subsection{Concept Drift Assessment}
-Comparing model confidence between early and late test periods reveals a drift indicator of +0.115, suggesting modest temporal variation. We recommend weekly monitoring with automated retraining triggers when PR-AUC drops below 0.70.
 \section{Limitations}
@@ -368,15 +413,19 @@ Comparing model confidence between early and late test periods reveals a drift i
     \item \textbf{Static Threshold}: The optimal threshold may shift as fraud patterns evolve; dynamic threshold adaptation is not implemented.
 \end{enumerate}
 \section{Future Work}
 Several promising directions emerge from this research:
-\textbf{Graph Neural Networks}: Modeling transaction networks as graphs could enable detection of fraud rings through collaborative behavioral patterns \cite{liu2021graph}.
 \textbf{Real-Time Streaming}: Integration with Apache Kafka and Apache Flink for millisecond-latency processing of transaction streams at scale.
-\textbf{Federated Learning}: Training across multiple banks without sharing raw transaction data, preserving privacy while improving generalization \cite{yang2019federated}.
 \textbf{LLM-Generated Explanations}: Using large language models to generate natural-language compliance explanations for flagged transactions, facilitating human review.
@@ -384,79 +433,90 @@ Several promising directions emerge from this research:
 \textbf{Adversarial Robustness}: Training models that are robust to adversarial perturbations designed to evade detection.
 \section{Conclusion}
 This paper presents a comprehensive fraud detection framework that systematically evaluates seven machine learning approaches on the benchmark European Cardholder dataset. Our results demonstrate that XGBoost achieves the best overall performance (PR-AUC: 0.8166, F1: 0.8507) through cost-sensitive learning with optimized class weights. Threshold optimization from 0.5 to 0.55 further improves F1 to 0.8636. The framework includes complete explainability through SHAP and LIME, production deployment via FastAPI with sub-10ms latency, and automated drift monitoring. Our analysis confirms that tree-based ensemble methods remain the most effective approach for tabular fraud detection, while highlighting the importance of proper class imbalance handling, threshold optimization, and the inadequacy of accuracy as a metric for imbalanced classification.
 \bibliographystyle{IEEEtran}
 \begin{thebibliography}{99}
 \bibitem{dal2015credit}
-A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, ``Calibrating probability with undersampling for unbalanced classification,'' in \textit{Proc. IEEE Symp. Comput. Intell. Data Mining (CIDM)}, 2015, pp. 159--166.
 \bibitem{nilson2022}
 Nilson Report, ``Global card fraud losses,'' \textit{Nilson Report}, Issue 1209, 2022.
 \bibitem{pozzolo2015calibrating}
-A. Dal Pozzolo, O. Caelen, and G. Bontempi, ``When is undersampling effective in unbalanced classification tasks?,'' in \textit{Proc. European Conf. Machine Learning and Knowledge Discovery in Databases}, 2015, pp. 200--215.
 \bibitem{saito2015precision}
-T. Saito and M. Rehmsmeier, ``The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,'' \textit{PLoS ONE}, vol. 10, no. 3, 2015.
 \bibitem{bolton2002statistical}
-R. J. Bolton and D. J. Hand, ``Statistical fraud detection: A review,'' \textit{Statistical Science}, vol. 17, no. 3, pp. 235--255, 2002.
 \bibitem{zhang2021fraud}
-Z. Zhang, X. Zhou, X. Zhang, L. Wang, and P. Wang, ``A model based on convolutional recurrent neural network for fraud detection in credit card,'' \textit{Complexity}, vol. 2021, pp. 1--9, 2021.
 \bibitem{shwartz2022tabular}
-R. Shwartz-Ziv and A. Armon, ``Tabular data: Deep learning is not all you need,'' \textit{Information Fusion}, vol. 81, pp. 84--90, 2022.
 \bibitem{chawla2002smote}
-N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ``SMOTE: Synthetic Minority Over-sampling Technique,'' \textit{J. Artificial Intelligence Research}, vol. 16, pp. 321--357, 2002.
 \bibitem{fernandez2018smote}
-A. Fernandez, S. Garcia, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, \textit{Learning from Imbalanced Data Sets}. Springer, 2018.
 \bibitem{xuan2018random}
-S. Xuan, G. Liu, Z. Li, L. Zheng, S. Wang, and C. Jiang, ``Random forest for credit card fraud detection,'' in \textit{Proc. IEEE 15th Intl. Conf. Networking, Sensing and Control (ICNSC)}, 2018, pp. 1--6.
 \bibitem{chen2016xgboost}
-T. Chen and C. Guestrin, ``XGBoost: A scalable tree boosting system,'' in \textit{Proc. 22nd ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2016, pp. 785--794.
 \bibitem{taha2020detection}
-A. A. Taha and S. J. Malebary, ``An intelligent approach to credit card fraud detection using an optimized light gradient boosting machine,'' \textit{IEEE Access}, vol. 8, pp. 25579--25587, 2020.
 \bibitem{ke2017lightgbm}
-G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, ``LightGBM: A highly efficient gradient boosting decision tree,'' in \textit{Advances in Neural Information Processing Systems}, vol. 30, 2017.
 \bibitem{prokhorenkova2018catboost}
-L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, ``CatBoost: Unbiased boosting with categorical features,'' in \textit{Advances in Neural Information Processing Systems}, vol. 31, 2018.
 \bibitem{pumsirirat2018credit}
-A. Pumsirirat and L. Yan, ``Credit card fraud detection using deep learning based on auto-encoder and restricted Boltzmann machine,'' \textit{Intl. J. Advanced Computer Science and Applications}, vol. 9, no. 1, 2018.
 \bibitem{lundberg2017unified}
-S. M. Lundberg and S.-I. Lee, ``A unified approach to interpreting model predictions,'' in \textit{Advances in Neural Information Processing Systems}, vol. 30, 2017.
 \bibitem{ribeiro2016lime}
-M. T. Ribeiro, S. Singh, and C. Guestrin, ``Why should I trust you?: Explaining the predictions of any classifier,'' in \textit{Proc. 22nd ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2016, pp. 1135--1144.
 \bibitem{belle2021principles}
-V. Belle and I. Papantonis, ``Principles and practice of explainable machine learning,'' \textit{Frontiers in Big Data}, vol. 4, 2021.
 \bibitem{akiba2019optuna}
-T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, ``Optuna: A next-generation hyperparameter optimization framework,'' in \textit{Proc. 25th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2019, pp. 2623--2631.
 \bibitem{grinsztajn2022tree}
-L. Grinsztajn, E. Oyallon, and G. Varoquaux, ``Why do tree-based models still outperform deep learning on tabular data?,'' in \textit{Advances in Neural Information Processing Systems}, vol. 35, 2022.
 \bibitem{liu2021graph}
-Y. Liu, M. Ao, C. Chi, F. Feng, D. Yang, and J. He, ``Pick and choose: A GNN-based imbalanced learning approach for fraud detection,'' in \textit{Proc. Web Conf.}, 2021, pp. 3168--3177.
 \bibitem{yang2019federated}
-Q. Yang, Y. Liu, T. Chen, and Y. Tong, ``Federated machine learning: Concept and applications,'' \textit{ACM Trans. Intelligent Systems and Technology}, vol. 10, no. 2, pp. 1--19, 2019.
 \end{thebibliography}

 \documentclass[journal]{IEEEtran}
+% ─── Packages ──────────────────────────────────────────────────────────────────
 \usepackage{cite}
 \usepackage{amsmath,amssymb,amsfonts}
 \usepackage{graphicx}
 \usepackage{textcomp}
 \usepackage{xcolor}
 \usepackage{booktabs}
 \usepackage{multirow}
+\usepackage{hyperref}
+\usepackage{listings}
+\usepackage{algorithm}
+\usepackage{algorithmic}
 \usepackage{array}
 \usepackage{float}
+\usepackage{url}
+\usepackage{balance}
+% ─── Listings Configuration ────────────────────────────────────────────────────
 \lstset{
   language=Python,
+  basicstyle=\ttfamily\scriptsize,
   keywordstyle=\color{blue},
   stringstyle=\color{red},
   commentstyle=\color{green!60!black},
   breaklines=true,
   frame=single,
+  numbers=left,
+  numberstyle=\tiny\color{gray},
+  captionpos=b,
 }
+% ─── Graphics Path ─────────────────────────────────────────────────────────────
+\graphicspath{{figures/}}
 \begin{document}
+% ═══════════════════════════════════════════════════════════════════════════════
+% TITLE
+% ═══════════════════════════════════════════════════════════════════════════════
 \title{A Comprehensive Ensemble-Based Framework for Credit Card Fraud Detection with Explainable AI}
+\author{}
 \maketitle
+% ═══════════════════════════════════════════════════════════════════════════════
+% ABSTRACT
+% ═══════════════════════════════════════════════════════════════════════════════
 \begin{abstract}
 Credit card fraud poses a significant threat to the global financial ecosystem, with estimated losses exceeding \$32 billion annually. This paper presents a comprehensive end-to-end fraud detection framework that systematically evaluates and compares seven machine learning approaches: Logistic Regression, Random Forest, XGBoost, LightGBM, Multilayer Perceptron, Autoencoder-based anomaly detection, and a Voting Ensemble. Using the benchmark European Cardholder dataset (284,807 transactions, 0.173\% fraud rate), we engineer 12 novel features and address the extreme class imbalance through both SMOTE oversampling and cost-sensitive learning with class weights. Our XGBoost model achieves the best performance with a PR-AUC of 0.8166, precision of 0.9048, recall of 0.8028, and F1-score of 0.8507 on the held-out test set. We demonstrate that optimizing the decision threshold from the default 0.5 to 0.55 improves F1 from 0.8507 to 0.8636. Comprehensive model explainability via SHAP and LIME analysis reveals that PCA components V4, V14, and V12 are the primary discriminative features. Error analysis shows that false negatives arise from sophisticated fraud patterns that closely mimic legitimate transaction behavior. We deploy the model as a production-ready FastAPI service achieving sub-10ms inference latency. The framework includes automated concept drift monitoring and retraining recommendations. All code, models, and results are publicly available.
 \end{abstract}
 Fraud detection, credit card, machine learning, XGBoost, ensemble learning, explainable AI, SHAP, class imbalance, anomaly detection
 \end{IEEEkeywords}
+% ═══════════════════════════════════════════════════════════════════════════════
+% I. INTRODUCTION
+% ═══════════════════════════════════════════════════════════════════════════════
 \section{Introduction}
+\IEEEPARstart{F}{inancial} fraud detection has become one of the most critical applications of machine learning in the modern digital economy. The proliferation of electronic payment systems has led to an exponential increase in both the volume of transactions and the sophistication of fraudulent activities~\cite{dal2015credit}. According to the Nilson Report, global card fraud losses reached \$32.34 billion in 2021 and are projected to exceed \$43 billion by 2026~\cite{nilson2022}.
+The fundamental challenge in fraud detection lies in the extreme class imbalance inherent in transaction data. In typical datasets, fraudulent transactions constitute less than 0.5\% of all transactions~\cite{pozzolo2015calibrating}. This imbalance renders conventional classification metrics such as accuracy misleading and necessitates specialized evaluation criteria including Precision-Recall AUC and Matthews Correlation Coefficient~\cite{saito2015precision}.
+Previous approaches to fraud detection have ranged from rule-based expert systems~\cite{bolton2002statistical} to sophisticated deep learning architectures~\cite{zhang2021fraud}. While deep learning methods have shown promise, tree-based ensemble methods such as XGBoost and LightGBM continue to demonstrate competitive or superior performance on tabular financial data~\cite{shwartz2022tabular}, particularly when augmented with careful feature engineering and proper handling of class imbalance.
 This paper makes the following contributions:
 \begin{enumerate}
     \item Quantitative business impact analysis estimating financial savings from deployment.
 \end{enumerate}
+% ═══════════════════════════════════════════════════════════════════════════════
+% II. RELATED WORK
+% ═══════════════════════════════════════════════════════════════════════════════
 \section{Related Work}
+Credit card fraud detection has been extensively studied across multiple paradigms. Dal Pozzolo et al.~\cite{dal2015credit} provided a foundational analysis of the challenges posed by class imbalance and concept drift in real-world fraud detection systems. Their work established that undersampling strategies could be effective but risked losing valuable information from the majority class.
+Chawla et al.~\cite{chawla2002smote} introduced SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic minority class samples by interpolating between existing examples. Subsequent work by Fernandez et al.~\cite{fernandez2018smote} demonstrated that SMOTE should be applied exclusively to training data, as applying it before splitting introduces data leakage.
+Ensemble methods have shown particular promise in fraud detection. Xuan et al.~\cite{xuan2018random} demonstrated that Random Forests achieve robust performance through bagging and feature randomization. Chen and Guestrin~\cite{chen2016xgboost} introduced XGBoost, which has since become a dominant method for tabular data classification, including fraud detection~\cite{taha2020detection}.
+Ke et al.~\cite{ke2017lightgbm} proposed LightGBM with leaf-wise tree growth and gradient-based one-side sampling, achieving faster training with comparable accuracy. Prokhorenkova et al.~\cite{prokhorenkova2018catboost} introduced CatBoost with ordered boosting to handle categorical features natively.
+Deep learning approaches have also been explored. Pumsirirat and Yan~\cite{pumsirirat2018credit} employed autoencoders for anomaly-based fraud detection, training exclusively on legitimate transactions and detecting fraud through reconstruction error. Zhang et al.~\cite{zhang2021fraud} proposed attention-based recurrent neural networks that capture sequential transaction patterns.
+Explainability in fraud detection has gained importance due to regulatory requirements. Lundberg and Lee~\cite{lundberg2017unified} introduced SHAP (SHapley Additive exPlanations), providing consistent feature attribution. Ribeiro et al.~\cite{ribeiro2016lime} proposed LIME (Local Interpretable Model-agnostic Explanations) for instance-level interpretability. Belle and Papantonis~\cite{belle2021principles} surveyed explainable AI methods applicable to financial decision-making.
+Akiba et al.~\cite{akiba2019optuna} introduced Optuna, a hyperparameter optimization framework using Tree-structured Parzen Estimators (TPE) that efficiently explores complex search spaces.
+Recent work by Shwartz-Ziv and Armon~\cite{shwartz2022tabular} demonstrated that well-tuned tree-based methods still outperform deep learning on most tabular datasets, supporting our choice of XGBoost as the primary model. Grinsztajn et al.~\cite{grinsztajn2022tree} further corroborated this finding with extensive benchmarking.
+% ═══════════════════════════════════════════════════════════════════════════════
+% III. DATASET AND EXPLORATORY DATA ANALYSIS
+% ═══════════════════════════════════════════════════════════════════════════════
 \section{Dataset and Exploratory Data Analysis}
 \subsection{Dataset Description}
+We use the European Cardholder Credit Card Fraud Detection dataset~\cite{dal2015credit}, containing 284,807 transactions made over two days in September 2013. The dataset includes 28 PCA-transformed features (V1--V28), the original \texttt{Time} and \texttt{Amount} features, and a binary \texttt{Class} label (0~=~legitimate, 1~=~fraud).
 \subsection{Class Distribution}
 The dataset exhibits extreme class imbalance with only 492 fraudulent transactions (0.173\%), yielding an imbalance ratio of approximately 1:577. This severe imbalance necessitates specialized handling during both training and evaluation.
+\begin{table}[!t]
 \centering
 \caption{Class Distribution in the Dataset}
 \label{tab:class_dist}
     \item \textbf{Feature Scale}: V1--V28 are PCA-transformed; only Time and Amount require normalization.
 \end{enumerate}
+\begin{figure}[!t]
 \centering
+\includegraphics[width=\columnwidth]{class_distribution.png}
 \caption{Class distribution showing extreme imbalance (0.173\% fraud rate).}
 \label{fig:class_dist}
 \end{figure}
+% ═══════════════════════════════════════════════════════════════════════════════
+% IV. METHODOLOGY
+% ═══════════════════════════════════════════════════════════════════════════════
 \section{Methodology}
 \subsection{Feature Engineering}
 We compare two approaches for handling the 1:577 class imbalance:
+\textbf{SMOTE}~\cite{chawla2002smote}: Applied exclusively to the training set after splitting, generating synthetic fraud samples to achieve a 1:2 minority-to-majority ratio.
 \textbf{Cost-Sensitive Learning}: Applying class weights inversely proportional to class frequency:
 \begin{equation}
 \subsection{Hyperparameter Optimization}
+We use Optuna~\cite{akiba2019optuna} with Tree-structured Parzen Estimators (TPE) to tune the top three models, optimizing PR-AUC on the validation set:
 \begin{equation}
 \theta^* = \arg\max_{\theta} \text{PR-AUC}(f_\theta, \mathcal{D}_{val})
 \end{equation}
+% ═══════════════════════════════════════════════════════════════════════════════
+% V. EXPERIMENTAL SETUP
+% ═══════════════════════════════════════════════════════════════════════════════
 \section{Experimental Setup}
 \subsection{Environment}
     \item \textbf{MCC}: $\frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$
 \end{itemize}
+% ═══════════════════════════════════════════════════════════════════════════════
+% VI. RESULTS AND DISCUSSION
+% ═══════════════════════════════════════════════════════════════════════════════
 \section{Results and Discussion}
 \subsection{Model Comparison}
+\begin{table*}[!t]
 \centering
 \caption{Comprehensive Model Comparison on Test Set (Threshold = 0.5)}
 \label{tab:results}
 Key observations:
+\textbf{Tree-based models dominate}: XGBoost, Random Forest, and LightGBM consistently outperform the neural network approaches, consistent with findings by Shwartz-Ziv and Armon~\cite{shwartz2022tabular}.
 \textbf{Class weight handling matters}: Logistic Regression achieves high recall (0.8873) but extremely low precision (0.0488), indicating that the linear decision boundary with class weights is too aggressive in flagging transactions.
 \textbf{Autoencoder limitations}: While achieving perfect recall (1.0), the autoencoder suffers from extremely low precision (0.0033), flagging nearly all transactions as anomalous. This suggests that the reconstruction-based approach is too sensitive for this PCA-transformed feature space.
+\begin{figure}[!t]
 \centering
+\includegraphics[width=\columnwidth]{roc_curves.png}
 \caption{ROC curves for all models. XGBoost and Voting Ensemble achieve the highest AUC.}
 \label{fig:roc}
 \end{figure}
+\begin{figure}[!t]
 \centering
+\includegraphics[width=\columnwidth]{pr_curves.png}
 \caption{Precision-Recall curves. PR-AUC is the primary metric for imbalanced classification.}
 \label{fig:pr}
 \end{figure}
 The default threshold of 0.5 is suboptimal for imbalanced data. Our analysis reveals that a threshold of 0.55 maximizes F1-score:
+\begin{table}[!t]
 \centering
 \caption{Threshold Sensitivity for XGBoost}
 \label{tab:threshold}
 \subsection{Business Impact}
+\begin{table}[!t]
 \centering
 \caption{Business Impact Analysis (Test Set)}
 \label{tab:business}
 Ensemble & 6,966 & 1,711 & 6,921 \\
 RF (Tuned) & 6,722 & 1,955 & 6,682 \\
 LR & 7,699 & 978 & 1,554 \\
+Autoencoder & 8,677 & 0 & $-$97,368 \\
 \bottomrule
 \end{tabular}
 \end{table}
 SHAP analysis reveals that V4 (mean $|\text{SHAP}| = 1.913$), V14 (1.843), and PCA\_magnitude (1.113) are the primary fraud discriminators. These features correspond to specific latent patterns in the PCA-transformed space that distinguish fraudulent from legitimate behavior.
+\begin{figure}[!t]
 \centering
+\includegraphics[width=\columnwidth]{shap_summary.png}
 \caption{SHAP summary plot showing feature contributions to fraud predictions.}
 \label{fig:shap}
 \end{figure}
+% ═══════════════════════════════════════════════════════════════════════════════
+% VII. ERROR ANALYSIS
+% ═══════════════════════════════════════════════════════════════════════════════
 \section{Error Analysis}
 \subsection{False Negative Analysis}
 \subsection{Concept Drift Assessment}
+Comparing model confidence between early and late test periods reveals a drift indicator of $+0.115$, suggesting modest temporal variation. We recommend weekly monitoring with automated retraining triggers when PR-AUC drops below 0.70.
+% ═══════════════════════════════════════════════════════════════════════════════
+% VIII. LIMITATIONS
+% ═══════════════════════════════════════════════════════════════════════════════
 \section{Limitations}
     \item \textbf{Static Threshold}: The optimal threshold may shift as fraud patterns evolve; dynamic threshold adaptation is not implemented.
 \end{enumerate}
+% ═══════════════════════════════════════════════════════════════════════════════
+% IX. FUTURE WORK
+% ═══════════════════════════════════════════════════════════════════════════════
 \section{Future Work}
 Several promising directions emerge from this research:
+\textbf{Graph Neural Networks}: Modeling transaction networks as graphs could enable detection of fraud rings through collaborative behavioral patterns~\cite{liu2021graph}.
 \textbf{Real-Time Streaming}: Integration with Apache Kafka and Apache Flink for millisecond-latency processing of transaction streams at scale.
+\textbf{Federated Learning}: Training across multiple banks without sharing raw transaction data, preserving privacy while improving generalization~\cite{yang2019federated}.
 \textbf{LLM-Generated Explanations}: Using large language models to generate natural-language compliance explanations for flagged transactions, facilitating human review.
 \textbf{Adversarial Robustness}: Training models that are robust to adversarial perturbations designed to evade detection.
+% ═══════════════════════════════════════════════════════════════════════════════
+% X. CONCLUSION
+% ═══════════════════════════════════════════════════════════════════════════════
 \section{Conclusion}
 This paper presents a comprehensive fraud detection framework that systematically evaluates seven machine learning approaches on the benchmark European Cardholder dataset. Our results demonstrate that XGBoost achieves the best overall performance (PR-AUC: 0.8166, F1: 0.8507) through cost-sensitive learning with optimized class weights. Threshold optimization from 0.5 to 0.55 further improves F1 to 0.8636. The framework includes complete explainability through SHAP and LIME, production deployment via FastAPI with sub-10ms latency, and automated drift monitoring. Our analysis confirms that tree-based ensemble methods remain the most effective approach for tabular fraud detection, while highlighting the importance of proper class imbalance handling, threshold optimization, and the inadequacy of accuracy as a metric for imbalanced classification.
+All code, models, and results are publicly available.
+% ═══════════════════════════════════════════════════════════════════════════════
+% REFERENCES
+% ═══════════════════════════════════════════════════════════════════════════════
+\balance
 \bibliographystyle{IEEEtran}
 \begin{thebibliography}{99}
 \bibitem{dal2015credit}
+A.~Dal~Pozzolo, O.~Caelen, R.~A.~Johnson, and G.~Bontempi, ``Calibrating probability with undersampling for unbalanced classification,'' in \textit{Proc. IEEE Symp. Comput. Intell. Data Mining (CIDM)}, 2015, pp.~159--166.
 \bibitem{nilson2022}
 Nilson Report, ``Global card fraud losses,'' \textit{Nilson Report}, Issue 1209, 2022.
 \bibitem{pozzolo2015calibrating}
+A.~Dal~Pozzolo, O.~Caelen, and G.~Bontempi, ``When is undersampling effective in unbalanced classification tasks?,'' in \textit{Proc. European Conf. Machine Learning and Knowledge Discovery in Databases}, 2015, pp.~200--215.
 \bibitem{saito2015precision}
+T.~Saito and M.~Rehmsmeier, ``The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets,'' \textit{PLoS ONE}, vol.~10, no.~3, 2015.
 \bibitem{bolton2002statistical}
+R.~J.~Bolton and D.~J.~Hand, ``Statistical fraud detection: A review,'' \textit{Statistical Science}, vol.~17, no.~3, pp.~235--255, 2002.
 \bibitem{zhang2021fraud}
+Z.~Zhang, X.~Zhou, X.~Zhang, L.~Wang, and P.~Wang, ``A model based on convolutional recurrent neural network for fraud detection in credit card,'' \textit{Complexity}, vol.~2021, pp.~1--9, 2021.
 \bibitem{shwartz2022tabular}
+R.~Shwartz-Ziv and A.~Armon, ``Tabular data: Deep learning is not all you need,'' \textit{Information Fusion}, vol.~81, pp.~84--90, 2022.
 \bibitem{chawla2002smote}
+N.~V.~Chawla, K.~W.~Bowyer, L.~O.~Hall, and W.~P.~Kegelmeyer, ``SMOTE: Synthetic Minority Over-sampling Technique,'' \textit{J. Artificial Intelligence Research}, vol.~16, pp.~321--357, 2002.
 \bibitem{fernandez2018smote}
+A.~Fernandez, S.~Garcia, M.~Galar, R.~C.~Prati, B.~Krawczyk, and F.~Herrera, \textit{Learning from Imbalanced Data Sets}.\ \ Springer, 2018.
 \bibitem{xuan2018random}
+S.~Xuan, G.~Liu, Z.~Li, L.~Zheng, S.~Wang, and C.~Jiang, ``Random forest for credit card fraud detection,'' in \textit{Proc. IEEE 15th Intl. Conf. Networking, Sensing and Control (ICNSC)}, 2018, pp.~1--6.
 \bibitem{chen2016xgboost}
+T.~Chen and C.~Guestrin, ``XGBoost: A scalable tree boosting system,'' in \textit{Proc. 22nd ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2016, pp.~785--794.
 \bibitem{taha2020detection}
+A.~A.~Taha and S.~J.~Malebary, ``An intelligent approach to credit card fraud detection using an optimized light gradient boosting machine,'' \textit{IEEE Access}, vol.~8, pp.~25579--25587, 2020.
 \bibitem{ke2017lightgbm}
+G.~Ke, Q.~Meng, T.~Finley, T.~Wang, W.~Chen, W.~Ma, Q.~Ye, and T.-Y.~Liu, ``LightGBM: A highly efficient gradient boosting decision tree,'' in \textit{Advances in Neural Information Processing Systems}, vol.~30, 2017.
 \bibitem{prokhorenkova2018catboost}
+L.~Prokhorenkova, G.~Gusev, A.~Vorobev, A.~V.~Dorogush, and A.~Gulin, ``CatBoost: Unbiased boosting with categorical features,'' in \textit{Advances in Neural Information Processing Systems}, vol.~31, 2018.
 \bibitem{pumsirirat2018credit}
+A.~Pumsirirat and L.~Yan, ``Credit card fraud detection using deep learning based on auto-encoder and restricted Boltzmann machine,'' \textit{Intl. J. Advanced Computer Science and Applications}, vol.~9, no.~1, 2018.
 \bibitem{lundberg2017unified}
+S.~M.~Lundberg and S.-I.~Lee, ``A unified approach to interpreting model predictions,'' in \textit{Advances in Neural Information Processing Systems}, vol.~30, 2017.
 \bibitem{ribeiro2016lime}
+M.~T.~Ribeiro, S.~Singh, and C.~Guestrin, ``Why should I trust you?: Explaining the predictions of any classifier,'' in \textit{Proc. 22nd ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2016, pp.~1135--1144.
 \bibitem{belle2021principles}
+V.~Belle and I.~Papantonis, ``Principles and practice of explainable machine learning,'' \textit{Frontiers in Big Data}, vol.~4, 2021.
 \bibitem{akiba2019optuna}
+T.~Akiba, S.~Sano, T.~Yanase, T.~Ohta, and M.~Koyama, ``Optuna: A next-generation hyperparameter optimization framework,'' in \textit{Proc. 25th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining}, 2019, pp.~2623--2631.
 \bibitem{grinsztajn2022tree}
+L.~Grinsztajn, E.~Oyallon, and G.~Varoquaux, ``Why do tree-based models still outperform deep learning on tabular data?,'' in \textit{Advances in Neural Information Processing Systems}, vol.~35, 2022.
 \bibitem{liu2021graph}
+Y.~Liu, M.~Ao, C.~Chi, F.~Feng, D.~Yang, and J.~He, ``Pick and choose: A GNN-based imbalanced learning approach for fraud detection,'' in \textit{Proc. Web Conf.}, 2021, pp.~3168--3177.
 \bibitem{yang2019federated}
+Q.~Yang, Y.~Liu, T.~Chen, and Y.~Tong, ``Federated machine learning: Concept and applications,'' \textit{ACM Trans. Intelligent Systems and Technology}, vol.~10, no.~2, pp.~1--19, 2019.
 \end{thebibliography}