🔍 Chicago Crime — Arrest Prediction (Classification & Regression)

Can we predict whether a crime in Chicago will result in an arrest — based purely on environment, time, and location?

🎥 Project Video Walkthrough

📁 Repository Contents

File	Description
`notebook.ipynb`	Full Colab notebook with all code, analysis, and results
`chicago_crime_gb_model.pkl`	Winning Regression model (Gradient Boosting Regressor)
`chicago_crime_gb_classifier.pkl`	Winning Classification model (Gradient Boosting Classifier)
`presentation.mp4`	Full walkthrough video presentation
`README.md`	This file

🗂️ Dataset

Source: gymprathap/Chicago-Crime-Dataset on Hugging Face

The raw dataset contains ~8 million rows of real Chicago Police Department crime reports with 21 features covering crime type, location, date/time, district, arrest status, and more.

After preprocessing:

Filtered to crimes reported 2010 and later
Dropped low-signal or redundant columns (ID, Case Number, IUCR, Beat, Ward, FBI Code, Block, X/Y Coordinates)
Removed duplicate rows
Parsed and extracted temporal features (Hour, Month, Year) from the Date column

❓ Central Question

What are the factors that most influence whether a person who has committed a crime is arrested?

This project approaches that question from two angles:

Regression — predict the exact arrest rate for a given time / location / district combination.
Classification — predict whether any arrest will occur at all (binary: arrest vs. no arrest).

🗺️ Project Pipeline

The pipeline runs from raw dataset → EDA → baseline model → feature engineering (including clustering) → three improved regression models → convert to classification → three classification models → export winning models to HuggingFace.

🔎 Exploratory Data Analysis (EDA)

EDA was used to tell the story of the data and identify the strongest predictors of arrest before any model was built. Four key questions were explored:

Q1 — Does geographic district affect arrest rate?

A quadrant analysis (crime volume × arrest rate per district) revealed extreme policing differences across Chicago.

District 11 operates as a high-volume, high-arrest environment.
Districts 10 & 15 have similarly high crime volume but below-average arrest rates.
High crime density amplifies extreme outcomes rather than producing a uniform result.

The data suggests a more nuanced conclusion: geographic policing outcomes are driven by structural and resource factors, not just crime volume alone.

Q2 — Which crime types lead to arrests most and least often?

High arrest rates correlate strongly with police-initiated incidents (narcotics, interference with officers) — crimes where the arrest is the intended outcome of the police interaction. Reactive crimes committed against people or property (burglary, theft) show low arrest rates because police arrive after the fact.

This distinction between proactive and reactive policing became a core conceptual framework for the entire project.

Q3 — Does time (hour, month, year) affect arrest probability?

Month: Minimal effect — only ~4% variation across seasons.
Year: Major drops in arrest rates around 2015–2016 and 2019–2022, likely driven by city leadership changes and COVID-19.
Hour: Most interesting finding — crime volume is high throughout the day, but arrest rates spike sharply after midnight. At night, fewer bystanders and less ambient noise mean police can respond more definitively.

Q4 — Does micro-location (location type) affect arrest rate?

Location type is a strong predictor. In private spaces (homes, near ATMs) police typically arrive long after the crime and rarely catch anyone. In monitored or controlled spaces (airports, stairways, elevators) arrest rates are far higher due to constant police presence or camera coverage.

Your surroundings at the moment of a crime are a significant factor in whether an arrest follows.

EDA Summary

Factor	Impact on Arrest Rate
Crime Type (proactive vs. reactive)	⬆️ Very High
Micro-location (monitored vs. private)	⬆️ High
Hour of day	⬆️ Moderate–High
Police District	⬆️ Moderate
Month of year	➡️ Low

⚙️ Feature Engineering

Features were selected and built based on the EDA findings:

Temporal features extracted from Date: Hour, Month, Year
Group-level aggregation: each (Hour, District, Location Description) combination was summarized into a historical arrest rate — the regression target
One-Hot Encoding of Location Description and District
K-Means Clustering on district profiles (average hourly crime volume × average arrest rate) → produced 3 District Archetypes added as a new feature:

Archetype	Description
High-Volume / High-Arrest	e.g. District 11
High-Volume / Low-Arrest	e.g. Districts 10, 15
Lower-Volume / Mixed	Remaining districts

📊 Regression Models

Target: Hourly arrest rate (continuous, 0.0–1.0) for each (Time, District, Location) group.

Baseline — Linear Regression (raw features)

Metric	Score
MAE	0.2264
RMSE	0.2796

On average the baseline prediction was off by ~22.6 percentage points. It captured mild central cases reasonably but failed at the extremes.

Improved Models — Engineered Features

Model	Performance vs. Baseline
Linear Regression (engineered)	Improved
Random Forest Regressor	Lower MAE / Higher R²
Gradient Boosting Regressor ✅	Best overall

Winner: Gradient Boosting Regressor — exported as chicago_crime_gb_model.pkl

The improvement from baseline to engineered features was meaningful but not dramatic — which is expected. Predicting whether a person will be arrested involves social and psychological dynamics that no finite feature set can fully encode.

🏷️ Regression → Classification

The continuous arrest rate target was converted into a binary classification problem.

A median split was attempted first, but since most crimes have zero arrests, the median was exactly 0.0. The most meaningful and natural threshold became Zero vs. Non-Zero:

Class	Meaning
Class 0	Arrest rate = 0.0 — zero arrests made ("total getaway")
Class 1	Arrest rate > 0.0 — at least one arrest made

Class Balance

Class	Approximate Share
Class 0 (No Arrest)	~85%
Class 1 (Arrest)	~15%

The dataset is heavily imbalanced. A model that always guesses "no arrest" would score ~85% accuracy while being completely useless. For this reason, Precision and F1-Score for Class 1 are the primary evaluation metrics.

Precision vs. Recall — The Operational Choice

From the perspective of a real-world policing tool, False Positives are more damaging than False Negatives. Sending patrol resources to a location that yields no arrest wastes time and erodes trust. Therefore Precision (of predicted arrests, how many were real) is prioritized over Recall (how many real arrests were found).

🤖 Classification Models

Three classifiers were trained on the engineered features, all using class_weight='balanced' to account for the imbalance.

Logistic Regression

Logistic Regression's primary failure mode is massive False Positives (13,832). Because it is a linear algorithm it compensates by aggressively predicting "Arrest," catching many true positives but triggering far too many false alarms. Not suitable for resource allocation.

Random Forest

Random Forest is more balanced than Logistic Regression — fewer False Positives (12,439) with a similar True Positive count (17,370). A stronger model but still generates a significant volume of false alarms.

Gradient Boosting ✅ — Winner

Gradient Boosting is the most conservative and precise model. It achieves the fewest False Positives (5,720) by being selective — only predicting "Arrest" when the environmental signal is strong. This comes at the cost of more False Negatives (16,715), but for operational use that trade-off is correct: it is far better to miss some arrests than to waste resources on false alarms.

Model Comparison

Model	Precision (Class 1)	Recall (Class 1)	F1 (Class 1)	False Positives
Logistic Regression	lower	high	~0.60	13,832 ❌
Random Forest	medium	medium	~0.60	12,439 ⚠️
Gradient Boosting ✅	0.67 (highest)	lower	~0.51	5,720 ✅

Winner: Gradient Boosting Classifier — exported as chicago_crime_gb_classifier.pkl

🧠 Key Takeaways

The nature of the crime (proactive vs. reactive policing) is the single strongest predictor of arrest.
Location type provides strong signal — monitored spaces have far higher arrest rates regardless of crime type.
Hour of day matters more than month of year; late-night hours see elevated arrest rates relative to crime volume.
District archetypes (K-Means engineered feature) improved model performance over raw district IDs.
Class imbalance reflects reality — the right response is choosing the correct metric (Precision, F1), not artificially rebalancing the data.
Human behavior has limits as a prediction target — the models perform well on typical cases but the extremes remain hard to predict, which is honest and expected.

🔧 How to Use the Models

import pickle

# Load the regression model
with open('chicago_crime_gb_model.pkl', 'rb') as f:
    reg_model = pickle.load(f)

# Load the classification model
with open('chicago_crime_gb_classifier.pkl', 'rb') as f:
    clf_model = pickle.load(f)

# Both models expect the engineered feature set from Part 4 of the notebook.
# See notebook.ipynb for the full feature engineering pipeline.

# Predict arrest rate (regression)
# y_pred_rate = reg_model.predict(X_new)       # float 0.0–1.0

# Predict arrest / no arrest (classification)
# y_pred_class = clf_model.predict(X_new)      # 0 = no arrest, 1 = arrest

📦 Environment & Libraries

Python 3.10+
pandas, numpy
scikit-learn
plotly, matplotlib, seaborn
folium
datasets (HuggingFace)
scipy, statsmodels

All random operations use SEED = 42 for full reproducibility.

👤 Author

adam lambez

Assignment #2 — Classification, Regression, Clustering, Evaluation April 2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

ADAMlam-16
/

chicago-crime-arrest-predictor