- π Chicago Crime β Arrest Prediction (Classification & Regression)
- π₯ Project Video Walkthrough
- π Repository Contents
- ποΈ Dataset
- β Central Question
- πΊοΈ Project Pipeline
- π Exploratory Data Analysis (EDA)
- βοΈ Feature Engineering
- π Regression Models
- π·οΈ Regression β Classification
- π€ Classification Models
- π§ Key Takeaways
- π§ How to Use the Models
- π¦ Environment & Libraries
- π€ Author
- Assignment #2 β Classification, Regression, Clustering, Evaluation
April 2026
- π₯ Project Video Walkthrough
π Chicago Crime β Arrest Prediction (Classification & Regression)
Can we predict whether a crime in Chicago will result in an arrest β based purely on environment, time, and location?
π₯ Project Video Walkthrough
π Repository Contents
| File | Description |
|---|---|
notebook.ipynb |
Full Colab notebook with all code, analysis, and results |
chicago_crime_gb_model.pkl |
Winning Regression model (Gradient Boosting Regressor) |
chicago_crime_gb_classifier.pkl |
Winning Classification model (Gradient Boosting Classifier) |
presentation.mp4 |
Full walkthrough video presentation |
README.md |
This file |
ποΈ Dataset
Source: gymprathap/Chicago-Crime-Dataset on Hugging Face
The raw dataset contains ~8 million rows of real Chicago Police Department crime reports with 21 features covering crime type, location, date/time, district, arrest status, and more.
After preprocessing:
- Filtered to crimes reported 2010 and later
- Dropped low-signal or redundant columns (
ID,Case Number,IUCR,Beat,Ward,FBI Code,Block,X/Y Coordinates) - Removed duplicate rows
- Parsed and extracted temporal features (
Hour,Month,Year) from theDatecolumn
β Central Question
What are the factors that most influence whether a person who has committed a crime is arrested?
This project approaches that question from two angles:
- Regression β predict the exact arrest rate for a given time / location / district combination.
- Classification β predict whether any arrest will occur at all (binary: arrest vs. no arrest).
πΊοΈ Project Pipeline
The pipeline runs from raw dataset β EDA β baseline model β feature engineering (including clustering) β three improved regression models β convert to classification β three classification models β export winning models to HuggingFace.
π Exploratory Data Analysis (EDA)
EDA was used to tell the story of the data and identify the strongest predictors of arrest before any model was built. Four key questions were explored:
Q1 β Does geographic district affect arrest rate?
A quadrant analysis (crime volume Γ arrest rate per district) revealed extreme policing differences across Chicago.
- District 11 operates as a high-volume, high-arrest environment.
- Districts 10 & 15 have similarly high crime volume but below-average arrest rates.
- High crime density amplifies extreme outcomes rather than producing a uniform result.
The data suggests a more nuanced conclusion: geographic policing outcomes are driven by structural and resource factors, not just crime volume alone.
Q2 β Which crime types lead to arrests most and least often?
High arrest rates correlate strongly with police-initiated incidents (narcotics, interference with officers) β crimes where the arrest is the intended outcome of the police interaction. Reactive crimes committed against people or property (burglary, theft) show low arrest rates because police arrive after the fact.
This distinction between proactive and reactive policing became a core conceptual framework for the entire project.
Q3 β Does time (hour, month, year) affect arrest probability?
- Month: Minimal effect β only ~4% variation across seasons.
- Year: Major drops in arrest rates around 2015β2016 and 2019β2022, likely driven by city leadership changes and COVID-19.
- Hour: Most interesting finding β crime volume is high throughout the day, but arrest rates spike sharply after midnight. At night, fewer bystanders and less ambient noise mean police can respond more definitively.
Q4 β Does micro-location (location type) affect arrest rate?
Location type is a strong predictor. In private spaces (homes, near ATMs) police typically arrive long after the crime and rarely catch anyone. In monitored or controlled spaces (airports, stairways, elevators) arrest rates are far higher due to constant police presence or camera coverage.
Your surroundings at the moment of a crime are a significant factor in whether an arrest follows.
EDA Summary
| Factor | Impact on Arrest Rate |
|---|---|
| Crime Type (proactive vs. reactive) | β¬οΈ Very High |
| Micro-location (monitored vs. private) | β¬οΈ High |
| Hour of day | β¬οΈ ModerateβHigh |
| Police District | β¬οΈ Moderate |
| Month of year | β‘οΈ Low |
βοΈ Feature Engineering
Features were selected and built based on the EDA findings:
- Temporal features extracted from
Date:Hour,Month,Year - Group-level aggregation: each (Hour, District, Location Description) combination was summarized into a historical arrest rate β the regression target
- One-Hot Encoding of
Location DescriptionandDistrict - K-Means Clustering on district profiles (average hourly crime volume Γ average arrest rate) β produced 3 District Archetypes added as a new feature:
| Archetype | Description |
|---|---|
| High-Volume / High-Arrest | e.g. District 11 |
| High-Volume / Low-Arrest | e.g. Districts 10, 15 |
| Lower-Volume / Mixed | Remaining districts |
π Regression Models
Target: Hourly arrest rate (continuous, 0.0β1.0) for each (Time, District, Location) group.
Baseline β Linear Regression (raw features)
| Metric | Score |
|---|---|
| MAE | 0.2264 |
| RMSE | 0.2796 |
On average the baseline prediction was off by ~22.6 percentage points. It captured mild central cases reasonably but failed at the extremes.
Improved Models β Engineered Features
| Model | Performance vs. Baseline |
|---|---|
| Linear Regression (engineered) | Improved |
| Random Forest Regressor | Lower MAE / Higher RΒ² |
| Gradient Boosting Regressor β | Best overall |
Winner: Gradient Boosting Regressor β exported as chicago_crime_gb_model.pkl
The improvement from baseline to engineered features was meaningful but not dramatic β which is expected. Predicting whether a person will be arrested involves social and psychological dynamics that no finite feature set can fully encode.
π·οΈ Regression β Classification
The continuous arrest rate target was converted into a binary classification problem.
A median split was attempted first, but since most crimes have zero arrests, the median was exactly 0.0. The most meaningful and natural threshold became Zero vs. Non-Zero:
| Class | Meaning |
|---|---|
| Class 0 | Arrest rate = 0.0 β zero arrests made ("total getaway") |
| Class 1 | Arrest rate > 0.0 β at least one arrest made |
Class Balance
| Class | Approximate Share |
|---|---|
| Class 0 (No Arrest) | ~85% |
| Class 1 (Arrest) | ~15% |
The dataset is heavily imbalanced. A model that always guesses "no arrest" would score ~85% accuracy while being completely useless. For this reason, Precision and F1-Score for Class 1 are the primary evaluation metrics.
Precision vs. Recall β The Operational Choice
From the perspective of a real-world policing tool, False Positives are more damaging than False Negatives. Sending patrol resources to a location that yields no arrest wastes time and erodes trust. Therefore Precision (of predicted arrests, how many were real) is prioritized over Recall (how many real arrests were found).
π€ Classification Models
Three classifiers were trained on the engineered features, all using class_weight='balanced' to account for the imbalance.
Logistic Regression
Logistic Regression's primary failure mode is massive False Positives (13,832). Because it is a linear algorithm it compensates by aggressively predicting "Arrest," catching many true positives but triggering far too many false alarms. Not suitable for resource allocation.
Random Forest
Random Forest is more balanced than Logistic Regression β fewer False Positives (12,439) with a similar True Positive count (17,370). A stronger model but still generates a significant volume of false alarms.
Gradient Boosting β β Winner
Gradient Boosting is the most conservative and precise model. It achieves the fewest False Positives (5,720) by being selective β only predicting "Arrest" when the environmental signal is strong. This comes at the cost of more False Negatives (16,715), but for operational use that trade-off is correct: it is far better to miss some arrests than to waste resources on false alarms.
Model Comparison
| Model | Precision (Class 1) | Recall (Class 1) | F1 (Class 1) | False Positives |
|---|---|---|---|---|
| Logistic Regression | lower | high | ~0.60 | 13,832 β |
| Random Forest | medium | medium | ~0.60 | 12,439 β οΈ |
| Gradient Boosting β | 0.67 (highest) | lower | ~0.51 | 5,720 β |
Winner: Gradient Boosting Classifier β exported as chicago_crime_gb_classifier.pkl
π§ Key Takeaways
- The nature of the crime (proactive vs. reactive policing) is the single strongest predictor of arrest.
- Location type provides strong signal β monitored spaces have far higher arrest rates regardless of crime type.
- Hour of day matters more than month of year; late-night hours see elevated arrest rates relative to crime volume.
- District archetypes (K-Means engineered feature) improved model performance over raw district IDs.
- Class imbalance reflects reality β the right response is choosing the correct metric (Precision, F1), not artificially rebalancing the data.
- Human behavior has limits as a prediction target β the models perform well on typical cases but the extremes remain hard to predict, which is honest and expected.
π§ How to Use the Models
import pickle
# Load the regression model
with open('chicago_crime_gb_model.pkl', 'rb') as f:
reg_model = pickle.load(f)
# Load the classification model
with open('chicago_crime_gb_classifier.pkl', 'rb') as f:
clf_model = pickle.load(f)
# Both models expect the engineered feature set from Part 4 of the notebook.
# See notebook.ipynb for the full feature engineering pipeline.
# Predict arrest rate (regression)
# y_pred_rate = reg_model.predict(X_new) # float 0.0β1.0
# Predict arrest / no arrest (classification)
# y_pred_class = clf_model.predict(X_new) # 0 = no arrest, 1 = arrest
π¦ Environment & Libraries
Python 3.10+
pandas, numpy
scikit-learn
plotly, matplotlib, seaborn
folium
datasets (HuggingFace)
scipy, statsmodels
All random operations use SEED = 42 for full reproducibility.
π€ Author
adam lambez



