File size: 3,093 Bytes
4386418
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# Insurance Fraud Prediction Model
 
This project focuses on building and evaluating a machine learning model to detect fraudulent insurance claims. 
The project involves data preprocessing, model training using a RandomForestClassifier, model evaluation with 
various metrics and visualizations, and a Streamlit UI for interacting with the model.

Create and activate a virtual environment:

```bash 

    python -m venv env

    source env/bin/activate  # On Windows use `env\Scripts\activate`



```

Install the required packages:

```bash

    pip install -r requirements.txt

```

### Project Structure
```bash

insurance-fraud-detection/

β”‚

β”œβ”€β”€ dataset/

β”‚   └── insurance_claims.csv

β”‚

β”œβ”€β”€ model/

β”‚   └── only_model.joblib

β”‚

β”œβ”€β”€ train.py

β”œβ”€β”€ prediction.py

β”œβ”€β”€ app.py

β”œβ”€β”€ requirements.txt

└── README.md

```


### Data Preprocessing
#### Data Loading
The data is loaded from a CSV file located at dataset/insurance_claims.csv. During loading, the following steps are 

performed:



- Drop the _c39 column.
- Replace '?' with NaN.

#### Data Cleaning
Fill missing values for 'property_damage', 'police_report_available', and 'collision_type' columns with their mode.
Drop duplicate records.

#### Encoding and Feature Selection
Encode categorical variables using Label Encoding.
Drop unnecessary columns that are not relevant for the model.
Select the final set of features for the model.

#### Preprocessed Features
The final set of features used for model training:

incident_severity

insured_hobbies
total_claim_amount
months_as_customer
policy_annual_premium
incident_date

capital-loss

capital-gains

insured_education_level

incident_city
fraud_reported (target variable)



####  Model Training

The model is trained using a RandomForestClassifier with a pipeline that includes preprocessing steps and 

hyperparameter tuning using GridSearchCV.



#### Training Steps

Train-test split: The data is split into training and testing sets with a 70-30 split.

Pipeline setup: A pipeline is created to include preprocessing and model training.

Hyperparameter tuning: A grid search is performed to find the best hyperparameters.

Model training: The best model is trained on the training data.

Model saving: The trained model is saved as fraud_insurance_pipeline.joblib.



#### Model Evaluation

The trained model is evaluated using the test set. The evaluation metrics include:



Classification Report: Precision, Recall, F1-score.

AUC Score: Area Under the ROC Curve.

Confusion Matrix: Visual representation of true vs. predicted values.

ROC Curve: Receiver Operating Characteristic curve.





### Usage



#### Training the Model

To train the model, run the following command:



```bash

python train.py

```

#### Evaluating the Model



To evaluate the model, run the following command:



```bash

python predict.py

```

#### Running the Streamlit App

To run the Streamlit app, use the following command:

```bash

streamlit run streamlit_app.py
```