# Insurance Fraud Prediction Model This project focuses on building and evaluating a machine learning model to detect fraudulent insurance claims. The project involves data preprocessing, model training using a RandomForestClassifier, model evaluation with various metrics and visualizations, and a Streamlit UI for interacting with the model. Create and activate a virtual environment: ```bash python -m venv env source env/bin/activate # On Windows use `env\Scripts\activate` ``` Install the required packages: ```bash pip install -r requirements.txt ``` ### Project Structure ```bash insurance-fraud-detection/ │ ├── dataset/ │ └── insurance_claims.csv │ ├── model/ │ └── only_model.joblib │ ├── train.py ├── prediction.py ├── app.py ├── requirements.txt └── README.md ``` ### Data Preprocessing #### Data Loading The data is loaded from a CSV file located at dataset/insurance_claims.csv. During loading, the following steps are performed: - Drop the _c39 column. - Replace '?' with NaN. #### Data Cleaning Fill missing values for 'property_damage', 'police_report_available', and 'collision_type' columns with their mode. Drop duplicate records. #### Encoding and Feature Selection Encode categorical variables using Label Encoding. Drop unnecessary columns that are not relevant for the model. Select the final set of features for the model. #### Preprocessed Features The final set of features used for model training: incident_severity insured_hobbies total_claim_amount months_as_customer policy_annual_premium incident_date capital-loss capital-gains insured_education_level incident_city fraud_reported (target variable) #### Model Training The model is trained using a RandomForestClassifier with a pipeline that includes preprocessing steps and hyperparameter tuning using GridSearchCV. #### Training Steps Train-test split: The data is split into training and testing sets with a 70-30 split. Pipeline setup: A pipeline is created to include preprocessing and model training. Hyperparameter tuning: A grid search is performed to find the best hyperparameters. Model training: The best model is trained on the training data. Model saving: The trained model is saved as fraud_insurance_pipeline.joblib. #### Model Evaluation The trained model is evaluated using the test set. The evaluation metrics include: Classification Report: Precision, Recall, F1-score. AUC Score: Area Under the ROC Curve. Confusion Matrix: Visual representation of true vs. predicted values. ROC Curve: Receiver Operating Characteristic curve. ### Usage #### Training the Model To train the model, run the following command: ```bash python train.py ``` #### Evaluating the Model To evaluate the model, run the following command: ```bash python predict.py ``` #### Running the Streamlit App To run the Streamlit app, use the following command: ```bash streamlit run streamlit_app.py ```