Spaces:

bacancydataprophets
/

Insurance_Fraud_Detection

Sleeping

App Files Files Community

Insurance_Fraud_Detection / readme.md

kothariyashhh

Upload 7 files

4386418 verified 2 months ago

preview code

raw

history blame contribute delete

No virus

3.09 kB

	# Insurance Fraud Prediction Model

	This project focuses on building and evaluating a machine learning model to detect fraudulent insurance claims.
	The project involves data preprocessing, model training using a RandomForestClassifier, model evaluation with
	various metrics and visualizations, and a Streamlit UI for interacting with the model.

	Create and activate a virtual environment:

	```bash
	python -m venv env
	source env/bin/activate # On Windows use `env\Scripts\activate`

	```

	Install the required packages:

	```bash
	pip install -r requirements.txt
	```

	### Project Structure
	```bash
	insurance-fraud-detection/
	│
	├── dataset/
	│ └── insurance_claims.csv
	│
	├── model/
	│ └── only_model.joblib
	│
	├── train.py
	├── prediction.py
	├── app.py
	├── requirements.txt
	└── README.md
	```


	### Data Preprocessing
	#### Data Loading
	The data is loaded from a CSV file located at dataset/insurance_claims.csv. During loading, the following steps are
	performed:

	- Drop the _c39 column.
	- Replace '?' with NaN.

	#### Data Cleaning
	Fill missing values for 'property_damage', 'police_report_available', and 'collision_type' columns with their mode.
	Drop duplicate records.

	#### Encoding and Feature Selection
	Encode categorical variables using Label Encoding.
	Drop unnecessary columns that are not relevant for the model.
	Select the final set of features for the model.

	#### Preprocessed Features
	The final set of features used for model training:

	incident_severity
	insured_hobbies
	total_claim_amount
	months_as_customer
	policy_annual_premium
	incident_date
	capital-loss
	capital-gains
	insured_education_level
	incident_city
	fraud_reported (target variable)

	#### Model Training
	The model is trained using a RandomForestClassifier with a pipeline that includes preprocessing steps and
	hyperparameter tuning using GridSearchCV.

	#### Training Steps
	Train-test split: The data is split into training and testing sets with a 70-30 split.
	Pipeline setup: A pipeline is created to include preprocessing and model training.
	Hyperparameter tuning: A grid search is performed to find the best hyperparameters.
	Model training: The best model is trained on the training data.
	Model saving: The trained model is saved as fraud_insurance_pipeline.joblib.

	#### Model Evaluation
	The trained model is evaluated using the test set. The evaluation metrics include:

	Classification Report: Precision, Recall, F1-score.
	AUC Score: Area Under the ROC Curve.
	Confusion Matrix: Visual representation of true vs. predicted values.
	ROC Curve: Receiver Operating Characteristic curve.


	### Usage

	#### Training the Model
	To train the model, run the following command:

	```bash
	python train.py
	```
	#### Evaluating the Model

	To evaluate the model, run the following command:

	```bash
	python predict.py
	```
	#### Running the Streamlit App
	To run the Streamlit app, use the following command:
	```bash
	streamlit run streamlit_app.py
	```