Tiburoncin commited on
Commit
396ba65
β€’
1 Parent(s): d9904c7

Upload 2 files

Browse files
πŸ¦€_Breast_Cancer_Prediction_Using_Machine_Learning.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
πŸ¦€_breast_cancer_prediction_using_machine_learning.py ADDED
@@ -0,0 +1,857 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """πŸ¦€ Breast Cancer Prediction Using Machine Learning
3
+
4
+ Automatically generated by Colab.
5
+
6
+ Original file is located at
7
+ https://colab.research.google.com/#fileId=https%3A//storage.googleapis.com/kaggle-colab-exported-notebooks/breast-cancer-prediction-using-machine-learning-64dbd263-f311-46a0-9f3a-6d5379802a34.ipynb%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com/20240706/auto/storage/goog4_request%26X-Goog-Date%3D20240706T233729Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D770b61f66b57f06cbdb54f5d4bc4ba32650abee908c284002eeb0472828613c36367d32e0a38bde7138192f4066bfd1989608bbf31e1f46626f2f9cf0ca2e8845b9e2b421ac0b2af146b3e14860f016c245a0909ac13965a6f7ea58b4f3425f3e42c50b8ddffc177dd6cecb561b8c4d47054356112477f0f1c5819cba3750f4737d50937a291458ce7a92ba56dd0f3dd2b91bac287210da2318d5f4e74d79aa63b496369ed514c57b8e8953a3b1b9cdf673261822f27b2e488f4c2d7c225be9fa7d959fa1afa6fb5455d6f2a8db1f67711c39e69e654183c88e15fb420a0b8696bc1d6420a2d81f03eb8b5ebb8e80c40d7cf7664fb585951d3ae1dc04093d6a0
8
+ """
9
+
10
+ # IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
11
+ # TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
12
+ # THEN FEEL FREE TO DELETE THIS CELL.
13
+ # NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
14
+ # ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
15
+ # NOTEBOOK.
16
+
17
+ import os
18
+ import sys
19
+ from tempfile import NamedTemporaryFile
20
+ from urllib.request import urlopen
21
+ from urllib.parse import unquote, urlparse
22
+ from urllib.error import HTTPError
23
+ from zipfile import ZipFile
24
+ import tarfile
25
+ import shutil
26
+
27
+ CHUNK_SIZE = 40960
28
+ DATA_SOURCE_MAPPING = 'breast-cancer-wisconsin-data:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F180%2F408%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240706%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240706T233729Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D2a42b19591dbfb7e3dadedf38ba5c4a2f41943260a2d6207aadbbdd6dc68ac198d85c58f17405095296f8c79de5c9517c6b9fdead7a5db588fea525cfb3a0474d6648706bd7ed55b1eec6b7718d64035647349365aa3b684519ef9f3ee4b750db4f314a520cd629a09d7a6ab3553ca46600d66b8613a67f2335fcfb93a051a47237d3adde9a5dbeccff7f24f0de64e5dc4346b7d5fcf85ce9ef16e62007599a879c970761ea4b4dfdc90568736428bca9722b7c679b20b5843c031092316569902ec1e5e413c2fb039207260c95e5cea134c8a4bc1f27e559256bb1c78141d4a53f01b9253fa597423bf463719f5f3d47f21afdf5c9030c3fd43009a347010b5'
29
+
30
+ KAGGLE_INPUT_PATH='/kaggle/input'
31
+ KAGGLE_WORKING_PATH='/kaggle/working'
32
+ KAGGLE_SYMLINK='kaggle'
33
+
34
+ !umount /kaggle/input/ 2> /dev/null
35
+ shutil.rmtree('/kaggle/input', ignore_errors=True)
36
+ os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
37
+ os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)
38
+
39
+ try:
40
+ os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
41
+ except FileExistsError:
42
+ pass
43
+ try:
44
+ os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
45
+ except FileExistsError:
46
+ pass
47
+
48
+ for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
49
+ directory, download_url_encoded = data_source_mapping.split(':')
50
+ download_url = unquote(download_url_encoded)
51
+ filename = urlparse(download_url).path
52
+ destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
53
+ try:
54
+ with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
55
+ total_length = fileres.headers['content-length']
56
+ print(f'Downloading {directory}, {total_length} bytes compressed')
57
+ dl = 0
58
+ data = fileres.read(CHUNK_SIZE)
59
+ while len(data) > 0:
60
+ dl += len(data)
61
+ tfile.write(data)
62
+ done = int(50 * dl / int(total_length))
63
+ sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
64
+ sys.stdout.flush()
65
+ data = fileres.read(CHUNK_SIZE)
66
+ if filename.endswith('.zip'):
67
+ with ZipFile(tfile) as zfile:
68
+ zfile.extractall(destination_path)
69
+ else:
70
+ with tarfile.open(tfile.name) as tarfile:
71
+ tarfile.extractall(destination_path)
72
+ print(f'\nDownloaded and uncompressed: {directory}')
73
+ except HTTPError as e:
74
+ print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
75
+ continue
76
+ except OSError as e:
77
+ print(f'Failed to load {download_url} to path {destination_path}')
78
+ continue
79
+
80
+ print('Data source import complete.')
81
+
82
+
83
+
84
+ """# πŸ¦€ Breast Cancer Prediction Using Machine Learning.
85
+
86
+ <div class="text-success "><h3> Table Of Contains</h3></div>
87
+
88
+ ---
89
+
90
+ > ### Steps are:
91
+
92
+
93
+ 1. [Gathering Data](#1)
94
+ - [Exploratory Data Analysis](#2)
95
+ - [Data Visualizations](#3)
96
+ - [Model Implementation.](#4)
97
+ - [ML Model Selecting and Model PredPrediction](#5)
98
+ - [HyperTunning the ML Model](#6)
99
+ - [Deploy Model](#7)
100
+
101
+
102
+
103
+
104
+ **Hope** you guys ****Love It**** and get a better **learning experience**. πŸ™
105
+
106
+ <center><img src="https://healthitanalytics.com/images/site/article_headers/_normal/ThinkstockPhotos-495951912.jpg" alt="Breast Cancer Prediction Using Machine Learning" height="70%" width="100%" /></center>
107
+
108
+ ### Attribute Information:
109
+
110
+ 1. ID number
111
+ - Diagnosis (M = malignant, B = benign)
112
+
113
+ Ten real-valued features are computed for each cell nucleus:
114
+
115
+ 1. radius (mean of distances from center to points on the perimeter)
116
+ - texture (standard deviation of gray-scale values)
117
+ - perimeter
118
+ - area
119
+ - smoothness (local variation in radius lengths)
120
+ - compactness (perimeter^2 / area - 1.0)
121
+ - concavity (severity of concave portions of the contour)
122
+ - concave points (number of concave portions of the contour)
123
+ - symmetry
124
+ - fractal dimension ("coastline approximation" - 1)
125
+ """
126
+
127
+ import numpy as np # linear algebra
128
+ import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
129
+
130
+ pd.options.display.max_columns = 100
131
+
132
+ """After installing numpy and pandas package, we are ready to fetch data using pandas package, Befor we use it, We need to know where's our dataset located. Means what is the path of our dataset"""
133
+
134
+ #
135
+
136
+ import os
137
+ for dirname, _, filenames in os.walk('/kaggle/input'):
138
+ for filename in filenames:
139
+ print(os.path.join(dirname, filename))
140
+
141
+ """<a id="1"></a><br>
142
+
143
+ # 1. Data Collection.
144
+ """
145
+
146
+ data = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")
147
+
148
+ """After collecting data, we need to know what are the shape of this dataset, Here we have attribute(`property`) called `data.shape`
149
+
150
+ For that we have 2 type of methods to show the shape of the datasets.
151
+
152
+ 1. `len(data.index), len(data.columns)`
153
+ - `data.shape`
154
+
155
+ Both methods are giving us the same output, As you can see in the below cells`
156
+ """
157
+
158
+ # Cell 1
159
+ len(data.index), len(data.columns)
160
+
161
+ # Cell 2
162
+ data.shape
163
+
164
+ data.head()
165
+
166
+ data.tail()
167
+
168
+ """<a id="2"></a><br>
169
+ # 2. Exploring Data Analysis
170
+ """
171
+
172
+ data.info()
173
+
174
+ data.isna()
175
+
176
+ data.isna().any()
177
+
178
+ data.isna().sum()
179
+
180
+ data = data.dropna(axis='columns')
181
+
182
+ """### Get object features
183
+
184
+ - Using this method, we can see how many `object(categorical)` type of feature exists in dataset
185
+ """
186
+
187
+ data.describe(include="O")
188
+
189
+ """- *As we can see abouve result there are only one single feature is categorical and it's values are `B` and `M`*
190
+
191
+ ### To know how many unique values
192
+ """
193
+
194
+ data.diagnosis.value_counts()
195
+
196
+ """using `value_counts` method we can see number of unique values in categorical type of feature.
197
+
198
+ ### Identify dependent and independent
199
+ """
200
+
201
+ data.head(2)
202
+
203
+ diagnosis_unique = data.diagnosis.unique()
204
+
205
+ diagnosis_unique
206
+
207
+ """<a id="3"></a><br>
208
+
209
+ # 3. Data Visualization.
210
+ """
211
+
212
+ # Commented out IPython magic to ensure Python compatibility.
213
+ import matplotlib.pyplot as plt
214
+ import seaborn as sns
215
+ import plotly.express as px
216
+ import plotly.graph_objects as go
217
+
218
+ # %matplotlib inline
219
+ sns.set_style('darkgrid')
220
+
221
+ plt.figure(figsize=(15, 5))
222
+
223
+ plt.subplot(1, 2, 1)
224
+ plt.hist( data.diagnosis)
225
+ # plt.legend()
226
+ plt.title("Counts of Diagnosis")
227
+ plt.xlabel("Diagnosis")
228
+
229
+
230
+ plt.subplot(1, 2, 2)
231
+
232
+ #sns.countplot('diagnosis', data=data); # ";" to remove output like this > <matplotlib.axes._subplots.AxesSubplot at 0x7f3a1dddba50>
233
+
234
+ # plt.show()
235
+
236
+ # plt.figure(figsize=(7,12))
237
+ px.histogram(data, x='diagnosis')
238
+ # plt.show()
239
+
240
+ cols = ["diagnosis", "radius_mean", "texture_mean", "perimeter_mean", "area_mean"]
241
+
242
+ sns.pairplot(data[cols], hue="diagnosis")
243
+ plt.show()
244
+
245
+ size = len(data['texture_mean'])
246
+
247
+ area = np.pi * (15 * np.random.rand( size ))**2
248
+ colors = np.random.rand( size )
249
+
250
+ plt.xlabel("texture mean")
251
+ plt.ylabel("radius mean")
252
+ plt.scatter(data['texture_mean'], data['radius_mean'], s=area, c=colors, alpha=0.5);
253
+
254
+ """### Data Filtering
255
+
256
+ - Now, we have one categorical feature, so we need to convert it into numeric values using `LabelEncoder` from `sklearn.preprocessing` packages
257
+ """
258
+
259
+ from sklearn.preprocessing import LabelEncoder
260
+
261
+ data.head(2)
262
+
263
+ """* LabelEncoder can be used to normalize labels.
264
+
265
+ """
266
+
267
+ labelencoder_Y = LabelEncoder()
268
+ data.diagnosis = labelencoder_Y.fit_transform(data.diagnosis)
269
+
270
+ """After converting into numerical values, we can check it's values using this way,"""
271
+
272
+ data.head(2)
273
+
274
+ print(data.diagnosis.value_counts())
275
+ print("\n", data.diagnosis.value_counts().sum())
276
+
277
+ """Finnaly, We can see in this output categorical values converted into 0 and 1.
278
+
279
+ #### Find the correlation between other features, mean features only
280
+ """
281
+
282
+ cols = ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
283
+ 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
284
+ 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']
285
+ print(len(cols))
286
+ data[cols].corr()
287
+
288
+ plt.figure(figsize=(12, 9))
289
+
290
+ plt.title("Correlation Graph")
291
+
292
+ cmap = sns.diverging_palette( 1000, 120, as_cmap=True)
293
+ sns.heatmap(data[cols].corr(), annot=True, fmt='.1%', linewidths=.05, cmap=cmap);
294
+
295
+ """Using, Plotly Pacage we can show it in interactive graphs like this,"""
296
+
297
+ plt.figure(figsize=(15, 10))
298
+
299
+
300
+ fig = px.imshow(data[cols].corr());
301
+ fig.show()
302
+
303
+ """<a id="4"></a><br>
304
+
305
+ # Model Implementation
306
+
307
+ ---
308
+ ---
309
+
310
+
311
+ #### Train Test Splitting
312
+
313
+ ##### Preprocessing and model selection
314
+ """
315
+
316
+ from sklearn.model_selection import train_test_split
317
+
318
+ from sklearn.preprocessing import StandardScaler
319
+
320
+ """### Import Machine Learning Models
321
+
322
+ """
323
+
324
+ from sklearn.linear_model import LogisticRegression
325
+
326
+ from sklearn.tree import DecisionTreeClassifier
327
+
328
+ from sklearn.ensemble import RandomForestClassifier
329
+
330
+ from sklearn.naive_bayes import GaussianNB
331
+
332
+ from sklearn.neighbors import KNeighborsClassifier
333
+
334
+ """### Check the Model Accuracy, Errors and it's Validations"""
335
+
336
+ from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
337
+
338
+ from sklearn.metrics import classification_report
339
+
340
+ from sklearn.model_selection import KFold
341
+
342
+ from sklearn.model_selection import cross_validate, cross_val_score
343
+
344
+ from sklearn.svm import SVC
345
+
346
+ from sklearn import metrics
347
+
348
+ """### Feature Selection
349
+
350
+ Select feature for predictions
351
+ """
352
+
353
+ data.columns
354
+
355
+ """- Take the dependent and independent feature for prediction"""
356
+
357
+ prediction_feature = [ "radius_mean", 'perimeter_mean', 'area_mean', 'symmetry_mean', 'compactness_mean', 'concave points_mean']
358
+
359
+ targeted_feature = 'diagnosis'
360
+
361
+ len(prediction_feature)
362
+
363
+ X = data[prediction_feature]
364
+ X
365
+
366
+ # print(X.shape)
367
+ # print(X.values)
368
+
369
+ y = data.diagnosis
370
+ y
371
+
372
+ # print(y.values)
373
+
374
+ """- Splite the dataset into TrainingSet and TestingSet by 33% and set the 15 fixed records"""
375
+
376
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=15)
377
+
378
+ print(X_train)
379
+ # print(X_test)
380
+
381
+ """### Perform Feature Standerd Scalling
382
+
383
+ Standardize features by removing the mean and scaling to unit variance
384
+
385
+ The standard score of a sample x is calculated as:
386
+
387
+ - z = (x - u) / s
388
+ """
389
+
390
+ # Scale the data to keep all the values in the same magnitude of 0 -1
391
+
392
+ sc = StandardScaler()
393
+
394
+ X_train = sc.fit_transform(X_train)
395
+ X_test = sc.fit_transform(X_test)
396
+
397
+ """<a id="5"></a><br>
398
+ # ML Model Selecting and Model PredPrediction
399
+
400
+
401
+
402
+ ---
403
+ ---
404
+
405
+ #### Model Building
406
+
407
+ Now, we are ready to build our model for prediction, for the I made function for model building and preforming prediction and measure it's prediction and accuracy score.
408
+
409
+ #### Arguments
410
+ 1. model => ML Model Object
411
+ 2. Feature Training Set data
412
+ 3. Feature Testing Set data
413
+ 4. Targetd Training Set data
414
+ 5. Targetd Testing Set data
415
+ """
416
+
417
+ def model_building(model, X_train, X_test, y_train, y_test):
418
+ """
419
+
420
+ Model Fitting, Prediction And Other stuff
421
+ return ('score', 'accuracy_score', 'predictions' )
422
+ """
423
+
424
+ model.fit(X_train, y_train)
425
+ score = model.score(X_train, y_train)
426
+ predictions = model.predict(X_test)
427
+ accuracy = accuracy_score(predictions, y_test)
428
+
429
+ return (score, accuracy, predictions)
430
+
431
+ """Let's make a dictionary for multiple models for bulk predictions"""
432
+
433
+ models_list = {
434
+ "LogisticRegression" : LogisticRegression(),
435
+ "RandomForestClassifier" : RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=5),
436
+ "DecisionTreeClassifier" : DecisionTreeClassifier(criterion='entropy', random_state=0),
437
+ "SVC" : SVC(),
438
+ }
439
+
440
+ # print(models_list)
441
+
442
+ """Before, sending it to the prediction check the key and values to store it's values in DataFrame below."""
443
+
444
+ print(list(models_list.keys()))
445
+ print(list(models_list.values()))
446
+
447
+ # print(zip(list(models_list.keys()), list(models_list.values())))
448
+
449
+ """### Model Implementing
450
+
451
+ Now, Train the model one by one and show the classification report of perticular models wise.
452
+ """
453
+
454
+ # Let's Define the function for confision metric Graphs
455
+
456
+ def cm_metrix_graph(cm):
457
+
458
+ sns.heatmap(cm,annot=True,fmt="d")
459
+ plt.show()
460
+
461
+ df_prediction = []
462
+ confusion_matrixs = []
463
+ df_prediction_cols = [ 'model_name', 'score', 'accuracy_score' , "accuracy_percentage"]
464
+
465
+ for name, model in zip(list(models_list.keys()), list(models_list.values())):
466
+
467
+ (score, accuracy, predictions) = model_building(model, X_train, X_test, y_train, y_test )
468
+
469
+ print("\n\nClassification Report of '"+ str(name), "'\n")
470
+
471
+ print(classification_report(y_test, predictions))
472
+
473
+ df_prediction.append([name, score, accuracy, "{0:.2%}".format(accuracy)])
474
+
475
+ # For Showing Metrics
476
+ confusion_matrixs.append(confusion_matrix(y_test, predictions))
477
+
478
+
479
+ df_pred = pd.DataFrame(df_prediction, columns=df_prediction_cols)
480
+
481
+ print(len(confusion_matrixs))
482
+
483
+ plt.figure(figsize=(10, 2))
484
+ # plt.title("Confusion Metric Graph")
485
+
486
+
487
+ for index, cm in enumerate(confusion_matrixs):
488
+
489
+ up
490
+ # plt.xlabel("Negative Positive")
491
+ # plt.ylabel("True Positive")
492
+
493
+
494
+
495
+ # Show The Metrics Graph
496
+ cm_metrix_graph(cm) # Call the Confusion Metrics Graph
497
+ plt.tight_layout(pad=True)
498
+
499
+ """While Predicting we can store model's score and prediction values to new generated dataframe"""
500
+
501
+ df_pred
502
+
503
+ """- print the hightest accuracy score using sort values"""
504
+
505
+ df_pred.sort_values('score', ascending=False)
506
+ # df_pred.sort_values('accuracy_score', ascending=False)
507
+
508
+ """### K-Fold Applying ..."""
509
+
510
+ len(data)
511
+ # print(len(X))
512
+
513
+ # Sample For testing only
514
+
515
+ cv_score = cross_validate(LogisticRegression(), X, y, cv=3,
516
+ scoring=('r2', 'neg_mean_squared_error'),
517
+ return_train_score=True)
518
+
519
+ pd.DataFrame(cv_score).describe().T
520
+
521
+ """Let's define a functino for cross validation scorring for multiple ML models
522
+
523
+ """
524
+
525
+ def cross_val_scorring(model):
526
+
527
+ # (score, accuracy, predictions) = model_building(model, X_train, X_test, y_train, y_test )
528
+
529
+ model.fit(data[prediction_feature], data[targeted_feature])
530
+
531
+ # score = model.score(X_train, y_train)
532
+
533
+ predictions = model.predict(data[prediction_feature])
534
+ accuracy = accuracy_score(predictions, data[targeted_feature])
535
+ print("\nFull-Data Accuracy:", round(accuracy, 2))
536
+ print("Cross Validation Score of'"+ str(name), "'\n")
537
+
538
+
539
+ # Initialize K folds.
540
+ kFold = KFold(n_splits=5) # define 5 diffrent data folds
541
+
542
+ err = []
543
+
544
+ for train_index, test_index in kFold.split(data):
545
+ # print("TRAIN:", train_index, "TEST:", test_index)
546
+
547
+ # Data Spliting via fold indexes
548
+ X_train = data[prediction_feature].iloc[train_index, :] # train_index = rows and all columns for Prediction_features
549
+ y_train = data[targeted_feature].iloc[train_index] # all targeted features trains
550
+
551
+ X_test = data[prediction_feature].iloc[test_index, :] # testing all rows and cols
552
+ y_test = data[targeted_feature].iloc[test_index] # all targeted tests
553
+
554
+ # Again Model Fitting
555
+ model.fit(X_train, y_train)
556
+
557
+ err.append(model.score(X_train, y_train))
558
+
559
+ print("Score:", round(np.mean(err), 2) )
560
+
561
+ """Call the function to know the cross validation function by mean for our select model predictions."""
562
+
563
+ for name, model in zip(list(models_list.keys()), list(models_list.values())):
564
+ cross_val_scorring(model)
565
+
566
+ """- Some of the model are giving prefect scorring. it means sometimes overfitting occurs
567
+
568
+ <a id="6"></a><br>
569
+ # HyperTunning the ML Model
570
+
571
+
572
+ ---
573
+ ---
574
+
575
+
576
+
577
+ ### Tuning Parameters applying...
578
+
579
+ <!-- https://www.kaggle.com/gargmanish/basic-machine-learning-with-cancer -->
580
+ """
581
+
582
+ from sklearn.model_selection import GridSearchCV
583
+
584
+ """For HyperTunning we can use `GridSearchCV` to know the best performing parameters
585
+
586
+ - GridSearchCV implements a β€œfit” and a β€œscore” method. It also implements β€œpredict”, β€œpredict_proba”, β€œdecision_function”, β€œtransform” and β€œinverse_transform” if they are implemented in the estimator used.
587
+
588
+ - The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
589
+ """
590
+
591
+ # Let's Implement Grid Search Algorithm
592
+
593
+ # Pick the model
594
+ model = DecisionTreeClassifier()
595
+
596
+ # Tunning Params
597
+ param_grid = {'max_features': ['auto', 'sqrt', 'log2'],
598
+ 'min_samples_split': [2,3,4,5,6,7,8,9,10],
599
+ 'min_samples_leaf':[2,3,4,5,6,7,8,9,10] }
600
+
601
+
602
+ # Implement GridSearchCV
603
+ gsc = GridSearchCV(model, param_grid, cv=10) # For 10 Cross-Validation
604
+
605
+ gsc.fit(X_train, y_train) # Model Fitting
606
+
607
+ print("\n Best Score is ")
608
+ print(gsc.best_score_)
609
+
610
+ print("\n Best Estinator is ")
611
+ print(gsc.best_estimator_)
612
+
613
+ print("\n Best Parametes are")
614
+ print(gsc.best_params_)
615
+
616
+ """### Observation
617
+
618
+ Using this Algorithm, we can see that
619
+ - The best score is increases
620
+ - know the best estimator parametes for final model
621
+ - get the best parametes for it.
622
+
623
+ - *Let's apply same criteria for* **K Neighbors Classification**
624
+
625
+ [**To know the right params chckout its doc params**](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
626
+ """
627
+
628
+ # Pick the model
629
+ model = KNeighborsClassifier()
630
+
631
+
632
+ # Tunning Params
633
+ param_grid = {
634
+ 'n_neighbors': list(range(1, 30)),
635
+ 'leaf_size': list(range(1,30)),
636
+ 'weights': [ 'distance', 'uniform' ]
637
+ }
638
+
639
+
640
+ # Implement GridSearchCV
641
+ gsc = GridSearchCV(model, param_grid, cv=10)
642
+
643
+ # Model Fitting
644
+ gsc.fit(X_train, y_train)
645
+
646
+ print("\n Best Score is ")
647
+ print(gsc.best_score_)
648
+
649
+ print("\n Best Estinator is ")
650
+ print(gsc.best_estimator_)
651
+
652
+ print("\n Best Parametes are")
653
+ print(gsc.best_params_)
654
+
655
+ """### Observation
656
+
657
+ Using this Algorithm, we can see that
658
+ - A little score improved compared to previous model
659
+ - Showing the Best Estimator Parametes for final model
660
+ - We can see the Best Parametes for KNN Model.
661
+
662
+ - Finally, Implement same strategy for **SVM**
663
+ """
664
+
665
+ # Pick the model
666
+ model = SVC()
667
+
668
+
669
+ # Tunning Params
670
+ param_grid = [
671
+ {'C': [1, 10, 100, 1000],
672
+ 'kernel': ['linear']
673
+ },
674
+ {'C': [1, 10, 100, 1000],
675
+ 'gamma': [0.001, 0.0001],
676
+ 'kernel': ['rbf']
677
+ }
678
+ ]
679
+
680
+
681
+ # Implement GridSearchCV
682
+ gsc = GridSearchCV(model, param_grid, cv=10) # 10 Cross Validation
683
+
684
+ # Model Fitting
685
+ gsc.fit(X_train, y_train)
686
+
687
+ print("\n Best Score is ")
688
+ print(gsc.best_score_)
689
+
690
+ print("\n Best Estinator is ")
691
+ print(gsc.best_estimator_)
692
+
693
+ print("\n Best Parametes are")
694
+ print(gsc.best_params_)
695
+
696
+ """### Observation
697
+
698
+ Using this Algorithm, we can see that
699
+ - It's gives slight better score
700
+ - Showing the Best Estimator Parametes for final model
701
+
702
+ Let's Implementing RandomForestClassifier for hyper Tunning
703
+
704
+ > Remember while you run the below cell, it will take time for prediction and give the best params and estimators
705
+ """
706
+
707
+ # Pick the model
708
+ model = RandomForestClassifier()
709
+
710
+
711
+ # Tunning Params
712
+ random_grid = {'bootstrap': [True, False],
713
+ 'max_depth': [40, 50, None], # 10, 20, 30, 60, 70, 100,
714
+ 'max_features': ['auto', 'sqrt'],
715
+ 'min_samples_leaf': [1, 2], # , 4
716
+ 'min_samples_split': [2, 5], # , 10
717
+ 'n_estimators': [200, 400]} # , 600, 800, 1000, 1200, 1400, 1600, 1800, 2000
718
+
719
+ # Implement GridSearchCV
720
+ gsc = GridSearchCV(model, random_grid, cv=10) # 10 Cross Validation
721
+
722
+ # Model Fitting
723
+ gsc.fit(X_train, y_train)
724
+
725
+ print("\n Best Score is ")
726
+ print(gsc.best_score_)
727
+
728
+ print("\n Best Estinator is ")
729
+ print(gsc.best_estimator_)
730
+
731
+ print("\n Best Parametes are")
732
+ print(gsc.best_params_)
733
+
734
+ """### Observation
735
+
736
+
737
+ Using this Algorithm, we can see that
738
+ - It's gives slight better score
739
+ - Showing the Best Estimator Parametes for final model
740
+
741
+
742
+ ---
743
+
744
+ <a id="7"></a><br>
745
+ # 7. Deploy Model
746
+
747
+ - Finally, we are done so far. The last step is to deploy our model in production map. So we need to export our model and bind with web application API.
748
+
749
+ Using pickle we can export our model and store in to `model.pkl` file, so we can ealy access this file and calculate customize prediction using Web App API.
750
+
751
+
752
+ ### A little bit information about pickle:
753
+
754
+ `Pickle` is the standard way of serializing objects in Python. You can use the pickle operation to serialize your machine learning algorithms and save the serialized format to a file. Later you can load this file to deserialize your model and use it to make new predictions
755
+
756
+
757
+ >> Here is example of the Pickle export model
758
+
759
+
760
+
761
+ ```
762
+ model.fit(X_train, Y_train)
763
+ # save the model to disk
764
+ filename = 'finalized_model.sav'
765
+ pickle.dump(model, open(filename, 'wb'))
766
+
767
+ # some time later...
768
+
769
+ # load the model from disk
770
+ loaded_model = pickle.load(open(filename, 'rb'))
771
+ result = loaded_model.score(X_test, Y_test)
772
+ print(result)
773
+ ```
774
+ """
775
+
776
+
777
+
778
+ import pickle as pkl
779
+
780
+ # Trainned Model # You can also use your own trainned model
781
+ logistic_model = LogisticRegression()
782
+ logistic_model.fit(X_train, y_train)
783
+
784
+ filename = 'logistic_model.pkl'
785
+ pkl.dump(logistic_model, open(filename, 'wb')) # wb means write as binary
786
+
787
+ !pip install datasets
788
+ !pip install huggingface_hub
789
+
790
+ from huggingface_hub import login
791
+ from datasets import Dataset
792
+
793
+ login()
794
+
795
+ import pandas as pd
796
+
797
+ input = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')
798
+ input
799
+
800
+ dataset = Dataset.from_pandas(input)
801
+ dataset = dataset.train_test_split(test_size=0.3)
802
+
803
+ print(dataset)
804
+
805
+ dataset.push_to_hub('Tiburoncin/mom-cancer2')
806
+
807
+ """#### Now, You can check your current directory. You can see the file with named "logistic_model.pkl"
808
+
809
+ - To read model from file
810
+
811
+ ```
812
+ # load the model from disk
813
+ loaded_model = pkl.load(open(filename, 'rb')) # rb means read as binary
814
+ result = loaded_model.score(X_test, Y_test)
815
+
816
+ ```
817
+ """
818
+
819
+
820
+
821
+ """---
822
+
823
+
824
+ ---
825
+
826
+
827
+ ---
828
+
829
+ ### Conclusion
830
+
831
+ - In this kernal, We had seen the data clearning and EDA using pandas methods and show some visual graphs to know the behaviour of this dataset and finnaly we train some model for it and calculate the prediction and it's acciracy scores and hyper tunning. I have wroted some basic codes in this notebook. So, After socessfully completed we can deploye our models to the live production mode using **exporting models and some python web applications.** For that we can use `Flask`, `Django` or `FastAPI` frameworks.
832
+
833
+ ### I hope you enjoy in this kernel and give Upvote it. πŸ‘
834
+
835
+ ---
836
+ ---
837
+
838
+ <div class="text-center">
839
+ <h1>That's it Guys,</h1>
840
+ <h1>πŸ™</h1>
841
+
842
+
843
+ I Hope you guys you like and enjoy it, and learn something interesting things from this notebook,
844
+
845
+ Even I learn a lots of things while I'm creating this notebook
846
+
847
+ Keep Learning,
848
+ Regards,
849
+ Vikas Ukani.
850
+
851
+ </div>
852
+
853
+ ---
854
+ ---
855
+
856
+ <img src="https://static.wixstatic.com/media/3592ed_5453a1ea302b4c4588413007ac4fcb93~mv2.gif" align="center" alt="Thank You" style="min-height:20%; max-height:20%" width="90%" />
857
+ """