Atulit23 commited on
Commit
8401d32
·
verified ·
1 Parent(s): 2ad747d

Upload folder using huggingface_hub

Browse files
5.urldata.csv ADDED
The diff for this file is too large to render. See raw diff
 
Phishing Website Detection_Models & Training.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
README.md CHANGED
@@ -1,12 +1,6 @@
1
  ---
2
  title: PhishingURLs
3
- emoji: 💻
4
- colorFrom: yellow
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 4.19.2
8
  app_file: app.py
9
- pinned: false
 
10
  ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
  title: PhishingURLs
 
 
 
 
 
3
  app_file: app.py
4
+ sdk: gradio
5
+ sdk_version: 3.44.4
6
  ---
 
 
URL Feature Extraction.ipynb ADDED
@@ -0,0 +1,2043 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "metadata": {
7
+ "colab": {},
8
+ "colab_type": "code",
9
+ "id": "PH13wfswmyDv"
10
+ },
11
+ "outputs": [],
12
+ "source": [
13
+ "#importing required packages for this module\n",
14
+ "import pandas as pd"
15
+ ]
16
+ },
17
+ {
18
+ "cell_type": "code",
19
+ "execution_count": 3,
20
+ "metadata": {
21
+ "colab": {
22
+ "base_uri": "https://localhost:8080/",
23
+ "height": 392
24
+ },
25
+ "colab_type": "code",
26
+ "id": "FF5vM84YriWc",
27
+ "outputId": "34b29509-57f2-48c9-a862-db8390b6af1c"
28
+ },
29
+ "outputs": [
30
+ {
31
+ "name": "stderr",
32
+ "output_type": "stream",
33
+ "text": [
34
+ "--2024-02-25 16:01:08-- http://data.phishtank.com/data/online-valid.csv\n",
35
+ "Resolving data.phishtank.com (data.phishtank.com)... 2606:4700:8392:2ee:d039:14a:6810:654b, 104.16.101.75, 104.17.177.85\n",
36
+ "Connecting to data.phishtank.com (data.phishtank.com)|2606:4700:8392:2ee:d039:14a:6810:654b|:80... connected.\n",
37
+ "HTTP request sent, awaiting response... 301 Moved Permanently\n",
38
+ "Location: https://data.phishtank.com/data/online-valid.csv [following]\n",
39
+ "--2024-02-25 16:01:08-- https://data.phishtank.com/data/online-valid.csv\n",
40
+ "Connecting to data.phishtank.com (data.phishtank.com)|2606:4700:8392:2ee:d039:14a:6810:654b|:443... connected.\n",
41
+ "HTTP request sent, awaiting response... 429 Too Many Requests\n",
42
+ "2024-02-25 16:01:09 ERROR 429: Too Many Requests.\n",
43
+ "\n"
44
+ ]
45
+ }
46
+ ],
47
+ "source": [
48
+ "#Downloading the phishing URLs file\n",
49
+ "!wget http://data.phishtank.com/data/online-valid.csv"
50
+ ]
51
+ },
52
+ {
53
+ "cell_type": "code",
54
+ "execution_count": null,
55
+ "metadata": {
56
+ "colab": {
57
+ "base_uri": "https://localhost:8080/",
58
+ "height": 305
59
+ },
60
+ "colab_type": "code",
61
+ "id": "GaGVL9gYKXma",
62
+ "outputId": "fad0a947-4996-44bf-d46f-89abc4306e62"
63
+ },
64
+ "outputs": [
65
+ {
66
+ "data": {
67
+ "text/html": [
68
+ "<div>\n",
69
+ "<style scoped>\n",
70
+ " .dataframe tbody tr th:only-of-type {\n",
71
+ " vertical-align: middle;\n",
72
+ " }\n",
73
+ "\n",
74
+ " .dataframe tbody tr th {\n",
75
+ " vertical-align: top;\n",
76
+ " }\n",
77
+ "\n",
78
+ " .dataframe thead th {\n",
79
+ " text-align: right;\n",
80
+ " }\n",
81
+ "</style>\n",
82
+ "<table border=\"1\" class=\"dataframe\">\n",
83
+ " <thead>\n",
84
+ " <tr style=\"text-align: right;\">\n",
85
+ " <th></th>\n",
86
+ " <th>phish_id</th>\n",
87
+ " <th>url</th>\n",
88
+ " <th>phish_detail_url</th>\n",
89
+ " <th>submission_time</th>\n",
90
+ " <th>verified</th>\n",
91
+ " <th>verification_time</th>\n",
92
+ " <th>online</th>\n",
93
+ " <th>target</th>\n",
94
+ " </tr>\n",
95
+ " </thead>\n",
96
+ " <tbody>\n",
97
+ " <tr>\n",
98
+ " <th>0</th>\n",
99
+ " <td>6557033</td>\n",
100
+ " <td>http://u1047531.cp.regruhosting.ru/acces-inges...</td>\n",
101
+ " <td>http://www.phishtank.com/phish_detail.php?phis...</td>\n",
102
+ " <td>2020-05-09T22:01:43+00:00</td>\n",
103
+ " <td>yes</td>\n",
104
+ " <td>2020-05-09T22:03:07+00:00</td>\n",
105
+ " <td>yes</td>\n",
106
+ " <td>Other</td>\n",
107
+ " </tr>\n",
108
+ " <tr>\n",
109
+ " <th>1</th>\n",
110
+ " <td>6557032</td>\n",
111
+ " <td>http://hoysalacreations.com/wp-content/plugins...</td>\n",
112
+ " <td>http://www.phishtank.com/phish_detail.php?phis...</td>\n",
113
+ " <td>2020-05-09T22:01:37+00:00</td>\n",
114
+ " <td>yes</td>\n",
115
+ " <td>2020-05-09T22:03:07+00:00</td>\n",
116
+ " <td>yes</td>\n",
117
+ " <td>Other</td>\n",
118
+ " </tr>\n",
119
+ " <tr>\n",
120
+ " <th>2</th>\n",
121
+ " <td>6557011</td>\n",
122
+ " <td>http://www.accsystemprblemhelp.site/checkpoint...</td>\n",
123
+ " <td>http://www.phishtank.com/phish_detail.php?phis...</td>\n",
124
+ " <td>2020-05-09T21:54:31+00:00</td>\n",
125
+ " <td>yes</td>\n",
126
+ " <td>2020-05-09T21:55:38+00:00</td>\n",
127
+ " <td>yes</td>\n",
128
+ " <td>Facebook</td>\n",
129
+ " </tr>\n",
130
+ " <tr>\n",
131
+ " <th>3</th>\n",
132
+ " <td>6557010</td>\n",
133
+ " <td>http://www.accsystemprblemhelp.site/login_atte...</td>\n",
134
+ " <td>http://www.phishtank.com/phish_detail.php?phis...</td>\n",
135
+ " <td>2020-05-09T21:53:48+00:00</td>\n",
136
+ " <td>yes</td>\n",
137
+ " <td>2020-05-09T21:54:34+00:00</td>\n",
138
+ " <td>yes</td>\n",
139
+ " <td>Facebook</td>\n",
140
+ " </tr>\n",
141
+ " <tr>\n",
142
+ " <th>4</th>\n",
143
+ " <td>6557009</td>\n",
144
+ " <td>https://firebasestorage.googleapis.com/v0/b/so...</td>\n",
145
+ " <td>http://www.phishtank.com/phish_detail.php?phis...</td>\n",
146
+ " <td>2020-05-09T21:49:27+00:00</td>\n",
147
+ " <td>yes</td>\n",
148
+ " <td>2020-05-09T21:51:24+00:00</td>\n",
149
+ " <td>yes</td>\n",
150
+ " <td>Microsoft</td>\n",
151
+ " </tr>\n",
152
+ " </tbody>\n",
153
+ "</table>\n",
154
+ "</div>"
155
+ ],
156
+ "text/plain": [
157
+ " phish_id ... target\n",
158
+ "0 6557033 ... Other\n",
159
+ "1 6557032 ... Other\n",
160
+ "2 6557011 ... Facebook\n",
161
+ "3 6557010 ... Facebook\n",
162
+ "4 6557009 ... Microsoft\n",
163
+ "\n",
164
+ "[5 rows x 8 columns]"
165
+ ]
166
+ },
167
+ "execution_count": 3,
168
+ "metadata": {
169
+ "tags": []
170
+ },
171
+ "output_type": "execute_result"
172
+ }
173
+ ],
174
+ "source": [
175
+ "#loading the phishing URLs data to dataframe\n",
176
+ "data0 = pd.read_csv(\"online-valid.csv\")\n",
177
+ "data0.head()"
178
+ ]
179
+ },
180
+ {
181
+ "cell_type": "code",
182
+ "execution_count": null,
183
+ "metadata": {
184
+ "colab": {
185
+ "base_uri": "https://localhost:8080/",
186
+ "height": 35
187
+ },
188
+ "colab_type": "code",
189
+ "id": "mAZAvSe2n1oT",
190
+ "outputId": "da2fbbb6-871f-4070-df86-cc9a135ac37a"
191
+ },
192
+ "outputs": [
193
+ {
194
+ "data": {
195
+ "text/plain": [
196
+ "(14858, 8)"
197
+ ]
198
+ },
199
+ "execution_count": 4,
200
+ "metadata": {
201
+ "tags": []
202
+ },
203
+ "output_type": "execute_result"
204
+ }
205
+ ],
206
+ "source": [
207
+ "data0.shape"
208
+ ]
209
+ },
210
+ {
211
+ "cell_type": "code",
212
+ "execution_count": null,
213
+ "metadata": {
214
+ "colab": {
215
+ "base_uri": "https://localhost:8080/",
216
+ "height": 305
217
+ },
218
+ "colab_type": "code",
219
+ "id": "9CTCI_EgERPM",
220
+ "outputId": "cb74e74c-5591-4523-e077-bbf13ef89245"
221
+ },
222
+ "outputs": [
223
+ {
224
+ "data": {
225
+ "text/html": [
226
+ "<div>\n",
227
+ "<style scoped>\n",
228
+ " .dataframe tbody tr th:only-of-type {\n",
229
+ " vertical-align: middle;\n",
230
+ " }\n",
231
+ "\n",
232
+ " .dataframe tbody tr th {\n",
233
+ " vertical-align: top;\n",
234
+ " }\n",
235
+ "\n",
236
+ " .dataframe thead th {\n",
237
+ " text-align: right;\n",
238
+ " }\n",
239
+ "</style>\n",
240
+ "<table border=\"1\" class=\"dataframe\">\n",
241
+ " <thead>\n",
242
+ " <tr style=\"text-align: right;\">\n",
243
+ " <th></th>\n",
244
+ " <th>phish_id</th>\n",
245
+ " <th>url</th>\n",
246
+ " <th>phish_detail_url</th>\n",
247
+ " <th>submission_time</th>\n",
248
+ " <th>verified</th>\n",
249
+ " <th>verification_time</th>\n",
250
+ " <th>online</th>\n",
251
+ " <th>target</th>\n",
252
+ " </tr>\n",
253
+ " </thead>\n",
254
+ " <tbody>\n",
255
+ " <tr>\n",
256
+ " <th>0</th>\n",
257
+ " <td>6485787</td>\n",
258
+ " <td>https://eevee.tv/Bootstrap/assets/css/acces</td>\n",
259
+ " <td>http://www.phishtank.com/phish_detail.php?phis...</td>\n",
260
+ " <td>2020-04-04T03:01:00+00:00</td>\n",
261
+ " <td>yes</td>\n",
262
+ " <td>2020-04-04T03:03:56+00:00</td>\n",
263
+ " <td>yes</td>\n",
264
+ " <td>Other</td>\n",
265
+ " </tr>\n",
266
+ " <tr>\n",
267
+ " <th>1</th>\n",
268
+ " <td>6422543</td>\n",
269
+ " <td>https://appleid.apple.com-sa.pm/appleid/?</td>\n",
270
+ " <td>http://www.phishtank.com/phish_detail.php?phis...</td>\n",
271
+ " <td>2020-02-27T17:01:01+00:00</td>\n",
272
+ " <td>yes</td>\n",
273
+ " <td>2020-03-17T01:50:51+00:00</td>\n",
274
+ " <td>yes</td>\n",
275
+ " <td>Other</td>\n",
276
+ " </tr>\n",
277
+ " <tr>\n",
278
+ " <th>2</th>\n",
279
+ " <td>6543602</td>\n",
280
+ " <td>https://grandcup.xyz/</td>\n",
281
+ " <td>http://www.phishtank.com/phish_detail.php?phis...</td>\n",
282
+ " <td>2020-05-02T23:07:29+00:00</td>\n",
283
+ " <td>yes</td>\n",
284
+ " <td>2020-05-02T23:09:03+00:00</td>\n",
285
+ " <td>yes</td>\n",
286
+ " <td>Steam</td>\n",
287
+ " </tr>\n",
288
+ " <tr>\n",
289
+ " <th>3</th>\n",
290
+ " <td>6528783</td>\n",
291
+ " <td>https://villa-azzurro.com/onedrive/</td>\n",
292
+ " <td>http://www.phishtank.com/phish_detail.php?phis...</td>\n",
293
+ " <td>2020-04-25T20:54:02+00:00</td>\n",
294
+ " <td>yes</td>\n",
295
+ " <td>2020-04-25T21:46:55+00:00</td>\n",
296
+ " <td>yes</td>\n",
297
+ " <td>Other</td>\n",
298
+ " </tr>\n",
299
+ " <tr>\n",
300
+ " <th>4</th>\n",
301
+ " <td>6498136</td>\n",
302
+ " <td>http://mygpstrip.net/ii/u.php</td>\n",
303
+ " <td>http://www.phishtank.com/phish_detail.php?phis...</td>\n",
304
+ " <td>2020-04-10T15:01:56+00:00</td>\n",
305
+ " <td>yes</td>\n",
306
+ " <td>2020-04-10T16:01:37+00:00</td>\n",
307
+ " <td>yes</td>\n",
308
+ " <td>Other</td>\n",
309
+ " </tr>\n",
310
+ " </tbody>\n",
311
+ "</table>\n",
312
+ "</div>"
313
+ ],
314
+ "text/plain": [
315
+ " phish_id url ... online target\n",
316
+ "0 6485787 https://eevee.tv/Bootstrap/assets/css/acces ... yes Other\n",
317
+ "1 6422543 https://appleid.apple.com-sa.pm/appleid/? ... yes Other\n",
318
+ "2 6543602 https://grandcup.xyz/ ... yes Steam\n",
319
+ "3 6528783 https://villa-azzurro.com/onedrive/ ... yes Other\n",
320
+ "4 6498136 http://mygpstrip.net/ii/u.php ... yes Other\n",
321
+ "\n",
322
+ "[5 rows x 8 columns]"
323
+ ]
324
+ },
325
+ "execution_count": 5,
326
+ "metadata": {
327
+ "tags": []
328
+ },
329
+ "output_type": "execute_result"
330
+ }
331
+ ],
332
+ "source": [
333
+ "#Collecting 5,000 Phishing URLs randomly\n",
334
+ "phishurl = data0.sample(n = 5000, random_state = 12).copy()\n",
335
+ "phishurl = phishurl.reset_index(drop=True)\n",
336
+ "phishurl.head()"
337
+ ]
338
+ },
339
+ {
340
+ "cell_type": "code",
341
+ "execution_count": null,
342
+ "metadata": {
343
+ "colab": {
344
+ "base_uri": "https://localhost:8080/",
345
+ "height": 35
346
+ },
347
+ "colab_type": "code",
348
+ "id": "-FOfv0bspc8N",
349
+ "outputId": "48e76e11-37d7-4ba1-e04a-c2fa661e9219"
350
+ },
351
+ "outputs": [
352
+ {
353
+ "data": {
354
+ "text/plain": [
355
+ "(5000, 8)"
356
+ ]
357
+ },
358
+ "execution_count": 6,
359
+ "metadata": {
360
+ "tags": []
361
+ },
362
+ "output_type": "execute_result"
363
+ }
364
+ ],
365
+ "source": [
366
+ "phishurl.shape"
367
+ ]
368
+ },
369
+ {
370
+ "cell_type": "code",
371
+ "execution_count": null,
372
+ "metadata": {
373
+ "colab": {
374
+ "base_uri": "https://localhost:8080/",
375
+ "height": 200
376
+ },
377
+ "colab_type": "code",
378
+ "id": "0wkw4wGAsIbT",
379
+ "outputId": "4395a2bd-dd8b-49ea-fb1e-36cf0b67e75f"
380
+ },
381
+ "outputs": [
382
+ {
383
+ "data": {
384
+ "text/html": [
385
+ "<div>\n",
386
+ "<style scoped>\n",
387
+ " .dataframe tbody tr th:only-of-type {\n",
388
+ " vertical-align: middle;\n",
389
+ " }\n",
390
+ "\n",
391
+ " .dataframe tbody tr th {\n",
392
+ " vertical-align: top;\n",
393
+ " }\n",
394
+ "\n",
395
+ " .dataframe thead th {\n",
396
+ " text-align: right;\n",
397
+ " }\n",
398
+ "</style>\n",
399
+ "<table border=\"1\" class=\"dataframe\">\n",
400
+ " <thead>\n",
401
+ " <tr style=\"text-align: right;\">\n",
402
+ " <th></th>\n",
403
+ " <th>URLs</th>\n",
404
+ " </tr>\n",
405
+ " </thead>\n",
406
+ " <tbody>\n",
407
+ " <tr>\n",
408
+ " <th>0</th>\n",
409
+ " <td>http://1337x.to/torrent/1110018/Blackhat-2015-...</td>\n",
410
+ " </tr>\n",
411
+ " <tr>\n",
412
+ " <th>1</th>\n",
413
+ " <td>http://1337x.to/torrent/1122940/Blackhat-2015-...</td>\n",
414
+ " </tr>\n",
415
+ " <tr>\n",
416
+ " <th>2</th>\n",
417
+ " <td>http://1337x.to/torrent/1124395/Fast-and-Furio...</td>\n",
418
+ " </tr>\n",
419
+ " <tr>\n",
420
+ " <th>3</th>\n",
421
+ " <td>http://1337x.to/torrent/1145504/Avengers-Age-o...</td>\n",
422
+ " </tr>\n",
423
+ " <tr>\n",
424
+ " <th>4</th>\n",
425
+ " <td>http://1337x.to/torrent/1160078/Avengers-age-o...</td>\n",
426
+ " </tr>\n",
427
+ " </tbody>\n",
428
+ "</table>\n",
429
+ "</div>"
430
+ ],
431
+ "text/plain": [
432
+ " URLs\n",
433
+ "0 http://1337x.to/torrent/1110018/Blackhat-2015-...\n",
434
+ "1 http://1337x.to/torrent/1122940/Blackhat-2015-...\n",
435
+ "2 http://1337x.to/torrent/1124395/Fast-and-Furio...\n",
436
+ "3 http://1337x.to/torrent/1145504/Avengers-Age-o...\n",
437
+ "4 http://1337x.to/torrent/1160078/Avengers-age-o..."
438
+ ]
439
+ },
440
+ "execution_count": 7,
441
+ "metadata": {
442
+ "tags": []
443
+ },
444
+ "output_type": "execute_result"
445
+ }
446
+ ],
447
+ "source": [
448
+ "#Loading legitimate files \n",
449
+ "data1 = pd.read_csv(\"Benign_list_big_final.csv\")\n",
450
+ "data1.columns = ['URLs']\n",
451
+ "data1.head()"
452
+ ]
453
+ },
454
+ {
455
+ "cell_type": "code",
456
+ "execution_count": null,
457
+ "metadata": {
458
+ "colab": {
459
+ "base_uri": "https://localhost:8080/",
460
+ "height": 200
461
+ },
462
+ "colab_type": "code",
463
+ "id": "EQRtf9Ybs5sv",
464
+ "outputId": "227e262b-1483-4549-8bdf-49da2f321b06"
465
+ },
466
+ "outputs": [
467
+ {
468
+ "data": {
469
+ "text/html": [
470
+ "<div>\n",
471
+ "<style scoped>\n",
472
+ " .dataframe tbody tr th:only-of-type {\n",
473
+ " vertical-align: middle;\n",
474
+ " }\n",
475
+ "\n",
476
+ " .dataframe tbody tr th {\n",
477
+ " vertical-align: top;\n",
478
+ " }\n",
479
+ "\n",
480
+ " .dataframe thead th {\n",
481
+ " text-align: right;\n",
482
+ " }\n",
483
+ "</style>\n",
484
+ "<table border=\"1\" class=\"dataframe\">\n",
485
+ " <thead>\n",
486
+ " <tr style=\"text-align: right;\">\n",
487
+ " <th></th>\n",
488
+ " <th>URLs</th>\n",
489
+ " </tr>\n",
490
+ " </thead>\n",
491
+ " <tbody>\n",
492
+ " <tr>\n",
493
+ " <th>0</th>\n",
494
+ " <td>http://graphicriver.net/search?date=this-month...</td>\n",
495
+ " </tr>\n",
496
+ " <tr>\n",
497
+ " <th>1</th>\n",
498
+ " <td>http://ecnavi.jp/redirect/?url=http://www.cros...</td>\n",
499
+ " </tr>\n",
500
+ " <tr>\n",
501
+ " <th>2</th>\n",
502
+ " <td>https://hubpages.com/signin?explain=follow+Hub...</td>\n",
503
+ " </tr>\n",
504
+ " <tr>\n",
505
+ " <th>3</th>\n",
506
+ " <td>http://extratorrent.cc/torrent/4190536/AOMEI+B...</td>\n",
507
+ " </tr>\n",
508
+ " <tr>\n",
509
+ " <th>4</th>\n",
510
+ " <td>http://icicibank.com/Personal-Banking/offers/o...</td>\n",
511
+ " </tr>\n",
512
+ " </tbody>\n",
513
+ "</table>\n",
514
+ "</div>"
515
+ ],
516
+ "text/plain": [
517
+ " URLs\n",
518
+ "0 http://graphicriver.net/search?date=this-month...\n",
519
+ "1 http://ecnavi.jp/redirect/?url=http://www.cros...\n",
520
+ "2 https://hubpages.com/signin?explain=follow+Hub...\n",
521
+ "3 http://extratorrent.cc/torrent/4190536/AOMEI+B...\n",
522
+ "4 http://icicibank.com/Personal-Banking/offers/o..."
523
+ ]
524
+ },
525
+ "execution_count": 8,
526
+ "metadata": {
527
+ "tags": []
528
+ },
529
+ "output_type": "execute_result"
530
+ }
531
+ ],
532
+ "source": [
533
+ "#Collecting 5,000 Legitimate URLs randomly\n",
534
+ "legiurl = data1.sample(n = 5000, random_state = 12).copy()\n",
535
+ "legiurl = legiurl.reset_index(drop=True)\n",
536
+ "legiurl.head()"
537
+ ]
538
+ },
539
+ {
540
+ "cell_type": "code",
541
+ "execution_count": null,
542
+ "metadata": {
543
+ "colab": {
544
+ "base_uri": "https://localhost:8080/",
545
+ "height": 35
546
+ },
547
+ "colab_type": "code",
548
+ "id": "QrpSRXzDuKwW",
549
+ "outputId": "8b8e5220-be59-4893-9dd5-3ffc381d2b1d"
550
+ },
551
+ "outputs": [
552
+ {
553
+ "data": {
554
+ "text/plain": [
555
+ "(5000, 1)"
556
+ ]
557
+ },
558
+ "execution_count": 9,
559
+ "metadata": {
560
+ "tags": []
561
+ },
562
+ "output_type": "execute_result"
563
+ }
564
+ ],
565
+ "source": [
566
+ "legiurl.shape"
567
+ ]
568
+ },
569
+ {
570
+ "cell_type": "code",
571
+ "execution_count": null,
572
+ "metadata": {
573
+ "colab": {},
574
+ "colab_type": "code",
575
+ "id": "Rk4HFWsEKXpS"
576
+ },
577
+ "outputs": [],
578
+ "source": [
579
+ "# importing required packages for this section\n",
580
+ "from urllib.parse import urlparse,urlencode\n",
581
+ "import ipaddress\n",
582
+ "import re"
583
+ ]
584
+ },
585
+ {
586
+ "cell_type": "code",
587
+ "execution_count": null,
588
+ "metadata": {
589
+ "colab": {},
590
+ "colab_type": "code",
591
+ "id": "S0QorYenhaOD"
592
+ },
593
+ "outputs": [],
594
+ "source": [
595
+ "# 1.Domain of the URL (Domain) \n",
596
+ "def getDomain(url): \n",
597
+ " domain = urlparse(url).netloc\n",
598
+ " if re.match(r\"^www.\",domain):\n",
599
+ " domain = domain.replace(\"www.\",\"\")\n",
600
+ " return domain"
601
+ ]
602
+ },
603
+ {
604
+ "cell_type": "code",
605
+ "execution_count": null,
606
+ "metadata": {
607
+ "colab": {},
608
+ "colab_type": "code",
609
+ "id": "SX-4mbq27QBj"
610
+ },
611
+ "outputs": [],
612
+ "source": [
613
+ "# 2.Checks for IP address in URL (Have_IP)\n",
614
+ "def havingIP(url):\n",
615
+ " try:\n",
616
+ " ipaddress.ip_address(url)\n",
617
+ " ip = 1\n",
618
+ " except:\n",
619
+ " ip = 0\n",
620
+ " return ip\n"
621
+ ]
622
+ },
623
+ {
624
+ "cell_type": "code",
625
+ "execution_count": null,
626
+ "metadata": {
627
+ "colab": {},
628
+ "colab_type": "code",
629
+ "id": "XZQZi3K17TcR"
630
+ },
631
+ "outputs": [],
632
+ "source": [
633
+ "# 3.Checks the presence of @ in URL (Have_At)\n",
634
+ "def haveAtSign(url):\n",
635
+ " if \"@\" in url:\n",
636
+ " at = 1 \n",
637
+ " else:\n",
638
+ " at = 0 \n",
639
+ " return at"
640
+ ]
641
+ },
642
+ {
643
+ "cell_type": "code",
644
+ "execution_count": null,
645
+ "metadata": {
646
+ "colab": {},
647
+ "colab_type": "code",
648
+ "id": "fnQazil39Kra"
649
+ },
650
+ "outputs": [],
651
+ "source": [
652
+ "# 4.Finding the length of URL and categorizing (URL_Length)\n",
653
+ "def getLength(url):\n",
654
+ " if len(url) < 54:\n",
655
+ " length = 0 \n",
656
+ " else:\n",
657
+ " length = 1 \n",
658
+ " return length"
659
+ ]
660
+ },
661
+ {
662
+ "cell_type": "code",
663
+ "execution_count": null,
664
+ "metadata": {
665
+ "colab": {},
666
+ "colab_type": "code",
667
+ "id": "yILgNFf_9L3X"
668
+ },
669
+ "outputs": [],
670
+ "source": [
671
+ "# 5.Gives number of '/' in URL (URL_Depth)\n",
672
+ "def getDepth(url):\n",
673
+ " s = urlparse(url).path.split('/')\n",
674
+ " depth = 0\n",
675
+ " for j in range(len(s)):\n",
676
+ " if len(s[j]) != 0:\n",
677
+ " depth = depth+1\n",
678
+ " return depth"
679
+ ]
680
+ },
681
+ {
682
+ "cell_type": "code",
683
+ "execution_count": null,
684
+ "metadata": {
685
+ "colab": {},
686
+ "colab_type": "code",
687
+ "id": "RIJEiq51BSy0"
688
+ },
689
+ "outputs": [],
690
+ "source": [
691
+ "# 6.Checking for redirection '//' in the url (Redirection)\n",
692
+ "def redirection(url):\n",
693
+ " pos = url.rfind('//')\n",
694
+ " if pos > 6:\n",
695
+ " if pos > 7:\n",
696
+ " return 1\n",
697
+ " else:\n",
698
+ " return 0\n",
699
+ " else:\n",
700
+ " return 0"
701
+ ]
702
+ },
703
+ {
704
+ "cell_type": "code",
705
+ "execution_count": null,
706
+ "metadata": {
707
+ "colab": {},
708
+ "colab_type": "code",
709
+ "id": "h2vW23O1BbWl"
710
+ },
711
+ "outputs": [],
712
+ "source": [
713
+ "# 7.Existence of “HTTPS” Token in the Domain Part of the URL (https_Domain)\n",
714
+ "def httpDomain(url):\n",
715
+ " domain = urlparse(url).netloc\n",
716
+ " if 'https' in domain:\n",
717
+ " return 1\n",
718
+ " else:\n",
719
+ " return 0"
720
+ ]
721
+ },
722
+ {
723
+ "cell_type": "code",
724
+ "execution_count": null,
725
+ "metadata": {
726
+ "colab": {},
727
+ "colab_type": "code",
728
+ "id": "UdC9pUdTAVRU"
729
+ },
730
+ "outputs": [],
731
+ "source": [
732
+ "#listing shortening services\n",
733
+ "shortening_services = r\"bit\\.ly|goo\\.gl|shorte\\.st|go2l\\.ink|x\\.co|ow\\.ly|t\\.co|tinyurl|tr\\.im|is\\.gd|cli\\.gs|\" \\\n",
734
+ " r\"yfrog\\.com|migre\\.me|ff\\.im|tiny\\.cc|url4\\.eu|twit\\.ac|su\\.pr|twurl\\.nl|snipurl\\.com|\" \\\n",
735
+ " r\"short\\.to|BudURL\\.com|ping\\.fm|post\\.ly|Just\\.as|bkite\\.com|snipr\\.com|fic\\.kr|loopt\\.us|\" \\\n",
736
+ " r\"doiop\\.com|short\\.ie|kl\\.am|wp\\.me|rubyurl\\.com|om\\.ly|to\\.ly|bit\\.do|t\\.co|lnkd\\.in|db\\.tt|\" \\\n",
737
+ " r\"qr\\.ae|adf\\.ly|goo\\.gl|bitly\\.com|cur\\.lv|tinyurl\\.com|ow\\.ly|bit\\.ly|ity\\.im|q\\.gs|is\\.gd|\" \\\n",
738
+ " r\"po\\.st|bc\\.vc|twitthis\\.com|u\\.to|j\\.mp|buzurl\\.com|cutt\\.us|u\\.bb|yourls\\.org|x\\.co|\" \\\n",
739
+ " r\"prettylinkpro\\.com|scrnch\\.me|filoops\\.info|vzturl\\.com|qr\\.net|1url\\.com|tweez\\.me|v\\.gd|\" \\\n",
740
+ " r\"tr\\.im|link\\.zip\\.net\""
741
+ ]
742
+ },
743
+ {
744
+ "cell_type": "code",
745
+ "execution_count": null,
746
+ "metadata": {
747
+ "colab": {},
748
+ "colab_type": "code",
749
+ "id": "IUkU9UbbnKpY"
750
+ },
751
+ "outputs": [],
752
+ "source": [
753
+ "# 8. Checking for Shortening Services in URL (Tiny_URL)\n",
754
+ "def tinyURL(url):\n",
755
+ " match=re.search(shortening_services,url)\n",
756
+ " if match:\n",
757
+ " return 1\n",
758
+ " else:\n",
759
+ " return 0"
760
+ ]
761
+ },
762
+ {
763
+ "cell_type": "code",
764
+ "execution_count": null,
765
+ "metadata": {
766
+ "colab": {},
767
+ "colab_type": "code",
768
+ "id": "vLyjiIUgPjuw"
769
+ },
770
+ "outputs": [],
771
+ "source": [
772
+ "# 9.Checking for Prefix or Suffix Separated by (-) in the Domain (Prefix/Suffix)\n",
773
+ "def prefixSuffix(url):\n",
774
+ " if '-' in urlparse(url).netloc:\n",
775
+ " return 1 # phishing\n",
776
+ " else:\n",
777
+ " return 0 # legitimate"
778
+ ]
779
+ },
780
+ {
781
+ "cell_type": "code",
782
+ "execution_count": null,
783
+ "metadata": {
784
+ "colab": {
785
+ "base_uri": "https://localhost:8080/",
786
+ "height": 232
787
+ },
788
+ "colab_type": "code",
789
+ "id": "NbkEYJ_JOVa7",
790
+ "outputId": "f08b25f8-3852-432c-e141-8eb57ff916d8"
791
+ },
792
+ "outputs": [
793
+ {
794
+ "name": "stdout",
795
+ "output_type": "stream",
796
+ "text": [
797
+ "Collecting python-whois\n",
798
+ "\u001b[?25l Downloading https://files.pythonhosted.org/packages/f0/ab/11c2d01db2554bbaabb2c32b06b6a73f7277372533484c320c78a304dfd7/python-whois-0.7.2.tar.gz (90kB)\n",
799
+ "\r\u001b[K |███▋ | 10kB 24.0MB/s eta 0:00:01\r\u001b[K |███████▎ | 20kB 6.5MB/s eta 0:00:01\r\u001b[K |███████████ | 30kB 6.8MB/s eta 0:00:01\r\u001b[K |██████████████▋ | 40kB 7.8MB/s eta 0:00:01\r\u001b[K |██████████████████▏ | 51kB 7.6MB/s eta 0:00:01\r\u001b[K |█████████████████████▉ | 61kB 8.6MB/s eta 0:00:01\r\u001b[K |█████████████████████████▌ | 71kB 8.4MB/s eta 0:00:01\r\u001b[K |█████████████████████████████▏ | 81kB 9.3MB/s eta 0:00:01\r\u001b[K |████████████████████████████████| 92kB 5.5MB/s \n",
800
+ "\u001b[?25hRequirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from python-whois) (0.16.0)\n",
801
+ "Building wheels for collected packages: python-whois\n",
802
+ " Building wheel for python-whois (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
803
+ " Created wheel for python-whois: filename=python_whois-0.7.2-cp36-none-any.whl size=85245 sha256=900afbc18f144913762a57978778098dda65b687b3b5a1f14f7998e9631564e8\n",
804
+ " Stored in directory: /root/.cache/pip/wheels/69/e6/62/1e6a746ca8e690f472611511b6948c325b232aaf693245ce46\n",
805
+ "Successfully built python-whois\n",
806
+ "Installing collected packages: python-whois\n",
807
+ "Successfully installed python-whois-0.7.2\n"
808
+ ]
809
+ }
810
+ ],
811
+ "source": [
812
+ "%pip install python-whois"
813
+ ]
814
+ },
815
+ {
816
+ "cell_type": "code",
817
+ "execution_count": null,
818
+ "metadata": {
819
+ "colab": {},
820
+ "colab_type": "code",
821
+ "id": "esZ7FcvlOMZu"
822
+ },
823
+ "outputs": [],
824
+ "source": [
825
+ "# importing required packages for this section\n",
826
+ "import re\n",
827
+ "from bs4 import BeautifulSoup\n",
828
+ "import whois\n",
829
+ "import urllib\n",
830
+ "import urllib.request\n",
831
+ "from datetime import datetime"
832
+ ]
833
+ },
834
+ {
835
+ "cell_type": "code",
836
+ "execution_count": null,
837
+ "metadata": {
838
+ "colab": {},
839
+ "colab_type": "code",
840
+ "id": "8O5D1jH0IDgf"
841
+ },
842
+ "outputs": [],
843
+ "source": [
844
+ "# 11.DNS Record availability (DNS_Record)\n",
845
+ "# obtained in the featureExtraction function itself"
846
+ ]
847
+ },
848
+ {
849
+ "cell_type": "code",
850
+ "execution_count": null,
851
+ "metadata": {
852
+ "colab": {},
853
+ "colab_type": "code",
854
+ "id": "mtwQiRotZ2GD"
855
+ },
856
+ "outputs": [],
857
+ "source": [
858
+ "# 12.Web traffic (Web_Traffic)\n",
859
+ "def web_traffic(url):\n",
860
+ " try:\n",
861
+ " #Filling the whitespaces in the URL if any\n",
862
+ " url = urllib.parse.quote(url)\n",
863
+ " rank = BeautifulSoup(urllib.request.urlopen(\"http://data.alexa.com/data?cli=10&dat=s&url=\" + url).read(), \"xml\").find(\n",
864
+ " \"REACH\")['RANK']\n",
865
+ " rank = int(rank)\n",
866
+ " except TypeError:\n",
867
+ " return 1\n",
868
+ " if rank <100000:\n",
869
+ " return 1\n",
870
+ " else:\n",
871
+ " return 0"
872
+ ]
873
+ },
874
+ {
875
+ "cell_type": "code",
876
+ "execution_count": null,
877
+ "metadata": {
878
+ "colab": {},
879
+ "colab_type": "code",
880
+ "id": "li03hqJgH__j"
881
+ },
882
+ "outputs": [],
883
+ "source": [
884
+ "# 13.Survival time of domain: The difference between termination time and creation time (Domain_Age) \n",
885
+ "def domainAge(domain_name):\n",
886
+ " creation_date = domain_name.creation_date\n",
887
+ " expiration_date = domain_name.expiration_date\n",
888
+ " if (isinstance(creation_date,str) or isinstance(expiration_date,str)):\n",
889
+ " try:\n",
890
+ " creation_date = datetime.strptime(creation_date,'%Y-%m-%d')\n",
891
+ " expiration_date = datetime.strptime(expiration_date,\"%Y-%m-%d\")\n",
892
+ " except:\n",
893
+ " return 1\n",
894
+ " if ((expiration_date is None) or (creation_date is None)):\n",
895
+ " return 1\n",
896
+ " elif ((type(expiration_date) is list) or (type(creation_date) is list)):\n",
897
+ " return 1\n",
898
+ " else:\n",
899
+ " ageofdomain = abs((expiration_date - creation_date).days)\n",
900
+ " if ((ageofdomain/30) < 6):\n",
901
+ " age = 1\n",
902
+ " else:\n",
903
+ " age = 0\n",
904
+ " return age"
905
+ ]
906
+ },
907
+ {
908
+ "cell_type": "code",
909
+ "execution_count": null,
910
+ "metadata": {
911
+ "colab": {},
912
+ "colab_type": "code",
913
+ "id": "NueO81-ttKYd"
914
+ },
915
+ "outputs": [],
916
+ "source": [
917
+ "# 14.End time of domain: The difference between termination time and current time (Domain_End) \n",
918
+ "def domainEnd(domain_name):\n",
919
+ " expiration_date = domain_name.expiration_date\n",
920
+ " if isinstance(expiration_date,str):\n",
921
+ " try:\n",
922
+ " expiration_date = datetime.strptime(expiration_date,\"%Y-%m-%d\")\n",
923
+ " except:\n",
924
+ " return 1\n",
925
+ " if (expiration_date is None):\n",
926
+ " return 1\n",
927
+ " elif (type(expiration_date) is list):\n",
928
+ " return 1\n",
929
+ " else:\n",
930
+ " today = datetime.now()\n",
931
+ " end = abs((expiration_date - today).days)\n",
932
+ " if ((end/30) < 6):\n",
933
+ " end = 0\n",
934
+ " else:\n",
935
+ " end = 1\n",
936
+ " return end"
937
+ ]
938
+ },
939
+ {
940
+ "cell_type": "code",
941
+ "execution_count": null,
942
+ "metadata": {
943
+ "colab": {},
944
+ "colab_type": "code",
945
+ "id": "lw0JmOGEQPwb"
946
+ },
947
+ "outputs": [],
948
+ "source": [
949
+ "# importing required packages for this section\n",
950
+ "import requests"
951
+ ]
952
+ },
953
+ {
954
+ "cell_type": "code",
955
+ "execution_count": null,
956
+ "metadata": {
957
+ "colab": {},
958
+ "colab_type": "code",
959
+ "id": "F2gpZEMSQGpu"
960
+ },
961
+ "outputs": [],
962
+ "source": [
963
+ "# 15. IFrame Redirection (iFrame)\n",
964
+ "def iframe(response):\n",
965
+ " if response == \"\":\n",
966
+ " return 1\n",
967
+ " else:\n",
968
+ " if re.findall(r\"[<iframe>|<frameBorder>]\", response.text):\n",
969
+ " return 0\n",
970
+ " else:\n",
971
+ " return 1"
972
+ ]
973
+ },
974
+ {
975
+ "cell_type": "code",
976
+ "execution_count": null,
977
+ "metadata": {
978
+ "colab": {},
979
+ "colab_type": "code",
980
+ "id": "eapOq2afVGCF"
981
+ },
982
+ "outputs": [],
983
+ "source": [
984
+ "# 16.Checks the effect of mouse over on status bar (Mouse_Over)\n",
985
+ "def mouseOver(response): \n",
986
+ " if response == \"\" :\n",
987
+ " return 1\n",
988
+ " else:\n",
989
+ " if re.findall(\"<script>.+onmouseover.+</script>\", response.text):\n",
990
+ " return 1\n",
991
+ " else:\n",
992
+ " return 0"
993
+ ]
994
+ },
995
+ {
996
+ "cell_type": "code",
997
+ "execution_count": null,
998
+ "metadata": {
999
+ "colab": {},
1000
+ "colab_type": "code",
1001
+ "id": "9x3lR3lFIVj2"
1002
+ },
1003
+ "outputs": [],
1004
+ "source": [
1005
+ "# 17.Checks the status of the right click attribute (Right_Click)\n",
1006
+ "def rightClick(response):\n",
1007
+ " if response == \"\":\n",
1008
+ " return 1\n",
1009
+ " else:\n",
1010
+ " if re.findall(r\"event.button ?== ?2\", response.text):\n",
1011
+ " return 0\n",
1012
+ " else:\n",
1013
+ " return 1"
1014
+ ]
1015
+ },
1016
+ {
1017
+ "cell_type": "code",
1018
+ "execution_count": null,
1019
+ "metadata": {
1020
+ "colab": {},
1021
+ "colab_type": "code",
1022
+ "id": "GkpLyDIpKK0W"
1023
+ },
1024
+ "outputs": [],
1025
+ "source": [
1026
+ "# 18.Checks the number of forwardings (Web_Forwards) \n",
1027
+ "def forwarding(response):\n",
1028
+ " if response == \"\":\n",
1029
+ " return 1\n",
1030
+ " else:\n",
1031
+ " if len(response.history) <= 2:\n",
1032
+ " return 0\n",
1033
+ " else:\n",
1034
+ " return 1"
1035
+ ]
1036
+ },
1037
+ {
1038
+ "cell_type": "code",
1039
+ "execution_count": null,
1040
+ "metadata": {
1041
+ "colab": {},
1042
+ "colab_type": "code",
1043
+ "id": "8GzyvCg2rzWU"
1044
+ },
1045
+ "outputs": [],
1046
+ "source": [
1047
+ "#Function to extract features\n",
1048
+ "def featureExtraction(url,label):\n",
1049
+ "\n",
1050
+ " features = []\n",
1051
+ " #Address bar based features (10)\n",
1052
+ " features.append(getDomain(url))\n",
1053
+ " features.append(havingIP(url))\n",
1054
+ " features.append(haveAtSign(url))\n",
1055
+ " features.append(getLength(url))\n",
1056
+ " features.append(getDepth(url))\n",
1057
+ " features.append(redirection(url))\n",
1058
+ " features.append(httpDomain(url))\n",
1059
+ " features.append(tinyURL(url))\n",
1060
+ " features.append(prefixSuffix(url))\n",
1061
+ " \n",
1062
+ " #Domain based features (4)\n",
1063
+ " dns = 0\n",
1064
+ " try:\n",
1065
+ " domain_name = whois.whois(urlparse(url).netloc)\n",
1066
+ " except:\n",
1067
+ " dns = 1\n",
1068
+ "\n",
1069
+ " features.append(dns)\n",
1070
+ " features.append(web_traffic(url))\n",
1071
+ " features.append(1 if dns == 1 else domainAge(domain_name))\n",
1072
+ " features.append(1 if dns == 1 else domainEnd(domain_name))\n",
1073
+ " \n",
1074
+ " # HTML & Javascript based features (4)\n",
1075
+ " try:\n",
1076
+ " response = requests.get(url)\n",
1077
+ " except:\n",
1078
+ " response = \"\"\n",
1079
+ " features.append(iframe(response))\n",
1080
+ " features.append(mouseOver(response))\n",
1081
+ " features.append(rightClick(response))\n",
1082
+ " features.append(forwarding(response))\n",
1083
+ " features.append(label)\n",
1084
+ " \n",
1085
+ " return features"
1086
+ ]
1087
+ },
1088
+ {
1089
+ "cell_type": "code",
1090
+ "execution_count": null,
1091
+ "metadata": {
1092
+ "colab": {
1093
+ "base_uri": "https://localhost:8080/",
1094
+ "height": 35
1095
+ },
1096
+ "colab_type": "code",
1097
+ "id": "s_2AX4OPeJRP",
1098
+ "outputId": "a4ac4615-e723-4969-d3c4-c7eab93515e7"
1099
+ },
1100
+ "outputs": [
1101
+ {
1102
+ "data": {
1103
+ "text/plain": [
1104
+ "(5000, 1)"
1105
+ ]
1106
+ },
1107
+ "execution_count": 33,
1108
+ "metadata": {
1109
+ "tags": []
1110
+ },
1111
+ "output_type": "execute_result"
1112
+ }
1113
+ ],
1114
+ "source": [
1115
+ "legiurl.shape"
1116
+ ]
1117
+ },
1118
+ {
1119
+ "cell_type": "code",
1120
+ "execution_count": null,
1121
+ "metadata": {
1122
+ "colab": {},
1123
+ "colab_type": "code",
1124
+ "id": "BKNg26HEP5kN"
1125
+ },
1126
+ "outputs": [],
1127
+ "source": [
1128
+ "#Extracting the feautres & storing them in a list\n",
1129
+ "legi_features = []\n",
1130
+ "label = 0\n",
1131
+ "\n",
1132
+ "for i in range(0, 5000):\n",
1133
+ " url = legiurl['URLs'][i]\n",
1134
+ " legi_features.append(featureExtraction(url,label))"
1135
+ ]
1136
+ },
1137
+ {
1138
+ "cell_type": "code",
1139
+ "execution_count": null,
1140
+ "metadata": {
1141
+ "colab": {
1142
+ "base_uri": "https://localhost:8080/",
1143
+ "height": 220
1144
+ },
1145
+ "colab_type": "code",
1146
+ "id": "DSuxYREMi0fr",
1147
+ "outputId": "de1e393b-ed2b-4021-a05b-d5b696852365"
1148
+ },
1149
+ "outputs": [
1150
+ {
1151
+ "data": {
1152
+ "text/html": [
1153
+ "<div>\n",
1154
+ "<style scoped>\n",
1155
+ " .dataframe tbody tr th:only-of-type {\n",
1156
+ " vertical-align: middle;\n",
1157
+ " }\n",
1158
+ "\n",
1159
+ " .dataframe tbody tr th {\n",
1160
+ " vertical-align: top;\n",
1161
+ " }\n",
1162
+ "\n",
1163
+ " .dataframe thead th {\n",
1164
+ " text-align: right;\n",
1165
+ " }\n",
1166
+ "</style>\n",
1167
+ "<table border=\"1\" class=\"dataframe\">\n",
1168
+ " <thead>\n",
1169
+ " <tr style=\"text-align: right;\">\n",
1170
+ " <th></th>\n",
1171
+ " <th>Domain</th>\n",
1172
+ " <th>Have_IP</th>\n",
1173
+ " <th>Have_At</th>\n",
1174
+ " <th>URL_Length</th>\n",
1175
+ " <th>URL_Depth</th>\n",
1176
+ " <th>Redirection</th>\n",
1177
+ " <th>https_Domain</th>\n",
1178
+ " <th>TinyURL</th>\n",
1179
+ " <th>Prefix/Suffix</th>\n",
1180
+ " <th>DNS_Record</th>\n",
1181
+ " <th>Web_Traffic</th>\n",
1182
+ " <th>Domain_Age</th>\n",
1183
+ " <th>Domain_End</th>\n",
1184
+ " <th>iFrame</th>\n",
1185
+ " <th>Mouse_Over</th>\n",
1186
+ " <th>Right_Click</th>\n",
1187
+ " <th>Web_Forwards</th>\n",
1188
+ " <th>Label</th>\n",
1189
+ " </tr>\n",
1190
+ " </thead>\n",
1191
+ " <tbody>\n",
1192
+ " <tr>\n",
1193
+ " <th>0</th>\n",
1194
+ " <td>graphicriver.net</td>\n",
1195
+ " <td>0</td>\n",
1196
+ " <td>0</td>\n",
1197
+ " <td>1</td>\n",
1198
+ " <td>1</td>\n",
1199
+ " <td>0</td>\n",
1200
+ " <td>0</td>\n",
1201
+ " <td>0</td>\n",
1202
+ " <td>0</td>\n",
1203
+ " <td>0</td>\n",
1204
+ " <td>1</td>\n",
1205
+ " <td>1</td>\n",
1206
+ " <td>1</td>\n",
1207
+ " <td>0</td>\n",
1208
+ " <td>0</td>\n",
1209
+ " <td>1</td>\n",
1210
+ " <td>0</td>\n",
1211
+ " <td>0</td>\n",
1212
+ " </tr>\n",
1213
+ " <tr>\n",
1214
+ " <th>1</th>\n",
1215
+ " <td>ecnavi.jp</td>\n",
1216
+ " <td>0</td>\n",
1217
+ " <td>0</td>\n",
1218
+ " <td>1</td>\n",
1219
+ " <td>1</td>\n",
1220
+ " <td>1</td>\n",
1221
+ " <td>0</td>\n",
1222
+ " <td>0</td>\n",
1223
+ " <td>0</td>\n",
1224
+ " <td>0</td>\n",
1225
+ " <td>1</td>\n",
1226
+ " <td>1</td>\n",
1227
+ " <td>1</td>\n",
1228
+ " <td>0</td>\n",
1229
+ " <td>0</td>\n",
1230
+ " <td>1</td>\n",
1231
+ " <td>0</td>\n",
1232
+ " <td>0</td>\n",
1233
+ " </tr>\n",
1234
+ " <tr>\n",
1235
+ " <th>2</th>\n",
1236
+ " <td>hubpages.com</td>\n",
1237
+ " <td>0</td>\n",
1238
+ " <td>0</td>\n",
1239
+ " <td>1</td>\n",
1240
+ " <td>1</td>\n",
1241
+ " <td>0</td>\n",
1242
+ " <td>0</td>\n",
1243
+ " <td>0</td>\n",
1244
+ " <td>0</td>\n",
1245
+ " <td>0</td>\n",
1246
+ " <td>1</td>\n",
1247
+ " <td>0</td>\n",
1248
+ " <td>1</td>\n",
1249
+ " <td>0</td>\n",
1250
+ " <td>0</td>\n",
1251
+ " <td>1</td>\n",
1252
+ " <td>0</td>\n",
1253
+ " <td>0</td>\n",
1254
+ " </tr>\n",
1255
+ " <tr>\n",
1256
+ " <th>3</th>\n",
1257
+ " <td>extratorrent.cc</td>\n",
1258
+ " <td>0</td>\n",
1259
+ " <td>0</td>\n",
1260
+ " <td>1</td>\n",
1261
+ " <td>3</td>\n",
1262
+ " <td>0</td>\n",
1263
+ " <td>0</td>\n",
1264
+ " <td>0</td>\n",
1265
+ " <td>0</td>\n",
1266
+ " <td>0</td>\n",
1267
+ " <td>1</td>\n",
1268
+ " <td>0</td>\n",
1269
+ " <td>1</td>\n",
1270
+ " <td>0</td>\n",
1271
+ " <td>0</td>\n",
1272
+ " <td>1</td>\n",
1273
+ " <td>0</td>\n",
1274
+ " <td>0</td>\n",
1275
+ " </tr>\n",
1276
+ " <tr>\n",
1277
+ " <th>4</th>\n",
1278
+ " <td>icicibank.com</td>\n",
1279
+ " <td>0</td>\n",
1280
+ " <td>0</td>\n",
1281
+ " <td>1</td>\n",
1282
+ " <td>3</td>\n",
1283
+ " <td>0</td>\n",
1284
+ " <td>0</td>\n",
1285
+ " <td>0</td>\n",
1286
+ " <td>0</td>\n",
1287
+ " <td>0</td>\n",
1288
+ " <td>1</td>\n",
1289
+ " <td>0</td>\n",
1290
+ " <td>1</td>\n",
1291
+ " <td>0</td>\n",
1292
+ " <td>0</td>\n",
1293
+ " <td>1</td>\n",
1294
+ " <td>0</td>\n",
1295
+ " <td>0</td>\n",
1296
+ " </tr>\n",
1297
+ " </tbody>\n",
1298
+ "</table>\n",
1299
+ "</div>"
1300
+ ],
1301
+ "text/plain": [
1302
+ " Domain Have_IP Have_At ... Right_Click Web_Forwards Label\n",
1303
+ "0 graphicriver.net 0 0 ... 1 0 0\n",
1304
+ "1 ecnavi.jp 0 0 ... 1 0 0\n",
1305
+ "2 hubpages.com 0 0 ... 1 0 0\n",
1306
+ "3 extratorrent.cc 0 0 ... 1 0 0\n",
1307
+ "4 icicibank.com 0 0 ... 1 0 0\n",
1308
+ "\n",
1309
+ "[5 rows x 18 columns]"
1310
+ ]
1311
+ },
1312
+ "execution_count": 35,
1313
+ "metadata": {
1314
+ "tags": []
1315
+ },
1316
+ "output_type": "execute_result"
1317
+ }
1318
+ ],
1319
+ "source": [
1320
+ "#converting the list to dataframe\n",
1321
+ "feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', \n",
1322
+ " 'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Web_Traffic', \n",
1323
+ " 'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']\n",
1324
+ "\n",
1325
+ "legitimate = pd.DataFrame(legi_features, columns= feature_names)\n",
1326
+ "legitimate.head()"
1327
+ ]
1328
+ },
1329
+ {
1330
+ "cell_type": "code",
1331
+ "execution_count": null,
1332
+ "metadata": {
1333
+ "colab": {},
1334
+ "colab_type": "code",
1335
+ "id": "1jVcLHvXC031"
1336
+ },
1337
+ "outputs": [],
1338
+ "source": [
1339
+ "# Storing the extracted legitimate URLs fatures to csv file\n",
1340
+ "legitimate.to_csv('legitimate.csv', index= False)"
1341
+ ]
1342
+ },
1343
+ {
1344
+ "cell_type": "code",
1345
+ "execution_count": null,
1346
+ "metadata": {
1347
+ "colab": {
1348
+ "base_uri": "https://localhost:8080/",
1349
+ "height": 35
1350
+ },
1351
+ "colab_type": "code",
1352
+ "id": "PSKf1PCeoeGd",
1353
+ "outputId": "953c12ff-7f25-4491-ce67-ffcf4b1251b0"
1354
+ },
1355
+ "outputs": [
1356
+ {
1357
+ "data": {
1358
+ "text/plain": [
1359
+ "(5000, 8)"
1360
+ ]
1361
+ },
1362
+ "execution_count": 37,
1363
+ "metadata": {
1364
+ "tags": []
1365
+ },
1366
+ "output_type": "execute_result"
1367
+ }
1368
+ ],
1369
+ "source": [
1370
+ "phishurl.shape"
1371
+ ]
1372
+ },
1373
+ {
1374
+ "cell_type": "code",
1375
+ "execution_count": null,
1376
+ "metadata": {
1377
+ "colab": {},
1378
+ "colab_type": "code",
1379
+ "id": "WGhZMkbaoeGh"
1380
+ },
1381
+ "outputs": [],
1382
+ "source": [
1383
+ "#Extracting the feautres & storing them in a list\n",
1384
+ "phish_features = []\n",
1385
+ "label = 1\n",
1386
+ "for i in range(0, 5000):\n",
1387
+ " url = phishurl['url'][i]\n",
1388
+ " phish_features.append(featureExtraction(url,label))"
1389
+ ]
1390
+ },
1391
+ {
1392
+ "cell_type": "code",
1393
+ "execution_count": null,
1394
+ "metadata": {
1395
+ "colab": {
1396
+ "base_uri": "https://localhost:8080/",
1397
+ "height": 237
1398
+ },
1399
+ "colab_type": "code",
1400
+ "id": "1brvc6kmoeGk",
1401
+ "outputId": "58a2e2cd-886f-414a-e2c6-2e5b7894efdd"
1402
+ },
1403
+ "outputs": [
1404
+ {
1405
+ "data": {
1406
+ "text/html": [
1407
+ "<div>\n",
1408
+ "<style scoped>\n",
1409
+ " .dataframe tbody tr th:only-of-type {\n",
1410
+ " vertical-align: middle;\n",
1411
+ " }\n",
1412
+ "\n",
1413
+ " .dataframe tbody tr th {\n",
1414
+ " vertical-align: top;\n",
1415
+ " }\n",
1416
+ "\n",
1417
+ " .dataframe thead th {\n",
1418
+ " text-align: right;\n",
1419
+ " }\n",
1420
+ "</style>\n",
1421
+ "<table border=\"1\" class=\"dataframe\">\n",
1422
+ " <thead>\n",
1423
+ " <tr style=\"text-align: right;\">\n",
1424
+ " <th></th>\n",
1425
+ " <th>Domain</th>\n",
1426
+ " <th>Have_IP</th>\n",
1427
+ " <th>Have_At</th>\n",
1428
+ " <th>URL_Length</th>\n",
1429
+ " <th>URL_Depth</th>\n",
1430
+ " <th>Redirection</th>\n",
1431
+ " <th>https_Domain</th>\n",
1432
+ " <th>Tiny_URL</th>\n",
1433
+ " <th>Prefix/Suffix</th>\n",
1434
+ " <th>DNS_Record</th>\n",
1435
+ " <th>Web_Traffic</th>\n",
1436
+ " <th>Domain_Age</th>\n",
1437
+ " <th>Domain_End</th>\n",
1438
+ " <th>iFrame</th>\n",
1439
+ " <th>Mouse_Over</th>\n",
1440
+ " <th>Right_Click</th>\n",
1441
+ " <th>Web_Forwards</th>\n",
1442
+ " <th>Label</th>\n",
1443
+ " </tr>\n",
1444
+ " </thead>\n",
1445
+ " <tbody>\n",
1446
+ " <tr>\n",
1447
+ " <th>0</th>\n",
1448
+ " <td>eevee.tv</td>\n",
1449
+ " <td>0</td>\n",
1450
+ " <td>0</td>\n",
1451
+ " <td>0</td>\n",
1452
+ " <td>4</td>\n",
1453
+ " <td>0</td>\n",
1454
+ " <td>0</td>\n",
1455
+ " <td>0</td>\n",
1456
+ " <td>0</td>\n",
1457
+ " <td>0</td>\n",
1458
+ " <td>1</td>\n",
1459
+ " <td>0</td>\n",
1460
+ " <td>0</td>\n",
1461
+ " <td>0</td>\n",
1462
+ " <td>0</td>\n",
1463
+ " <td>1</td>\n",
1464
+ " <td>0</td>\n",
1465
+ " <td>1</td>\n",
1466
+ " </tr>\n",
1467
+ " <tr>\n",
1468
+ " <th>1</th>\n",
1469
+ " <td>appleid.apple.com-sa.pm</td>\n",
1470
+ " <td>0</td>\n",
1471
+ " <td>0</td>\n",
1472
+ " <td>0</td>\n",
1473
+ " <td>1</td>\n",
1474
+ " <td>0</td>\n",
1475
+ " <td>0</td>\n",
1476
+ " <td>0</td>\n",
1477
+ " <td>1</td>\n",
1478
+ " <td>0</td>\n",
1479
+ " <td>1</td>\n",
1480
+ " <td>1</td>\n",
1481
+ " <td>1</td>\n",
1482
+ " <td>0</td>\n",
1483
+ " <td>0</td>\n",
1484
+ " <td>1</td>\n",
1485
+ " <td>0</td>\n",
1486
+ " <td>1</td>\n",
1487
+ " </tr>\n",
1488
+ " <tr>\n",
1489
+ " <th>2</th>\n",
1490
+ " <td>grandcup.xyz</td>\n",
1491
+ " <td>0</td>\n",
1492
+ " <td>0</td>\n",
1493
+ " <td>0</td>\n",
1494
+ " <td>0</td>\n",
1495
+ " <td>0</td>\n",
1496
+ " <td>0</td>\n",
1497
+ " <td>0</td>\n",
1498
+ " <td>0</td>\n",
1499
+ " <td>0</td>\n",
1500
+ " <td>1</td>\n",
1501
+ " <td>0</td>\n",
1502
+ " <td>1</td>\n",
1503
+ " <td>1</td>\n",
1504
+ " <td>1</td>\n",
1505
+ " <td>1</td>\n",
1506
+ " <td>1</td>\n",
1507
+ " <td>1</td>\n",
1508
+ " </tr>\n",
1509
+ " <tr>\n",
1510
+ " <th>3</th>\n",
1511
+ " <td>villa-azzurro.com</td>\n",
1512
+ " <td>0</td>\n",
1513
+ " <td>0</td>\n",
1514
+ " <td>0</td>\n",
1515
+ " <td>1</td>\n",
1516
+ " <td>0</td>\n",
1517
+ " <td>0</td>\n",
1518
+ " <td>0</td>\n",
1519
+ " <td>1</td>\n",
1520
+ " <td>0</td>\n",
1521
+ " <td>0</td>\n",
1522
+ " <td>0</td>\n",
1523
+ " <td>1</td>\n",
1524
+ " <td>0</td>\n",
1525
+ " <td>0</td>\n",
1526
+ " <td>1</td>\n",
1527
+ " <td>0</td>\n",
1528
+ " <td>1</td>\n",
1529
+ " </tr>\n",
1530
+ " <tr>\n",
1531
+ " <th>4</th>\n",
1532
+ " <td>mygpstrip.net</td>\n",
1533
+ " <td>0</td>\n",
1534
+ " <td>0</td>\n",
1535
+ " <td>0</td>\n",
1536
+ " <td>2</td>\n",
1537
+ " <td>0</td>\n",
1538
+ " <td>0</td>\n",
1539
+ " <td>0</td>\n",
1540
+ " <td>0</td>\n",
1541
+ " <td>0</td>\n",
1542
+ " <td>1</td>\n",
1543
+ " <td>0</td>\n",
1544
+ " <td>1</td>\n",
1545
+ " <td>0</td>\n",
1546
+ " <td>0</td>\n",
1547
+ " <td>1</td>\n",
1548
+ " <td>0</td>\n",
1549
+ " <td>1</td>\n",
1550
+ " </tr>\n",
1551
+ " </tbody>\n",
1552
+ "</table>\n",
1553
+ "</div>"
1554
+ ],
1555
+ "text/plain": [
1556
+ " Domain Have_IP Have_At ... Right_Click Web_Forwards Label\n",
1557
+ "0 eevee.tv 0 0 ... 1 0 1\n",
1558
+ "1 appleid.apple.com-sa.pm 0 0 ... 1 0 1\n",
1559
+ "2 grandcup.xyz 0 0 ... 1 1 1\n",
1560
+ "3 villa-azzurro.com 0 0 ... 1 0 1\n",
1561
+ "4 mygpstrip.net 0 0 ... 1 0 1\n",
1562
+ "\n",
1563
+ "[5 rows x 18 columns]"
1564
+ ]
1565
+ },
1566
+ "execution_count": 40,
1567
+ "metadata": {
1568
+ "tags": []
1569
+ },
1570
+ "output_type": "execute_result"
1571
+ }
1572
+ ],
1573
+ "source": [
1574
+ "#converting the list to dataframe\n",
1575
+ "feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', \n",
1576
+ " 'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Web_Traffic', \n",
1577
+ " 'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']\n",
1578
+ "\n",
1579
+ "phishing = pd.DataFrame(phish_features, columns= feature_names)\n",
1580
+ "phishing.head()"
1581
+ ]
1582
+ },
1583
+ {
1584
+ "cell_type": "code",
1585
+ "execution_count": null,
1586
+ "metadata": {
1587
+ "colab": {},
1588
+ "colab_type": "code",
1589
+ "id": "UhBbG2O7E30d"
1590
+ },
1591
+ "outputs": [],
1592
+ "source": [
1593
+ "\n",
1594
+ "phishing.to_csv('phishing.csv', index= False)"
1595
+ ]
1596
+ },
1597
+ {
1598
+ "cell_type": "code",
1599
+ "execution_count": null,
1600
+ "metadata": {
1601
+ "colab": {
1602
+ "base_uri": "https://localhost:8080/",
1603
+ "height": 220
1604
+ },
1605
+ "colab_type": "code",
1606
+ "id": "ktJNvDY5sUsX",
1607
+ "outputId": "b64aa1a8-59a1-44a9-92d4-fc9cd811521b"
1608
+ },
1609
+ "outputs": [
1610
+ {
1611
+ "data": {
1612
+ "text/html": [
1613
+ "<div>\n",
1614
+ "<style scoped>\n",
1615
+ " .dataframe tbody tr th:only-of-type {\n",
1616
+ " vertical-align: middle;\n",
1617
+ " }\n",
1618
+ "\n",
1619
+ " .dataframe tbody tr th {\n",
1620
+ " vertical-align: top;\n",
1621
+ " }\n",
1622
+ "\n",
1623
+ " .dataframe thead th {\n",
1624
+ " text-align: right;\n",
1625
+ " }\n",
1626
+ "</style>\n",
1627
+ "<table border=\"1\" class=\"dataframe\">\n",
1628
+ " <thead>\n",
1629
+ " <tr style=\"text-align: right;\">\n",
1630
+ " <th></th>\n",
1631
+ " <th>Domain</th>\n",
1632
+ " <th>Have_IP</th>\n",
1633
+ " <th>Have_At</th>\n",
1634
+ " <th>URL_Length</th>\n",
1635
+ " <th>URL_Depth</th>\n",
1636
+ " <th>Redirection</th>\n",
1637
+ " <th>https_Domain</th>\n",
1638
+ " <th>TinyURL</th>\n",
1639
+ " <th>Prefix/Suffix</th>\n",
1640
+ " <th>DNS_Record</th>\n",
1641
+ " <th>Web_Traffic</th>\n",
1642
+ " <th>Domain_Age</th>\n",
1643
+ " <th>Domain_End</th>\n",
1644
+ " <th>iFrame</th>\n",
1645
+ " <th>Mouse_Over</th>\n",
1646
+ " <th>Right_Click</th>\n",
1647
+ " <th>Web_Forwards</th>\n",
1648
+ " <th>Label</th>\n",
1649
+ " </tr>\n",
1650
+ " </thead>\n",
1651
+ " <tbody>\n",
1652
+ " <tr>\n",
1653
+ " <th>0</th>\n",
1654
+ " <td>graphicriver.net</td>\n",
1655
+ " <td>0</td>\n",
1656
+ " <td>0</td>\n",
1657
+ " <td>1</td>\n",
1658
+ " <td>1</td>\n",
1659
+ " <td>0</td>\n",
1660
+ " <td>0</td>\n",
1661
+ " <td>0</td>\n",
1662
+ " <td>0</td>\n",
1663
+ " <td>0</td>\n",
1664
+ " <td>1</td>\n",
1665
+ " <td>1</td>\n",
1666
+ " <td>1</td>\n",
1667
+ " <td>0</td>\n",
1668
+ " <td>0</td>\n",
1669
+ " <td>1</td>\n",
1670
+ " <td>0</td>\n",
1671
+ " <td>0</td>\n",
1672
+ " </tr>\n",
1673
+ " <tr>\n",
1674
+ " <th>1</th>\n",
1675
+ " <td>ecnavi.jp</td>\n",
1676
+ " <td>0</td>\n",
1677
+ " <td>0</td>\n",
1678
+ " <td>1</td>\n",
1679
+ " <td>1</td>\n",
1680
+ " <td>1</td>\n",
1681
+ " <td>0</td>\n",
1682
+ " <td>0</td>\n",
1683
+ " <td>0</td>\n",
1684
+ " <td>0</td>\n",
1685
+ " <td>1</td>\n",
1686
+ " <td>1</td>\n",
1687
+ " <td>1</td>\n",
1688
+ " <td>0</td>\n",
1689
+ " <td>0</td>\n",
1690
+ " <td>1</td>\n",
1691
+ " <td>0</td>\n",
1692
+ " <td>0</td>\n",
1693
+ " </tr>\n",
1694
+ " <tr>\n",
1695
+ " <th>2</th>\n",
1696
+ " <td>hubpages.com</td>\n",
1697
+ " <td>0</td>\n",
1698
+ " <td>0</td>\n",
1699
+ " <td>1</td>\n",
1700
+ " <td>1</td>\n",
1701
+ " <td>0</td>\n",
1702
+ " <td>0</td>\n",
1703
+ " <td>0</td>\n",
1704
+ " <td>0</td>\n",
1705
+ " <td>0</td>\n",
1706
+ " <td>1</td>\n",
1707
+ " <td>0</td>\n",
1708
+ " <td>1</td>\n",
1709
+ " <td>0</td>\n",
1710
+ " <td>0</td>\n",
1711
+ " <td>1</td>\n",
1712
+ " <td>0</td>\n",
1713
+ " <td>0</td>\n",
1714
+ " </tr>\n",
1715
+ " <tr>\n",
1716
+ " <th>3</th>\n",
1717
+ " <td>extratorrent.cc</td>\n",
1718
+ " <td>0</td>\n",
1719
+ " <td>0</td>\n",
1720
+ " <td>1</td>\n",
1721
+ " <td>3</td>\n",
1722
+ " <td>0</td>\n",
1723
+ " <td>0</td>\n",
1724
+ " <td>0</td>\n",
1725
+ " <td>0</td>\n",
1726
+ " <td>0</td>\n",
1727
+ " <td>1</td>\n",
1728
+ " <td>0</td>\n",
1729
+ " <td>1</td>\n",
1730
+ " <td>0</td>\n",
1731
+ " <td>0</td>\n",
1732
+ " <td>1</td>\n",
1733
+ " <td>0</td>\n",
1734
+ " <td>0</td>\n",
1735
+ " </tr>\n",
1736
+ " <tr>\n",
1737
+ " <th>4</th>\n",
1738
+ " <td>icicibank.com</td>\n",
1739
+ " <td>0</td>\n",
1740
+ " <td>0</td>\n",
1741
+ " <td>1</td>\n",
1742
+ " <td>3</td>\n",
1743
+ " <td>0</td>\n",
1744
+ " <td>0</td>\n",
1745
+ " <td>0</td>\n",
1746
+ " <td>0</td>\n",
1747
+ " <td>0</td>\n",
1748
+ " <td>1</td>\n",
1749
+ " <td>0</td>\n",
1750
+ " <td>1</td>\n",
1751
+ " <td>0</td>\n",
1752
+ " <td>0</td>\n",
1753
+ " <td>1</td>\n",
1754
+ " <td>0</td>\n",
1755
+ " <td>0</td>\n",
1756
+ " </tr>\n",
1757
+ " </tbody>\n",
1758
+ "</table>\n",
1759
+ "</div>"
1760
+ ],
1761
+ "text/plain": [
1762
+ " Domain Have_IP Have_At ... Right_Click Web_Forwards Label\n",
1763
+ "0 graphicriver.net 0 0 ... 1 0 0\n",
1764
+ "1 ecnavi.jp 0 0 ... 1 0 0\n",
1765
+ "2 hubpages.com 0 0 ... 1 0 0\n",
1766
+ "3 extratorrent.cc 0 0 ... 1 0 0\n",
1767
+ "4 icicibank.com 0 0 ... 1 0 0\n",
1768
+ "\n",
1769
+ "[5 rows x 18 columns]"
1770
+ ]
1771
+ },
1772
+ "execution_count": 45,
1773
+ "metadata": {
1774
+ "tags": []
1775
+ },
1776
+ "output_type": "execute_result"
1777
+ }
1778
+ ],
1779
+ "source": [
1780
+ "\n",
1781
+ "urldata = pd.concat([legitimate, phishing]).reset_index(drop=True)\n",
1782
+ "urldata.head()"
1783
+ ]
1784
+ },
1785
+ {
1786
+ "cell_type": "code",
1787
+ "execution_count": null,
1788
+ "metadata": {
1789
+ "colab": {
1790
+ "base_uri": "https://localhost:8080/",
1791
+ "height": 271
1792
+ },
1793
+ "colab_type": "code",
1794
+ "id": "u1Viw3jKh3_o",
1795
+ "outputId": "eb924dbb-e26a-4711-a46a-3e39b4a37718"
1796
+ },
1797
+ "outputs": [
1798
+ {
1799
+ "data": {
1800
+ "text/html": [
1801
+ "<div>\n",
1802
+ "<style scoped>\n",
1803
+ " .dataframe tbody tr th:only-of-type {\n",
1804
+ " vertical-align: middle;\n",
1805
+ " }\n",
1806
+ "\n",
1807
+ " .dataframe tbody tr th {\n",
1808
+ " vertical-align: top;\n",
1809
+ " }\n",
1810
+ "\n",
1811
+ " .dataframe thead th {\n",
1812
+ " text-align: right;\n",
1813
+ " }\n",
1814
+ "</style>\n",
1815
+ "<table border=\"1\" class=\"dataframe\">\n",
1816
+ " <thead>\n",
1817
+ " <tr style=\"text-align: right;\">\n",
1818
+ " <th></th>\n",
1819
+ " <th>Domain</th>\n",
1820
+ " <th>Have_IP</th>\n",
1821
+ " <th>Have_At</th>\n",
1822
+ " <th>URL_Length</th>\n",
1823
+ " <th>URL_Depth</th>\n",
1824
+ " <th>Redirection</th>\n",
1825
+ " <th>https_Domain</th>\n",
1826
+ " <th>TinyURL</th>\n",
1827
+ " <th>Prefix/Suffix</th>\n",
1828
+ " <th>DNS_Record</th>\n",
1829
+ " <th>Web_Traffic</th>\n",
1830
+ " <th>Domain_Age</th>\n",
1831
+ " <th>Domain_End</th>\n",
1832
+ " <th>iFrame</th>\n",
1833
+ " <th>Mouse_Over</th>\n",
1834
+ " <th>Right_Click</th>\n",
1835
+ " <th>Web_Forwards</th>\n",
1836
+ " <th>Label</th>\n",
1837
+ " </tr>\n",
1838
+ " </thead>\n",
1839
+ " <tbody>\n",
1840
+ " <tr>\n",
1841
+ " <th>9995</th>\n",
1842
+ " <td>wvk12-my.sharepoint.com</td>\n",
1843
+ " <td>0</td>\n",
1844
+ " <td>0</td>\n",
1845
+ " <td>1</td>\n",
1846
+ " <td>5</td>\n",
1847
+ " <td>0</td>\n",
1848
+ " <td>0</td>\n",
1849
+ " <td>1</td>\n",
1850
+ " <td>1</td>\n",
1851
+ " <td>0</td>\n",
1852
+ " <td>1</td>\n",
1853
+ " <td>1</td>\n",
1854
+ " <td>1</td>\n",
1855
+ " <td>0</td>\n",
1856
+ " <td>0</td>\n",
1857
+ " <td>1</td>\n",
1858
+ " <td>0</td>\n",
1859
+ " <td>1</td>\n",
1860
+ " </tr>\n",
1861
+ " <tr>\n",
1862
+ " <th>9996</th>\n",
1863
+ " <td>adplife.com</td>\n",
1864
+ " <td>0</td>\n",
1865
+ " <td>0</td>\n",
1866
+ " <td>1</td>\n",
1867
+ " <td>4</td>\n",
1868
+ " <td>0</td>\n",
1869
+ " <td>0</td>\n",
1870
+ " <td>0</td>\n",
1871
+ " <td>0</td>\n",
1872
+ " <td>0</td>\n",
1873
+ " <td>1</td>\n",
1874
+ " <td>0</td>\n",
1875
+ " <td>1</td>\n",
1876
+ " <td>0</td>\n",
1877
+ " <td>0</td>\n",
1878
+ " <td>1</td>\n",
1879
+ " <td>0</td>\n",
1880
+ " <td>1</td>\n",
1881
+ " </tr>\n",
1882
+ " <tr>\n",
1883
+ " <th>9997</th>\n",
1884
+ " <td>kurortnoye.com.ua</td>\n",
1885
+ " <td>0</td>\n",
1886
+ " <td>1</td>\n",
1887
+ " <td>1</td>\n",
1888
+ " <td>3</td>\n",
1889
+ " <td>0</td>\n",
1890
+ " <td>0</td>\n",
1891
+ " <td>1</td>\n",
1892
+ " <td>0</td>\n",
1893
+ " <td>0</td>\n",
1894
+ " <td>0</td>\n",
1895
+ " <td>1</td>\n",
1896
+ " <td>1</td>\n",
1897
+ " <td>1</td>\n",
1898
+ " <td>0</td>\n",
1899
+ " <td>1</td>\n",
1900
+ " <td>0</td>\n",
1901
+ " <td>1</td>\n",
1902
+ " </tr>\n",
1903
+ " <tr>\n",
1904
+ " <th>9998</th>\n",
1905
+ " <td>norcaltc-my.sharepoint.com</td>\n",
1906
+ " <td>0</td>\n",
1907
+ " <td>0</td>\n",
1908
+ " <td>1</td>\n",
1909
+ " <td>5</td>\n",
1910
+ " <td>0</td>\n",
1911
+ " <td>0</td>\n",
1912
+ " <td>1</td>\n",
1913
+ " <td>1</td>\n",
1914
+ " <td>0</td>\n",
1915
+ " <td>1</td>\n",
1916
+ " <td>1</td>\n",
1917
+ " <td>1</td>\n",
1918
+ " <td>0</td>\n",
1919
+ " <td>0</td>\n",
1920
+ " <td>1</td>\n",
1921
+ " <td>0</td>\n",
1922
+ " <td>1</td>\n",
1923
+ " </tr>\n",
1924
+ " <tr>\n",
1925
+ " <th>9999</th>\n",
1926
+ " <td>sieck-kuehlsysteme.de</td>\n",
1927
+ " <td>0</td>\n",
1928
+ " <td>1</td>\n",
1929
+ " <td>1</td>\n",
1930
+ " <td>4</td>\n",
1931
+ " <td>0</td>\n",
1932
+ " <td>0</td>\n",
1933
+ " <td>1</td>\n",
1934
+ " <td>1</td>\n",
1935
+ " <td>0</td>\n",
1936
+ " <td>1</td>\n",
1937
+ " <td>1</td>\n",
1938
+ " <td>1</td>\n",
1939
+ " <td>0</td>\n",
1940
+ " <td>0</td>\n",
1941
+ " <td>1</td>\n",
1942
+ " <td>0</td>\n",
1943
+ " <td>1</td>\n",
1944
+ " </tr>\n",
1945
+ " </tbody>\n",
1946
+ "</table>\n",
1947
+ "</div>"
1948
+ ],
1949
+ "text/plain": [
1950
+ " Domain Have_IP ... Web_Forwards Label\n",
1951
+ "9995 wvk12-my.sharepoint.com 0 ... 0 1\n",
1952
+ "9996 adplife.com 0 ... 0 1\n",
1953
+ "9997 kurortnoye.com.ua 0 ... 0 1\n",
1954
+ "9998 norcaltc-my.sharepoint.com 0 ... 0 1\n",
1955
+ "9999 sieck-kuehlsysteme.de 0 ... 0 1\n",
1956
+ "\n",
1957
+ "[5 rows x 18 columns]"
1958
+ ]
1959
+ },
1960
+ "execution_count": 46,
1961
+ "metadata": {
1962
+ "tags": []
1963
+ },
1964
+ "output_type": "execute_result"
1965
+ }
1966
+ ],
1967
+ "source": [
1968
+ "urldata.tail()"
1969
+ ]
1970
+ },
1971
+ {
1972
+ "cell_type": "code",
1973
+ "execution_count": null,
1974
+ "metadata": {
1975
+ "colab": {
1976
+ "base_uri": "https://localhost:8080/",
1977
+ "height": 35
1978
+ },
1979
+ "colab_type": "code",
1980
+ "id": "1NNxgbVCr7vt",
1981
+ "outputId": "3064082c-4508-4579-8128-bff9842a04b7"
1982
+ },
1983
+ "outputs": [
1984
+ {
1985
+ "data": {
1986
+ "text/plain": [
1987
+ "(10000, 18)"
1988
+ ]
1989
+ },
1990
+ "execution_count": 47,
1991
+ "metadata": {
1992
+ "tags": []
1993
+ },
1994
+ "output_type": "execute_result"
1995
+ }
1996
+ ],
1997
+ "source": [
1998
+ "urldata.shape"
1999
+ ]
2000
+ },
2001
+ {
2002
+ "cell_type": "code",
2003
+ "execution_count": null,
2004
+ "metadata": {
2005
+ "colab": {},
2006
+ "colab_type": "code",
2007
+ "id": "596496VUrhRI"
2008
+ },
2009
+ "outputs": [],
2010
+ "source": [
2011
+ "# Storing the data in CSV file\n",
2012
+ "urldata.to_csv('urldata.csv', index=False)"
2013
+ ]
2014
+ }
2015
+ ],
2016
+ "metadata": {
2017
+ "accelerator": "GPU",
2018
+ "colab": {
2019
+ "collapsed_sections": [],
2020
+ "name": "URL Feature Extraction.ipynb",
2021
+ "provenance": [],
2022
+ "toc_visible": true
2023
+ },
2024
+ "kernelspec": {
2025
+ "display_name": "Python 3",
2026
+ "name": "python3"
2027
+ },
2028
+ "language_info": {
2029
+ "codemirror_mode": {
2030
+ "name": "ipython",
2031
+ "version": 3
2032
+ },
2033
+ "file_extension": ".py",
2034
+ "mimetype": "text/x-python",
2035
+ "name": "python",
2036
+ "nbconvert_exporter": "python",
2037
+ "pygments_lexer": "ipython3",
2038
+ "version": "3.11.3"
2039
+ }
2040
+ },
2041
+ "nbformat": 4,
2042
+ "nbformat_minor": 0
2043
+ }
XGBoostClassifier1.pickle.dat ADDED
Binary file (250 kB). View file
 
app.py ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from urllib.parse import urlparse, urlencode
2
+ import ipaddress
3
+ import re
4
+ from bs4 import BeautifulSoup
5
+ import whois
6
+ import urllib
7
+ import urllib.request
8
+ from datetime import datetime
9
+ import requests
10
+ import pickle
11
+ import gradio as gr
12
+
13
+ loaded_model = pickle.load(open("XGBoostClassifier1.pickle.dat", "rb"))
14
+
15
+ shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
16
+ r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
17
+ r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
18
+ r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
19
+ r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
20
+ r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
21
+ r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
22
+ r"tr\.im|link\.zip\.net"
23
+
24
+ def getDomain(url):
25
+ domain = urlparse(url).netloc
26
+ if re.match(r"^www.",domain):
27
+ domain = domain.replace("www.","")
28
+ return domain
29
+
30
+ def havingIP(url):
31
+ try:
32
+ ipaddress.ip_address(url)
33
+ ip = 1
34
+ except:
35
+ ip = 0
36
+ return ip
37
+
38
+ def haveAtSign(url):
39
+ if "@" in url:
40
+ at = 1
41
+ else:
42
+ at = 0
43
+ return at
44
+
45
+ def getLength(url):
46
+ if len(url) < 54:
47
+ length = 0
48
+ else:
49
+ length = 1
50
+ return length
51
+
52
+ def getDepth(url):
53
+ s = urlparse(url).path.split('/')
54
+ depth = 0
55
+ for j in range(len(s)):
56
+ if len(s[j]) != 0:
57
+ depth = depth+1
58
+ return depth
59
+
60
+ def redirection(url):
61
+ pos = url.rfind('//')
62
+ if pos > 6:
63
+ if pos > 7:
64
+ return 1
65
+ else:
66
+ return 0
67
+ else:
68
+ return 0
69
+
70
+
71
+ def httpDomain(url):
72
+ domain = urlparse(url).netloc
73
+ if 'https' in domain:
74
+ return 1
75
+ else:
76
+ return 0
77
+
78
+
79
+ def tinyURL(url):
80
+ match=re.search(shortening_services,url)
81
+ if match:
82
+ return 1
83
+ else:
84
+ return 0
85
+
86
+ def prefixSuffix(url):
87
+ if '-' in urlparse(url).netloc:
88
+ return 1 # phishing
89
+ else:
90
+ return 0 # legitimate
91
+
92
+ def web_traffic(url):
93
+ # try:
94
+ # url = urllib.parse.quote(url)
95
+ # rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find(
96
+ # "REACH")['RANK']
97
+ # rank = int(rank)
98
+ # except TypeError:
99
+ # return 1
100
+ # if rank <100000:
101
+ # return 1
102
+ # else:
103
+ return 0
104
+
105
+ def domainAge(domain_name):
106
+ creation_date = domain_name.creation_date
107
+ expiration_date = domain_name.expiration_date
108
+ if (isinstance(creation_date,str) or isinstance(expiration_date,str)):
109
+ try:
110
+ creation_date = datetime.strptime(creation_date,'%Y-%m-%d')
111
+ expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
112
+ except:
113
+ return 1
114
+ if ((expiration_date is None) or (creation_date is None)):
115
+ return 1
116
+ elif ((type(expiration_date) is list) or (type(creation_date) is list)):
117
+ return 1
118
+ else:
119
+ ageofdomain = abs((expiration_date - creation_date).days)
120
+ if ((ageofdomain/30) < 6):
121
+ age = 1
122
+ else:
123
+ age = 0
124
+ return age
125
+
126
+
127
+ def domainEnd(domain_name):
128
+ expiration_date = domain_name.expiration_date
129
+ if isinstance(expiration_date,str):
130
+ try:
131
+ expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
132
+ except:
133
+ return 1
134
+ if (expiration_date is None):
135
+ return 1
136
+ elif (type(expiration_date) is list):
137
+ return 1
138
+ else:
139
+ today = datetime.now()
140
+ end = abs((expiration_date - today).days)
141
+ if ((end/30) < 6):
142
+ end = 0
143
+ else:
144
+ end = 1
145
+ return end
146
+
147
+
148
+ def iframe(response):
149
+ if response == "":
150
+ return 1
151
+ else:
152
+ if re.findall(r"[<iframe>|<frameBorder>]", response.text):
153
+ return 0
154
+ else:
155
+ return 1
156
+
157
+
158
+ def mouseOver(response):
159
+ if response == "" :
160
+ return 1
161
+ else:
162
+ if re.findall("<script>.+onmouseover.+</script>", response.text):
163
+ return 1
164
+ else:
165
+ return 0
166
+
167
+ def rightClick(response):
168
+ if response == "":
169
+ return 1
170
+ else:
171
+ if re.findall(r"event.button ?== ?2", response.text):
172
+ return 0
173
+ else:
174
+ return 1
175
+
176
+ def forwarding(response):
177
+ if response == "":
178
+ return 1
179
+ else:
180
+ if len(response.history) <= 2:
181
+ return 0
182
+ else:
183
+ return 1
184
+
185
+
186
+
187
+ def featureExtraction(url):
188
+ features = []
189
+ # features.append(getDomain(url))
190
+ features.append(havingIP(url))
191
+ features.append(haveAtSign(url))
192
+ features.append(getLength(url))
193
+ features.append(getDepth(url))
194
+ features.append(redirection(url))
195
+ features.append(httpDomain(url))
196
+ features.append(tinyURL(url))
197
+ features.append(prefixSuffix(url))
198
+
199
+ #Domain based features (4)
200
+ dns = 0
201
+ try:
202
+ domain_name = whois.whois(urlparse(url).netloc)
203
+ except:
204
+ dns = 1
205
+
206
+ features.append(dns)
207
+ features.append(web_traffic(url))
208
+ features.append(1 if dns == 1 else domainAge(domain_name))
209
+ features.append(1 if dns == 1 else domainEnd(domain_name))
210
+
211
+ # HTML & Javascript based features (4)
212
+ try:
213
+ response = requests.get(url)
214
+ except:
215
+ response = ""
216
+ features.append(iframe(response))
217
+ features.append(mouseOver(response))
218
+ features.append(rightClick(response))
219
+ features.append(forwarding(response))
220
+
221
+ return features
222
+
223
+ def index(url):
224
+ features = featureExtraction(url)
225
+ prediction = loaded_model.predict([features])
226
+ print(features)
227
+ print(prediction)
228
+
229
+ if(prediction[0] == 0):
230
+ return "Safe"
231
+ else:
232
+ return "Unsafe"
233
+
234
+
235
+ inputs_image_url = [
236
+ gr.Textbox(type="text", label="URL"),
237
+ ]
238
+
239
+ outputs_result_dict = [
240
+ gr.Textbox(type="text", label="Result Dictionary"),
241
+ ]
242
+
243
+ interface_image_url = gr.Interface(
244
+ fn=index,
245
+ inputs=inputs_image_url,
246
+ outputs=outputs_result_dict,
247
+ title="URL Detection",
248
+ cache_examples=False,
249
+ )
250
+
251
+ gr.TabbedInterface(
252
+ [interface_image_url],
253
+ tab_names=['URL inference']
254
+ ).queue().launch()
255
+
256
+ # 0 -> Riyal
257
+ # 1 -> Phishing
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ tensorflow
2
+ opencv-python
3
+ matplotlib
4
+ numpy
5
+ gradio
6
+ python-whois
7
+ datetime
8
+ bs4
9
+ ipaddress
10
+ requests