Commit
·
818bcad
1
Parent(s):
5b84d56
Add Writeup & codes
Browse files- Foong_Coding_Challenge_for_Fatima_Fellowship.ipynb +1264 -0
- README.md +15 -0
Foong_Coding_Challenge_for_Fatima_Fellowship.ipynb
ADDED
@@ -0,0 +1,1264 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"nbformat": 4,
|
3 |
+
"nbformat_minor": 0,
|
4 |
+
"metadata": {
|
5 |
+
"colab": {
|
6 |
+
"name": "Foong_Coding Challenge for Fatima Fellowship",
|
7 |
+
"provenance": [],
|
8 |
+
"collapsed_sections": []
|
9 |
+
},
|
10 |
+
"kernelspec": {
|
11 |
+
"name": "python3",
|
12 |
+
"display_name": "Python 3"
|
13 |
+
}
|
14 |
+
},
|
15 |
+
"cells": [
|
16 |
+
{
|
17 |
+
"cell_type": "markdown",
|
18 |
+
"metadata": {
|
19 |
+
"id": "eBpjBBZc6IvA"
|
20 |
+
},
|
21 |
+
"source": [
|
22 |
+
"# Fatima Fellowship Quick Coding Challenge (Pick 1)\n",
|
23 |
+
"\n",
|
24 |
+
"Thank you for applying to the Fatima Fellowship. To help us select the Fellows and assess your ability to do machine learning research, we are asking that you complete a short coding challenge. Please pick **1 of these 5** coding challenges, whichever is most aligned with your interests. \n",
|
25 |
+
"\n",
|
26 |
+
"**Due date: 1 week**\n",
|
27 |
+
"\n",
|
28 |
+
"**How to submit**: Please make a copy of this colab notebook, add your code and results, and submit your colab notebook to the submission link below. If you have never used a colab notebook, [check out this video](https://www.youtube.com/watch?v=i-HnvsehuSw).\n",
|
29 |
+
"\n",
|
30 |
+
"**Submission link**: https://airtable.com/shrXy3QKSsO2yALd3"
|
31 |
+
]
|
32 |
+
},
|
33 |
+
{
|
34 |
+
"cell_type": "markdown",
|
35 |
+
"metadata": {
|
36 |
+
"id": "sFU9LTOyMiMj"
|
37 |
+
},
|
38 |
+
"source": [
|
39 |
+
"# 2. Deep Learning for NLP\n",
|
40 |
+
"\n",
|
41 |
+
"**Fake news classifier**: Train a text classification model to detect fake news articles!\n",
|
42 |
+
"\n",
|
43 |
+
"* Download the dataset here: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset\n",
|
44 |
+
"* Develop an NLP model for classification that uses a pretrained language model\n",
|
45 |
+
"* Finetune your model on the dataset, and generate an AUC curve of your model on the test set of your choice. \n",
|
46 |
+
"* [Upload the the model to the Hugging Face Hub](https://huggingface.co/docs/hub/adding-a-model), and add a link to your model below.\n",
|
47 |
+
"* *Answer the following question*: Look at some of the news articles that were classified incorrectly. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)"
|
48 |
+
]
|
49 |
+
},
|
50 |
+
{
|
51 |
+
"cell_type": "code",
|
52 |
+
"source": [
|
53 |
+
"### WRITE YOUR CODE TO TRAIN THE MODEL HERE\n",
|
54 |
+
"import numpy as np\n",
|
55 |
+
"import pandas as pd\n",
|
56 |
+
"import csv\n",
|
57 |
+
"from sklearn.metrics import accuracy_score, precision_recall_fscore_support\n",
|
58 |
+
"\n"
|
59 |
+
],
|
60 |
+
"metadata": {
|
61 |
+
"id": "E90i018KyJH3"
|
62 |
+
},
|
63 |
+
"execution_count": 1,
|
64 |
+
"outputs": []
|
65 |
+
},
|
66 |
+
{
|
67 |
+
"cell_type": "markdown",
|
68 |
+
"source": [
|
69 |
+
"## Data Loading"
|
70 |
+
],
|
71 |
+
"metadata": {
|
72 |
+
"id": "HUDOBz2tRivY"
|
73 |
+
}
|
74 |
+
},
|
75 |
+
{
|
76 |
+
"cell_type": "code",
|
77 |
+
"source": [
|
78 |
+
"real_news = pd.read_csv(\"True.csv\", sep=',', engine='python', encoding='utf8',on_bad_lines='skip')\n",
|
79 |
+
"fake_news = pd.read_csv(\"Fake.csv\", sep=',', engine='python', encoding='utf8',on_bad_lines='skip')\n",
|
80 |
+
"\n",
|
81 |
+
"print(\"real_news: \" + str(real_news.shape))\n",
|
82 |
+
"print(\"fake_news: \" + str(fake_news.shape))"
|
83 |
+
],
|
84 |
+
"metadata": {
|
85 |
+
"colab": {
|
86 |
+
"base_uri": "https://localhost:8080/"
|
87 |
+
},
|
88 |
+
"id": "d60sCvRjOSWa",
|
89 |
+
"outputId": "99813f74-971d-41e2-8597-4913ca131fe1"
|
90 |
+
},
|
91 |
+
"execution_count": 2,
|
92 |
+
"outputs": [
|
93 |
+
{
|
94 |
+
"output_type": "stream",
|
95 |
+
"name": "stdout",
|
96 |
+
"text": [
|
97 |
+
"real_news: (21417, 4)\n",
|
98 |
+
"fake_news: (14568, 4)\n"
|
99 |
+
]
|
100 |
+
}
|
101 |
+
]
|
102 |
+
},
|
103 |
+
{
|
104 |
+
"cell_type": "code",
|
105 |
+
"source": [
|
106 |
+
"fake_news.head()"
|
107 |
+
],
|
108 |
+
"metadata": {
|
109 |
+
"colab": {
|
110 |
+
"base_uri": "https://localhost:8080/",
|
111 |
+
"height": 206
|
112 |
+
},
|
113 |
+
"id": "ywYW2xTuOVGy",
|
114 |
+
"outputId": "2e442a61-4634-4965-a6f7-822896f45dbb"
|
115 |
+
},
|
116 |
+
"execution_count": 3,
|
117 |
+
"outputs": [
|
118 |
+
{
|
119 |
+
"output_type": "execute_result",
|
120 |
+
"data": {
|
121 |
+
"text/plain": [
|
122 |
+
" title \\\n",
|
123 |
+
"0 Donald Trump Sends Out Embarrassing New Year’... \n",
|
124 |
+
"1 Drunk Bragging Trump Staffer Started Russian ... \n",
|
125 |
+
"2 Sheriff David Clarke Becomes An Internet Joke... \n",
|
126 |
+
"3 Trump Is So Obsessed He Even Has Obama’s Name... \n",
|
127 |
+
"4 Pope Francis Just Called Out Donald Trump Dur... \n",
|
128 |
+
"\n",
|
129 |
+
" text subject \\\n",
|
130 |
+
"0 Donald Trump just couldn t wish all Americans ... News \n",
|
131 |
+
"1 House Intelligence Committee Chairman Devin Nu... News \n",
|
132 |
+
"2 On Friday, it was revealed that former Milwauk... News \n",
|
133 |
+
"3 On Christmas day, Donald Trump announced that ... News \n",
|
134 |
+
"4 Pope Francis used his annual Christmas Day mes... News \n",
|
135 |
+
"\n",
|
136 |
+
" date \n",
|
137 |
+
"0 December 31, 2017 \n",
|
138 |
+
"1 December 31, 2017 \n",
|
139 |
+
"2 December 30, 2017 \n",
|
140 |
+
"3 December 29, 2017 \n",
|
141 |
+
"4 December 25, 2017 "
|
142 |
+
],
|
143 |
+
"text/html": [
|
144 |
+
"\n",
|
145 |
+
" <div id=\"df-afb2a3a4-f8e1-4dbb-9bb9-ad1bf6c5d39d\">\n",
|
146 |
+
" <div class=\"colab-df-container\">\n",
|
147 |
+
" <div>\n",
|
148 |
+
"<style scoped>\n",
|
149 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
150 |
+
" vertical-align: middle;\n",
|
151 |
+
" }\n",
|
152 |
+
"\n",
|
153 |
+
" .dataframe tbody tr th {\n",
|
154 |
+
" vertical-align: top;\n",
|
155 |
+
" }\n",
|
156 |
+
"\n",
|
157 |
+
" .dataframe thead th {\n",
|
158 |
+
" text-align: right;\n",
|
159 |
+
" }\n",
|
160 |
+
"</style>\n",
|
161 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
162 |
+
" <thead>\n",
|
163 |
+
" <tr style=\"text-align: right;\">\n",
|
164 |
+
" <th></th>\n",
|
165 |
+
" <th>title</th>\n",
|
166 |
+
" <th>text</th>\n",
|
167 |
+
" <th>subject</th>\n",
|
168 |
+
" <th>date</th>\n",
|
169 |
+
" </tr>\n",
|
170 |
+
" </thead>\n",
|
171 |
+
" <tbody>\n",
|
172 |
+
" <tr>\n",
|
173 |
+
" <th>0</th>\n",
|
174 |
+
" <td>Donald Trump Sends Out Embarrassing New Year’...</td>\n",
|
175 |
+
" <td>Donald Trump just couldn t wish all Americans ...</td>\n",
|
176 |
+
" <td>News</td>\n",
|
177 |
+
" <td>December 31, 2017</td>\n",
|
178 |
+
" </tr>\n",
|
179 |
+
" <tr>\n",
|
180 |
+
" <th>1</th>\n",
|
181 |
+
" <td>Drunk Bragging Trump Staffer Started Russian ...</td>\n",
|
182 |
+
" <td>House Intelligence Committee Chairman Devin Nu...</td>\n",
|
183 |
+
" <td>News</td>\n",
|
184 |
+
" <td>December 31, 2017</td>\n",
|
185 |
+
" </tr>\n",
|
186 |
+
" <tr>\n",
|
187 |
+
" <th>2</th>\n",
|
188 |
+
" <td>Sheriff David Clarke Becomes An Internet Joke...</td>\n",
|
189 |
+
" <td>On Friday, it was revealed that former Milwauk...</td>\n",
|
190 |
+
" <td>News</td>\n",
|
191 |
+
" <td>December 30, 2017</td>\n",
|
192 |
+
" </tr>\n",
|
193 |
+
" <tr>\n",
|
194 |
+
" <th>3</th>\n",
|
195 |
+
" <td>Trump Is So Obsessed He Even Has Obama’s Name...</td>\n",
|
196 |
+
" <td>On Christmas day, Donald Trump announced that ...</td>\n",
|
197 |
+
" <td>News</td>\n",
|
198 |
+
" <td>December 29, 2017</td>\n",
|
199 |
+
" </tr>\n",
|
200 |
+
" <tr>\n",
|
201 |
+
" <th>4</th>\n",
|
202 |
+
" <td>Pope Francis Just Called Out Donald Trump Dur...</td>\n",
|
203 |
+
" <td>Pope Francis used his annual Christmas Day mes...</td>\n",
|
204 |
+
" <td>News</td>\n",
|
205 |
+
" <td>December 25, 2017</td>\n",
|
206 |
+
" </tr>\n",
|
207 |
+
" </tbody>\n",
|
208 |
+
"</table>\n",
|
209 |
+
"</div>\n",
|
210 |
+
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-afb2a3a4-f8e1-4dbb-9bb9-ad1bf6c5d39d')\"\n",
|
211 |
+
" title=\"Convert this dataframe to an interactive table.\"\n",
|
212 |
+
" style=\"display:none;\">\n",
|
213 |
+
" \n",
|
214 |
+
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
215 |
+
" width=\"24px\">\n",
|
216 |
+
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
|
217 |
+
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
|
218 |
+
" </svg>\n",
|
219 |
+
" </button>\n",
|
220 |
+
" \n",
|
221 |
+
" <style>\n",
|
222 |
+
" .colab-df-container {\n",
|
223 |
+
" display:flex;\n",
|
224 |
+
" flex-wrap:wrap;\n",
|
225 |
+
" gap: 12px;\n",
|
226 |
+
" }\n",
|
227 |
+
"\n",
|
228 |
+
" .colab-df-convert {\n",
|
229 |
+
" background-color: #E8F0FE;\n",
|
230 |
+
" border: none;\n",
|
231 |
+
" border-radius: 50%;\n",
|
232 |
+
" cursor: pointer;\n",
|
233 |
+
" display: none;\n",
|
234 |
+
" fill: #1967D2;\n",
|
235 |
+
" height: 32px;\n",
|
236 |
+
" padding: 0 0 0 0;\n",
|
237 |
+
" width: 32px;\n",
|
238 |
+
" }\n",
|
239 |
+
"\n",
|
240 |
+
" .colab-df-convert:hover {\n",
|
241 |
+
" background-color: #E2EBFA;\n",
|
242 |
+
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
243 |
+
" fill: #174EA6;\n",
|
244 |
+
" }\n",
|
245 |
+
"\n",
|
246 |
+
" [theme=dark] .colab-df-convert {\n",
|
247 |
+
" background-color: #3B4455;\n",
|
248 |
+
" fill: #D2E3FC;\n",
|
249 |
+
" }\n",
|
250 |
+
"\n",
|
251 |
+
" [theme=dark] .colab-df-convert:hover {\n",
|
252 |
+
" background-color: #434B5C;\n",
|
253 |
+
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
254 |
+
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
255 |
+
" fill: #FFFFFF;\n",
|
256 |
+
" }\n",
|
257 |
+
" </style>\n",
|
258 |
+
"\n",
|
259 |
+
" <script>\n",
|
260 |
+
" const buttonEl =\n",
|
261 |
+
" document.querySelector('#df-afb2a3a4-f8e1-4dbb-9bb9-ad1bf6c5d39d button.colab-df-convert');\n",
|
262 |
+
" buttonEl.style.display =\n",
|
263 |
+
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
264 |
+
"\n",
|
265 |
+
" async function convertToInteractive(key) {\n",
|
266 |
+
" const element = document.querySelector('#df-afb2a3a4-f8e1-4dbb-9bb9-ad1bf6c5d39d');\n",
|
267 |
+
" const dataTable =\n",
|
268 |
+
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
|
269 |
+
" [key], {});\n",
|
270 |
+
" if (!dataTable) return;\n",
|
271 |
+
"\n",
|
272 |
+
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
|
273 |
+
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
|
274 |
+
" + ' to learn more about interactive tables.';\n",
|
275 |
+
" element.innerHTML = '';\n",
|
276 |
+
" dataTable['output_type'] = 'display_data';\n",
|
277 |
+
" await google.colab.output.renderOutput(dataTable, element);\n",
|
278 |
+
" const docLink = document.createElement('div');\n",
|
279 |
+
" docLink.innerHTML = docLinkHtml;\n",
|
280 |
+
" element.appendChild(docLink);\n",
|
281 |
+
" }\n",
|
282 |
+
" </script>\n",
|
283 |
+
" </div>\n",
|
284 |
+
" </div>\n",
|
285 |
+
" "
|
286 |
+
]
|
287 |
+
},
|
288 |
+
"metadata": {},
|
289 |
+
"execution_count": 3
|
290 |
+
}
|
291 |
+
]
|
292 |
+
},
|
293 |
+
{
|
294 |
+
"cell_type": "markdown",
|
295 |
+
"source": [
|
296 |
+
"## Add labeling"
|
297 |
+
],
|
298 |
+
"metadata": {
|
299 |
+
"id": "ZghmfpC2SIVC"
|
300 |
+
}
|
301 |
+
},
|
302 |
+
{
|
303 |
+
"cell_type": "code",
|
304 |
+
"source": [
|
305 |
+
"fake_news['label'] = 0 \n",
|
306 |
+
"real_news['label'] = 1"
|
307 |
+
],
|
308 |
+
"metadata": {
|
309 |
+
"id": "rZ8pF-RtSJ6_"
|
310 |
+
},
|
311 |
+
"execution_count": 4,
|
312 |
+
"outputs": []
|
313 |
+
},
|
314 |
+
{
|
315 |
+
"cell_type": "code",
|
316 |
+
"source": [
|
317 |
+
"fake_news.head()"
|
318 |
+
],
|
319 |
+
"metadata": {
|
320 |
+
"colab": {
|
321 |
+
"base_uri": "https://localhost:8080/",
|
322 |
+
"height": 206
|
323 |
+
},
|
324 |
+
"id": "CR_yBlbRR6R4",
|
325 |
+
"outputId": "f2eff41d-8cfc-44cf-d68c-313cb692fb45"
|
326 |
+
},
|
327 |
+
"execution_count": 5,
|
328 |
+
"outputs": [
|
329 |
+
{
|
330 |
+
"output_type": "execute_result",
|
331 |
+
"data": {
|
332 |
+
"text/plain": [
|
333 |
+
" title \\\n",
|
334 |
+
"0 Donald Trump Sends Out Embarrassing New Year’... \n",
|
335 |
+
"1 Drunk Bragging Trump Staffer Started Russian ... \n",
|
336 |
+
"2 Sheriff David Clarke Becomes An Internet Joke... \n",
|
337 |
+
"3 Trump Is So Obsessed He Even Has Obama’s Name... \n",
|
338 |
+
"4 Pope Francis Just Called Out Donald Trump Dur... \n",
|
339 |
+
"\n",
|
340 |
+
" text subject \\\n",
|
341 |
+
"0 Donald Trump just couldn t wish all Americans ... News \n",
|
342 |
+
"1 House Intelligence Committee Chairman Devin Nu... News \n",
|
343 |
+
"2 On Friday, it was revealed that former Milwauk... News \n",
|
344 |
+
"3 On Christmas day, Donald Trump announced that ... News \n",
|
345 |
+
"4 Pope Francis used his annual Christmas Day mes... News \n",
|
346 |
+
"\n",
|
347 |
+
" date label \n",
|
348 |
+
"0 December 31, 2017 0 \n",
|
349 |
+
"1 December 31, 2017 0 \n",
|
350 |
+
"2 December 30, 2017 0 \n",
|
351 |
+
"3 December 29, 2017 0 \n",
|
352 |
+
"4 December 25, 2017 0 "
|
353 |
+
],
|
354 |
+
"text/html": [
|
355 |
+
"\n",
|
356 |
+
" <div id=\"df-1816b266-5128-4164-8426-5f4dfe2c9522\">\n",
|
357 |
+
" <div class=\"colab-df-container\">\n",
|
358 |
+
" <div>\n",
|
359 |
+
"<style scoped>\n",
|
360 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
361 |
+
" vertical-align: middle;\n",
|
362 |
+
" }\n",
|
363 |
+
"\n",
|
364 |
+
" .dataframe tbody tr th {\n",
|
365 |
+
" vertical-align: top;\n",
|
366 |
+
" }\n",
|
367 |
+
"\n",
|
368 |
+
" .dataframe thead th {\n",
|
369 |
+
" text-align: right;\n",
|
370 |
+
" }\n",
|
371 |
+
"</style>\n",
|
372 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
373 |
+
" <thead>\n",
|
374 |
+
" <tr style=\"text-align: right;\">\n",
|
375 |
+
" <th></th>\n",
|
376 |
+
" <th>title</th>\n",
|
377 |
+
" <th>text</th>\n",
|
378 |
+
" <th>subject</th>\n",
|
379 |
+
" <th>date</th>\n",
|
380 |
+
" <th>label</th>\n",
|
381 |
+
" </tr>\n",
|
382 |
+
" </thead>\n",
|
383 |
+
" <tbody>\n",
|
384 |
+
" <tr>\n",
|
385 |
+
" <th>0</th>\n",
|
386 |
+
" <td>Donald Trump Sends Out Embarrassing New Year’...</td>\n",
|
387 |
+
" <td>Donald Trump just couldn t wish all Americans ...</td>\n",
|
388 |
+
" <td>News</td>\n",
|
389 |
+
" <td>December 31, 2017</td>\n",
|
390 |
+
" <td>0</td>\n",
|
391 |
+
" </tr>\n",
|
392 |
+
" <tr>\n",
|
393 |
+
" <th>1</th>\n",
|
394 |
+
" <td>Drunk Bragging Trump Staffer Started Russian ...</td>\n",
|
395 |
+
" <td>House Intelligence Committee Chairman Devin Nu...</td>\n",
|
396 |
+
" <td>News</td>\n",
|
397 |
+
" <td>December 31, 2017</td>\n",
|
398 |
+
" <td>0</td>\n",
|
399 |
+
" </tr>\n",
|
400 |
+
" <tr>\n",
|
401 |
+
" <th>2</th>\n",
|
402 |
+
" <td>Sheriff David Clarke Becomes An Internet Joke...</td>\n",
|
403 |
+
" <td>On Friday, it was revealed that former Milwauk...</td>\n",
|
404 |
+
" <td>News</td>\n",
|
405 |
+
" <td>December 30, 2017</td>\n",
|
406 |
+
" <td>0</td>\n",
|
407 |
+
" </tr>\n",
|
408 |
+
" <tr>\n",
|
409 |
+
" <th>3</th>\n",
|
410 |
+
" <td>Trump Is So Obsessed He Even Has Obama’s Name...</td>\n",
|
411 |
+
" <td>On Christmas day, Donald Trump announced that ...</td>\n",
|
412 |
+
" <td>News</td>\n",
|
413 |
+
" <td>December 29, 2017</td>\n",
|
414 |
+
" <td>0</td>\n",
|
415 |
+
" </tr>\n",
|
416 |
+
" <tr>\n",
|
417 |
+
" <th>4</th>\n",
|
418 |
+
" <td>Pope Francis Just Called Out Donald Trump Dur...</td>\n",
|
419 |
+
" <td>Pope Francis used his annual Christmas Day mes...</td>\n",
|
420 |
+
" <td>News</td>\n",
|
421 |
+
" <td>December 25, 2017</td>\n",
|
422 |
+
" <td>0</td>\n",
|
423 |
+
" </tr>\n",
|
424 |
+
" </tbody>\n",
|
425 |
+
"</table>\n",
|
426 |
+
"</div>\n",
|
427 |
+
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1816b266-5128-4164-8426-5f4dfe2c9522')\"\n",
|
428 |
+
" title=\"Convert this dataframe to an interactive table.\"\n",
|
429 |
+
" style=\"display:none;\">\n",
|
430 |
+
" \n",
|
431 |
+
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
432 |
+
" width=\"24px\">\n",
|
433 |
+
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
|
434 |
+
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
|
435 |
+
" </svg>\n",
|
436 |
+
" </button>\n",
|
437 |
+
" \n",
|
438 |
+
" <style>\n",
|
439 |
+
" .colab-df-container {\n",
|
440 |
+
" display:flex;\n",
|
441 |
+
" flex-wrap:wrap;\n",
|
442 |
+
" gap: 12px;\n",
|
443 |
+
" }\n",
|
444 |
+
"\n",
|
445 |
+
" .colab-df-convert {\n",
|
446 |
+
" background-color: #E8F0FE;\n",
|
447 |
+
" border: none;\n",
|
448 |
+
" border-radius: 50%;\n",
|
449 |
+
" cursor: pointer;\n",
|
450 |
+
" display: none;\n",
|
451 |
+
" fill: #1967D2;\n",
|
452 |
+
" height: 32px;\n",
|
453 |
+
" padding: 0 0 0 0;\n",
|
454 |
+
" width: 32px;\n",
|
455 |
+
" }\n",
|
456 |
+
"\n",
|
457 |
+
" .colab-df-convert:hover {\n",
|
458 |
+
" background-color: #E2EBFA;\n",
|
459 |
+
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
460 |
+
" fill: #174EA6;\n",
|
461 |
+
" }\n",
|
462 |
+
"\n",
|
463 |
+
" [theme=dark] .colab-df-convert {\n",
|
464 |
+
" background-color: #3B4455;\n",
|
465 |
+
" fill: #D2E3FC;\n",
|
466 |
+
" }\n",
|
467 |
+
"\n",
|
468 |
+
" [theme=dark] .colab-df-convert:hover {\n",
|
469 |
+
" background-color: #434B5C;\n",
|
470 |
+
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
471 |
+
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
472 |
+
" fill: #FFFFFF;\n",
|
473 |
+
" }\n",
|
474 |
+
" </style>\n",
|
475 |
+
"\n",
|
476 |
+
" <script>\n",
|
477 |
+
" const buttonEl =\n",
|
478 |
+
" document.querySelector('#df-1816b266-5128-4164-8426-5f4dfe2c9522 button.colab-df-convert');\n",
|
479 |
+
" buttonEl.style.display =\n",
|
480 |
+
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
481 |
+
"\n",
|
482 |
+
" async function convertToInteractive(key) {\n",
|
483 |
+
" const element = document.querySelector('#df-1816b266-5128-4164-8426-5f4dfe2c9522');\n",
|
484 |
+
" const dataTable =\n",
|
485 |
+
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
|
486 |
+
" [key], {});\n",
|
487 |
+
" if (!dataTable) return;\n",
|
488 |
+
"\n",
|
489 |
+
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
|
490 |
+
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
|
491 |
+
" + ' to learn more about interactive tables.';\n",
|
492 |
+
" element.innerHTML = '';\n",
|
493 |
+
" dataTable['output_type'] = 'display_data';\n",
|
494 |
+
" await google.colab.output.renderOutput(dataTable, element);\n",
|
495 |
+
" const docLink = document.createElement('div');\n",
|
496 |
+
" docLink.innerHTML = docLinkHtml;\n",
|
497 |
+
" element.appendChild(docLink);\n",
|
498 |
+
" }\n",
|
499 |
+
" </script>\n",
|
500 |
+
" </div>\n",
|
501 |
+
" </div>\n",
|
502 |
+
" "
|
503 |
+
]
|
504 |
+
},
|
505 |
+
"metadata": {},
|
506 |
+
"execution_count": 5
|
507 |
+
}
|
508 |
+
]
|
509 |
+
},
|
510 |
+
{
|
511 |
+
"cell_type": "markdown",
|
512 |
+
"source": [
|
513 |
+
"## Combine Real & Fake News into one dataframe"
|
514 |
+
],
|
515 |
+
"metadata": {
|
516 |
+
"id": "ZB2C1ImfSUUg"
|
517 |
+
}
|
518 |
+
},
|
519 |
+
{
|
520 |
+
"cell_type": "code",
|
521 |
+
"source": [
|
522 |
+
"news = pd.concat([real_news,fake_news],axis=0,ignore_index=True)\n",
|
523 |
+
"news = news.sample(frac = 1).reset_index(drop = True)\n",
|
524 |
+
"news.head()"
|
525 |
+
],
|
526 |
+
"metadata": {
|
527 |
+
"colab": {
|
528 |
+
"base_uri": "https://localhost:8080/",
|
529 |
+
"height": 206
|
530 |
+
},
|
531 |
+
"id": "RTifEXcHSQJ0",
|
532 |
+
"outputId": "d2e996c9-9068-4cfb-dbe0-1fb84c4b0b2f"
|
533 |
+
},
|
534 |
+
"execution_count": 6,
|
535 |
+
"outputs": [
|
536 |
+
{
|
537 |
+
"output_type": "execute_result",
|
538 |
+
"data": {
|
539 |
+
"text/plain": [
|
540 |
+
" title \\\n",
|
541 |
+
"0 Trump’s Involvement In Houston Chemical Plant... \n",
|
542 |
+
"1 OOPS! Media Forgot Ted Kennedy Asked Russia To... \n",
|
543 |
+
"2 OBAMA GIVES FINAL THOUGHTS On Trump Presidency... \n",
|
544 |
+
"3 CNN ANCHOR DON LEMON: A Republican Winning in ... \n",
|
545 |
+
"4 Trump Confirms He Thinks GOP Healthcare Bill ... \n",
|
546 |
+
"\n",
|
547 |
+
" text subject \\\n",
|
548 |
+
"0 In the aftermath of the historic flooding that... News \n",
|
549 |
+
"1 In 1991 a reporter for the London Times found ... politics \n",
|
550 |
+
"2 The Obama family ended their eight-year reside... politics \n",
|
551 |
+
"3 CNN anchor Don Lemon got snarky during reporti... politics \n",
|
552 |
+
"4 Trump got into a bizarre pissing match with fo... News \n",
|
553 |
+
"\n",
|
554 |
+
" date label \n",
|
555 |
+
"0 September 1, 2017 0 \n",
|
556 |
+
"1 Feb 16, 2017 0 \n",
|
557 |
+
"2 Jan 20, 2017 0 \n",
|
558 |
+
"3 Jun 21, 2017 0 \n",
|
559 |
+
"4 June 25, 2017 0 "
|
560 |
+
],
|
561 |
+
"text/html": [
|
562 |
+
"\n",
|
563 |
+
" <div id=\"df-be529211-591d-4216-9d18-87f7badb81b3\">\n",
|
564 |
+
" <div class=\"colab-df-container\">\n",
|
565 |
+
" <div>\n",
|
566 |
+
"<style scoped>\n",
|
567 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
568 |
+
" vertical-align: middle;\n",
|
569 |
+
" }\n",
|
570 |
+
"\n",
|
571 |
+
" .dataframe tbody tr th {\n",
|
572 |
+
" vertical-align: top;\n",
|
573 |
+
" }\n",
|
574 |
+
"\n",
|
575 |
+
" .dataframe thead th {\n",
|
576 |
+
" text-align: right;\n",
|
577 |
+
" }\n",
|
578 |
+
"</style>\n",
|
579 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
580 |
+
" <thead>\n",
|
581 |
+
" <tr style=\"text-align: right;\">\n",
|
582 |
+
" <th></th>\n",
|
583 |
+
" <th>title</th>\n",
|
584 |
+
" <th>text</th>\n",
|
585 |
+
" <th>subject</th>\n",
|
586 |
+
" <th>date</th>\n",
|
587 |
+
" <th>label</th>\n",
|
588 |
+
" </tr>\n",
|
589 |
+
" </thead>\n",
|
590 |
+
" <tbody>\n",
|
591 |
+
" <tr>\n",
|
592 |
+
" <th>0</th>\n",
|
593 |
+
" <td>Trump’s Involvement In Houston Chemical Plant...</td>\n",
|
594 |
+
" <td>In the aftermath of the historic flooding that...</td>\n",
|
595 |
+
" <td>News</td>\n",
|
596 |
+
" <td>September 1, 2017</td>\n",
|
597 |
+
" <td>0</td>\n",
|
598 |
+
" </tr>\n",
|
599 |
+
" <tr>\n",
|
600 |
+
" <th>1</th>\n",
|
601 |
+
" <td>OOPS! Media Forgot Ted Kennedy Asked Russia To...</td>\n",
|
602 |
+
" <td>In 1991 a reporter for the London Times found ...</td>\n",
|
603 |
+
" <td>politics</td>\n",
|
604 |
+
" <td>Feb 16, 2017</td>\n",
|
605 |
+
" <td>0</td>\n",
|
606 |
+
" </tr>\n",
|
607 |
+
" <tr>\n",
|
608 |
+
" <th>2</th>\n",
|
609 |
+
" <td>OBAMA GIVES FINAL THOUGHTS On Trump Presidency...</td>\n",
|
610 |
+
" <td>The Obama family ended their eight-year reside...</td>\n",
|
611 |
+
" <td>politics</td>\n",
|
612 |
+
" <td>Jan 20, 2017</td>\n",
|
613 |
+
" <td>0</td>\n",
|
614 |
+
" </tr>\n",
|
615 |
+
" <tr>\n",
|
616 |
+
" <th>3</th>\n",
|
617 |
+
" <td>CNN ANCHOR DON LEMON: A Republican Winning in ...</td>\n",
|
618 |
+
" <td>CNN anchor Don Lemon got snarky during reporti...</td>\n",
|
619 |
+
" <td>politics</td>\n",
|
620 |
+
" <td>Jun 21, 2017</td>\n",
|
621 |
+
" <td>0</td>\n",
|
622 |
+
" </tr>\n",
|
623 |
+
" <tr>\n",
|
624 |
+
" <th>4</th>\n",
|
625 |
+
" <td>Trump Confirms He Thinks GOP Healthcare Bill ...</td>\n",
|
626 |
+
" <td>Trump got into a bizarre pissing match with fo...</td>\n",
|
627 |
+
" <td>News</td>\n",
|
628 |
+
" <td>June 25, 2017</td>\n",
|
629 |
+
" <td>0</td>\n",
|
630 |
+
" </tr>\n",
|
631 |
+
" </tbody>\n",
|
632 |
+
"</table>\n",
|
633 |
+
"</div>\n",
|
634 |
+
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-be529211-591d-4216-9d18-87f7badb81b3')\"\n",
|
635 |
+
" title=\"Convert this dataframe to an interactive table.\"\n",
|
636 |
+
" style=\"display:none;\">\n",
|
637 |
+
" \n",
|
638 |
+
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
639 |
+
" width=\"24px\">\n",
|
640 |
+
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
|
641 |
+
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
|
642 |
+
" </svg>\n",
|
643 |
+
" </button>\n",
|
644 |
+
" \n",
|
645 |
+
" <style>\n",
|
646 |
+
" .colab-df-container {\n",
|
647 |
+
" display:flex;\n",
|
648 |
+
" flex-wrap:wrap;\n",
|
649 |
+
" gap: 12px;\n",
|
650 |
+
" }\n",
|
651 |
+
"\n",
|
652 |
+
" .colab-df-convert {\n",
|
653 |
+
" background-color: #E8F0FE;\n",
|
654 |
+
" border: none;\n",
|
655 |
+
" border-radius: 50%;\n",
|
656 |
+
" cursor: pointer;\n",
|
657 |
+
" display: none;\n",
|
658 |
+
" fill: #1967D2;\n",
|
659 |
+
" height: 32px;\n",
|
660 |
+
" padding: 0 0 0 0;\n",
|
661 |
+
" width: 32px;\n",
|
662 |
+
" }\n",
|
663 |
+
"\n",
|
664 |
+
" .colab-df-convert:hover {\n",
|
665 |
+
" background-color: #E2EBFA;\n",
|
666 |
+
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
667 |
+
" fill: #174EA6;\n",
|
668 |
+
" }\n",
|
669 |
+
"\n",
|
670 |
+
" [theme=dark] .colab-df-convert {\n",
|
671 |
+
" background-color: #3B4455;\n",
|
672 |
+
" fill: #D2E3FC;\n",
|
673 |
+
" }\n",
|
674 |
+
"\n",
|
675 |
+
" [theme=dark] .colab-df-convert:hover {\n",
|
676 |
+
" background-color: #434B5C;\n",
|
677 |
+
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
678 |
+
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
679 |
+
" fill: #FFFFFF;\n",
|
680 |
+
" }\n",
|
681 |
+
" </style>\n",
|
682 |
+
"\n",
|
683 |
+
" <script>\n",
|
684 |
+
" const buttonEl =\n",
|
685 |
+
" document.querySelector('#df-be529211-591d-4216-9d18-87f7badb81b3 button.colab-df-convert');\n",
|
686 |
+
" buttonEl.style.display =\n",
|
687 |
+
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
688 |
+
"\n",
|
689 |
+
" async function convertToInteractive(key) {\n",
|
690 |
+
" const element = document.querySelector('#df-be529211-591d-4216-9d18-87f7badb81b3');\n",
|
691 |
+
" const dataTable =\n",
|
692 |
+
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
|
693 |
+
" [key], {});\n",
|
694 |
+
" if (!dataTable) return;\n",
|
695 |
+
"\n",
|
696 |
+
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
|
697 |
+
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
|
698 |
+
" + ' to learn more about interactive tables.';\n",
|
699 |
+
" element.innerHTML = '';\n",
|
700 |
+
" dataTable['output_type'] = 'display_data';\n",
|
701 |
+
" await google.colab.output.renderOutput(dataTable, element);\n",
|
702 |
+
" const docLink = document.createElement('div');\n",
|
703 |
+
" docLink.innerHTML = docLinkHtml;\n",
|
704 |
+
" element.appendChild(docLink);\n",
|
705 |
+
" }\n",
|
706 |
+
" </script>\n",
|
707 |
+
" </div>\n",
|
708 |
+
" </div>\n",
|
709 |
+
" "
|
710 |
+
]
|
711 |
+
},
|
712 |
+
"metadata": {},
|
713 |
+
"execution_count": 6
|
714 |
+
}
|
715 |
+
]
|
716 |
+
},
|
717 |
+
{
|
718 |
+
"cell_type": "code",
|
719 |
+
"source": [
|
720 |
+
"news['combine'] = news['title'] + ' ' + news['text']\n",
|
721 |
+
"news.head()"
|
722 |
+
],
|
723 |
+
"metadata": {
|
724 |
+
"colab": {
|
725 |
+
"base_uri": "https://localhost:8080/",
|
726 |
+
"height": 337
|
727 |
+
},
|
728 |
+
"id": "N7QZ7Zk5VvDk",
|
729 |
+
"outputId": "1abb083b-33d5-4e82-a14f-7bd943231d9e"
|
730 |
+
},
|
731 |
+
"execution_count": 7,
|
732 |
+
"outputs": [
|
733 |
+
{
|
734 |
+
"output_type": "execute_result",
|
735 |
+
"data": {
|
736 |
+
"text/plain": [
|
737 |
+
" title \\\n",
|
738 |
+
"0 Trump’s Involvement In Houston Chemical Plant... \n",
|
739 |
+
"1 OOPS! Media Forgot Ted Kennedy Asked Russia To... \n",
|
740 |
+
"2 OBAMA GIVES FINAL THOUGHTS On Trump Presidency... \n",
|
741 |
+
"3 CNN ANCHOR DON LEMON: A Republican Winning in ... \n",
|
742 |
+
"4 Trump Confirms He Thinks GOP Healthcare Bill ... \n",
|
743 |
+
"\n",
|
744 |
+
" text subject \\\n",
|
745 |
+
"0 In the aftermath of the historic flooding that... News \n",
|
746 |
+
"1 In 1991 a reporter for the London Times found ... politics \n",
|
747 |
+
"2 The Obama family ended their eight-year reside... politics \n",
|
748 |
+
"3 CNN anchor Don Lemon got snarky during reporti... politics \n",
|
749 |
+
"4 Trump got into a bizarre pissing match with fo... News \n",
|
750 |
+
"\n",
|
751 |
+
" date label combine \n",
|
752 |
+
"0 September 1, 2017 0 Trump’s Involvement In Houston Chemical Plant... \n",
|
753 |
+
"1 Feb 16, 2017 0 OOPS! Media Forgot Ted Kennedy Asked Russia To... \n",
|
754 |
+
"2 Jan 20, 2017 0 OBAMA GIVES FINAL THOUGHTS On Trump Presidency... \n",
|
755 |
+
"3 Jun 21, 2017 0 CNN ANCHOR DON LEMON: A Republican Winning in ... \n",
|
756 |
+
"4 June 25, 2017 0 Trump Confirms He Thinks GOP Healthcare Bill ... "
|
757 |
+
],
|
758 |
+
"text/html": [
|
759 |
+
"\n",
|
760 |
+
" <div id=\"df-1f1e70e6-bf75-4424-8a1d-f4ceecb435e7\">\n",
|
761 |
+
" <div class=\"colab-df-container\">\n",
|
762 |
+
" <div>\n",
|
763 |
+
"<style scoped>\n",
|
764 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
765 |
+
" vertical-align: middle;\n",
|
766 |
+
" }\n",
|
767 |
+
"\n",
|
768 |
+
" .dataframe tbody tr th {\n",
|
769 |
+
" vertical-align: top;\n",
|
770 |
+
" }\n",
|
771 |
+
"\n",
|
772 |
+
" .dataframe thead th {\n",
|
773 |
+
" text-align: right;\n",
|
774 |
+
" }\n",
|
775 |
+
"</style>\n",
|
776 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
777 |
+
" <thead>\n",
|
778 |
+
" <tr style=\"text-align: right;\">\n",
|
779 |
+
" <th></th>\n",
|
780 |
+
" <th>title</th>\n",
|
781 |
+
" <th>text</th>\n",
|
782 |
+
" <th>subject</th>\n",
|
783 |
+
" <th>date</th>\n",
|
784 |
+
" <th>label</th>\n",
|
785 |
+
" <th>combine</th>\n",
|
786 |
+
" </tr>\n",
|
787 |
+
" </thead>\n",
|
788 |
+
" <tbody>\n",
|
789 |
+
" <tr>\n",
|
790 |
+
" <th>0</th>\n",
|
791 |
+
" <td>Trump’s Involvement In Houston Chemical Plant...</td>\n",
|
792 |
+
" <td>In the aftermath of the historic flooding that...</td>\n",
|
793 |
+
" <td>News</td>\n",
|
794 |
+
" <td>September 1, 2017</td>\n",
|
795 |
+
" <td>0</td>\n",
|
796 |
+
" <td>Trump’s Involvement In Houston Chemical Plant...</td>\n",
|
797 |
+
" </tr>\n",
|
798 |
+
" <tr>\n",
|
799 |
+
" <th>1</th>\n",
|
800 |
+
" <td>OOPS! Media Forgot Ted Kennedy Asked Russia To...</td>\n",
|
801 |
+
" <td>In 1991 a reporter for the London Times found ...</td>\n",
|
802 |
+
" <td>politics</td>\n",
|
803 |
+
" <td>Feb 16, 2017</td>\n",
|
804 |
+
" <td>0</td>\n",
|
805 |
+
" <td>OOPS! Media Forgot Ted Kennedy Asked Russia To...</td>\n",
|
806 |
+
" </tr>\n",
|
807 |
+
" <tr>\n",
|
808 |
+
" <th>2</th>\n",
|
809 |
+
" <td>OBAMA GIVES FINAL THOUGHTS On Trump Presidency...</td>\n",
|
810 |
+
" <td>The Obama family ended their eight-year reside...</td>\n",
|
811 |
+
" <td>politics</td>\n",
|
812 |
+
" <td>Jan 20, 2017</td>\n",
|
813 |
+
" <td>0</td>\n",
|
814 |
+
" <td>OBAMA GIVES FINAL THOUGHTS On Trump Presidency...</td>\n",
|
815 |
+
" </tr>\n",
|
816 |
+
" <tr>\n",
|
817 |
+
" <th>3</th>\n",
|
818 |
+
" <td>CNN ANCHOR DON LEMON: A Republican Winning in ...</td>\n",
|
819 |
+
" <td>CNN anchor Don Lemon got snarky during reporti...</td>\n",
|
820 |
+
" <td>politics</td>\n",
|
821 |
+
" <td>Jun 21, 2017</td>\n",
|
822 |
+
" <td>0</td>\n",
|
823 |
+
" <td>CNN ANCHOR DON LEMON: A Republican Winning in ...</td>\n",
|
824 |
+
" </tr>\n",
|
825 |
+
" <tr>\n",
|
826 |
+
" <th>4</th>\n",
|
827 |
+
" <td>Trump Confirms He Thinks GOP Healthcare Bill ...</td>\n",
|
828 |
+
" <td>Trump got into a bizarre pissing match with fo...</td>\n",
|
829 |
+
" <td>News</td>\n",
|
830 |
+
" <td>June 25, 2017</td>\n",
|
831 |
+
" <td>0</td>\n",
|
832 |
+
" <td>Trump Confirms He Thinks GOP Healthcare Bill ...</td>\n",
|
833 |
+
" </tr>\n",
|
834 |
+
" </tbody>\n",
|
835 |
+
"</table>\n",
|
836 |
+
"</div>\n",
|
837 |
+
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1f1e70e6-bf75-4424-8a1d-f4ceecb435e7')\"\n",
|
838 |
+
" title=\"Convert this dataframe to an interactive table.\"\n",
|
839 |
+
" style=\"display:none;\">\n",
|
840 |
+
" \n",
|
841 |
+
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
|
842 |
+
" width=\"24px\">\n",
|
843 |
+
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
|
844 |
+
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
|
845 |
+
" </svg>\n",
|
846 |
+
" </button>\n",
|
847 |
+
" \n",
|
848 |
+
" <style>\n",
|
849 |
+
" .colab-df-container {\n",
|
850 |
+
" display:flex;\n",
|
851 |
+
" flex-wrap:wrap;\n",
|
852 |
+
" gap: 12px;\n",
|
853 |
+
" }\n",
|
854 |
+
"\n",
|
855 |
+
" .colab-df-convert {\n",
|
856 |
+
" background-color: #E8F0FE;\n",
|
857 |
+
" border: none;\n",
|
858 |
+
" border-radius: 50%;\n",
|
859 |
+
" cursor: pointer;\n",
|
860 |
+
" display: none;\n",
|
861 |
+
" fill: #1967D2;\n",
|
862 |
+
" height: 32px;\n",
|
863 |
+
" padding: 0 0 0 0;\n",
|
864 |
+
" width: 32px;\n",
|
865 |
+
" }\n",
|
866 |
+
"\n",
|
867 |
+
" .colab-df-convert:hover {\n",
|
868 |
+
" background-color: #E2EBFA;\n",
|
869 |
+
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
|
870 |
+
" fill: #174EA6;\n",
|
871 |
+
" }\n",
|
872 |
+
"\n",
|
873 |
+
" [theme=dark] .colab-df-convert {\n",
|
874 |
+
" background-color: #3B4455;\n",
|
875 |
+
" fill: #D2E3FC;\n",
|
876 |
+
" }\n",
|
877 |
+
"\n",
|
878 |
+
" [theme=dark] .colab-df-convert:hover {\n",
|
879 |
+
" background-color: #434B5C;\n",
|
880 |
+
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
|
881 |
+
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
|
882 |
+
" fill: #FFFFFF;\n",
|
883 |
+
" }\n",
|
884 |
+
" </style>\n",
|
885 |
+
"\n",
|
886 |
+
" <script>\n",
|
887 |
+
" const buttonEl =\n",
|
888 |
+
" document.querySelector('#df-1f1e70e6-bf75-4424-8a1d-f4ceecb435e7 button.colab-df-convert');\n",
|
889 |
+
" buttonEl.style.display =\n",
|
890 |
+
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
|
891 |
+
"\n",
|
892 |
+
" async function convertToInteractive(key) {\n",
|
893 |
+
" const element = document.querySelector('#df-1f1e70e6-bf75-4424-8a1d-f4ceecb435e7');\n",
|
894 |
+
" const dataTable =\n",
|
895 |
+
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
|
896 |
+
" [key], {});\n",
|
897 |
+
" if (!dataTable) return;\n",
|
898 |
+
"\n",
|
899 |
+
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
|
900 |
+
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
|
901 |
+
" + ' to learn more about interactive tables.';\n",
|
902 |
+
" element.innerHTML = '';\n",
|
903 |
+
" dataTable['output_type'] = 'display_data';\n",
|
904 |
+
" await google.colab.output.renderOutput(dataTable, element);\n",
|
905 |
+
" const docLink = document.createElement('div');\n",
|
906 |
+
" docLink.innerHTML = docLinkHtml;\n",
|
907 |
+
" element.appendChild(docLink);\n",
|
908 |
+
" }\n",
|
909 |
+
" </script>\n",
|
910 |
+
" </div>\n",
|
911 |
+
" </div>\n",
|
912 |
+
" "
|
913 |
+
]
|
914 |
+
},
|
915 |
+
"metadata": {},
|
916 |
+
"execution_count": 7
|
917 |
+
}
|
918 |
+
]
|
919 |
+
},
|
920 |
+
{
|
921 |
+
"cell_type": "markdown",
|
922 |
+
"source": [
|
923 |
+
"## Tfidf Vectorization"
|
924 |
+
],
|
925 |
+
"metadata": {
|
926 |
+
"id": "_q-ySMGeThhL"
|
927 |
+
}
|
928 |
+
},
|
929 |
+
{
|
930 |
+
"cell_type": "code",
|
931 |
+
"source": [
|
932 |
+
"import nltk\n",
|
933 |
+
"from nltk import word_tokenize\n",
|
934 |
+
"from nltk.stem import SnowballStemmer\n",
|
935 |
+
"from nltk.corpus import stopwords \n",
|
936 |
+
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
937 |
+
"nltk.download('punkt')\n",
|
938 |
+
"\n",
|
939 |
+
"# Tokenizing\n",
|
940 |
+
"news['combine'] = news['combine'].apply(lambda x: word_tokenize(str(x)))"
|
941 |
+
],
|
942 |
+
"metadata": {
|
943 |
+
"colab": {
|
944 |
+
"base_uri": "https://localhost:8080/"
|
945 |
+
},
|
946 |
+
"id": "oLbvalr4Tort",
|
947 |
+
"outputId": "bd3a981d-2e13-46b3-e7df-fe9fc28a867d"
|
948 |
+
},
|
949 |
+
"execution_count": 8,
|
950 |
+
"outputs": [
|
951 |
+
{
|
952 |
+
"output_type": "stream",
|
953 |
+
"name": "stdout",
|
954 |
+
"text": [
|
955 |
+
"[nltk_data] Downloading package punkt to /root/nltk_data...\n",
|
956 |
+
"[nltk_data] Package punkt is already up-to-date!\n"
|
957 |
+
]
|
958 |
+
}
|
959 |
+
]
|
960 |
+
},
|
961 |
+
{
|
962 |
+
"cell_type": "code",
|
963 |
+
"source": [
|
964 |
+
"# Stemming\n",
|
965 |
+
"snowball = SnowballStemmer(language='english')\n",
|
966 |
+
"news['combine'] = news['combine'].apply(lambda x: [snowball.stem(y) for y in x])"
|
967 |
+
],
|
968 |
+
"metadata": {
|
969 |
+
"id": "KedYPGIFTsyz"
|
970 |
+
},
|
971 |
+
"execution_count": 9,
|
972 |
+
"outputs": []
|
973 |
+
},
|
974 |
+
{
|
975 |
+
"cell_type": "code",
|
976 |
+
"source": [
|
977 |
+
"news['combine'] = news['combine'].apply(lambda x: ' '.join(x))"
|
978 |
+
],
|
979 |
+
"metadata": {
|
980 |
+
"id": "QvVafpZVT5BJ"
|
981 |
+
},
|
982 |
+
"execution_count": 10,
|
983 |
+
"outputs": []
|
984 |
+
},
|
985 |
+
{
|
986 |
+
"cell_type": "code",
|
987 |
+
"source": [
|
988 |
+
"tfidf = TfidfVectorizer()\n",
|
989 |
+
"X_text = tfidf.fit_transform(news['combine'])"
|
990 |
+
],
|
991 |
+
"metadata": {
|
992 |
+
"id": "pMT8lagAT-hJ"
|
993 |
+
},
|
994 |
+
"execution_count": 11,
|
995 |
+
"outputs": []
|
996 |
+
},
|
997 |
+
{
|
998 |
+
"cell_type": "code",
|
999 |
+
"source": [
|
1000 |
+
"from sklearn.model_selection import train_test_split\n",
|
1001 |
+
"X_train, X_test, y_train, y_test = train_test_split(X_text, news['label'], test_size=0.3, random_state=1)"
|
1002 |
+
],
|
1003 |
+
"metadata": {
|
1004 |
+
"id": "RJ3rBOD7Sli2"
|
1005 |
+
},
|
1006 |
+
"execution_count": 29,
|
1007 |
+
"outputs": []
|
1008 |
+
},
|
1009 |
+
{
|
1010 |
+
"cell_type": "markdown",
|
1011 |
+
"source": [
|
1012 |
+
"## Data Modeling - Support Vector Machine"
|
1013 |
+
],
|
1014 |
+
"metadata": {
|
1015 |
+
"id": "vBd9mLdpWcg8"
|
1016 |
+
}
|
1017 |
+
},
|
1018 |
+
{
|
1019 |
+
"cell_type": "code",
|
1020 |
+
"source": [
|
1021 |
+
"from sklearn.svm import LinearSVC\n",
|
1022 |
+
"from sklearn.model_selection import cross_val_score\n",
|
1023 |
+
"from sklearn.metrics import accuracy_score\n",
|
1024 |
+
"from sklearn.metrics import confusion_matrix\n",
|
1025 |
+
"from sklearn.metrics import roc_auc_score\n",
|
1026 |
+
"\n",
|
1027 |
+
"clf = LinearSVC(max_iter=100, C=1.0)\n",
|
1028 |
+
"clf.fit(X_train, y_train)\n",
|
1029 |
+
"\n",
|
1030 |
+
"y_pred = clf.predict(X_test)\n",
|
1031 |
+
"print(\"Cross validation score:\")\n",
|
1032 |
+
"print(cross_val_score(clf, X_text, news['label'], cv=3))\n",
|
1033 |
+
"\n",
|
1034 |
+
"print(\"\\nAccuracy:\")\n",
|
1035 |
+
"print(accuracy_score(y_pred, y_test))\n",
|
1036 |
+
"\n",
|
1037 |
+
"print(\"\\nConfusion Matrix:\")\n",
|
1038 |
+
"print(confusion_matrix(y_pred, y_test))\n",
|
1039 |
+
"\n",
|
1040 |
+
"print(\"\\nROC AUC:\")\n",
|
1041 |
+
"print(roc_auc_score(y_pred, y_test))\n"
|
1042 |
+
],
|
1043 |
+
"metadata": {
|
1044 |
+
"colab": {
|
1045 |
+
"base_uri": "https://localhost:8080/"
|
1046 |
+
},
|
1047 |
+
"id": "ViT1BRrBWgTi",
|
1048 |
+
"outputId": "4ad7ab1b-faeb-4199-a4ad-36c7a2b506cf"
|
1049 |
+
},
|
1050 |
+
"execution_count": 30,
|
1051 |
+
"outputs": [
|
1052 |
+
{
|
1053 |
+
"output_type": "stream",
|
1054 |
+
"name": "stdout",
|
1055 |
+
"text": [
|
1056 |
+
"Cross validation score:\n",
|
1057 |
+
"[0.99566486 0.99433097 0.99516465]\n",
|
1058 |
+
"\n",
|
1059 |
+
"Accuracy:\n",
|
1060 |
+
"0.9954612819562801\n",
|
1061 |
+
"\n",
|
1062 |
+
"Confusion Matrix:\n",
|
1063 |
+
"[[4416 19]\n",
|
1064 |
+
" [ 30 6331]]\n",
|
1065 |
+
"\n",
|
1066 |
+
"ROC AUC:\n",
|
1067 |
+
"0.9954998283473117\n"
|
1068 |
+
]
|
1069 |
+
}
|
1070 |
+
]
|
1071 |
+
},
|
1072 |
+
{
|
1073 |
+
"cell_type": "markdown",
|
1074 |
+
"source": [
|
1075 |
+
"## Find out the misclassified"
|
1076 |
+
],
|
1077 |
+
"metadata": {
|
1078 |
+
"id": "Gkb2PvDx7YY6"
|
1079 |
+
}
|
1080 |
+
},
|
1081 |
+
{
|
1082 |
+
"cell_type": "code",
|
1083 |
+
"source": [
|
1084 |
+
"y_test_1 = np.asarray(y_test)\n",
|
1085 |
+
"misclassified = np.where(y_test_1 != clf.predict(X_test))\n",
|
1086 |
+
"misclassified"
|
1087 |
+
],
|
1088 |
+
"metadata": {
|
1089 |
+
"colab": {
|
1090 |
+
"base_uri": "https://localhost:8080/"
|
1091 |
+
},
|
1092 |
+
"id": "HoAm7-MQ67e0",
|
1093 |
+
"outputId": "39cc09df-b7c1-4a12-b708-debb97a5493e"
|
1094 |
+
},
|
1095 |
+
"execution_count": 32,
|
1096 |
+
"outputs": [
|
1097 |
+
{
|
1098 |
+
"output_type": "execute_result",
|
1099 |
+
"data": {
|
1100 |
+
"text/plain": [
|
1101 |
+
"(array([ 479, 875, 900, 964, 1115, 1332, 1808, 2002, 2008,\n",
|
1102 |
+
" 2364, 2495, 2811, 3009, 3332, 4407, 4495, 4633, 4636,\n",
|
1103 |
+
" 4680, 4864, 4934, 5376, 5426, 5519, 5764, 6018, 6021,\n",
|
1104 |
+
" 6046, 6202, 6223, 6267, 6537, 6744, 6832, 6938, 7042,\n",
|
1105 |
+
" 7305, 7572, 7798, 7986, 8645, 8970, 9176, 9440, 9653,\n",
|
1106 |
+
" 10068, 10122, 10229, 10283]),)"
|
1107 |
+
]
|
1108 |
+
},
|
1109 |
+
"metadata": {},
|
1110 |
+
"execution_count": 32
|
1111 |
+
}
|
1112 |
+
]
|
1113 |
+
},
|
1114 |
+
{
|
1115 |
+
"cell_type": "code",
|
1116 |
+
"source": [
|
1117 |
+
"news.iloc[479]"
|
1118 |
+
],
|
1119 |
+
"metadata": {
|
1120 |
+
"colab": {
|
1121 |
+
"base_uri": "https://localhost:8080/"
|
1122 |
+
},
|
1123 |
+
"id": "Mpfzn-PQ7J1c",
|
1124 |
+
"outputId": "fe0e4277-03ba-4024-dbd3-321915e7d9e1"
|
1125 |
+
},
|
1126 |
+
"execution_count": 33,
|
1127 |
+
"outputs": [
|
1128 |
+
{
|
1129 |
+
"output_type": "execute_result",
|
1130 |
+
"data": {
|
1131 |
+
"text/plain": [
|
1132 |
+
"title Russia hits Islamic State with bomb raids, mis...\n",
|
1133 |
+
"text MOSCOW (Reuters) - Russia has carried out 18 b...\n",
|
1134 |
+
"subject worldnews\n",
|
1135 |
+
"date November 3, 2017 \n",
|
1136 |
+
"label 1\n",
|
1137 |
+
"combine russia hit islam state with bomb raid , missil...\n",
|
1138 |
+
"Name: 479, dtype: object"
|
1139 |
+
]
|
1140 |
+
},
|
1141 |
+
"metadata": {},
|
1142 |
+
"execution_count": 33
|
1143 |
+
}
|
1144 |
+
]
|
1145 |
+
},
|
1146 |
+
{
|
1147 |
+
"cell_type": "code",
|
1148 |
+
"source": [
|
1149 |
+
"y_pred[479]"
|
1150 |
+
],
|
1151 |
+
"metadata": {
|
1152 |
+
"colab": {
|
1153 |
+
"base_uri": "https://localhost:8080/"
|
1154 |
+
},
|
1155 |
+
"id": "i8i9FUnw7PkQ",
|
1156 |
+
"outputId": "1730ceba-1db0-41a3-e2f4-b3608d1dfd18"
|
1157 |
+
},
|
1158 |
+
"execution_count": 36,
|
1159 |
+
"outputs": [
|
1160 |
+
{
|
1161 |
+
"output_type": "execute_result",
|
1162 |
+
"data": {
|
1163 |
+
"text/plain": [
|
1164 |
+
"0"
|
1165 |
+
]
|
1166 |
+
},
|
1167 |
+
"metadata": {},
|
1168 |
+
"execution_count": 36
|
1169 |
+
}
|
1170 |
+
]
|
1171 |
+
},
|
1172 |
+
{
|
1173 |
+
"cell_type": "code",
|
1174 |
+
"source": [
|
1175 |
+
"news.iloc[875]"
|
1176 |
+
],
|
1177 |
+
"metadata": {
|
1178 |
+
"colab": {
|
1179 |
+
"base_uri": "https://localhost:8080/"
|
1180 |
+
},
|
1181 |
+
"id": "ha2q7JvE7dcw",
|
1182 |
+
"outputId": "d29668d1-9173-4fec-9b96-8d0139081486"
|
1183 |
+
},
|
1184 |
+
"execution_count": 37,
|
1185 |
+
"outputs": [
|
1186 |
+
{
|
1187 |
+
"output_type": "execute_result",
|
1188 |
+
"data": {
|
1189 |
+
"text/plain": [
|
1190 |
+
"title Trump Gets HUMILIATED After Whining About Sta...\n",
|
1191 |
+
"text Donald Trump threw a temper tantrum on Saturda...\n",
|
1192 |
+
"subject News\n",
|
1193 |
+
"date July 1, 2017\n",
|
1194 |
+
"label 0\n",
|
1195 |
+
"combine trump get humili after whine about state refus...\n",
|
1196 |
+
"Name: 875, dtype: object"
|
1197 |
+
]
|
1198 |
+
},
|
1199 |
+
"metadata": {},
|
1200 |
+
"execution_count": 37
|
1201 |
+
}
|
1202 |
+
]
|
1203 |
+
},
|
1204 |
+
{
|
1205 |
+
"cell_type": "code",
|
1206 |
+
"source": [
|
1207 |
+
"y_pred[875]"
|
1208 |
+
],
|
1209 |
+
"metadata": {
|
1210 |
+
"colab": {
|
1211 |
+
"base_uri": "https://localhost:8080/"
|
1212 |
+
},
|
1213 |
+
"id": "XZSWP-zk7gNk",
|
1214 |
+
"outputId": "51c10cdb-6729-4dda-88b9-2d28fd9919b1"
|
1215 |
+
},
|
1216 |
+
"execution_count": 38,
|
1217 |
+
"outputs": [
|
1218 |
+
{
|
1219 |
+
"output_type": "execute_result",
|
1220 |
+
"data": {
|
1221 |
+
"text/plain": [
|
1222 |
+
"1"
|
1223 |
+
]
|
1224 |
+
},
|
1225 |
+
"metadata": {},
|
1226 |
+
"execution_count": 38
|
1227 |
+
}
|
1228 |
+
]
|
1229 |
+
},
|
1230 |
+
{
|
1231 |
+
"cell_type": "markdown",
|
1232 |
+
"source": [
|
1233 |
+
"## Optional Code"
|
1234 |
+
],
|
1235 |
+
"metadata": {
|
1236 |
+
"id": "0OpbUEpq4IpV"
|
1237 |
+
}
|
1238 |
+
},
|
1239 |
+
{
|
1240 |
+
"cell_type": "code",
|
1241 |
+
"source": [
|
1242 |
+
"# news_out = pd.merge(news,y_test,how = 'left',left_index = True, right_index = True)\n",
|
1243 |
+
"# temp = news_out[~(news_out[['label_y']].isnull().any(axis=1))]\n",
|
1244 |
+
"# temp.loc[(temp['label_x'] == temp['label_y'])]"
|
1245 |
+
],
|
1246 |
+
"metadata": {
|
1247 |
+
"id": "_pwDcqx1qOby"
|
1248 |
+
},
|
1249 |
+
"execution_count": 14,
|
1250 |
+
"outputs": []
|
1251 |
+
},
|
1252 |
+
{
|
1253 |
+
"cell_type": "markdown",
|
1254 |
+
"source": [
|
1255 |
+
"**Write up**: \n",
|
1256 |
+
"* Link to the model on Hugging Face Hub: \n",
|
1257 |
+
"* Include some examples of misclassified news articles. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)"
|
1258 |
+
],
|
1259 |
+
"metadata": {
|
1260 |
+
"id": "kpInVUMLyJ24"
|
1261 |
+
}
|
1262 |
+
}
|
1263 |
+
]
|
1264 |
+
}
|
README.md
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Coding Challenge - Deep Learning for NLP (Foong)
|
2 |
+
|
3 |
+
### Description:
|
4 |
+
This repository contains notebook using scikit-learn SVM to classify real & fake news.
|
5 |
+
|
6 |
+
Dataset: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
|
7 |
+
Libraries used: Scikit-learn, NLTK, pandas, numpy, csv
|
8 |
+
|
9 |
+
### Write-up:
|
10 |
+
The accuracy of the model is 0.995.
|
11 |
+
|
12 |
+
There are a couple misclassified news articles and to improve the model's performance on these news articles, here're some suggestions:
|
13 |
+
- Remove stop words: The news article title and text contain a lot of most commonly used words which should be removed as features. Tehrefore, more data cleaning should be eprformed prior to model building.
|
14 |
+
- Try using neural network by setting batch size, apply dropout & finetuning it
|
15 |
+
- Run cross validation
|