{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": [ "# Homework 2 - DH-500\n", "## Analysing Twitter " ], "id": "6eee1bc06b03d39b" }, { "cell_type": "code", "id": "initial_id", "metadata": { "collapsed": true, "ExecuteTime": { "end_time": "2024-04-14T19:00:44.730100Z", "start_time": "2024-04-14T19:00:44.726958Z" } }, "source": [ "student, sciper = 'Carlos Alberto Vargas Rivera', '384891'\n", "print('Author:', f'{student} ({sciper})')" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Author: Carlos Alberto Vargas Rivera (384891)\n" ] } ], "execution_count": 268 }, { "metadata": {}, "cell_type": "markdown", "source": "### Q.1-2 - Media and Influencers that tweet about US Politics", "id": "e0890030a56f3933" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:42:04.917981Z", "start_time": "2024-04-14T18:42:04.912679Z" } }, "cell_type": "code", "source": [ "media_sources = [\n", " # 'people',\n", " 'usatoday',\n", " 'nytimes',\n", " # 'yahoo',\n", " # 'cnn',\n", " # 'washingtonpost',\n", " # 'nypost',\n", " # 'foxnews',\n", " # 'reuters',\n", " # 'newsweek',\n", " # 'bloomberg',\n", " 'theguardian',\n", " 'businessinsider',\n", " 'abcnews',\n", " # 'politico',\n", " 'WhiteHouse',\n", " # 'cnnbrk',\n", " 'cnni',\n", " 'usnews',\n", " 'TheLastofUsNews'\n", " 'USNewsMoney',\n", " 'CNNPolitics',\n", " 'foxnewspolitics',\n", " 'usbank',\n", " 'googlenews',\n", " 'detroitnews',\n", " 'GuardianUS',\n", " 'NewsWire_US',\n", " 'MirrorUSNews',\n", " 'ExpressUSNews',\n", " 'Interior',\n", " 'BuzzFeedNews'\n", "]\n", "print(len(media_sources))" ], "id": "16bff05215606233", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20\n" ] } ], "execution_count": 221 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:42:04.933383Z", "start_time": "2024-04-14T18:42:04.920519Z" } }, "cell_type": "code", "source": [ "influencers = [\n", " 'realDonaldTrump',\n", " 'JoeBiden',\n", " # 'LennyDykstra',\n", " 'POTUS',\n", " 'MikePence',\n", " 'BernieSanders',\n", " 'AndrewYang',\n", " # 'chrislhayes',\n", " # 'SenSchumer',\n", " # 'GenFlynn',\n", " # 'ChrisMurphyCT',\n", " # 'BretBaier',\n", " 'TuckerCarlson',\n", " 'seanhannity',\n", " # 'MarkWarner',\n", " 'MikeTyson',\n", " 'JudyWoodruff',\n", " # 'AdamBaldwin',\n", " 'johncardillo',\n", " 'AntonioSabatoJr',\n", " 'JulianCastro',\n", " 'brianschatz',\n", " 'MichelleObama',\n", " 'HillaryClinton',\n", " 'IvankaTrump',\n", " 'JessicaBiel',\n", " 'AlyssaMilano',\n", " 'JLo'\n", "]\n", "print(len(influencers))" ], "id": "dd3f74de67f8a5b3", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20\n" ] } ], "execution_count": 222 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:42:05.455128Z", "start_time": "2024-04-14T18:42:04.940473Z" } }, "cell_type": "code", "source": [ "import pandas as pd\n", "import string\n", "import re\n", "\n", "path = 'data/'\n", "file = 'dataset.csv'\n", "names = ['TEXT', 'RETWEET_COUNT', 'FAVORITE_COUNT', 'TWEET_ID', 'TWEET_BY', 'TWEET_BY_ID', 'DATETIME', 'NUM_OF_URLS',\n", " 'RETWEETED', 'RETWEETED_TWEET_ID', 'RETWEETED_TWEET_BY', 'RETWEETED_TWEET_BY_ID', 'RETWEETED_TEXT',\n", " 'RETWEETED_URLS', 'RETWEETED_MEDIA']\n", "d_types = {\n", " 'TEXT': 'string',\n", " 'RETWEET_COUNT': 'int64',\n", " 'FAVORITE_COUNT': 'int64',\n", " 'TWEET_ID': 'string',\n", " 'TWEET_BY': 'string',\n", " 'TWEET_BY_ID': 'string',\n", " 'DATETIME': 'string',\n", " 'NUM_OF_URLS': 'int64',\n", " 'RETWEETED': 'boolean',\n", " 'RETWEETED_TWEET_ID': 'string',\n", " 'RETWEETED_TWEET_BY': 'string',\n", " 'RETWEETED_TWEET_BY_ID': 'string',\n", " 'RETWEETED_TEXT': 'string',\n", " 'RETWEETED_URLS': 'Float64',\n", " 'RETWEETED_MEDIA': 'Float64'\n", "}\n", "tw_df = pd.read_csv(path + file, sep=',', header=0, dtype=d_types)\n", "columns = list(tw_df.columns)\n", "print(columns, tw_df.shape)" ], "id": "5408e62813b4c435", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['TEXT', 'RETWEET_COUNT', 'FAVORITE_COUNT', 'TWEET_ID', 'TWEET_BY', 'TWEET_BY_ID', 'DATETIME', 'NUM_OF_URLS', 'RETWEETED', 'RETWEETED_TWEET_ID', 'RETWEETED_TWEET_BY', 'RETWEETED_TWEET_BY_ID', 'RETWEETED_TEXT', 'RETWEETED_URLS', 'RETWEETED_MEDIA'] (92414, 15)\n" ] } ], "execution_count": 223 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:42:05.469236Z", "start_time": "2024-04-14T18:42:05.457294Z" } }, "cell_type": "code", "source": "tw_df.head(9)", "id": "deb8dc84b0d9012e", "outputs": [ { "data": { "text/plain": [ " TEXT RETWEET_COUNT \\\n", "0 While you are at home socially distancing, che... 12 \n", "1 Americans have made incredible sacrifices to g... 19 \n", "2 The @SenSchumer & @SpeakerPelosi paycheck ... 5360 \n", "3 The #PaycheckProtectionProgram is helping work... 21 \n", "4 RT @SenateGOP: We all saw the MSNBC clip.\n", "\n", "STO... 108 \n", "5 I want to make sure we stop the spread of #cor... 32 \n", "6 RT @SenateGOP: Conference Chairman @SenJohnBar... 44 \n", "7 Congrats to my good friend @SenatorTimScott on... 10 \n", "8 RT @EPWGOP: Chairman @SenJohnBarrasso: “@NRCgo... 4 \n", "\n", " FAVORITE_COUNT TWEET_ID TWEET_BY TWEET_BY_ID \\\n", "0 27 #1251250167837347841 SenJohnBarrasso #202206694 \n", "1 60 #1250850558249926656 SenJohnBarrasso #202206694 \n", "2 14837 #1250772931015258112 SenJohnBarrasso #202206694 \n", "3 80 #1250525652736004098 SenJohnBarrasso #202206694 \n", "4 0 #1250487291895742468 SenJohnBarrasso #202206694 \n", "5 87 #1250439402389532672 SenJohnBarrasso #202206694 \n", "6 0 #1250099010226053121 SenJohnBarrasso #202206694 \n", "7 32 #1250098752280571910 SenJohnBarrasso #202206694 \n", "8 0 #1249787975279374336 SenJohnBarrasso #202206694 \n", "\n", " DATETIME NUM_OF_URLS RETWEETED \\\n", "0 Fri Apr 17 20:44:21 +0000 2020 1 False \n", "1 Thu Apr 16 18:16:27 +0000 2020 1 False \n", "2 Thu Apr 16 13:07:59 +0000 2020 1 False \n", "3 Wed Apr 15 20:45:23 +0000 2020 0 False \n", "4 Wed Apr 15 18:12:57 +0000 2020 0 True \n", "5 Wed Apr 15 15:02:40 +0000 2020 0 False \n", "6 Tue Apr 14 16:30:04 +0000 2020 0 True \n", "7 Tue Apr 14 16:29:02 +0000 2020 0 False \n", "8 Mon Apr 13 19:54:07 +0000 2020 0 False \n", "\n", " RETWEETED_TWEET_ID RETWEETED_TWEET_BY RETWEETED_TWEET_BY_ID \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 #1250477604097966084 SenateGOP #14344823 \n", "5 \n", "6 #1250078167609618432 SenateGOP #14344823 \n", "7 \n", "8 \n", "\n", " RETWEETED_TEXT RETWEETED_URLS \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 We all saw the MSNBC clip.\n", "\n", "STOP using this “c... 1.0 \n", "5 \n", "6 Conference Chairman @SenJohnBarrasso on @Varne... 0.0 \n", "7 \n", "8 \n", "\n", " RETWEETED_MEDIA \n", "0 \n", "1 \n", "2 \n", "3 \n", "4 1.0 \n", "5 \n", "6 1.0 \n", "7 \n", "8 " ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TEXTRETWEET_COUNTFAVORITE_COUNTTWEET_IDTWEET_BYTWEET_BY_IDDATETIMENUM_OF_URLSRETWEETEDRETWEETED_TWEET_IDRETWEETED_TWEET_BYRETWEETED_TWEET_BY_IDRETWEETED_TEXTRETWEETED_URLSRETWEETED_MEDIA
0While you are at home socially distancing, che...1227#1251250167837347841SenJohnBarrasso#202206694Fri Apr 17 20:44:21 +0000 20201False<NA><NA><NA><NA><NA><NA>
1Americans have made incredible sacrifices to g...1960#1250850558249926656SenJohnBarrasso#202206694Thu Apr 16 18:16:27 +0000 20201False<NA><NA><NA><NA><NA><NA>
2The @SenSchumer &amp; @SpeakerPelosi paycheck ...536014837#1250772931015258112SenJohnBarrasso#202206694Thu Apr 16 13:07:59 +0000 20201False<NA><NA><NA><NA><NA><NA>
3The #PaycheckProtectionProgram is helping work...2180#1250525652736004098SenJohnBarrasso#202206694Wed Apr 15 20:45:23 +0000 20200False<NA><NA><NA><NA><NA><NA>
4RT @SenateGOP: We all saw the MSNBC clip.\n", "\n", "STO...1080#1250487291895742468SenJohnBarrasso#202206694Wed Apr 15 18:12:57 +0000 20200True#1250477604097966084SenateGOP#14344823We all saw the MSNBC clip.\n", "\n", "STOP using this “c...1.01.0
5I want to make sure we stop the spread of #cor...3287#1250439402389532672SenJohnBarrasso#202206694Wed Apr 15 15:02:40 +0000 20200False<NA><NA><NA><NA><NA><NA>
6RT @SenateGOP: Conference Chairman @SenJohnBar...440#1250099010226053121SenJohnBarrasso#202206694Tue Apr 14 16:30:04 +0000 20200True#1250078167609618432SenateGOP#14344823Conference Chairman @SenJohnBarrasso on @Varne...0.01.0
7Congrats to my good friend @SenatorTimScott on...1032#1250098752280571910SenJohnBarrasso#202206694Tue Apr 14 16:29:02 +0000 20200False<NA><NA><NA><NA><NA><NA>
8RT @EPWGOP: Chairman @SenJohnBarrasso: “@NRCgo...40#1249787975279374336SenJohnBarrasso#202206694Mon Apr 13 19:54:07 +0000 20200False<NA><NA><NA><NA><NA><NA>
\n", "
" ] }, "execution_count": 224, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 224 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:48:09.638698Z", "start_time": "2024-04-14T18:48:09.634021Z" } }, "cell_type": "code", "source": [ "puncs = re.sub(\"[@#]\", \"\", string.punctuation)\n", "# print(puncs)\n", "translator = str.maketrans('', '', puncs)\n", "\n", "my_string = 'Hola!!! soy @Carlos, estoy en @EFPL #studying para DH-500...'\n", "new_string = my_string.translate(translator)\n", "print(new_string)" ], "id": "acee9e1e20269509", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hola soy @Carlos estoy en @EFPL #studying para DH500\n" ] } ], "execution_count": 258 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:42:05.775162Z", "start_time": "2024-04-14T18:42:05.488490Z" } }, "cell_type": "code", "source": [ "def find_at_word(text):\n", " # word = re.findall(r'@([a-zA-Z0-9]{1,15})', text)\n", " word = re.findall(r'(?<=@)\\w{1,15}', text)\n", " mentions = \" \".join(word)\n", " return mentions.translate(translator)\n", "\n", "tw_df['mentions'] = tw_df['TEXT'].apply(lambda x: find_at_word(x))\n", "print(\"Extracting @mentions from dataframe columns:\")\n", "print(len(tw_df.mentions), tw_df.mentions[:9])" ], "id": "68ba54d11940a88c", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Extracting @mentions from dataframe columns:\n", "92414 0 JumpShotMovie WyoAthletics StephenCurry30 jake...\n", "1 \n", "2 SenSchumer SpeakerPelosi\n", "3 \n", "4 SenateGOP\n", "5 SpeakerPelosi\n", "6 SenateGOP SenJohnBarrasso Varneyco\n", "7 SenatorTimScott\n", "8 EPWGOP SenJohnBarrasso NRCgov\n", "Name: mentions, dtype: object\n" ] } ], "execution_count": 226 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:48:46.778364Z", "start_time": "2024-04-14T18:48:46.742993Z" } }, "cell_type": "code", "source": [ "from collections import Counter\n", "from pprint import pprint\n", "\n", "fq_m = dict(Counter([m for i in tw_df.mentions for m in i.split()]))\n", "fq_m = dict(sorted(fq_m.items(), key=lambda x: x[1], reverse=True))\n", "\n", "print('TOTAL unique @mentions:', len(list(fq_m.items())))\n", "pprint(list(fq_m.items())[:9])" ], "id": "38f45b963ee0a166", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TOTAL unique @mentions: 17734\n", "[('realDonaldTrump', 1946),\n", " ('JoeBiden', 407),\n", " ('WhiteHouse', 307),\n", " ('SBAgov', 272),\n", " ('LennyDykstra', 271),\n", " ('SpeakerPelosi', 257),\n", " ('FoxNews', 252),\n", " ('iheartmindy', 250),\n", " ('jaketapper', 232)]\n" ] } ], "execution_count": 259 }, { "metadata": {}, "cell_type": "markdown", "source": "### Q.3 - Descriptive statistics", "id": "d35bf59121c67002" }, { "metadata": {}, "cell_type": "markdown", "source": "#### Q.3.1 - The Percentage of tweets that contain URLs", "id": "6ca70296deeba0c" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:48:52.612575Z", "start_time": "2024-04-14T18:48:52.605681Z" } }, "cell_type": "code", "source": "print(tw_df.NUM_OF_URLS.describe())", "id": "aa80b763e471f2c4", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count 92414.000000\n", "mean 0.569621\n", "std 0.521984\n", "min 0.000000\n", "25% 0.000000\n", "50% 1.000000\n", "75% 1.000000\n", "max 5.000000\n", "Name: NUM_OF_URLS, dtype: float64\n" ] } ], "execution_count": 260 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:48:53.484281Z", "start_time": "2024-04-14T18:48:53.475185Z" } }, "cell_type": "code", "source": [ "tw_url = tw_df.loc[tw_df['NUM_OF_URLS'] > 0, 'NUM_OF_URLS'].count() / tw_df['TEXT'].count()\n", "print('The Percentage of tweets that contain URLs is:', f'{round(tw_url, 3) * 100}%')" ], "id": "11e63cee9c34439a", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Percentage of tweets that contain URLs is: 55.7%\n" ] } ], "execution_count": 261 }, { "metadata": {}, "cell_type": "markdown", "source": "#### Q3.2 - The Percentage of tweets that are (or contain) retweets", "id": "134541ad66fb3f6" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:48:57.640645Z", "start_time": "2024-04-14T18:48:57.634566Z" } }, "cell_type": "code", "source": "print(tw_df['RETWEETED'].describe())", "id": "480e1018c82db48a", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count 92414\n", "unique 2\n", "top False\n", "freq 82505\n", "Name: RETWEETED, dtype: object\n" ] } ], "execution_count": 262 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:49:00.313487Z", "start_time": "2024-04-14T18:49:00.300808Z" } }, "cell_type": "code", "source": [ "tw_re_tw = tw_df.loc[tw_df['RETWEETED_MEDIA'].notnull(), 'RETWEETED_MEDIA'].count() / tw_df['TEXT'].count()\n", "print('The Percentage of tweets are/or contain RE-Tweets:', f'{round(tw_re_tw, 3) * 100}%')" ], "id": "2b149b590ae262b9", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Percentage of tweets are/or contain RE-Tweets: 10.7%\n" ] } ], "execution_count": 263 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:49:11.925379Z", "start_time": "2024-04-14T18:49:11.915270Z" } }, "cell_type": "code", "source": [ "tw_re_tw = tw_df.loc[tw_df['RETWEETED'], 'RETWEETED'].count() / tw_df['TEXT'].count()\n", "print('The Percentage of tweets are/or contain RE-Tweets:', f'{round(tw_re_tw, 3) * 100}%', 'TOTAL TWEETS:',\n", " tw_df.loc[tw_df['RETWEETED'], 'RETWEETED'].count())" ], "id": "c39056dadb6ada84", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Percentage of tweets are/or contain RE-Tweets: 10.7% TOTAL TWEETS: 9909\n" ] } ], "execution_count": 264 }, { "metadata": {}, "cell_type": "markdown", "source": "#### Q3.3 - Table of the 30 most frequent hashtags", "id": "6db470c9ced1f9ba" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:49:17.174986Z", "start_time": "2024-04-14T18:49:16.929491Z" } }, "cell_type": "code", "source": [ "def find_gato_word(text):\n", " # word = re.findall(r'@([a-zA-Z0-9]{1,15})', text)\n", " word = re.findall(r'(?<=#)\\w{1,280}', text)\n", " mentions = \" \".join(word)\n", " return mentions.translate(translator)\n", "\n", "tw_df['hashtags'] = tw_df['TEXT'].apply(lambda x: find_gato_word(x))\n", "print(\"Extracting #hashtag from dataframe columns:\")\n", "print(tw_df.hashtags[:9], len(tw_df.hashtags))" ], "id": "70e237aa6da7f40a", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Extracting #hashtag from dataframe columns:\n", "0 KennySailors 4Kenny\n", "1 \n", "2 PaycheckProtectionProgram\n", "3 PaycheckProtectionProgram PPP\n", "4 \n", "5 coronavirus PaycheckProtectionProgram\n", "6 \n", "7 \n", "8 \n", "Name: hashtags, dtype: object 92414\n" ] } ], "execution_count": 265 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:42:06.156768Z", "start_time": "2024-04-14T18:42:06.134086Z" } }, "cell_type": "code", "source": [ "fq_h = dict(Counter([m for i in tw_df.hashtags for m in i.split()]))\n", "fq_h = dict(sorted(fq_h.items(), key=lambda x: x[1], reverse=True))\n", "pprint(list(fq_h.items())[:30])\n", "print('TOTAL unique #hashtags:', len(list(fq_h.items())))" ], "id": "874735eee709faf7", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('COVID19', 2100),\n", " ('coronavirus', 1350),\n", " ('PaycheckProtectionProgram', 598),\n", " ('Coronavirus', 302),\n", " ('CARESAct', 191),\n", " ('InThisTogetherOhio', 190),\n", " ('PPP', 189),\n", " ('COVIDー19', 176),\n", " ('AmericanIdol', 165),\n", " ('CNNTownHall', 120),\n", " ('FamiliesFirst', 119),\n", " ('madamextheatre', 116),\n", " ('Ohio', 109),\n", " ('DennisMillerOption', 106),\n", " ('Covid19', 105),\n", " ('911onFox', 99),\n", " ('smallbiz', 95),\n", " ('watchingwithrichard', 87),\n", " ('TheWestWing', 85),\n", " ('EarthDay', 83),\n", " ('covid19', 81),\n", " ('CombatCOVID19Challenge', 80),\n", " ('ICYMI', 79),\n", " ('NY21', 78),\n", " ('DemDebate', 77),\n", " ('Vote4Mindy', 77),\n", " ('NV03', 75),\n", " ('2020Census', 74),\n", " ('foxnews', 70),\n", " ('CombatCOVID19', 69)]\n", "TOTAL unique #hashtags: 7695\n" ] } ], "execution_count": 234 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:42:06.179899Z", "start_time": "2024-04-14T18:42:06.158049Z" } }, "cell_type": "code", "source": [ "l_fq_h = [(i + 1, f'#{k}', int(v)) for i, (k, v) in enumerate(fq_h.items())]\n", "pprint(l_fq_h[:30])" ], "id": "f21f7fdd7d62d3b5", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(1, '#COVID19', 2100),\n", " (2, '#coronavirus', 1350),\n", " (3, '#PaycheckProtectionProgram', 598),\n", " (4, '#Coronavirus', 302),\n", " (5, '#CARESAct', 191),\n", " (6, '#InThisTogetherOhio', 190),\n", " (7, '#PPP', 189),\n", " (8, '#COVIDー19', 176),\n", " (9, '#AmericanIdol', 165),\n", " (10, '#CNNTownHall', 120),\n", " (11, '#FamiliesFirst', 119),\n", " (12, '#madamextheatre', 116),\n", " (13, '#Ohio', 109),\n", " (14, '#DennisMillerOption', 106),\n", " (15, '#Covid19', 105),\n", " (16, '#911onFox', 99),\n", " (17, '#smallbiz', 95),\n", " (18, '#watchingwithrichard', 87),\n", " (19, '#TheWestWing', 85),\n", " (20, '#EarthDay', 83),\n", " (21, '#covid19', 81),\n", " (22, '#CombatCOVID19Challenge', 80),\n", " (23, '#ICYMI', 79),\n", " (24, '#NY21', 78),\n", " (25, '#DemDebate', 77),\n", " (26, '#Vote4Mindy', 77),\n", " (27, '#NV03', 75),\n", " (28, '#2020Census', 74),\n", " (29, '#foxnews', 70),\n", " (30, '#CombatCOVID19', 69)]\n" ] } ], "execution_count": 235 }, { "metadata": {}, "cell_type": "markdown", "source": "### Q4 - Filtering by Media & Influencers", "id": "25e37470cb8749ff" }, { "metadata": {}, "cell_type": "markdown", "source": "#### Q4.1 - Tweets generated by all the 20 media accounts", "id": "4294e53da1cba100" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:46:06.786188Z", "start_time": "2024-04-14T18:46:06.782790Z" } }, "cell_type": "code", "source": "print(media_sources[:9], len(media_sources))", "id": "cc903f624a16eb93", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['usatoday', 'nytimes', 'theguardian', 'businessinsider', 'abcnews', 'WhiteHouse', 'cnni', 'usnews', 'TheLastofUsNewsUSNewsMoney'] 20\n" ] } ], "execution_count": 249 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:46:06.992568Z", "start_time": "2024-04-14T18:46:06.969166Z" } }, "cell_type": "code", "source": [ "tw_by_media = tw_df['TWEET_BY'].isin(media_sources)\n", "print('The Percentage of *ALL* TWEET_BY SELECTED MEDIA:', f'{round(sum(tw_by_media)/len(tw_by_media), 3) * 100}%', 'TOTAL TWEET_BY MEDIA:', sum(tw_by_media), 'TOTAL TWEETS:', tw_df['TEXT'].count())" ], "id": "47530a55dd246920", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Percentage of *ALL* TWEET_BY SELECTED MEDIA: 3.1% TOTAL TWEET_BY MEDIA: 2880 TOTAL TWEETS: 92414\n" ] } ], "execution_count": 250 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:46:07.482512Z", "start_time": "2024-04-14T18:46:07.128789Z" } }, "cell_type": "code", "source": [ "def direct_tweet(s):\n", " return True if (s['TWEET_BY'] in media_sources) and (s['RETWEETED'] == False) else False\n", "\n", "tw_df['DIRECT_BY_MEDIA'] = tw_df.apply(direct_tweet, axis=1)\n", "d_by_media = sum(tw_df['DIRECT_BY_MEDIA'])\n", "print('The Percentage of *DIRECTLY* TWEET_BY SELECTED MEDIA:', f'{round(d_by_media/len(tw_by_media), 3) * 100}%', 'TOTAL TWEET_BY MEDIA:', d_by_media, 'TOTAL TWEETS:', tw_df['TEXT'].count())" ], "id": "e09c336b433f7491", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Percentage of *DIRECTLY* TWEET_BY SELECTED MEDIA: 3.0% TOTAL TWEET_BY MEDIA: 2776 TOTAL TWEETS: 92414\n" ] } ], "execution_count": 251 }, { "metadata": {}, "cell_type": "markdown", "source": "#### Q4.2 - Tweets generated by all the 20 influencers", "id": "677aacaa708b5008" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:46:07.486730Z", "start_time": "2024-04-14T18:46:07.483976Z" } }, "cell_type": "code", "source": "print(influencers[:9], len(influencers))", "id": "87ec7f2ee26191a4", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['realDonaldTrump', 'JoeBiden', 'POTUS', 'MikePence', 'BernieSanders', 'AndrewYang', 'TuckerCarlson', 'seanhannity', 'MikeTyson'] 20\n" ] } ], "execution_count": 252 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:46:07.600862Z", "start_time": "2024-04-14T18:46:07.582077Z" } }, "cell_type": "code", "source": [ "tw_by_influ = tw_df['TWEET_BY'].isin(influencers)\n", "print('The Percentage of *ALL* TWEET_BY SELECTED INFLUENCERS:', f'{round((sum(tw_by_influ)/len(tw_by_influ) * 100), 3)}%', 'TOTAL TWEET_BY INFLUENCERS:', sum(tw_by_influ), 'TOTAL TWEETS:', tw_df['TEXT'].count())" ], "id": "9c76c100eabe4bef", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Percentage of *ALL* TWEET_BY SELECTED INFLUENCERS: 5.915% TOTAL TWEET_BY INFLUENCERS: 5466 TOTAL TWEETS: 92414\n" ] } ], "execution_count": 253 }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:46:08.097943Z", "start_time": "2024-04-14T18:46:07.737084Z" } }, "cell_type": "code", "source": [ "def direct_tweet(s):\n", " return True if (s['TWEET_BY'] in influencers) and (s['RETWEETED'] == False) else False\n", "\n", "tw_df['DIRECT_BY_INFLUENCERS'] = tw_df.apply(direct_tweet, axis=1)\n", "d_by_media = sum(tw_df['DIRECT_BY_INFLUENCERS'])\n", "print('The Percentage of *DIRECTLY* TWEET_BY SELECTED INFLUENCERS:', f'{round(d_by_media/len(tw_by_media), 3) * 100}%', 'TOTAL TWEET_BY INFLUENCERS:', d_by_media, 'TOTAL TWEETS:', tw_df['TEXT'].count())" ], "id": "60420845b454170", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Percentage of *DIRECTLY* TWEET_BY SELECTED INFLUENCERS: 5.2% TOTAL TWEET_BY INFLUENCERS: 4792 TOTAL TWEETS: 92414\n" ] } ], "execution_count": 254 }, { "metadata": {}, "cell_type": "markdown", "source": "#### Q4.3 - Tweets generated by all the 20 media sources as RETWEETS", "id": "e1ad45585a6ba533" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:46:08.459036Z", "start_time": "2024-04-14T18:46:08.099276Z" } }, "cell_type": "code", "source": [ "def in_direct_tweet(s):\n", " return True if (s['TWEET_BY'] in media_sources) and (s['RETWEETED'] == True) else False\n", "\n", "tw_df['INDIRECT_BY_MEDIA'] = tw_df.apply(in_direct_tweet, axis=1)\n", "total_ind_by_media = len(tw_df['INDIRECT_BY_MEDIA'])\n", "ind_by_media = sum(tw_df['INDIRECT_BY_MEDIA'])\n", "print('The Percentage of *INDIRECTLY* TWEET_BY SELECTED MEDIA:', f'{round(((ind_by_media/total_ind_by_media) * 100), 6)}%', 'TOTAL RETWEETED_BY MEDIA:', sum(tw_df['INDIRECT_BY_MEDIA']), 'TOTAL TWEETS:', tw_df['TEXT'].count())" ], "id": "b066d5a98e7effdf", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Percentage of *INDIRECTLY* TWEET_BY SELECTED MEDIA: 0.112537% TOTAL RETWEETED_BY MEDIA: 104 TOTAL TWEETS: 92414\n" ] } ], "execution_count": 255 }, { "metadata": {}, "cell_type": "markdown", "source": "#### Q4.4 - Tweets generated by all the 20 influencers sources as RETWEETS", "id": "50ea65457a6f8508" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T18:46:08.825478Z", "start_time": "2024-04-14T18:46:08.460163Z" } }, "cell_type": "code", "source": [ "def in_direct_tweet(s):\n", " return True if (s['TWEET_BY'] in influencers) and (s['RETWEETED'] == True) else False\n", "\n", "tw_df['INDIRECT_BY_INFLU'] = tw_df.apply(in_direct_tweet, axis=1)\n", "total_ind_by_influ = len(tw_df['INDIRECT_BY_INFLU'])\n", "ind_by_influ = sum(tw_df['INDIRECT_BY_INFLU'])\n", "print('The Percentage of *INDIRECTLY* TWEET_BY SELECTED INFLUENCERS:', f'{round(((ind_by_influ/total_ind_by_influ) * 100), 6)}%', 'TOTAL RETWEETED_BY INFLUENCERS:', sum(tw_df['INDIRECT_BY_INFLU']), 'TOTAL TWEETS:', tw_df['TEXT'].count())" ], "id": "2930ba5bfcae31c8", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Percentage of *INDIRECTLY* TWEET_BY SELECTED INFLUENCERS: 0.729327% TOTAL RETWEETED_BY INFLUENCERS: 674 TOTAL TWEETS: 92414\n" ] } ], "execution_count": 256 }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Q5 - Discussion results of points 3 & 4\n", "\n", "Are any of the results unexpected? Why?" ], "id": "3da960163cc539a" }, { "metadata": {}, "cell_type": "markdown", "source": [ "It is interesting that even due more than the half of the Tweets contain URLS (55.7%). Then, only the 10.7% of them are re-tweets. It means that **the most of the tweets are citing or promoting sources out of the Tweeter ecosystem.**\n", "\n", "Additionally, at subsampling my *arbitrary* selected groups of MEDIA SOURCES and INFLUENCERS. One can observe similar insights as those presented in the class last Friday:\n", "\n", "- These random MEDIA SOURCES already produced ~3% of the Tweets.\n", "- These ~3% of Tweets are **DIRECT** Tweets. Thus, they produce them themselves.\n", "- The **MEDIA SOURCES** are active, but not as active as the **INFLUENCERS**. This second group produce almost the double of the Tweets (~6%).\n", "- **INFLUENCERS** RETWEET more than the **MEDIA SOURCES**. However, the most of their tweets are also **DIRECT**.\n", "- MEDIA Retweets (~0.11%) compared to INFLUENCERS retweets (~0.73%)." ], "id": "48531149a66baebf" }, { "metadata": { "ExecuteTime": { "end_time": "2024-04-14T22:55:31.509477Z", "start_time": "2024-04-14T22:55:31.504609Z" } }, "cell_type": "code", "source": "print('Das is todo, merci!')", "id": "992263206e59f051", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Das is todo, merci!\n" ] } ], "execution_count": 272 }, { "metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": "", "id": "48a770ca733f9401" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }