{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# Homework 2 - DH-500\n",
    "## Analysing Twitter "
   ],
   "id": "6eee1bc06b03d39b"
  },
  {
   "cell_type": "code",
   "id": "initial_id",
   "metadata": {
    "collapsed": true,
    "ExecuteTime": {
     "end_time": "2024-04-14T19:00:44.730100Z",
     "start_time": "2024-04-14T19:00:44.726958Z"
    }
   },
   "source": [
    "student, sciper = 'Carlos Alberto Vargas Rivera', '384891'\n",
    "print('Author:', f'{student} ({sciper})')"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Author: Carlos Alberto Vargas Rivera (384891)\n"
     ]
    }
   ],
   "execution_count": 268
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "### Q.1-2 - Media and Influencers that tweet about US Politics",
   "id": "e0890030a56f3933"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:42:04.917981Z",
     "start_time": "2024-04-14T18:42:04.912679Z"
    }
   },
   "cell_type": "code",
   "source": [
    "media_sources = [\n",
    "    # 'people',\n",
    "    'usatoday',\n",
    "    'nytimes',\n",
    "    # 'yahoo',\n",
    "    # 'cnn',\n",
    "    # 'washingtonpost',\n",
    "    # 'nypost',\n",
    "    # 'foxnews',\n",
    "    # 'reuters',\n",
    "    # 'newsweek',\n",
    "    # 'bloomberg',\n",
    "    'theguardian',\n",
    "    'businessinsider',\n",
    "    'abcnews',\n",
    "    # 'politico',\n",
    "    'WhiteHouse',\n",
    "    # 'cnnbrk',\n",
    "    'cnni',\n",
    "    'usnews',\n",
    "    'TheLastofUsNews'\n",
    "    'USNewsMoney',\n",
    "    'CNNPolitics',\n",
    "    'foxnewspolitics',\n",
    "    'usbank',\n",
    "    'googlenews',\n",
    "    'detroitnews',\n",
    "    'GuardianUS',\n",
    "    'NewsWire_US',\n",
    "    'MirrorUSNews',\n",
    "    'ExpressUSNews',\n",
    "    'Interior',\n",
    "    'BuzzFeedNews'\n",
    "]\n",
    "print(len(media_sources))"
   ],
   "id": "16bff05215606233",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20\n"
     ]
    }
   ],
   "execution_count": 221
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:42:04.933383Z",
     "start_time": "2024-04-14T18:42:04.920519Z"
    }
   },
   "cell_type": "code",
   "source": [
    "influencers = [\n",
    "    'realDonaldTrump',\n",
    "    'JoeBiden',\n",
    "    # 'LennyDykstra',\n",
    "    'POTUS',\n",
    "    'MikePence',\n",
    "    'BernieSanders',\n",
    "    'AndrewYang',\n",
    "    # 'chrislhayes',\n",
    "    # 'SenSchumer',\n",
    "    # 'GenFlynn',\n",
    "    # 'ChrisMurphyCT',\n",
    "    # 'BretBaier',\n",
    "    'TuckerCarlson',\n",
    "    'seanhannity',\n",
    "    # 'MarkWarner',\n",
    "    'MikeTyson',\n",
    "    'JudyWoodruff',\n",
    "    # 'AdamBaldwin',\n",
    "    'johncardillo',\n",
    "    'AntonioSabatoJr',\n",
    "    'JulianCastro',\n",
    "    'brianschatz',\n",
    "    'MichelleObama',\n",
    "    'HillaryClinton',\n",
    "    'IvankaTrump',\n",
    "    'JessicaBiel',\n",
    "    'AlyssaMilano',\n",
    "    'JLo'\n",
    "]\n",
    "print(len(influencers))"
   ],
   "id": "dd3f74de67f8a5b3",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20\n"
     ]
    }
   ],
   "execution_count": 222
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:42:05.455128Z",
     "start_time": "2024-04-14T18:42:04.940473Z"
    }
   },
   "cell_type": "code",
   "source": [
    "import pandas as pd\n",
    "import string\n",
    "import re\n",
    "\n",
    "path = 'data/'\n",
    "file = 'dataset.csv'\n",
    "names = ['TEXT', 'RETWEET_COUNT', 'FAVORITE_COUNT', 'TWEET_ID', 'TWEET_BY', 'TWEET_BY_ID', 'DATETIME', 'NUM_OF_URLS',\n",
    "         'RETWEETED', 'RETWEETED_TWEET_ID', 'RETWEETED_TWEET_BY', 'RETWEETED_TWEET_BY_ID', 'RETWEETED_TEXT',\n",
    "         'RETWEETED_URLS', 'RETWEETED_MEDIA']\n",
    "d_types = {\n",
    "    'TEXT': 'string',\n",
    "    'RETWEET_COUNT': 'int64',\n",
    "    'FAVORITE_COUNT': 'int64',\n",
    "    'TWEET_ID': 'string',\n",
    "    'TWEET_BY': 'string',\n",
    "    'TWEET_BY_ID': 'string',\n",
    "    'DATETIME': 'string',\n",
    "    'NUM_OF_URLS': 'int64',\n",
    "    'RETWEETED': 'boolean',\n",
    "    'RETWEETED_TWEET_ID': 'string',\n",
    "    'RETWEETED_TWEET_BY': 'string',\n",
    "    'RETWEETED_TWEET_BY_ID': 'string',\n",
    "    'RETWEETED_TEXT': 'string',\n",
    "    'RETWEETED_URLS': 'Float64',\n",
    "    'RETWEETED_MEDIA': 'Float64'\n",
    "}\n",
    "tw_df = pd.read_csv(path + file, sep=',', header=0, dtype=d_types)\n",
    "columns = list(tw_df.columns)\n",
    "print(columns, tw_df.shape)"
   ],
   "id": "5408e62813b4c435",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['TEXT', 'RETWEET_COUNT', 'FAVORITE_COUNT', 'TWEET_ID', 'TWEET_BY', 'TWEET_BY_ID', 'DATETIME', 'NUM_OF_URLS', 'RETWEETED', 'RETWEETED_TWEET_ID', 'RETWEETED_TWEET_BY', 'RETWEETED_TWEET_BY_ID', 'RETWEETED_TEXT', 'RETWEETED_URLS', 'RETWEETED_MEDIA'] (92414, 15)\n"
     ]
    }
   ],
   "execution_count": 223
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:42:05.469236Z",
     "start_time": "2024-04-14T18:42:05.457294Z"
    }
   },
   "cell_type": "code",
   "source": "tw_df.head(9)",
   "id": "deb8dc84b0d9012e",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                                                TEXT  RETWEET_COUNT  \\\n",
       "0  While you are at home socially distancing, che...             12   \n",
       "1  Americans have made incredible sacrifices to g...             19   \n",
       "2  The @SenSchumer &amp; @SpeakerPelosi paycheck ...           5360   \n",
       "3  The #PaycheckProtectionProgram is helping work...             21   \n",
       "4  RT @SenateGOP: We all saw the MSNBC clip.\n",
       "\n",
       "STO...            108   \n",
       "5  I want to make sure we stop the spread of #cor...             32   \n",
       "6  RT @SenateGOP: Conference Chairman @SenJohnBar...             44   \n",
       "7  Congrats to my good friend @SenatorTimScott on...             10   \n",
       "8  RT @EPWGOP: Chairman @SenJohnBarrasso: “@NRCgo...              4   \n",
       "\n",
       "   FAVORITE_COUNT              TWEET_ID         TWEET_BY TWEET_BY_ID  \\\n",
       "0              27  #1251250167837347841  SenJohnBarrasso  #202206694   \n",
       "1              60  #1250850558249926656  SenJohnBarrasso  #202206694   \n",
       "2           14837  #1250772931015258112  SenJohnBarrasso  #202206694   \n",
       "3              80  #1250525652736004098  SenJohnBarrasso  #202206694   \n",
       "4               0  #1250487291895742468  SenJohnBarrasso  #202206694   \n",
       "5              87  #1250439402389532672  SenJohnBarrasso  #202206694   \n",
       "6               0  #1250099010226053121  SenJohnBarrasso  #202206694   \n",
       "7              32  #1250098752280571910  SenJohnBarrasso  #202206694   \n",
       "8               0  #1249787975279374336  SenJohnBarrasso  #202206694   \n",
       "\n",
       "                         DATETIME  NUM_OF_URLS  RETWEETED  \\\n",
       "0  Fri Apr 17 20:44:21 +0000 2020            1      False   \n",
       "1  Thu Apr 16 18:16:27 +0000 2020            1      False   \n",
       "2  Thu Apr 16 13:07:59 +0000 2020            1      False   \n",
       "3  Wed Apr 15 20:45:23 +0000 2020            0      False   \n",
       "4  Wed Apr 15 18:12:57 +0000 2020            0       True   \n",
       "5  Wed Apr 15 15:02:40 +0000 2020            0      False   \n",
       "6  Tue Apr 14 16:30:04 +0000 2020            0       True   \n",
       "7  Tue Apr 14 16:29:02 +0000 2020            0      False   \n",
       "8  Mon Apr 13 19:54:07 +0000 2020            0      False   \n",
       "\n",
       "     RETWEETED_TWEET_ID RETWEETED_TWEET_BY RETWEETED_TWEET_BY_ID  \\\n",
       "0                  <NA>               <NA>                  <NA>   \n",
       "1                  <NA>               <NA>                  <NA>   \n",
       "2                  <NA>               <NA>                  <NA>   \n",
       "3                  <NA>               <NA>                  <NA>   \n",
       "4  #1250477604097966084          SenateGOP             #14344823   \n",
       "5                  <NA>               <NA>                  <NA>   \n",
       "6  #1250078167609618432          SenateGOP             #14344823   \n",
       "7                  <NA>               <NA>                  <NA>   \n",
       "8                  <NA>               <NA>                  <NA>   \n",
       "\n",
       "                                      RETWEETED_TEXT  RETWEETED_URLS  \\\n",
       "0                                               <NA>            <NA>   \n",
       "1                                               <NA>            <NA>   \n",
       "2                                               <NA>            <NA>   \n",
       "3                                               <NA>            <NA>   \n",
       "4  We all saw the MSNBC clip.\n",
       "\n",
       "STOP using this “c...             1.0   \n",
       "5                                               <NA>            <NA>   \n",
       "6  Conference Chairman @SenJohnBarrasso on @Varne...             0.0   \n",
       "7                                               <NA>            <NA>   \n",
       "8                                               <NA>            <NA>   \n",
       "\n",
       "   RETWEETED_MEDIA  \n",
       "0             <NA>  \n",
       "1             <NA>  \n",
       "2             <NA>  \n",
       "3             <NA>  \n",
       "4              1.0  \n",
       "5             <NA>  \n",
       "6              1.0  \n",
       "7             <NA>  \n",
       "8             <NA>  "
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TEXT</th>\n",
       "      <th>RETWEET_COUNT</th>\n",
       "      <th>FAVORITE_COUNT</th>\n",
       "      <th>TWEET_ID</th>\n",
       "      <th>TWEET_BY</th>\n",
       "      <th>TWEET_BY_ID</th>\n",
       "      <th>DATETIME</th>\n",
       "      <th>NUM_OF_URLS</th>\n",
       "      <th>RETWEETED</th>\n",
       "      <th>RETWEETED_TWEET_ID</th>\n",
       "      <th>RETWEETED_TWEET_BY</th>\n",
       "      <th>RETWEETED_TWEET_BY_ID</th>\n",
       "      <th>RETWEETED_TEXT</th>\n",
       "      <th>RETWEETED_URLS</th>\n",
       "      <th>RETWEETED_MEDIA</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>While you are at home socially distancing, che...</td>\n",
       "      <td>12</td>\n",
       "      <td>27</td>\n",
       "      <td>#1251250167837347841</td>\n",
       "      <td>SenJohnBarrasso</td>\n",
       "      <td>#202206694</td>\n",
       "      <td>Fri Apr 17 20:44:21 +0000 2020</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Americans have made incredible sacrifices to g...</td>\n",
       "      <td>19</td>\n",
       "      <td>60</td>\n",
       "      <td>#1250850558249926656</td>\n",
       "      <td>SenJohnBarrasso</td>\n",
       "      <td>#202206694</td>\n",
       "      <td>Thu Apr 16 18:16:27 +0000 2020</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The @SenSchumer &amp;amp; @SpeakerPelosi paycheck ...</td>\n",
       "      <td>5360</td>\n",
       "      <td>14837</td>\n",
       "      <td>#1250772931015258112</td>\n",
       "      <td>SenJohnBarrasso</td>\n",
       "      <td>#202206694</td>\n",
       "      <td>Thu Apr 16 13:07:59 +0000 2020</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>The #PaycheckProtectionProgram is helping work...</td>\n",
       "      <td>21</td>\n",
       "      <td>80</td>\n",
       "      <td>#1250525652736004098</td>\n",
       "      <td>SenJohnBarrasso</td>\n",
       "      <td>#202206694</td>\n",
       "      <td>Wed Apr 15 20:45:23 +0000 2020</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>RT @SenateGOP: We all saw the MSNBC clip.\n",
       "\n",
       "STO...</td>\n",
       "      <td>108</td>\n",
       "      <td>0</td>\n",
       "      <td>#1250487291895742468</td>\n",
       "      <td>SenJohnBarrasso</td>\n",
       "      <td>#202206694</td>\n",
       "      <td>Wed Apr 15 18:12:57 +0000 2020</td>\n",
       "      <td>0</td>\n",
       "      <td>True</td>\n",
       "      <td>#1250477604097966084</td>\n",
       "      <td>SenateGOP</td>\n",
       "      <td>#14344823</td>\n",
       "      <td>We all saw the MSNBC clip.\n",
       "\n",
       "STOP using this “c...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>I want to make sure we stop the spread of #cor...</td>\n",
       "      <td>32</td>\n",
       "      <td>87</td>\n",
       "      <td>#1250439402389532672</td>\n",
       "      <td>SenJohnBarrasso</td>\n",
       "      <td>#202206694</td>\n",
       "      <td>Wed Apr 15 15:02:40 +0000 2020</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>RT @SenateGOP: Conference Chairman @SenJohnBar...</td>\n",
       "      <td>44</td>\n",
       "      <td>0</td>\n",
       "      <td>#1250099010226053121</td>\n",
       "      <td>SenJohnBarrasso</td>\n",
       "      <td>#202206694</td>\n",
       "      <td>Tue Apr 14 16:30:04 +0000 2020</td>\n",
       "      <td>0</td>\n",
       "      <td>True</td>\n",
       "      <td>#1250078167609618432</td>\n",
       "      <td>SenateGOP</td>\n",
       "      <td>#14344823</td>\n",
       "      <td>Conference Chairman @SenJohnBarrasso on @Varne...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Congrats to my good friend @SenatorTimScott on...</td>\n",
       "      <td>10</td>\n",
       "      <td>32</td>\n",
       "      <td>#1250098752280571910</td>\n",
       "      <td>SenJohnBarrasso</td>\n",
       "      <td>#202206694</td>\n",
       "      <td>Tue Apr 14 16:29:02 +0000 2020</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>RT @EPWGOP: Chairman @SenJohnBarrasso: “@NRCgo...</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>#1249787975279374336</td>\n",
       "      <td>SenJohnBarrasso</td>\n",
       "      <td>#202206694</td>\n",
       "      <td>Mon Apr 13 19:54:07 +0000 2020</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "execution_count": 224,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 224
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:48:09.638698Z",
     "start_time": "2024-04-14T18:48:09.634021Z"
    }
   },
   "cell_type": "code",
   "source": [
    "puncs = re.sub(\"[@#]\", \"\", string.punctuation)\n",
    "# print(puncs)\n",
    "translator = str.maketrans('', '', puncs)\n",
    "\n",
    "my_string = 'Hola!!! soy @Carlos, estoy en @EFPL #studying para DH-500...'\n",
    "new_string = my_string.translate(translator)\n",
    "print(new_string)"
   ],
   "id": "acee9e1e20269509",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hola soy @Carlos estoy en @EFPL #studying para DH500\n"
     ]
    }
   ],
   "execution_count": 258
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:42:05.775162Z",
     "start_time": "2024-04-14T18:42:05.488490Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def find_at_word(text):\n",
    "    # word = re.findall(r'@([a-zA-Z0-9]{1,15})', text)\n",
    "    word = re.findall(r'(?<=@)\\w{1,15}', text)\n",
    "    mentions = \" \".join(word)\n",
    "    return mentions.translate(translator)\n",
    "\n",
    "tw_df['mentions'] = tw_df['TEXT'].apply(lambda x: find_at_word(x))\n",
    "print(\"Extracting @mentions from dataframe columns:\")\n",
    "print(len(tw_df.mentions), tw_df.mentions[:9])"
   ],
   "id": "68ba54d11940a88c",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracting @mentions from dataframe columns:\n",
      "92414 0    JumpShotMovie WyoAthletics StephenCurry30 jake...\n",
      "1                                                     \n",
      "2                             SenSchumer SpeakerPelosi\n",
      "3                                                     \n",
      "4                                            SenateGOP\n",
      "5                                        SpeakerPelosi\n",
      "6                   SenateGOP SenJohnBarrasso Varneyco\n",
      "7                                      SenatorTimScott\n",
      "8                        EPWGOP SenJohnBarrasso NRCgov\n",
      "Name: mentions, dtype: object\n"
     ]
    }
   ],
   "execution_count": 226
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:48:46.778364Z",
     "start_time": "2024-04-14T18:48:46.742993Z"
    }
   },
   "cell_type": "code",
   "source": [
    "from collections import Counter\n",
    "from pprint import pprint\n",
    "\n",
    "fq_m = dict(Counter([m for i in tw_df.mentions for m in i.split()]))\n",
    "fq_m = dict(sorted(fq_m.items(), key=lambda x: x[1], reverse=True))\n",
    "\n",
    "print('TOTAL unique @mentions:', len(list(fq_m.items())))\n",
    "pprint(list(fq_m.items())[:9])"
   ],
   "id": "38f45b963ee0a166",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TOTAL unique @mentions: 17734\n",
      "[('realDonaldTrump', 1946),\n",
      " ('JoeBiden', 407),\n",
      " ('WhiteHouse', 307),\n",
      " ('SBAgov', 272),\n",
      " ('LennyDykstra', 271),\n",
      " ('SpeakerPelosi', 257),\n",
      " ('FoxNews', 252),\n",
      " ('iheartmindy', 250),\n",
      " ('jaketapper', 232)]\n"
     ]
    }
   ],
   "execution_count": 259
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "### Q.3 - Descriptive statistics",
   "id": "d35bf59121c67002"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "#### Q.3.1 - The Percentage of tweets that contain URLs",
   "id": "6ca70296deeba0c"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:48:52.612575Z",
     "start_time": "2024-04-14T18:48:52.605681Z"
    }
   },
   "cell_type": "code",
   "source": "print(tw_df.NUM_OF_URLS.describe())",
   "id": "aa80b763e471f2c4",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "count    92414.000000\n",
      "mean         0.569621\n",
      "std          0.521984\n",
      "min          0.000000\n",
      "25%          0.000000\n",
      "50%          1.000000\n",
      "75%          1.000000\n",
      "max          5.000000\n",
      "Name: NUM_OF_URLS, dtype: float64\n"
     ]
    }
   ],
   "execution_count": 260
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:48:53.484281Z",
     "start_time": "2024-04-14T18:48:53.475185Z"
    }
   },
   "cell_type": "code",
   "source": [
    "tw_url = tw_df.loc[tw_df['NUM_OF_URLS'] > 0, 'NUM_OF_URLS'].count() / tw_df['TEXT'].count()\n",
    "print('The Percentage of tweets that contain URLs is:', f'{round(tw_url, 3) * 100}%')"
   ],
   "id": "11e63cee9c34439a",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Percentage of tweets that contain URLs is: 55.7%\n"
     ]
    }
   ],
   "execution_count": 261
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "#### Q3.2 - The Percentage of tweets that are (or contain) retweets",
   "id": "134541ad66fb3f6"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:48:57.640645Z",
     "start_time": "2024-04-14T18:48:57.634566Z"
    }
   },
   "cell_type": "code",
   "source": "print(tw_df['RETWEETED'].describe())",
   "id": "480e1018c82db48a",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "count     92414\n",
      "unique        2\n",
      "top       False\n",
      "freq      82505\n",
      "Name: RETWEETED, dtype: object\n"
     ]
    }
   ],
   "execution_count": 262
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:49:00.313487Z",
     "start_time": "2024-04-14T18:49:00.300808Z"
    }
   },
   "cell_type": "code",
   "source": [
    "tw_re_tw = tw_df.loc[tw_df['RETWEETED_MEDIA'].notnull(), 'RETWEETED_MEDIA'].count() / tw_df['TEXT'].count()\n",
    "print('The Percentage of tweets are/or contain RE-Tweets:', f'{round(tw_re_tw, 3) * 100}%')"
   ],
   "id": "2b149b590ae262b9",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Percentage of tweets are/or contain RE-Tweets: 10.7%\n"
     ]
    }
   ],
   "execution_count": 263
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:49:11.925379Z",
     "start_time": "2024-04-14T18:49:11.915270Z"
    }
   },
   "cell_type": "code",
   "source": [
    "tw_re_tw = tw_df.loc[tw_df['RETWEETED'], 'RETWEETED'].count() / tw_df['TEXT'].count()\n",
    "print('The Percentage of tweets are/or contain RE-Tweets:', f'{round(tw_re_tw, 3) * 100}%', 'TOTAL TWEETS:',\n",
    "      tw_df.loc[tw_df['RETWEETED'], 'RETWEETED'].count())"
   ],
   "id": "c39056dadb6ada84",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Percentage of tweets are/or contain RE-Tweets: 10.7% TOTAL TWEETS: 9909\n"
     ]
    }
   ],
   "execution_count": 264
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "#### Q3.3 - Table of the 30 most frequent hashtags",
   "id": "6db470c9ced1f9ba"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:49:17.174986Z",
     "start_time": "2024-04-14T18:49:16.929491Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def find_gato_word(text):\n",
    "    # word = re.findall(r'@([a-zA-Z0-9]{1,15})', text)\n",
    "    word = re.findall(r'(?<=#)\\w{1,280}', text)\n",
    "    mentions = \" \".join(word)\n",
    "    return mentions.translate(translator)\n",
    "\n",
    "tw_df['hashtags'] = tw_df['TEXT'].apply(lambda x: find_gato_word(x))\n",
    "print(\"Extracting #hashtag from dataframe columns:\")\n",
    "print(tw_df.hashtags[:9], len(tw_df.hashtags))"
   ],
   "id": "70e237aa6da7f40a",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracting #hashtag from dataframe columns:\n",
      "0                      KennySailors 4Kenny\n",
      "1                                         \n",
      "2                PaycheckProtectionProgram\n",
      "3            PaycheckProtectionProgram PPP\n",
      "4                                         \n",
      "5    coronavirus PaycheckProtectionProgram\n",
      "6                                         \n",
      "7                                         \n",
      "8                                         \n",
      "Name: hashtags, dtype: object 92414\n"
     ]
    }
   ],
   "execution_count": 265
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:42:06.156768Z",
     "start_time": "2024-04-14T18:42:06.134086Z"
    }
   },
   "cell_type": "code",
   "source": [
    "fq_h = dict(Counter([m for i in tw_df.hashtags for m in i.split()]))\n",
    "fq_h = dict(sorted(fq_h.items(), key=lambda x: x[1], reverse=True))\n",
    "pprint(list(fq_h.items())[:30])\n",
    "print('TOTAL unique #hashtags:', len(list(fq_h.items())))"
   ],
   "id": "874735eee709faf7",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('COVID19', 2100),\n",
      " ('coronavirus', 1350),\n",
      " ('PaycheckProtectionProgram', 598),\n",
      " ('Coronavirus', 302),\n",
      " ('CARESAct', 191),\n",
      " ('InThisTogetherOhio', 190),\n",
      " ('PPP', 189),\n",
      " ('COVIDー19', 176),\n",
      " ('AmericanIdol', 165),\n",
      " ('CNNTownHall', 120),\n",
      " ('FamiliesFirst', 119),\n",
      " ('madamextheatre', 116),\n",
      " ('Ohio', 109),\n",
      " ('DennisMillerOption', 106),\n",
      " ('Covid19', 105),\n",
      " ('911onFox', 99),\n",
      " ('smallbiz', 95),\n",
      " ('watchingwithrichard', 87),\n",
      " ('TheWestWing', 85),\n",
      " ('EarthDay', 83),\n",
      " ('covid19', 81),\n",
      " ('CombatCOVID19Challenge', 80),\n",
      " ('ICYMI', 79),\n",
      " ('NY21', 78),\n",
      " ('DemDebate', 77),\n",
      " ('Vote4Mindy', 77),\n",
      " ('NV03', 75),\n",
      " ('2020Census', 74),\n",
      " ('foxnews', 70),\n",
      " ('CombatCOVID19', 69)]\n",
      "TOTAL unique #hashtags: 7695\n"
     ]
    }
   ],
   "execution_count": 234
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:42:06.179899Z",
     "start_time": "2024-04-14T18:42:06.158049Z"
    }
   },
   "cell_type": "code",
   "source": [
    "l_fq_h = [(i + 1, f'#{k}', int(v)) for i, (k, v) in enumerate(fq_h.items())]\n",
    "pprint(l_fq_h[:30])"
   ],
   "id": "f21f7fdd7d62d3b5",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(1, '#COVID19', 2100),\n",
      " (2, '#coronavirus', 1350),\n",
      " (3, '#PaycheckProtectionProgram', 598),\n",
      " (4, '#Coronavirus', 302),\n",
      " (5, '#CARESAct', 191),\n",
      " (6, '#InThisTogetherOhio', 190),\n",
      " (7, '#PPP', 189),\n",
      " (8, '#COVIDー19', 176),\n",
      " (9, '#AmericanIdol', 165),\n",
      " (10, '#CNNTownHall', 120),\n",
      " (11, '#FamiliesFirst', 119),\n",
      " (12, '#madamextheatre', 116),\n",
      " (13, '#Ohio', 109),\n",
      " (14, '#DennisMillerOption', 106),\n",
      " (15, '#Covid19', 105),\n",
      " (16, '#911onFox', 99),\n",
      " (17, '#smallbiz', 95),\n",
      " (18, '#watchingwithrichard', 87),\n",
      " (19, '#TheWestWing', 85),\n",
      " (20, '#EarthDay', 83),\n",
      " (21, '#covid19', 81),\n",
      " (22, '#CombatCOVID19Challenge', 80),\n",
      " (23, '#ICYMI', 79),\n",
      " (24, '#NY21', 78),\n",
      " (25, '#DemDebate', 77),\n",
      " (26, '#Vote4Mindy', 77),\n",
      " (27, '#NV03', 75),\n",
      " (28, '#2020Census', 74),\n",
      " (29, '#foxnews', 70),\n",
      " (30, '#CombatCOVID19', 69)]\n"
     ]
    }
   ],
   "execution_count": 235
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "### Q4 - Filtering by Media & Influencers",
   "id": "25e37470cb8749ff"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "#### Q4.1 - Tweets generated by all the 20 media accounts",
   "id": "4294e53da1cba100"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:46:06.786188Z",
     "start_time": "2024-04-14T18:46:06.782790Z"
    }
   },
   "cell_type": "code",
   "source": "print(media_sources[:9], len(media_sources))",
   "id": "cc903f624a16eb93",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['usatoday', 'nytimes', 'theguardian', 'businessinsider', 'abcnews', 'WhiteHouse', 'cnni', 'usnews', 'TheLastofUsNewsUSNewsMoney'] 20\n"
     ]
    }
   ],
   "execution_count": 249
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:46:06.992568Z",
     "start_time": "2024-04-14T18:46:06.969166Z"
    }
   },
   "cell_type": "code",
   "source": [
    "tw_by_media = tw_df['TWEET_BY'].isin(media_sources)\n",
    "print('The Percentage of *ALL* TWEET_BY SELECTED MEDIA:', f'{round(sum(tw_by_media)/len(tw_by_media), 3) * 100}%', 'TOTAL TWEET_BY MEDIA:', sum(tw_by_media), 'TOTAL TWEETS:', tw_df['TEXT'].count())"
   ],
   "id": "47530a55dd246920",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Percentage of *ALL* TWEET_BY SELECTED MEDIA: 3.1% TOTAL TWEET_BY MEDIA: 2880 TOTAL TWEETS: 92414\n"
     ]
    }
   ],
   "execution_count": 250
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:46:07.482512Z",
     "start_time": "2024-04-14T18:46:07.128789Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def direct_tweet(s):\n",
    "    return True if (s['TWEET_BY'] in media_sources) and (s['RETWEETED'] == False) else False\n",
    "\n",
    "tw_df['DIRECT_BY_MEDIA'] = tw_df.apply(direct_tweet, axis=1)\n",
    "d_by_media = sum(tw_df['DIRECT_BY_MEDIA'])\n",
    "print('The Percentage of *DIRECTLY* TWEET_BY SELECTED MEDIA:', f'{round(d_by_media/len(tw_by_media), 3) * 100}%', 'TOTAL TWEET_BY MEDIA:', d_by_media, 'TOTAL TWEETS:', tw_df['TEXT'].count())"
   ],
   "id": "e09c336b433f7491",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Percentage of *DIRECTLY* TWEET_BY SELECTED MEDIA: 3.0% TOTAL TWEET_BY MEDIA: 2776 TOTAL TWEETS: 92414\n"
     ]
    }
   ],
   "execution_count": 251
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "#### Q4.2 - Tweets generated by all the 20 influencers",
   "id": "677aacaa708b5008"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:46:07.486730Z",
     "start_time": "2024-04-14T18:46:07.483976Z"
    }
   },
   "cell_type": "code",
   "source": "print(influencers[:9], len(influencers))",
   "id": "87ec7f2ee26191a4",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['realDonaldTrump', 'JoeBiden', 'POTUS', 'MikePence', 'BernieSanders', 'AndrewYang', 'TuckerCarlson', 'seanhannity', 'MikeTyson'] 20\n"
     ]
    }
   ],
   "execution_count": 252
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:46:07.600862Z",
     "start_time": "2024-04-14T18:46:07.582077Z"
    }
   },
   "cell_type": "code",
   "source": [
    "tw_by_influ = tw_df['TWEET_BY'].isin(influencers)\n",
    "print('The Percentage of *ALL* TWEET_BY SELECTED INFLUENCERS:', f'{round((sum(tw_by_influ)/len(tw_by_influ) * 100), 3)}%', 'TOTAL TWEET_BY INFLUENCERS:', sum(tw_by_influ), 'TOTAL TWEETS:', tw_df['TEXT'].count())"
   ],
   "id": "9c76c100eabe4bef",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Percentage of *ALL* TWEET_BY SELECTED INFLUENCERS: 5.915% TOTAL TWEET_BY INFLUENCERS: 5466 TOTAL TWEETS: 92414\n"
     ]
    }
   ],
   "execution_count": 253
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:46:08.097943Z",
     "start_time": "2024-04-14T18:46:07.737084Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def direct_tweet(s):\n",
    "    return True if (s['TWEET_BY'] in influencers) and (s['RETWEETED'] == False) else False\n",
    "\n",
    "tw_df['DIRECT_BY_INFLUENCERS'] = tw_df.apply(direct_tweet, axis=1)\n",
    "d_by_media = sum(tw_df['DIRECT_BY_INFLUENCERS'])\n",
    "print('The Percentage of *DIRECTLY* TWEET_BY SELECTED INFLUENCERS:', f'{round(d_by_media/len(tw_by_media), 3) * 100}%', 'TOTAL TWEET_BY INFLUENCERS:', d_by_media, 'TOTAL TWEETS:', tw_df['TEXT'].count())"
   ],
   "id": "60420845b454170",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Percentage of *DIRECTLY* TWEET_BY SELECTED INFLUENCERS: 5.2% TOTAL TWEET_BY INFLUENCERS: 4792 TOTAL TWEETS: 92414\n"
     ]
    }
   ],
   "execution_count": 254
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "#### Q4.3 - Tweets generated by all the 20 media sources as RETWEETS",
   "id": "e1ad45585a6ba533"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:46:08.459036Z",
     "start_time": "2024-04-14T18:46:08.099276Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def in_direct_tweet(s):\n",
    "    return True if (s['TWEET_BY'] in media_sources) and (s['RETWEETED'] == True) else False\n",
    "\n",
    "tw_df['INDIRECT_BY_MEDIA'] = tw_df.apply(in_direct_tweet, axis=1)\n",
    "total_ind_by_media = len(tw_df['INDIRECT_BY_MEDIA'])\n",
    "ind_by_media = sum(tw_df['INDIRECT_BY_MEDIA'])\n",
    "print('The Percentage of *INDIRECTLY* TWEET_BY SELECTED MEDIA:', f'{round(((ind_by_media/total_ind_by_media) * 100), 6)}%', 'TOTAL RETWEETED_BY MEDIA:', sum(tw_df['INDIRECT_BY_MEDIA']), 'TOTAL TWEETS:', tw_df['TEXT'].count())"
   ],
   "id": "b066d5a98e7effdf",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Percentage of *INDIRECTLY* TWEET_BY SELECTED MEDIA: 0.112537% TOTAL RETWEETED_BY MEDIA: 104 TOTAL TWEETS: 92414\n"
     ]
    }
   ],
   "execution_count": 255
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "#### Q4.4 - Tweets generated by all the 20 influencers sources as RETWEETS",
   "id": "50ea65457a6f8508"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T18:46:08.825478Z",
     "start_time": "2024-04-14T18:46:08.460163Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def in_direct_tweet(s):\n",
    "    return True if (s['TWEET_BY'] in influencers) and (s['RETWEETED'] == True) else False\n",
    "\n",
    "tw_df['INDIRECT_BY_INFLU'] = tw_df.apply(in_direct_tweet, axis=1)\n",
    "total_ind_by_influ = len(tw_df['INDIRECT_BY_INFLU'])\n",
    "ind_by_influ = sum(tw_df['INDIRECT_BY_INFLU'])\n",
    "print('The Percentage of *INDIRECTLY* TWEET_BY SELECTED INFLUENCERS:', f'{round(((ind_by_influ/total_ind_by_influ) * 100), 6)}%', 'TOTAL RETWEETED_BY INFLUENCERS:', sum(tw_df['INDIRECT_BY_INFLU']), 'TOTAL TWEETS:', tw_df['TEXT'].count())"
   ],
   "id": "2930ba5bfcae31c8",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Percentage of *INDIRECTLY* TWEET_BY SELECTED INFLUENCERS: 0.729327% TOTAL RETWEETED_BY INFLUENCERS: 674 TOTAL TWEETS: 92414\n"
     ]
    }
   ],
   "execution_count": 256
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### Q5 - Discussion results of points 3 & 4\n",
    "\n",
    "Are any of the results unexpected? Why?"
   ],
   "id": "3da960163cc539a"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "It is interesting that even due more than the half of the Tweets contain URLS (55.7%). Then, only the 10.7% of them are re-tweets. It means that **the most of the tweets are citing or promoting sources out of the Tweeter ecosystem.**\n",
    "\n",
    "Additionally, at subsampling my *arbitrary* selected groups of MEDIA SOURCES and INFLUENCERS. One can observe similar insights as those presented in the class last Friday:\n",
    "\n",
    "- These random MEDIA SOURCES already produced ~3% of the Tweets.\n",
    "- These ~3% of Tweets are **DIRECT** Tweets. Thus, they produce them themselves.\n",
    "- The **MEDIA SOURCES** are active, but not as active as the **INFLUENCERS**. This second group produce almost the double of the Tweets (~6%).\n",
    "- **INFLUENCERS** RETWEET more than the **MEDIA SOURCES**. However, the most of their tweets are also **DIRECT**.\n",
    "- MEDIA Retweets (~0.11%) compared to INFLUENCERS retweets (~0.73%)."
   ],
   "id": "48531149a66baebf"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-14T22:55:31.509477Z",
     "start_time": "2024-04-14T22:55:31.504609Z"
    }
   },
   "cell_type": "code",
   "source": "print('Das is todo, merci!')",
   "id": "992263206e59f051",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Das is todo, merci!\n"
     ]
    }
   ],
   "execution_count": 272
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": "",
   "id": "48a770ca733f9401"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}