{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Briefly\n", "\n", "\n", "### __ Problem Statement __\n", "- Obtain news from google news articles\n", "- Sammarize the articles within 60 words\n", "- Obtain keywords from the articles\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "##### Importing all the necessary libraries required to run the following code " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from gnewsclient import gnewsclient # for fetching google news\n", "from newspaper import Article # to obtain text from news articles\n", "from transformers import pipeline # to summarize text\n", "import spacy # for named entity recognition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Load sshleifer/distilbart-cnn-12-6 model" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def load_model(): \n", " model = pipeline('summarization')\n", " return model\n", "data = gnewsclient.NewsClient(max_results=0)\n", "nlp = spacy.load(\"en_core_web_lg\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Obtain urls and it's content" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def getNews(topic,location): \n", " count=0\n", " contents=[]\n", " titles=[]\n", " authors=[]\n", " urls=[]\n", " data = gnewsclient.NewsClient(language='english',location=location,topic=topic,max_results=10) \n", " news = data.get_news() \n", " for item in news:\n", " url=item['link']\n", " article = Article(url)\n", " try:\n", " article.download()\n", " article.parse()\n", " temp=item['title'][::-1]\n", " index=temp.find(\"-\")\n", " temp=temp[:index-1][::-1]\n", " urls.append(url)\n", " contents.append(article.text)\n", " titles.append(item['title'][:-index-1]) \n", " authors.append(temp)\n", " count+=1\n", " if(count==5):\n", " break\n", " except:\n", " continue \n", " return contents,titles,authors,urls " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Summarizes the content- minimum word limit 30 and maximum 60" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def getNewsSummary(contents,summarizer): \n", " summaries=[] \n", " for content in contents:\n", " minimum=len(content.split())\n", " summaries.append(summarizer(content,max_length=60,min_length=min(30,minimum),do_sample=False,truncation=True)[0]['summary_text']) \n", " return summaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Named Entity Recognition" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Obtain 4 keywords from content (person,organisation or geopolitical entity) \n", "def generateKeyword(contents): \n", " keywords=[]\n", " words=[] \n", " labels=[\"PERSON\",\"ORG\",\"GPE\"]\n", " for content in contents:\n", " doc=nlp(content)\n", " keys=[]\n", " limit=0\n", " for ent in doc.ents:\n", " key=ent.text.upper()\n", " label=ent.label_\n", " if(key not in words and key not in keywords and label in labels): \n", " keys.append(key)\n", " limit+=1\n", " for element in key.split():\n", " words.append(element)\n", " if(limit==4):\n", " keywords.append(keys)\n", " break \n", " return keywords\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Displaying keywords " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def printKeywords(keywords):\n", " for keyword in keywords:\n", " print(keyword)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Displaying the Summary with keywords in it highlighted" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def printSummary(summaries,titles):\n", " for summary,title in zip(summaries,titles):\n", " print(title.upper(),'\\n')\n", " print(summary)\n", " print(\"\\n\\n\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)\n" ] } ], "source": [ "summarizer=load_model() " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "contents,titles,authors,urls=getNews(\"Sports\",\"India\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "summaries=getNewsSummary(contents,summarizer)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "keywords=generateKeyword(contents)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['INDIA', 'SCOTLAND', 'SUPER 12', 'DUBAI']\n", "[\"VIRAT KOHLI'S\", 'TEAM INDIA', 'DHONI', 'UAE']\n", "['AUSTRALIA', 'AFGHANISTAN', 'CRICKET AUSTRALIA', 'CRICBUZZ STAFF •']\n", "['GARY STEAD', 'TRENT BOULT', 'COLIN DE GRANDHOMME', 'BLACKCAPS']\n", "['DWAYNE BRAVO', 'SRI LANKA', 'ICC', 'THE WEST INDIES']\n" ] } ], "source": [ "printKeywords(keywords)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "T20 WORLD CUP 2021, IND VS SCO PREVIEW: INDIA FACE SCOTLAND, EYE ANOTHER BIG WIN \n", "\n", " India take on Scotland in a Super 12 clash of the 2021 T20 World Cup in Dubai on Friday . Virat Kohli-led side beat Afghanistan by 66 runs in Abu Dhabi on Wednesday . India must win their remaining two games while maintaining high run rates and hope for New Zealand to\n", "\n", "\n", "\n", "‘THERE ARE MANY CANDIDATES BUT HE’S THE BEST': SEHWAG PICKS NEXT INDIA CAPTAIN AFTER KOHLI STEPS DOWN AT END OF T20 WC \n", "\n", " Virat Kohli set to step down as T20I captain after this World Cup in UAE and Oman . Many experts are anticipating his deputy Rohit Sharma to fill up the position . Former India opener Virender Sehwag backed Rohit as the ideal candidate .\n", "\n", "\n", "\n", "ONE-OFF TEST VS AFGHANISTAN POSTPONED, CONFIRMS CRICKET AUSTRALIA | CRICBUZZ.COM - CRICBUZZ \n", "\n", " Cricket Australia's one-off Test against Afghanistan has officially been postponed . The historic Test has been hanging in the balance since the CA revealed that they wouldn't support the Taliban government's stance against the inclusion of women in sports . Instead of cancelling the Test match, CA has vowed to\n", "\n", "\n", "\n", "NEW ZEALAND INCLUDE FIVE SPINNERS FOR INDIA TOUR, TRENT BOULT OPTS OUT CITING BUBBLE FATIGUE \n", "\n", " New Zealand name five spinners in 15-man squad for two-Test series against India . Senior pacer Trent Boult and fast-bowling all-rounder Colin de Grandhomme will miss tour due to bio-bubble fatigue . Ajaz Patel, Will Somerville and\n", "\n", "\n", "\n", "T20 WORLD CUP 2021: WEST INDIES AND CHENNAI SUPER KINGS ALL-ROUNDER DWAYNE BRAVO TO RETIRE AFTER SHOWPIECE... \n", "\n", " West Indies all-rounder Dwayne Bravo will hang his boots at the end of the ICC T20 World Cup 2021 . Bravo told ICC on the post-match Facebook Live show that he will be drawing the curtains on his international career . West Indies lost to Sri Lanka by 20 runs in\n", "\n", "\n", "\n" ] } ], "source": [ "printSummary(summaries,titles)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }