Devika Nair M commited on
Commit
cd24a6a
1 Parent(s): 548f20a

Add files via upload

Browse files
Files changed (1) hide show
  1. Briefly.ipynb +337 -0
Briefly.ipynb ADDED
@@ -0,0 +1,337 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "## Briefly\n",
8
+ "\n",
9
+ "\n",
10
+ "### __ Problem Statement __\n",
11
+ "- Obtain news from google news articles\n",
12
+ "- Sammarize the articles within 60 words\n",
13
+ "- Obtain keywords from the articles\n",
14
+ "\n",
15
+ "\n",
16
+ "\n",
17
+ "\n",
18
+ "\n",
19
+ "\n",
20
+ "\n",
21
+ "\n",
22
+ "\n",
23
+ "\n",
24
+ "##### Importing all the necessary libraries required to run the following code "
25
+ ]
26
+ },
27
+ {
28
+ "cell_type": "code",
29
+ "execution_count": 1,
30
+ "metadata": {},
31
+ "outputs": [],
32
+ "source": [
33
+ "from gnewsclient import gnewsclient # for fetching google news\n",
34
+ "from newspaper import Article # to obtain text from news articles\n",
35
+ "from transformers import pipeline # to summarize text\n",
36
+ "import spacy # for named entity recognition"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "markdown",
41
+ "metadata": {},
42
+ "source": [
43
+ "##### Load sshleifer/distilbart-cnn-12-6 model"
44
+ ]
45
+ },
46
+ {
47
+ "cell_type": "code",
48
+ "execution_count": 2,
49
+ "metadata": {},
50
+ "outputs": [],
51
+ "source": [
52
+ "def load_model(): \n",
53
+ " model = pipeline('summarization')\n",
54
+ " return model\n",
55
+ "data = gnewsclient.NewsClient(max_results=0)\n",
56
+ "nlp = spacy.load(\"en_core_web_lg\") "
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "markdown",
61
+ "metadata": {},
62
+ "source": [
63
+ "##### Obtain urls and it's content"
64
+ ]
65
+ },
66
+ {
67
+ "cell_type": "code",
68
+ "execution_count": 3,
69
+ "metadata": {},
70
+ "outputs": [],
71
+ "source": [
72
+ "def getNews(topic,location): \n",
73
+ " count=0\n",
74
+ " contents=[]\n",
75
+ " titles=[]\n",
76
+ " authors=[]\n",
77
+ " urls=[]\n",
78
+ " data = gnewsclient.NewsClient(language='english',location=location,topic=topic,max_results=10) \n",
79
+ " news = data.get_news() \n",
80
+ " for item in news:\n",
81
+ " url=item['link']\n",
82
+ " article = Article(url)\n",
83
+ " try:\n",
84
+ " article.download()\n",
85
+ " article.parse()\n",
86
+ " temp=item['title'][::-1]\n",
87
+ " index=temp.find(\"-\")\n",
88
+ " temp=temp[:index-1][::-1]\n",
89
+ " urls.append(url)\n",
90
+ " contents.append(article.text)\n",
91
+ " titles.append(item['title'][:-index-1]) \n",
92
+ " authors.append(temp)\n",
93
+ " count+=1\n",
94
+ " if(count==5):\n",
95
+ " break\n",
96
+ " except:\n",
97
+ " continue \n",
98
+ " return contents,titles,authors,urls "
99
+ ]
100
+ },
101
+ {
102
+ "cell_type": "markdown",
103
+ "metadata": {},
104
+ "source": [
105
+ "##### Summarizes the content- minimum word limit 30 and maximum 60"
106
+ ]
107
+ },
108
+ {
109
+ "cell_type": "code",
110
+ "execution_count": 4,
111
+ "metadata": {},
112
+ "outputs": [],
113
+ "source": [
114
+ "def getNewsSummary(contents,summarizer): \n",
115
+ " summaries=[] \n",
116
+ " for content in contents:\n",
117
+ " minimum=len(content.split())\n",
118
+ " summaries.append(summarizer(content,max_length=60,min_length=min(30,minimum),do_sample=False,truncation=True)[0]['summary_text']) \n",
119
+ " return summaries"
120
+ ]
121
+ },
122
+ {
123
+ "cell_type": "markdown",
124
+ "metadata": {},
125
+ "source": [
126
+ "##### Named Entity Recognition"
127
+ ]
128
+ },
129
+ {
130
+ "cell_type": "code",
131
+ "execution_count": 5,
132
+ "metadata": {},
133
+ "outputs": [],
134
+ "source": [
135
+ "# Obtain 4 keywords from content (person,organisation or geopolitical entity) \n",
136
+ "def generateKeyword(contents): \n",
137
+ " keywords=[]\n",
138
+ " words=[] \n",
139
+ " labels=[\"PERSON\",\"ORG\",\"GPE\"]\n",
140
+ " for content in contents:\n",
141
+ " doc=nlp(content)\n",
142
+ " keys=[]\n",
143
+ " limit=0\n",
144
+ " for ent in doc.ents:\n",
145
+ " key=ent.text.upper()\n",
146
+ " label=ent.label_\n",
147
+ " if(key not in words and key not in keywords and label in labels): \n",
148
+ " keys.append(key)\n",
149
+ " limit+=1\n",
150
+ " for element in key.split():\n",
151
+ " words.append(element)\n",
152
+ " if(limit==4):\n",
153
+ " keywords.append(keys)\n",
154
+ " break \n",
155
+ " return keywords\n",
156
+ " "
157
+ ]
158
+ },
159
+ {
160
+ "cell_type": "markdown",
161
+ "metadata": {},
162
+ "source": [
163
+ "##### Displaying keywords "
164
+ ]
165
+ },
166
+ {
167
+ "cell_type": "code",
168
+ "execution_count": 6,
169
+ "metadata": {},
170
+ "outputs": [],
171
+ "source": [
172
+ "def printKeywords(keywords):\n",
173
+ " for keyword in keywords:\n",
174
+ " print(keyword)"
175
+ ]
176
+ },
177
+ {
178
+ "cell_type": "markdown",
179
+ "metadata": {},
180
+ "source": [
181
+ "##### Displaying the Summary with keywords in it highlighted"
182
+ ]
183
+ },
184
+ {
185
+ "cell_type": "code",
186
+ "execution_count": 7,
187
+ "metadata": {},
188
+ "outputs": [],
189
+ "source": [
190
+ "def printSummary(summaries,titles):\n",
191
+ " for summary,title in zip(summaries,titles):\n",
192
+ " print(title.upper(),'\\n')\n",
193
+ " print(summary)\n",
194
+ " print(\"\\n\\n\")"
195
+ ]
196
+ },
197
+ {
198
+ "cell_type": "code",
199
+ "execution_count": 8,
200
+ "metadata": {},
201
+ "outputs": [
202
+ {
203
+ "name": "stderr",
204
+ "output_type": "stream",
205
+ "text": [
206
+ "No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)\n"
207
+ ]
208
+ }
209
+ ],
210
+ "source": [
211
+ "summarizer=load_model() "
212
+ ]
213
+ },
214
+ {
215
+ "cell_type": "code",
216
+ "execution_count": 9,
217
+ "metadata": {},
218
+ "outputs": [],
219
+ "source": [
220
+ "contents,titles,authors,urls=getNews(\"Sports\",\"India\")"
221
+ ]
222
+ },
223
+ {
224
+ "cell_type": "code",
225
+ "execution_count": 10,
226
+ "metadata": {},
227
+ "outputs": [],
228
+ "source": [
229
+ "summaries=getNewsSummary(contents,summarizer)"
230
+ ]
231
+ },
232
+ {
233
+ "cell_type": "code",
234
+ "execution_count": 11,
235
+ "metadata": {},
236
+ "outputs": [],
237
+ "source": [
238
+ "keywords=generateKeyword(contents)"
239
+ ]
240
+ },
241
+ {
242
+ "cell_type": "code",
243
+ "execution_count": 12,
244
+ "metadata": {},
245
+ "outputs": [
246
+ {
247
+ "name": "stdout",
248
+ "output_type": "stream",
249
+ "text": [
250
+ "['INDIA', 'SCOTLAND', 'SUPER 12', 'DUBAI']\n",
251
+ "[\"VIRAT KOHLI'S\", 'TEAM INDIA', 'DHONI', 'UAE']\n",
252
+ "['AUSTRALIA', 'AFGHANISTAN', 'CRICKET AUSTRALIA', 'CRICBUZZ STAFF •']\n",
253
+ "['GARY STEAD', 'TRENT BOULT', 'COLIN DE GRANDHOMME', 'BLACKCAPS']\n",
254
+ "['DWAYNE BRAVO', 'SRI LANKA', 'ICC', 'THE WEST INDIES']\n"
255
+ ]
256
+ }
257
+ ],
258
+ "source": [
259
+ "printKeywords(keywords)"
260
+ ]
261
+ },
262
+ {
263
+ "cell_type": "code",
264
+ "execution_count": 13,
265
+ "metadata": {},
266
+ "outputs": [
267
+ {
268
+ "name": "stdout",
269
+ "output_type": "stream",
270
+ "text": [
271
+ "T20 WORLD CUP 2021, IND VS SCO PREVIEW: INDIA FACE SCOTLAND, EYE ANOTHER BIG WIN \n",
272
+ "\n",
273
+ " India take on Scotland in a Super 12 clash of the 2021 T20 World Cup in Dubai on Friday . Virat Kohli-led side beat Afghanistan by 66 runs in Abu Dhabi on Wednesday . India must win their remaining two games while maintaining high run rates and hope for New Zealand to\n",
274
+ "\n",
275
+ "\n",
276
+ "\n",
277
+ "‘THERE ARE MANY CANDIDATES BUT HE’S THE BEST': SEHWAG PICKS NEXT INDIA CAPTAIN AFTER KOHLI STEPS DOWN AT END OF T20 WC \n",
278
+ "\n",
279
+ " Virat Kohli set to step down as T20I captain after this World Cup in UAE and Oman . Many experts are anticipating his deputy Rohit Sharma to fill up the position . Former India opener Virender Sehwag backed Rohit as the ideal candidate .\n",
280
+ "\n",
281
+ "\n",
282
+ "\n",
283
+ "ONE-OFF TEST VS AFGHANISTAN POSTPONED, CONFIRMS CRICKET AUSTRALIA | CRICBUZZ.COM - CRICBUZZ \n",
284
+ "\n",
285
+ " Cricket Australia's one-off Test against Afghanistan has officially been postponed . The historic Test has been hanging in the balance since the CA revealed that they wouldn't support the Taliban government's stance against the inclusion of women in sports . Instead of cancelling the Test match, CA has vowed to\n",
286
+ "\n",
287
+ "\n",
288
+ "\n",
289
+ "NEW ZEALAND INCLUDE FIVE SPINNERS FOR INDIA TOUR, TRENT BOULT OPTS OUT CITING BUBBLE FATIGUE \n",
290
+ "\n",
291
+ " New Zealand name five spinners in 15-man squad for two-Test series against India . Senior pacer Trent Boult and fast-bowling all-rounder Colin de Grandhomme will miss tour due to bio-bubble fatigue . Ajaz Patel, Will Somerville and\n",
292
+ "\n",
293
+ "\n",
294
+ "\n",
295
+ "T20 WORLD CUP 2021: WEST INDIES AND CHENNAI SUPER KINGS ALL-ROUNDER DWAYNE BRAVO TO RETIRE AFTER SHOWPIECE... \n",
296
+ "\n",
297
+ " West Indies all-rounder Dwayne Bravo will hang his boots at the end of the ICC T20 World Cup 2021 . Bravo told ICC on the post-match Facebook Live show that he will be drawing the curtains on his international career . West Indies lost to Sri Lanka by 20 runs in\n",
298
+ "\n",
299
+ "\n",
300
+ "\n"
301
+ ]
302
+ }
303
+ ],
304
+ "source": [
305
+ "printSummary(summaries,titles)"
306
+ ]
307
+ },
308
+ {
309
+ "cell_type": "code",
310
+ "execution_count": null,
311
+ "metadata": {},
312
+ "outputs": [],
313
+ "source": []
314
+ }
315
+ ],
316
+ "metadata": {
317
+ "kernelspec": {
318
+ "display_name": "Python 3",
319
+ "language": "python",
320
+ "name": "python3"
321
+ },
322
+ "language_info": {
323
+ "codemirror_mode": {
324
+ "name": "ipython",
325
+ "version": 3
326
+ },
327
+ "file_extension": ".py",
328
+ "mimetype": "text/x-python",
329
+ "name": "python",
330
+ "nbconvert_exporter": "python",
331
+ "pygments_lexer": "ipython3",
332
+ "version": "3.8.5"
333
+ }
334
+ },
335
+ "nbformat": 4,
336
+ "nbformat_minor": 4
337
+ }