vtiyyal1 commited on
Commit
5e43d3e
·
verified ·
1 Parent(s): bfa79fd

Upload 10 files

Browse files

Multi query type router

Files changed (7) hide show
  1. about.txt +118 -0
  2. app.py +165 -34
  3. filter_options.json +87 -0
  4. full_chain.py +2 -5
  5. get_articles.py +112 -75
  6. help.txt +85 -0
  7. shots.json +34 -0
about.txt ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ About Tobacco Watcher
2
+ Our Vision
3
+ The global tobacco control landscape is changing more quickly than ever. In this fast-paced world, Tobacco Watcher helps you stay on top of these changes by finding, analyzing, and delivering the world's tobacco-related news. Powered by cutting-edge innovations in machine learning and artificial intelligence, Tobacco Watcher empowers you to combat global tobacco use.
4
+ Media monitoring provides valuable assessments of the tobacco control environment for tobacco control planning and evaluation. However, existing media monitoring approaches require scouring a substantial and growing number of disparate news sources, often leading to information overload. Tobacco Watcher addresses this challenge through a web application that automates the collection, querying, filtering, highlighting, integration, and analysis of tobacco-focused news media worldwide.
5
+
6
+ An Analysis Engine Fueled by Data
7
+ Tobacco Watcher rests on the largest assembly of data in the history of tobacco control. The AI behind Tobacco Watcher has investigated more than 25 million news articles.
8
+ The Challenge
9
+ Media monitoring is an essential tool for supervising the rapidly changing tobacco control environment. However, it comes with challenges:
10
+ What news sources can be accessed?
11
+ How should non-English sources be searched?
12
+ What news reports should be saved for future analyses?
13
+ How can news media be analyzed and shared efficiently with researchers, advocates, and regulators?
14
+ The Solution
15
+ Imagine having millions of global news reports collected in one place. Tobacco-related content is organized with precision around key topics (e.g., smoke-free laws) or products (e.g., electronic cigarettes), rivaling human-level judgment. Features include:
16
+ Curated reports on your device, organized by topics and products.
17
+ Emailed alerts tailored to your precise interests.
18
+ Trend plotting for agenda setting or campaign evaluations.
19
+ The Tobacco Watcher web application, a product of the Johns Hopkins Bloomberg School of Public Health's Institute for Global Tobacco Control, leverages automation for data gathering, filtering, and analysis.
20
+
21
+ Core Objectives
22
+ The principal objective of Tobacco Watcher is to:
23
+ Warehouse the greatest amount of tobacco-related news reports across diverse substantive, geographic, and linguistic areas.
24
+ Avoid overwhelming users with excess information.
25
+ Automate classification, provide flexible outputs, and support common media monitoring use cases.
26
+ Using AI strategies, Tobacco Watcher not only automates the gathering and filtering of news but also enables actionable insights for tobacco control leaders.
27
+
28
+ Research
29
+ Tobacco Watcher has been utilized in various research contexts, particularly in studies focusing on the technologies behind Tobacco Watcher and the use of Tobacco Watcher for :
30
+ Did Philip Morris International use the e-cigarette, or vaping, product use associated lung injury (EVALI) outbreak to market IQOS heated tobacco?
31
+ In that study we tracked how Philip Morris’ press releases were covered by the news media, including some misleading claims that were later edited.
32
+ Next generation media monitoring: Global coverage of electronic nicotine delivery systems (electronic cigarettes) on Bing, Google and Twitter, 2013-2018
33
+ In that study we described the development of Tobacco Watcher and how as an analysis engine it can be used for near real time needs assessment and tobacco control responses.
34
+
35
+ The Data Pipeline
36
+ Database
37
+ Tobacco Watcher queries more than 600,000 news sources, including:
38
+ News aggregators (e.g., Bing)
39
+ RSS news feeds
40
+ Curated news websites
41
+ Social media platforms (including 500,000 tweets with URLs daily)
42
+ With hundreds of tobacco-related keywords in 23 languages, Tobacco Watcher processes about 65,000 news reports per day. Reports area high degree of perfect precision (0.95).
43
+ Translated: Retained reports are translated into English for deeper analysis.
44
+ Warehoused: Structured reports are made accessible to users.
45
+
46
+ AI-Powered Filters
47
+ Primary Filters
48
+ The Tobacco Watcher system supports the filtering of news articles based on:
49
+ Subject: Articles are clustered into 16 policy domains inspired by WHO's MPOWER framework:
50
+ Advertising: Articles on promotions, sponsorships, and marketing of tobacco through media.
51
+
52
+ Age: Articles on age restrictions for purchasing or using tobacco, such as minimum age limits.
53
+
54
+ Agriculture: Agriculture includes articles primarily about the farming and cultivation of tobacco plants.
55
+ Air: Air includes articles primarily about efforts to prevent the use of tobacco products in indoor spaces to reduce secondhand smoke, environmental tobacco smoke, or passive smoking. This also includes vaping products. Additionally, articles about banning tobacco use in designated areas are also Air related.
56
+
57
+ Criminal Justice: Criminal Justice includes articles primarily about crime, policing, or criminal prosecution that is principally about tobacco products. The article must be focussed on criminal justice directly linked to possession, selling, or transport of tobacco products. If the article is about a crime and tobacco products are mentioned in passing it is not relevant and should be ignored.
58
+
59
+ Entertainment: Entertainment includes articles primarily about celebrities’ involvement with tobacco and tobacco use, or tobacco use in movies or other forms of entertainment.
60
+
61
+ Environment: Environment includes articles primarily about the impact of tobacco on the environment. Covers pollution caused by smoking, cigarette butts, and anything that detrimentally affects nature.
62
+
63
+ Flavor: Articles on flavored tobacco products, including bans, usage, and reviews.
64
+
65
+ Industry: Articles on industry activities, including financial reports and mergers.
66
+
67
+ Nicotine: Nicotine includes articles primarily about regulations that reduce or restrict the amount of nicotine in tobacco products. Articles can also be about the enforcement of these regulations, such as banning products that contain too much nicotine.
68
+
69
+ Packaging: Packaging includes articles primarily about warning labels on tobacco packages, plain packaging on tobacco products, or graphic warnings on tobacco packaging.
70
+
71
+ Prevalence: Articles on tobacco use statistics and trends, including user demographics.
72
+
73
+ Prices: Articles on tobacco product taxes and minimum price regulations.
74
+
75
+ Quitting: Articles on cessation methods, products, and programs, including cessation statistics.
76
+
77
+ Health Warnings: Articles on dangers, diseases, and health risks associated with tobacco use.
78
+
79
+ Retail: Articles on issues involving the point-of-sale for tobacco products, including online.
80
+
81
+ Product: Articles focus on 8 product categories:
82
+ Cigarettes, bidis, cigars, e-cigarettes, kreteks, smokeless tobacco, hookah, and heated tobacco.
83
+ Geographic Focus: Articles are classified nationally and regionally using entity recognition.
84
+ The system evolves dynamically. User feedback on classification accuracy improves future filtering capabilities.
85
+
86
+ Core Features
87
+ 1. Articles
88
+ Each article is presented in a standardized format, including headline, date, source, language, and topic classification.
89
+ Interactive features include "additional coverage" for related articles and user-driven feedback to refine classifications.
90
+ Users can favorite articles, save searches, and share insights with others.
91
+ 2. Alerts
92
+ Receive news reports directly in your inbox without visiting the platform.
93
+ Alerts can be tailored by:
94
+ Keywords, subject, location, and frequency (daily, weekly, bi-weekly, or monthly).
95
+ Alerts are grouped into:
96
+ My Alerts (user-created), Recommended Alerts (system-suggested), and Shared Alerts (created by other users).
97
+ 3. Analyses
98
+ Analyze global tobacco trends using Tobacco Watcher's time-series analysis tools.
99
+ Plot up to 5 trends simultaneously for:
100
+ Data exploration, program planning, or campaign evaluations.
101
+ Features include trend naming, customizable time windows, and data exports for external analysis.
102
+
103
+ Who We Support
104
+ Tobacco Watcher makes media monitoring easier and more effective, informing and improving global tobacco control efforts. Designed for:
105
+ Tobacco control advocates
106
+ Researchers
107
+ Policy makers
108
+ By classifying media by content and location in 23 languages, Tobacco Watcher enables exploration of the tobacco control landscape at an unprecedented scale. Spend less time searching and more time analyzing.
109
+
110
+ Who We Are
111
+ Tobacco Watcher was born out of the Institute for Global Tobacco Control at the Johns Hopkins Bloomberg School of Public Health.
112
+ Leadership Team
113
+ Project Lead: Joanna Cohen, PhD, MHSc
114
+ Bloomberg Professor of Disease Prevention and Director of IGTC at Johns Hopkins Bloomberg School of Public Health. Dr. Cohen brings nearly three decades of tobacco policy research expertise, focusing on public health policy adoption and implementation.
115
+ Project Technical Lead: Mark Dredze, PhD
116
+ John C. Malone Professor of Computer Science at Johns Hopkins University. Dr. Dredze specializes in AI and NLP for public health applications, including tobacco control, infectious disease surveillance, and clinical informatics.
117
+ Project Manager: John W. Ayers, PhD, MA
118
+ Computational epidemiologist and Vice Chief of Innovation at UC San Diego. Dr. Ayers focuses on integrating big data and AI to derive actionable public health insights, including real-time media analysis for tobacco control.
app.py CHANGED
@@ -1,48 +1,179 @@
1
- import openai
2
  import gradio as gr
3
  from full_chain import get_response
4
  import os
 
 
 
 
5
 
6
- api_key = os.getenv("OPENAI_API_KEY")
7
- client = openai.OpenAI(api_key=api_key)
8
 
9
- def create_hyperlink(url, title, domain):
10
- """Create HTML hyperlink with domain information."""
11
- return f"<a href='{url}' target='_blank'>{title}</a> ({domain})"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  def predict(message, history):
14
- """Process user message and return response with hyperlinked sources."""
15
- # Get response and source information
16
- responder, links, titles, domains, published_dates = get_response(message, rerank_type="crossencoder")
17
-
18
- # The responder already contains the formatted response with numbered citations
19
- # We just need to add the hyperlinked references at the bottom
20
- hyperlinks = []
21
- for i, (link, title, domain, published_date) in enumerate(zip(links, titles, domains, published_dates), 1):
22
- hyperlink = f"[{i}] {create_hyperlink(link, title, domain)} {published_date}"
23
- hyperlinks.append(hyperlink)
24
-
25
- # Split the responder to separate the response from its references
26
- response_parts = responder.split("References:")
27
- main_response = response_parts[0].strip()
28
-
29
- # Combine the response with hyperlinked references
30
- final_response = (
31
- f"{main_response}\n\n"
32
- f"References:\n"
33
- f"{chr(10).join(hyperlinks)}"
34
- )
35
-
36
- return final_response
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  # Initialize and launch Gradio interface
39
  gr.ChatInterface(
40
  predict,
41
  examples=[
42
- "How many Americans Smoke?",
43
- "What are some measures taken by the Indian Government to reduce the smoking population?",
44
- "Does smoking negatively affect my health?"
 
 
 
45
  ],
46
- title="Tobacco Information Assistant",
47
- description="Ask questions about tobacco-related topics and get answers with reliable sources."
48
  ).launch()
 
 
1
  import gradio as gr
2
  from full_chain import get_response
3
  import os
4
+ import urllib3
5
+ import json
6
+ from langchain_openai import ChatOpenAI
7
+ from langchain.schema import SystemMessage, HumanMessage
8
 
9
+ urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
 
10
 
11
+ # Initialize ChatOpenAI
12
+ llm = ChatOpenAI(
13
+ api_key=os.getenv("OPENAI_API_KEY"),
14
+ model="gpt-3.5-turbo",
15
+ temperature=0
16
+ )
17
+
18
+ def load_content(filename):
19
+ """Load content from text files"""
20
+ with open(os.path.join("prompts", filename), "r", encoding="utf-8") as f:
21
+ return f.read()
22
+
23
+ def load_filter_options():
24
+ with open(os.path.join("prompts", "filter_options.json"), "r") as f:
25
+ return json.load(f)
26
+
27
+ def load_example_shots():
28
+ with open(os.path.join("prompts", "shots.json"), "r") as f:
29
+ return json.load(f)
30
 
31
  def predict(message, history):
32
+ """Process user message and return appropriate response."""
33
+ try:
34
+ # Query classification prompt
35
+ classifier_prompt = """You are the Tobacco Watcher Assistant. Analyze the user's query and categorize it into exactly ONE of these types:
36
+
37
+ 1. HELP - Questions about using the website, its features, or navigation
38
+ Example: "How do I use filters?", "How to search for articles?"
39
+
40
+ 2. ABOUT - Questions about Tobacco Watcher's purpose, mission, or organization
41
+ Example: "What is Tobacco Watcher?", "Who runs this website?"
42
+
43
+ 3. FILTER - Requests for specific articles using filters
44
+ Example: "Show articles about smoking in India from 2023", "Find French articles about e-cigarettes"
45
+
46
+ 4. QUERY - Questions seeking tobacco-related information
47
+ Example: "How many people smoke in Asia?", "What are the effects of secondhand smoke?"
48
+
49
+ Respond with ONLY the category name (HELP, ABOUT, FILTER, or QUERY).
50
+ """
51
+
52
+ messages = [
53
+ SystemMessage(content=classifier_prompt),
54
+ HumanMessage(content=message)
55
+ ]
56
+
57
+ response = llm.invoke(messages)
58
+ query_type = response.content.strip().upper()
59
+ print(f"Query type: {query_type}")
60
+
61
+ if query_type == "HELP":
62
+ help_content = load_content("help.txt")
63
+ messages = [
64
+ SystemMessage(content="""You are the Tobacco Watcher Help Assistant.
65
+ Use the provided help content to guide users on how to use the platform's features.
66
+ Be clear and specific in your instructions. If a feature isn't mentioned in the content, acknowledge that and suggest contacting support."""),
67
+ HumanMessage(content=f"Using this help content:\n\n{help_content}\n\nAnswer this question: {message}")
68
+ ]
69
+
70
+ response = llm.invoke(messages)
71
+ return response.content
72
+
73
+ elif query_type == "ABOUT":
74
+ about_content = load_content("about.txt")
75
+ messages = [
76
+ SystemMessage(content="""You are the Tobacco Watcher Assistant specializing in explaining the platform.
77
+ Use the provided content to answer questions about Tobacco Watcher's purpose, mission, features, and organization.
78
+ Be concise but informative. If a specific detail isn't in the content, say so rather than making assumptions."""),
79
+ HumanMessage(content=f"Using this content:\n\n{about_content}\n\nAnswer this question: {message}")
80
+ ]
81
+ response = llm.invoke(messages)
82
+ return response.content
83
+
84
+ elif query_type == "FILTER":
85
+ filter_options = load_filter_options()
86
+ example_shots = load_example_shots()
87
+
88
+ url_prompt = """Generate a Tobacco Watcher article URL based on the query. Follow these rules:
89
+
90
+ 1. Base URL: https://tobaccowatcher.globaltobactocontrol.org/articles/
91
+ 2. Parameters:
92
+ - Subject (c=): Can have multiple
93
+ - Product (pro=): Can have multiple
94
+ - Region (r=): Can have multiple
95
+ - Language (lang=)
96
+ - Always add: st=&e=&section=keywords&dups=0&sort=-timestamp
97
+
98
+ Available filters:
99
+ """ + json.dumps(filter_options, indent=2) + """
100
+
101
+ Example queries and URLs:
102
+ """
103
+
104
+ for shot in example_shots:
105
+ url_prompt += f"\nQuery: {shot['query']}\nURL: {shot['url']}\n"
106
+
107
+ url_prompt += "\nGenerate a valid URL for this query. Return ONLY the complete URL."
108
+
109
+ messages = [
110
+ SystemMessage(content=url_prompt),
111
+ HumanMessage(content=message)
112
+ ]
113
+
114
+ try:
115
+ response = llm.invoke(messages)
116
+ url_response = response.content.strip()
117
+ print(f"Generated URL: {url_response}")
118
+
119
+ if url_response.startswith("http"):
120
+ return f"Here are the filtered articles you requested:\n{url_response}"
121
+ else:
122
+ return "I couldn't create a proper filter URL. Please try rephrasing your request."
123
+ except Exception as e:
124
+ print(f"Error creating filter URL: {str(e)}")
125
+ return "I couldn't create a proper filter URL. Please try rephrasing your request."
126
+
127
+ else: # QUERY
128
+ try:
129
+ response = get_response(message, rerank_type="crossencoder")
130
+ if not response or len(response) != 5:
131
+ print(f"Invalid response format: {response}")
132
+ return "I apologize, but I couldn't find relevant information. Please try rephrasing your question."
133
+
134
+ responder, links, titles, domains, published_dates = response
135
+
136
+ if not responder:
137
+ print("Empty response content")
138
+ return "I apologize, but I couldn't generate a meaningful response. Please try rephrasing your question."
139
+
140
+ response_parts = responder.split("References:")
141
+ main_response = response_parts[0].strip()
142
+
143
+ if not any([links, titles, domains, published_dates]):
144
+ print("Missing citation data")
145
+ return main_response # Return just the response without citations
146
+
147
+ hyperlinks = [
148
+ f"[{i}] <a href='{link}' target='_blank'>{title}</a> ({domain}) {date}"
149
+ for i, (link, title, domain, date) in
150
+ enumerate(zip(links, titles, domains, published_dates), 1)
151
+ if link and title and domain # Only create links for complete data
152
+ ]
153
+
154
+ if hyperlinks:
155
+ return f"{main_response}\n\nReferences:\n{chr(10).join(hyperlinks)}"
156
+ return main_response
157
+
158
+ except Exception as e:
159
+ print(f"Error in QUERY handling: {str(e)}")
160
+ return "I apologize, but I encountered an error processing your request. Please try again."
161
+
162
+ except Exception as e:
163
+ print(f"Error in predict: {str(e)}")
164
+ return "I apologize, but I encountered an error processing your request. Please try again."
165
 
166
  # Initialize and launch Gradio interface
167
  gr.ChatInterface(
168
  predict,
169
  examples=[
170
+ "What is Tobacco Watcher?",
171
+ "How do I use the search filters?",
172
+ "Show me articles about smoking in India from 2023",
173
+ "Find French articles about e-cigarettes",
174
+ "What are the health effects of secondhand smoke?",
175
+ "Show me articles about tobacco industry in Eastern Europe",
176
  ],
177
+ title="Tobacco Watcher Chatbot",
178
+ description="Ask questions about tobacco-related topics, get help with navigation, or learn about Tobacco Watcher."
179
  ).launch()
filter_options.json ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "popularity": {
3
+ "parameter": "popularity",
4
+ "paramvalues": [
5
+ "trending_on_social",
6
+ "domain"
7
+ ]
8
+ },
9
+ "subject": {
10
+ "parameter": "c",
11
+ "paramvalues": [
12
+ "advertising",
13
+ "age",
14
+ "agriculture",
15
+ "air",
16
+ "criminal justice",
17
+ "entertainment",
18
+ "environment",
19
+ "flavor",
20
+ "industry",
21
+ "nicotine",
22
+ "packaging",
23
+ "prevalence",
24
+ "prices",
25
+ "quitting",
26
+ "retailers",
27
+ "warnings",
28
+ "uncategorized"
29
+ ]
30
+ },
31
+ "product": {
32
+ "parameter": "pro",
33
+ "paramvalues": [
34
+ "bidis_primary",
35
+ "cigarettes_primary",
36
+ "cigars_primary",
37
+ "ecigs_primary",
38
+ "hookahs_primary",
39
+ "smokeless_tobacco_primary",
40
+ "heated_tobacco_primary",
41
+ "non_specific"
42
+ ]
43
+ },
44
+ "location": {
45
+ "parameter": "r",
46
+ "paramvalues": [
47
+ "africa",
48
+ "Central+America",
49
+ "Central+Asia",
50
+ "East+Asia",
51
+ "Eastern+Europe",
52
+ "North+America",
53
+ "South+America",
54
+ "South+Pacific",
55
+ "Southeastern+Asia",
56
+ "West+Asia",
57
+ "Western+Europe"
58
+ ]
59
+ },
60
+ "language": {
61
+ "parameter": "la",
62
+ "paramvalues": [
63
+ "en",
64
+ "ar",
65
+ "bn",
66
+ "zh",
67
+ "fr",
68
+ "de",
69
+ "hi",
70
+ "id",
71
+ "it",
72
+ "ja",
73
+ "ko",
74
+ "pl",
75
+ "pt",
76
+ "ru",
77
+ "es",
78
+ "tl",
79
+ "ta",
80
+ "th",
81
+ "tr",
82
+ "uk",
83
+ "ur",
84
+ "vi"
85
+ ]
86
+ }
87
+ }
full_chain.py CHANGED
@@ -7,13 +7,11 @@ from rerank import langchain_rerank_answer, langchain_with_sources, crossencoder
7
  #from feed_to_llm import feed_articles_to_gpt_with_links
8
  from feed_to_llm_v2 import feed_articles_to_gpt_with_links
9
 
10
-
11
  def get_response(question, rerank_type="crossencoder", llm_type="chat"):
12
- csv_path = save_solr_articles_full(question, keyword_type="rake")
13
  reranked_out = crossencoder_rerank_answer(csv_path, question)
14
  return feed_articles_to_gpt_with_links(reranked_out, question)
15
 
16
-
17
  # save_path = save_solr_articles_full(question)
18
  # information = crossencoder_rerank_answer(save_path, question)
19
  # response, links, titles = feed_articles_to_gpt_with_links(information, question)
@@ -29,5 +27,4 @@ if __name__ == "__main__":
29
  response, links, titles, domains = get_response(question, rerank_type, llm_type)
30
  print(response)
31
  print(links)
32
- print(titles)
33
- print(domains)
 
7
  #from feed_to_llm import feed_articles_to_gpt_with_links
8
  from feed_to_llm_v2 import feed_articles_to_gpt_with_links
9
 
 
10
  def get_response(question, rerank_type="crossencoder", llm_type="chat"):
11
+ csv_path = save_solr_articles_full(question, keyword_type="rake", num_articles=10)
12
  reranked_out = crossencoder_rerank_answer(csv_path, question)
13
  return feed_articles_to_gpt_with_links(reranked_out, question)
14
 
 
15
  # save_path = save_solr_articles_full(question)
16
  # information = crossencoder_rerank_answer(save_path, question)
17
  # response, links, titles = feed_articles_to_gpt_with_links(information, question)
 
27
  response, links, titles, domains = get_response(question, rerank_type, llm_type)
28
  print(response)
29
  print(links)
30
+ print(titles) print(domains)
 
get_articles.py CHANGED
@@ -6,7 +6,7 @@ import torch
6
  from datetime import datetime
7
  from get_keywords import get_keywords
8
  import os
9
-
10
  """
11
  This function creates top 15 articles from Solr and saves them in a csv file
12
  Input:
@@ -15,11 +15,26 @@ Input:
15
  keyword_type: str (openai, rake, or na)
16
  Output: path to csv file
17
  """
18
- def save_solr_articles_full(query: str, num_articles=15, keyword_type="openai") -> str:
19
- keywords = get_keywords(query, keyword_type)
20
- if keyword_type == "na":
21
- keywords = query
22
- return save_solr_articles(keywords, num_articles)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
 
25
  """
@@ -55,76 +70,98 @@ Minor details:
55
  If one of title, uuid, cleaned_content, url are missing the article is skipped.
56
  """
57
  def save_solr_articles(keywords: str, num_articles=15) -> str:
58
- solr_key = os.getenv("SOLR_KEY")
59
- SOLR_ARTICLES_URL = f"https://website:{solr_key}@solr.machines.globalhealthwatcher.org:8080/solr/articles/"
60
- solr = Solr(SOLR_ARTICLES_URL, verify=False)
61
-
62
- # No duplicates
63
- fq = ['-dups:0']
64
-
65
- query = f'text:({keywords})' + " AND " + "dead_url:(false)"
66
-
67
- # Get top 2*num_articles articles and then remove misformed or duplicate articles
68
- outputs = solr.search(query, fq=fq, sort="score desc", rows=num_articles * 2)
69
-
70
- article_count = 0
71
-
72
- save_path = os.path.join("data", "articles.csv")
73
- if not os.path.exists(os.path.dirname(save_path)):
74
- os.makedirs(os.path.dirname(save_path))
75
-
76
- with open(save_path, 'w', newline='') as csvfile:
77
- fieldnames = ['title', 'uuid', 'content', 'url', 'domain','published_date']
78
- writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_NONNUMERIC)
79
- writer.writeheader()
80
-
81
- title_five_words = set()
82
-
83
- for d in outputs.docs:
84
- if article_count == num_articles:
85
- break
86
-
87
- # skip if title returns a keyerror
88
- if 'title' not in d or 'uuid' not in d or 'cleaned_content' not in d or 'url' not in d:
89
- continue
90
-
91
- title_cleaned = remove_spaces_newlines(d['title'])
92
-
93
- split = title_cleaned.split()
94
- # skip if title is a duplicate
95
- if not len(split) < 5:
96
- five_words = title_cleaned.split()[:5]
97
- five_words = ' '.join(five_words)
98
- if five_words in title_five_words:
 
 
 
 
 
 
 
99
  continue
100
- title_five_words.add(five_words)
101
-
102
- article_count += 1
103
-
104
- cleaned_content = remove_spaces_newlines(d['cleaned_content'])
105
- cleaned_content = truncate_article(cleaned_content)
106
-
107
- domain = ""
108
- if 'domain' not in d:
109
- domain = "Not Specified"
110
- else:
111
- domain = d['domain']
112
- print(domain)
113
- raw_date = d.get('year_month_day', "Unknown Date")
114
-
115
- # Format the date from YYYY-MM-DD to MM/DD/YYYY if available
116
- if raw_date != "Unknown Date":
117
- try:
118
- publication_date = datetime.strptime(raw_date, "%Y-%m-%d").strftime("%m/%d/%Y")
119
- except ValueError:
120
- publication_date = "Invalid Date"
121
- else:
122
- publication_date = raw_date
123
-
124
- writer.writerow({'title': title_cleaned, 'uuid': d['uuid'], 'content': cleaned_content, 'url': d['url'],
125
- 'domain': domain, 'published_date': publication_date})
126
- print(f"Article saved: {title_cleaned}, {d['uuid']}, {cleaned_content}, {d['url']}, {domain}, {publication_date}")
127
- return save_path
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
 
130
  def save_embedding_base_articles(query, article_embeddings, titles, contents, uuids, urls, num_articles=15):
 
6
  from datetime import datetime
7
  from get_keywords import get_keywords
8
  import os
9
+ import re
10
  """
11
  This function creates top 15 articles from Solr and saves them in a csv file
12
  Input:
 
15
  keyword_type: str (openai, rake, or na)
16
  Output: path to csv file
17
  """
18
+
19
+ def sanitize_query(text):
20
+ """Sanitize the query text for Solr."""
21
+ # Remove special characters that could break Solr syntax
22
+ sanitized = re.sub(r'[[\]{}()*+?\\^|;:!]', ' ', text)
23
+ # Normalize whitespace
24
+ sanitized = ' '.join(sanitized.split())
25
+ return sanitized
26
+
27
+ def save_solr_articles_full(query: str, num_articles: int, keyword_type: str = "openai") -> str:
28
+ try:
29
+ keywords = get_keywords(query, keyword_type)
30
+ if keyword_type == "na":
31
+ keywords = query
32
+ # Sanitize keywords before creating Solr query
33
+ keywords = sanitize_query(keywords)
34
+
35
+ return save_solr_articles(keywords, num_articles)
36
+ except Exception as e:
37
+ raise
38
 
39
 
40
  """
 
70
  If one of title, uuid, cleaned_content, url are missing the article is skipped.
71
  """
72
  def save_solr_articles(keywords: str, num_articles=15) -> str:
73
+ """Save top articles from Solr search to CSV."""
74
+ try:
75
+ solr_key = os.getenv("SOLR_KEY")
76
+ SOLR_ARTICLES_URL = f"https://website:{solr_key}@solr.machines.globalhealthwatcher.org:8080/solr/articles/"
77
+ solr = Solr(SOLR_ARTICLES_URL, verify=False)
78
+
79
+ # No duplicates and must be in English
80
+ fq = ['-dups:0', 'is_english:(true)']
81
+
82
+ # Construct and sanitize query
83
+ query = f'text:({keywords}) AND dead_url:(false)'
84
+
85
+ print(f"Executing Solr query: {query}")
86
+
87
+ # Use boost function to combine relevance score with recency
88
+ # This gives higher weight to more recent articles while still considering relevance
89
+ boost_query = "sum(score,product(0.3,recip(ms(NOW,year_month_day),3.16e-11,1,1)))"
90
+
91
+ try:
92
+ outputs = solr.search(
93
+ query,
94
+ fq=fq,
95
+ sort=boost_query + " desc",
96
+ rows=num_articles * 2,
97
+ fl='*,score' # Include score in results
98
+ )
99
+ except Exception as e:
100
+ print(f"Solr query failed: {str(e)}")
101
+ raise
102
+
103
+ article_count = 0
104
+ save_path = os.path.join("data", "articles.csv")
105
+ if not os.path.exists(os.path.dirname(save_path)):
106
+ os.makedirs(os.path.dirname(save_path))
107
+
108
+ with open(save_path, 'w', newline='') as csvfile:
109
+ fieldnames = ['title', 'uuid', 'content', 'url', 'domain', 'published_date']
110
+ writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_NONNUMERIC)
111
+ writer.writeheader()
112
+
113
+ title_five_words = set()
114
+
115
+ for d in outputs.docs:
116
+ if article_count == num_articles:
117
+ break
118
+
119
+ # Skip if required fields are missing
120
+ if 'title' not in d or 'uuid' not in d or 'cleaned_content' not in d or 'url' not in d:
121
  continue
122
+
123
+ title_cleaned = remove_spaces_newlines(d['title'])
124
+
125
+ # Skip duplicate titles based on first five words
126
+ split = title_cleaned.split()
127
+ if len(split) >= 5:
128
+ five_words = ' '.join(split[:5])
129
+ if five_words in title_five_words:
130
+ continue
131
+ title_five_words.add(five_words)
132
+
133
+ article_count += 1
134
+
135
+ cleaned_content = remove_spaces_newlines(d['cleaned_content'])
136
+ cleaned_content = truncate_article(cleaned_content)
137
+
138
+ domain = d.get('domain', "Not Specified")
139
+ raw_date = d.get('year_month_day', "Unknown Date")
140
+
141
+ # Format the date
142
+ if raw_date != "Unknown Date":
143
+ try:
144
+ publication_date = datetime.strptime(raw_date, "%Y-%m-%d").strftime("%m/%d/%Y")
145
+ except ValueError:
146
+ publication_date = "Invalid Date"
147
+ else:
148
+ publication_date = raw_date
149
+
150
+ writer.writerow({
151
+ 'title': title_cleaned,
152
+ 'uuid': d['uuid'],
153
+ 'content': cleaned_content,
154
+ 'url': d['url'],
155
+ 'domain': domain,
156
+ 'published_date': publication_date
157
+ })
158
+ print(f"Article saved: {title_cleaned}, {d['uuid']}, {domain}, {publication_date}")
159
+
160
+ return save_path
161
+
162
+ except Exception as e:
163
+ print(f"Error in save_solr_articles: {str(e)}")
164
+ raise
165
 
166
 
167
  def save_embedding_base_articles(query, article_embeddings, titles, contents, uuids, urls, num_articles=15):
help.txt ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Tobacco Watcher Help: Sentinels
2
+ Sentinels are estimated in real time by comparing observed volumes of news coverage with expected volumes of news coverage. The above image presents a simplification of the method from a single time series of news coverage. At present, sentinels are events where news coverage is two standard deviations greater than what was expected for a given country or subject.
3
+
4
+ Tobacco Watcher Help: Articles Page
5
+ Introduction
6
+ The Articles Page in Tobacco Watcher allows you to explore the latest tobacco-related news collected from across the web. Articles are:
7
+ Sorted by date, with the newest articles shown first.
8
+ Automatically identified by subject, product focus, and location.
9
+ Translated into English for non-English articles, with access to the original text.
10
+
11
+ Keyword Search
12
+ Use Tobacco Watcher’s powerful keyword search to find relevant articles:
13
+ Combine terms using logical operators (e.g., AND, OR).
14
+ Add wildcards (e.g., cig*) to broaden searches.
15
+ Use quotes for exact matches.
16
+ Search in the translated text or the original foreign language.
17
+ Limit searches to:
18
+ Title
19
+ Opening text
20
+ Full text
21
+ Control duplicates: Choose whether to hide or display duplicate articles in search results.
22
+ For more advanced search options, click the Advanced Search icon to the left of the search box.
23
+
24
+ Search Results
25
+ The Results section displays the number of articles matching your search criteria and filters:
26
+ By default, duplicate articles are hidden, and only unique articles are shown.
27
+ If duplicates are displayed, the full list of articles, including their duplicates, will appear.
28
+ For hidden duplicates, a link is provided to view all versions of the article.
29
+
30
+ Filters
31
+ Popularity
32
+ Filter for popular articles:
33
+ These articles come from prominent sources and/or are widely discussed on social media.
34
+ Popular articles are marked with a popularity icon ( ).
35
+ Subject
36
+ Articles are labeled based on key tobacco control topics, such as:
37
+ Smoking bans, prevalence, cessation, advertising restrictions, and more.
38
+ Product
39
+ Filter articles by the primary tobacco product being discussed, such as:
40
+ Cigarettes, electronic cigarettes, hookah, smokeless tobacco, and others.
41
+ Location
42
+ Filter articles based on their primary subject location:
43
+ Choose from specific countries or regions.
44
+ Start typing in the text box to select a location.
45
+ Language
46
+ Filter articles by their source language:
47
+ Tobacco Watcher supports 16 languages.
48
+ Articles are translated into English for convenience, with a link to view the original article in its source language.
49
+ Timeframe
50
+ Narrow results to a specific date range:
51
+ Use the date drop-down menu or manually enter start and end dates.
52
+
53
+ Additional Features
54
+ Explore more tools from the Articles Page:
55
+ Create an Alert: Go to the Alerts Page to set up email alerts based on your current search filters.
56
+ Analyze Trends: Go to the Analyses Page to analyze time trends using the same filters.
57
+
58
+ Article Box
59
+ Each article is presented in a compact summary box with interactive features:
60
+ Article Title
61
+ The article box includes:
62
+ The truncated title of the article.
63
+ The source of the article (e.g., Reuters).
64
+ Article Summary
65
+ Summarized information includes:
66
+ The subject of the article.
67
+ The product focus (e.g., electronic cigarettes).
68
+ The primary location.
69
+ For articles with duplicates, the number of duplicate articles is shown under "Additional Coverage."
70
+ Article Features
71
+ View More: Learn more about the article, including additional coverage (exact duplicates) and related coverage (similar stories).
72
+ Share: Send the article to colleagues.
73
+ Feedback: Report errors in classification to help improve the system.
74
+
75
+ Help & Support
76
+ Look for the ? icons throughout the platform for additional tips and information.
77
+ If you need further help, click "Help" in the top-right corner of the page to revisit this walkthrough.
78
+
79
+ Summary
80
+ With the Articles Page, you can:
81
+ Search for relevant articles using advanced keyword tools.
82
+ Apply filters for subjects, products, locations, languages, and timeframes.
83
+ Review articles in a user-friendly format with summaries and interactive options.
84
+ Use additional features like alerts and trend analyses to stay informed.
85
+ For more assistance, contact us at contact@tobaccowatcher.org.
shots.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "query": "Show me trending articles about age restrictions",
4
+ "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?popularity=trending_on_social&c=Age&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
5
+ },
6
+ {
7
+ "query": "Find popular articles about cigarettes and age limits",
8
+ "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?popularity=trending_on_social&c=Age&pro=cigarettes_primary&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
9
+ },
10
+ {
11
+ "query": "Show me trending articles about age and agriculture for cigarettes and cigars",
12
+ "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?popularity=trending_on_social&c=Age&c=Agriculture&pro=cigarettes_primary&pro=cigars_primary&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
13
+ },
14
+ {
15
+ "query": "Find trending articles about age and agriculture for cigarettes and cigars in Africa and Central America",
16
+ "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?popularity=trending_on_social&c=Age&c=Agriculture&pro=cigarettes_primary&pro=cigars_primary&st=&e=&r=Africa&r=Central+America&lang=en&section=keywords&dups=0&sort=-timestamp"
17
+ },
18
+ {
19
+ "query": "Show me Chinese articles about age and agriculture in East Asia",
20
+ "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?c=Age&c=Agriculture&la=zh&r=East+Asia&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
21
+ },
22
+ {
23
+ "query": "Find trending Chinese articles about cigarettes and cigars in East Asia",
24
+ "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?popularity=trending_on_social&pro=cigarettes_primary&pro=cigars_primary&la=zh&r=East+Asia&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
25
+ },
26
+ {
27
+ "query": "Show me French articles about e-cigarettes",
28
+ "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?pro=ecigs_primary&la=fr&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
29
+ },
30
+ {
31
+ "query": "Find articles about smoking prevalence in India",
32
+ "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?c=Prevalence&r=Central+Asia%3A+India&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
33
+ }
34
+ ]