Spaces:

mdredze1
/

tobacco-watcher-chat-with-citations

Sleeping

App Files Files Community

vtiyyal1 commited on Dec 17, 2024

Commit

5e43d3e

verified ·

1 Parent(s): bfa79fd

Upload 10 files

Browse files

Multi query type router

Files changed (7) hide show

about.txt +118 -0
app.py +165 -34
filter_options.json +87 -0
full_chain.py +2 -5
get_articles.py +112 -75
help.txt +85 -0
shots.json +34 -0

about.txt ADDED Viewed

	@@ -0,0 +1,118 @@

+About Tobacco Watcher
+Our Vision
+The global tobacco control landscape is changing more quickly than ever. In this fast-paced world, Tobacco Watcher helps you stay on top of these changes by finding, analyzing, and delivering the world's tobacco-related news. Powered by cutting-edge innovations in machine learning and artificial intelligence, Tobacco Watcher empowers you to combat global tobacco use.
+Media monitoring provides valuable assessments of the tobacco control environment for tobacco control planning and evaluation. However, existing media monitoring approaches require scouring a substantial and growing number of disparate news sources, often leading to information overload. Tobacco Watcher addresses this challenge through a web application that automates the collection, querying, filtering, highlighting, integration, and analysis of tobacco-focused news media worldwide.
+An Analysis Engine Fueled by Data
+Tobacco Watcher rests on the largest assembly of data in the history of tobacco control. The AI behind Tobacco Watcher has investigated more than 25 million news articles.
+The Challenge
+Media monitoring is an essential tool for supervising the rapidly changing tobacco control environment. However, it comes with challenges:
+What news sources can be accessed?
+How should non-English sources be searched?
+What news reports should be saved for future analyses?
+How can news media be analyzed and shared efficiently with researchers, advocates, and regulators?
+The Solution
+Imagine having millions of global news reports collected in one place. Tobacco-related content is organized with precision around key topics (e.g., smoke-free laws) or products (e.g., electronic cigarettes), rivaling human-level judgment. Features include:
+Curated reports on your device, organized by topics and products.
+Emailed alerts tailored to your precise interests.
+Trend plotting for agenda setting or campaign evaluations.
+The Tobacco Watcher web application, a product of the Johns Hopkins Bloomberg School of Public Health's Institute for Global Tobacco Control, leverages automation for data gathering, filtering, and analysis.
+Core Objectives
+The principal objective of Tobacco Watcher is to:
+Warehouse the greatest amount of tobacco-related news reports across diverse substantive, geographic, and linguistic areas.
+Avoid overwhelming users with excess information.
+Automate classification, provide flexible outputs, and support common media monitoring use cases.
+Using AI strategies, Tobacco Watcher not only automates the gathering and filtering of news but also enables actionable insights for tobacco control leaders.
+Research
+Tobacco Watcher has been utilized in various research contexts, particularly in studies focusing on the technologies behind Tobacco Watcher and the use of Tobacco Watcher for :
+Did Philip Morris International use the e-cigarette, or vaping, product use associated lung injury (EVALI) outbreak to market IQOS heated tobacco?
+In that study we tracked how Philip Morris’ press releases were covered by the news media, including some misleading claims that were later edited.
+Next generation media monitoring: Global coverage of electronic nicotine delivery systems (electronic cigarettes) on Bing, Google and Twitter, 2013-2018
+In that study we described the development of Tobacco Watcher and how as an analysis engine it can be used for near real time needs assessment and tobacco control responses.
+The Data Pipeline
+Database
+Tobacco Watcher queries more than 600,000 news sources, including:
+News aggregators (e.g., Bing)
+RSS news feeds
+Curated news websites
+Social media platforms (including 500,000 tweets with URLs daily)
+With hundreds of tobacco-related keywords in 23 languages, Tobacco Watcher processes about 65,000 news reports per day. Reports area high degree of perfect precision (0.95).
+Translated: Retained reports are translated into English for deeper analysis.
+Warehoused: Structured reports are made accessible to users.
+AI-Powered Filters
+Primary Filters
+The Tobacco Watcher system supports the filtering of news articles based on:
+Subject: Articles are clustered into 16 policy domains inspired by WHO's MPOWER framework:
+Advertising: Articles on promotions, sponsorships, and marketing of tobacco through media.
+Age: Articles on age restrictions for purchasing or using tobacco, such as minimum age limits.
+Agriculture: Agriculture includes articles primarily about the farming and cultivation of tobacco plants.
+Air: Air includes articles primarily about efforts to prevent the use of tobacco products in indoor spaces to reduce secondhand smoke, environmental tobacco smoke, or passive smoking. This also includes vaping products. Additionally, articles about banning tobacco use in designated areas are also Air related.
+Criminal Justice: Criminal Justice includes articles primarily about crime, policing, or criminal prosecution that is principally about tobacco products. The article must be focussed on criminal justice directly linked to possession, selling, or transport of tobacco products. If the article is about a crime and tobacco products are mentioned in passing it is not relevant and should be ignored.
+Entertainment: Entertainment includes articles primarily about celebrities’ involvement with tobacco and tobacco use, or tobacco use in movies or other forms of entertainment.
+Environment: Environment includes articles primarily about the impact of tobacco on the environment. Covers pollution caused by smoking, cigarette butts, and anything that detrimentally affects nature.
+Flavor: Articles on flavored tobacco products, including bans, usage, and reviews.
+Industry: Articles on industry activities, including financial reports and mergers.
+Nicotine: Nicotine includes articles primarily about regulations that reduce or restrict the amount of nicotine in tobacco products. Articles can also be about the enforcement of these regulations, such as banning products that contain too much nicotine.
+Packaging: Packaging includes articles primarily about warning labels on tobacco packages, plain packaging on tobacco products, or graphic warnings on tobacco packaging.
+Prevalence: Articles on tobacco use statistics and trends, including user demographics.
+Prices: Articles on tobacco product taxes and minimum price regulations.
+Quitting: Articles on cessation methods, products, and programs, including cessation statistics.
+Health Warnings: Articles on dangers, diseases, and health risks associated with tobacco use.
+Retail: Articles on issues involving the point-of-sale for tobacco products, including online.
+Product: Articles focus on 8 product categories:
+Cigarettes, bidis, cigars, e-cigarettes, kreteks, smokeless tobacco, hookah, and heated tobacco.
+Geographic Focus: Articles are classified nationally and regionally using entity recognition.
+The system evolves dynamically. User feedback on classification accuracy improves future filtering capabilities.
+Core Features
+1. Articles
+Each article is presented in a standardized format, including headline, date, source, language, and topic classification.
+Interactive features include "additional coverage" for related articles and user-driven feedback to refine classifications.
+Users can favorite articles, save searches, and share insights with others.
+2. Alerts
+Receive news reports directly in your inbox without visiting the platform.
+Alerts can be tailored by:
+Keywords, subject, location, and frequency (daily, weekly, bi-weekly, or monthly).
+Alerts are grouped into:
+My Alerts (user-created), Recommended Alerts (system-suggested), and Shared Alerts (created by other users).
+3. Analyses
+Analyze global tobacco trends using Tobacco Watcher's time-series analysis tools.
+Plot up to 5 trends simultaneously for:
+Data exploration, program planning, or campaign evaluations.
+Features include trend naming, customizable time windows, and data exports for external analysis.
+Who We Support
+Tobacco Watcher makes media monitoring easier and more effective, informing and improving global tobacco control efforts. Designed for:
+Tobacco control advocates
+Researchers
+Policy makers
+By classifying media by content and location in 23 languages, Tobacco Watcher enables exploration of the tobacco control landscape at an unprecedented scale. Spend less time searching and more time analyzing.
+Who We Are
+Tobacco Watcher was born out of the Institute for Global Tobacco Control at the Johns Hopkins Bloomberg School of Public Health.
+Leadership Team
+Project Lead: Joanna Cohen, PhD, MHSc
+Bloomberg Professor of Disease Prevention and Director of IGTC at Johns Hopkins Bloomberg School of Public Health. Dr. Cohen brings nearly three decades of tobacco policy research expertise, focusing on public health policy adoption and implementation.
+Project Technical Lead: Mark Dredze, PhD
+John C. Malone Professor of Computer Science at Johns Hopkins University. Dr. Dredze specializes in AI and NLP for public health applications, including tobacco control, infectious disease surveillance, and clinical informatics.
+Project Manager: John W. Ayers, PhD, MA
+Computational epidemiologist and Vice Chief of Innovation at UC San Diego. Dr. Ayers focuses on integrating big data and AI to derive actionable public health insights, including real-time media analysis for tobacco control.

app.py CHANGED Viewed

@@ -1,48 +1,179 @@
-import openai
 import gradio as gr
 from full_chain import get_response
 import os
-api_key = os.getenv("OPENAI_API_KEY")
-client = openai.OpenAI(api_key=api_key)
-def create_hyperlink(url, title, domain):
-    """Create HTML hyperlink with domain information."""
-    return f"<a href='{url}' target='_blank'>{title}</a> ({domain})"
 def predict(message, history):
-    """Process user message and return response with hyperlinked sources."""
-    # Get response and source information
-    responder, links, titles, domains, published_dates = get_response(message, rerank_type="crossencoder")
-    # The responder already contains the formatted response with numbered citations
-    # We just need to add the hyperlinked references at the bottom
-    hyperlinks = []
-    for i, (link, title, domain, published_date) in enumerate(zip(links, titles, domains, published_dates), 1):
-        hyperlink = f"[{i}] {create_hyperlink(link, title, domain)} {published_date}"
-        hyperlinks.append(hyperlink)
-    # Split the responder to separate the response from its references
-    response_parts = responder.split("References:")
-    main_response = response_parts[0].strip()
-    # Combine the response with hyperlinked references
-    final_response = (
-        f"{main_response}\n\n"
-        f"References:\n"
-        f"{chr(10).join(hyperlinks)}"
-    )
-    return final_response
 # Initialize and launch Gradio interface
 gr.ChatInterface(
     predict,
     examples=[
-        "How many Americans Smoke?",
-        "What are some measures taken by the Indian Government to reduce the smoking population?",
-        "Does smoking negatively affect my health?"
     ],
-    title="Tobacco Information Assistant",
-    description="Ask questions about tobacco-related topics and get answers with reliable sources."
 ).launch()

 import gradio as gr
 from full_chain import get_response
 import os
+import urllib3
+import json
+from langchain_openai import ChatOpenAI
+from langchain.schema import SystemMessage, HumanMessage
+urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+# Initialize ChatOpenAI
+llm = ChatOpenAI(
+    api_key=os.getenv("OPENAI_API_KEY"),
+    model="gpt-3.5-turbo",
+    temperature=0
+)
+def load_content(filename):
+    """Load content from text files"""
+    with open(os.path.join("prompts", filename), "r", encoding="utf-8") as f:
+        return f.read()
+def load_filter_options():
+    with open(os.path.join("prompts", "filter_options.json"), "r") as f:
+        return json.load(f)
+def load_example_shots():
+    with open(os.path.join("prompts", "shots.json"), "r") as f:
+        return json.load(f)
 def predict(message, history):
+    """Process user message and return appropriate response."""
+    try:
+        # Query classification prompt
+        classifier_prompt = """You are the Tobacco Watcher Assistant. Analyze the user's query and categorize it into exactly ONE of these types:
+        1. HELP - Questions about using the website, its features, or navigation
+        Example: "How do I use filters?", "How to search for articles?"
+        2. ABOUT - Questions about Tobacco Watcher's purpose, mission, or organization
+        Example: "What is Tobacco Watcher?", "Who runs this website?"
+        3. FILTER - Requests for specific articles using filters
+        Example: "Show articles about smoking in India from 2023", "Find French articles about e-cigarettes"
+        4. QUERY - Questions seeking tobacco-related information
+        Example: "How many people smoke in Asia?", "What are the effects of secondhand smoke?"
+        Respond with ONLY the category name (HELP, ABOUT, FILTER, or QUERY).
+        """
+        messages = [
+            SystemMessage(content=classifier_prompt),
+            HumanMessage(content=message)
+        ]
+        response = llm.invoke(messages)
+        query_type = response.content.strip().upper()
+        print(f"Query type: {query_type}")
+        if query_type == "HELP":
+            help_content = load_content("help.txt")
+            messages = [
+                SystemMessage(content="""You are the Tobacco Watcher Help Assistant.
+Use the provided help content to guide users on how to use the platform's features.
+Be clear and specific in your instructions. If a feature isn't mentioned in the content, acknowledge that and suggest contacting support."""),
+                HumanMessage(content=f"Using this help content:\n\n{help_content}\n\nAnswer this question: {message}")
+            ]
+            response = llm.invoke(messages)
+            return response.content
+        elif query_type == "ABOUT":
+            about_content = load_content("about.txt")
+            messages = [
+                SystemMessage(content="""You are the Tobacco Watcher Assistant specializing in explaining the platform.
+Use the provided content to answer questions about Tobacco Watcher's purpose, mission, features, and organization.
+Be concise but informative. If a specific detail isn't in the content, say so rather than making assumptions."""),
+                HumanMessage(content=f"Using this content:\n\n{about_content}\n\nAnswer this question: {message}")
+            ]
+            response = llm.invoke(messages)
+            return response.content
+        elif query_type == "FILTER":
+            filter_options = load_filter_options()
+            example_shots = load_example_shots()
+            url_prompt = """Generate a Tobacco Watcher article URL based on the query. Follow these rules:
+            1. Base URL: https://tobaccowatcher.globaltobactocontrol.org/articles/
+            2. Parameters:
+               - Subject (c=): Can have multiple
+               - Product (pro=): Can have multiple
+               - Region (r=): Can have multiple
+               - Language (lang=)
+               - Always add: st=&e=&section=keywords&dups=0&sort=-timestamp
+            Available filters:
+            """ + json.dumps(filter_options, indent=2) + """
+            Example queries and URLs:
+            """
+            for shot in example_shots:
+                url_prompt += f"\nQuery: {shot['query']}\nURL: {shot['url']}\n"
+            url_prompt += "\nGenerate a valid URL for this query. Return ONLY the complete URL."
+            messages = [
+                SystemMessage(content=url_prompt),
+                HumanMessage(content=message)
+            ]
+            try:
+                response = llm.invoke(messages)
+                url_response = response.content.strip()
+                print(f"Generated URL: {url_response}")
+                if url_response.startswith("http"):
+                    return f"Here are the filtered articles you requested:\n{url_response}"
+                else:
+                    return "I couldn't create a proper filter URL. Please try rephrasing your request."
+            except Exception as e:
+                print(f"Error creating filter URL: {str(e)}")
+                return "I couldn't create a proper filter URL. Please try rephrasing your request."
+        else:  # QUERY
+            try:
+                response = get_response(message, rerank_type="crossencoder")
+                if not response or len(response) != 5:
+                    print(f"Invalid response format: {response}")
+                    return "I apologize, but I couldn't find relevant information. Please try rephrasing your question."
+                responder, links, titles, domains, published_dates = response
+                if not responder:
+                    print("Empty response content")
+                    return "I apologize, but I couldn't generate a meaningful response. Please try rephrasing your question."
+                response_parts = responder.split("References:")
+                main_response = response_parts[0].strip()
+                if not any([links, titles, domains, published_dates]):
+                    print("Missing citation data")
+                    return main_response  # Return just the response without citations
+                hyperlinks = [
+                    f"[{i}] <a href='{link}' target='_blank'>{title}</a> ({domain}) {date}"
+                    for i, (link, title, domain, date) in
+                    enumerate(zip(links, titles, domains, published_dates), 1)
+                    if link and title and domain  # Only create links for complete data
+                ]
+                if hyperlinks:
+                    return f"{main_response}\n\nReferences:\n{chr(10).join(hyperlinks)}"
+                return main_response
+            except Exception as e:
+                print(f"Error in QUERY handling: {str(e)}")
+                return "I apologize, but I encountered an error processing your request. Please try again."
+    except Exception as e:
+        print(f"Error in predict: {str(e)}")
+        return "I apologize, but I encountered an error processing your request. Please try again."
 # Initialize and launch Gradio interface
 gr.ChatInterface(
     predict,
     examples=[
+        "What is Tobacco Watcher?",
+        "How do I use the search filters?",
+        "Show me articles about smoking in India from 2023",
+        "Find French articles about e-cigarettes",
+        "What are the health effects of secondhand smoke?",
+        "Show me articles about tobacco industry in Eastern Europe",
     ],
+    title="Tobacco Watcher Chatbot",
+    description="Ask questions about tobacco-related topics, get help with navigation, or learn about Tobacco Watcher."
 ).launch()

filter_options.json ADDED Viewed

	@@ -0,0 +1,87 @@

+{
+  "popularity": {
+    "parameter": "popularity",
+    "paramvalues": [
+      "trending_on_social",
+      "domain"
+    ]
+  },
+  "subject": {
+    "parameter": "c",
+    "paramvalues": [
+      "advertising",
+      "age",
+      "agriculture",
+      "air",
+      "criminal justice",
+      "entertainment",
+      "environment",
+      "flavor",
+      "industry",
+      "nicotine",
+      "packaging",
+      "prevalence",
+      "prices",
+      "quitting",
+      "retailers",
+      "warnings",
+      "uncategorized"
+    ]
+  },
+  "product": {
+    "parameter": "pro",
+    "paramvalues": [
+      "bidis_primary",
+      "cigarettes_primary",
+      "cigars_primary",
+      "ecigs_primary",
+      "hookahs_primary",
+      "smokeless_tobacco_primary",
+      "heated_tobacco_primary",
+      "non_specific"
+    ]
+  },
+  "location": {
+    "parameter": "r",
+    "paramvalues": [
+      "africa",
+      "Central+America",
+      "Central+Asia",
+      "East+Asia",
+      "Eastern+Europe",
+      "North+America",
+      "South+America",
+      "South+Pacific",
+      "Southeastern+Asia",
+      "West+Asia",
+      "Western+Europe"
+    ]
+  },
+  "language": {
+    "parameter": "la",
+    "paramvalues": [
+      "en",
+      "ar",
+      "bn",
+      "zh",
+      "fr",
+      "de",
+      "hi",
+      "id",
+      "it",
+      "ja",
+      "ko",
+      "pl",
+      "pt",
+      "ru",
+      "es",
+      "tl",
+      "ta",
+      "th",
+      "tr",
+      "uk",
+      "ur",
+      "vi"
+    ]
+  }
+}

full_chain.py CHANGED Viewed

@@ -7,13 +7,11 @@ from rerank import langchain_rerank_answer, langchain_with_sources, crossencoder
 #from feed_to_llm import feed_articles_to_gpt_with_links
 from feed_to_llm_v2 import feed_articles_to_gpt_with_links
 def get_response(question, rerank_type="crossencoder", llm_type="chat"):
-    csv_path = save_solr_articles_full(question, keyword_type="rake")
     reranked_out = crossencoder_rerank_answer(csv_path, question)
     return feed_articles_to_gpt_with_links(reranked_out, question)
     # save_path = save_solr_articles_full(question)
     # information = crossencoder_rerank_answer(save_path, question)
     # response, links, titles = feed_articles_to_gpt_with_links(information, question)
@@ -29,5 +27,4 @@ if __name__ == "__main__":
     response, links, titles, domains = get_response(question, rerank_type, llm_type)
     print(response)
     print(links)
-    print(titles)
-    print(domains)

 #from feed_to_llm import feed_articles_to_gpt_with_links
 from feed_to_llm_v2 import feed_articles_to_gpt_with_links
 def get_response(question, rerank_type="crossencoder", llm_type="chat"):
+    csv_path = save_solr_articles_full(question, keyword_type="rake", num_articles=10)
     reranked_out = crossencoder_rerank_answer(csv_path, question)
     return feed_articles_to_gpt_with_links(reranked_out, question)
     # save_path = save_solr_articles_full(question)
     # information = crossencoder_rerank_answer(save_path, question)
     # response, links, titles = feed_articles_to_gpt_with_links(information, question)
     response, links, titles, domains = get_response(question, rerank_type, llm_type)
     print(response)
     print(links)
+    print(titles)    print(domains)

get_articles.py CHANGED Viewed

@@ -6,7 +6,7 @@ import torch
 from datetime import datetime
 from get_keywords import get_keywords
 import os
 """
 This function creates top 15 articles from Solr and saves them in a csv file
 Input:
@@ -15,11 +15,26 @@ Input:
     keyword_type: str (openai, rake, or na)
 Output: path to csv file
 """
-def save_solr_articles_full(query: str, num_articles=15, keyword_type="openai") -> str:
-    keywords = get_keywords(query, keyword_type)
-    if keyword_type == "na":
-        keywords = query
-    return save_solr_articles(keywords, num_articles)
 """
@@ -55,76 +70,98 @@ Minor details:
     If one of title, uuid, cleaned_content, url are missing the article is skipped.
 """
 def save_solr_articles(keywords: str, num_articles=15) -> str:
-    solr_key = os.getenv("SOLR_KEY")
-    SOLR_ARTICLES_URL = f"https://website:{solr_key}@solr.machines.globalhealthwatcher.org:8080/solr/articles/"
-    solr = Solr(SOLR_ARTICLES_URL, verify=False)
-    # No duplicates
-    fq = ['-dups:0']
-    query = f'text:({keywords})' + " AND " + "dead_url:(false)"
-    # Get top 2*num_articles articles and then remove misformed or duplicate articles
-    outputs = solr.search(query, fq=fq, sort="score desc", rows=num_articles * 2)
-    article_count = 0
-    save_path = os.path.join("data", "articles.csv")
-    if not os.path.exists(os.path.dirname(save_path)):
-        os.makedirs(os.path.dirname(save_path))
-    with open(save_path, 'w', newline='') as csvfile:
-        fieldnames = ['title', 'uuid', 'content', 'url', 'domain','published_date']
-        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_NONNUMERIC)
-        writer.writeheader()
-        title_five_words = set()
-        for d in outputs.docs:
-            if article_count == num_articles:
-                break
-            # skip if title returns a keyerror
-            if 'title' not in d or 'uuid' not in d or 'cleaned_content' not in d or 'url' not in d:
-                continue
-            title_cleaned = remove_spaces_newlines(d['title'])
-            split = title_cleaned.split()
-            # skip if title is a duplicate
-            if not len(split) < 5:
-                five_words = title_cleaned.split()[:5]
-                five_words = ' '.join(five_words)
-                if five_words in title_five_words:
                     continue
-                title_five_words.add(five_words)
-            article_count += 1
-            cleaned_content = remove_spaces_newlines(d['cleaned_content'])
-            cleaned_content = truncate_article(cleaned_content)
-            domain = ""
-            if 'domain' not in d:
-                domain = "Not Specified"
-            else:
-                domain = d['domain']
-            print(domain)
-            raw_date = d.get('year_month_day', "Unknown Date")
-            # Format the date from YYYY-MM-DD to MM/DD/YYYY if available
-            if raw_date != "Unknown Date":
-                try:
-                    publication_date = datetime.strptime(raw_date, "%Y-%m-%d").strftime("%m/%d/%Y")
-                except ValueError:
-                    publication_date = "Invalid Date"
-            else:
-                publication_date = raw_date
-            writer.writerow({'title': title_cleaned, 'uuid': d['uuid'], 'content': cleaned_content, 'url': d['url'],
-                             'domain': domain, 'published_date': publication_date})
-            print(f"Article saved: {title_cleaned}, {d['uuid']}, {cleaned_content}, {d['url']}, {domain}, {publication_date}")
-    return save_path
 def save_embedding_base_articles(query, article_embeddings, titles, contents, uuids, urls, num_articles=15):

 from datetime import datetime
 from get_keywords import get_keywords
 import os
+import re
 """
 This function creates top 15 articles from Solr and saves them in a csv file
 Input:
     keyword_type: str (openai, rake, or na)
 Output: path to csv file
 """
+def sanitize_query(text):
+    """Sanitize the query text for Solr."""
+    # Remove special characters that could break Solr syntax
+    sanitized = re.sub(r'[[\]{}()*+?\\^|;:!]', ' ', text)
+    # Normalize whitespace
+    sanitized = ' '.join(sanitized.split())
+    return sanitized
+def save_solr_articles_full(query: str, num_articles: int, keyword_type: str = "openai") -> str:
+    try:
+        keywords = get_keywords(query, keyword_type)
+        if keyword_type == "na":
+            keywords = query
+        # Sanitize keywords before creating Solr query
+        keywords = sanitize_query(keywords)
+        return save_solr_articles(keywords, num_articles)
+    except Exception as e:
+        raise
 """
     If one of title, uuid, cleaned_content, url are missing the article is skipped.
 """
 def save_solr_articles(keywords: str, num_articles=15) -> str:
+    """Save top articles from Solr search to CSV."""
+    try:
+        solr_key = os.getenv("SOLR_KEY")
+        SOLR_ARTICLES_URL = f"https://website:{solr_key}@solr.machines.globalhealthwatcher.org:8080/solr/articles/"
+        solr = Solr(SOLR_ARTICLES_URL, verify=False)
+        # No duplicates and must be in English
+        fq = ['-dups:0', 'is_english:(true)']
+        # Construct and sanitize query
+        query = f'text:({keywords}) AND dead_url:(false)'
+        print(f"Executing Solr query: {query}")
+        # Use boost function to combine relevance score with recency
+        # This gives higher weight to more recent articles while still considering relevance
+        boost_query = "sum(score,product(0.3,recip(ms(NOW,year_month_day),3.16e-11,1,1)))"
+        try:
+            outputs = solr.search(
+                query,
+                fq=fq,
+                sort=boost_query + " desc",
+                rows=num_articles * 2,
+                fl='*,score'  # Include score in results
+            )
+        except Exception as e:
+            print(f"Solr query failed: {str(e)}")
+            raise
+        article_count = 0
+        save_path = os.path.join("data", "articles.csv")
+        if not os.path.exists(os.path.dirname(save_path)):
+            os.makedirs(os.path.dirname(save_path))
+        with open(save_path, 'w', newline='') as csvfile:
+            fieldnames = ['title', 'uuid', 'content', 'url', 'domain', 'published_date']
+            writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_NONNUMERIC)
+            writer.writeheader()
+            title_five_words = set()
+            for d in outputs.docs:
+                if article_count == num_articles:
+                    break
+                # Skip if required fields are missing
+                if 'title' not in d or 'uuid' not in d or 'cleaned_content' not in d or 'url' not in d:
                     continue
+                title_cleaned = remove_spaces_newlines(d['title'])
+                # Skip duplicate titles based on first five words
+                split = title_cleaned.split()
+                if len(split) >= 5:
+                    five_words = ' '.join(split[:5])
+                    if five_words in title_five_words:
+                        continue
+                    title_five_words.add(five_words)
+                article_count += 1
+                cleaned_content = remove_spaces_newlines(d['cleaned_content'])
+                cleaned_content = truncate_article(cleaned_content)
+                domain = d.get('domain', "Not Specified")
+                raw_date = d.get('year_month_day', "Unknown Date")
+                # Format the date
+                if raw_date != "Unknown Date":
+                    try:
+                        publication_date = datetime.strptime(raw_date, "%Y-%m-%d").strftime("%m/%d/%Y")
+                    except ValueError:
+                        publication_date = "Invalid Date"
+                else:
+                    publication_date = raw_date
+                writer.writerow({
+                    'title': title_cleaned,
+                    'uuid': d['uuid'],
+                    'content': cleaned_content,
+                    'url': d['url'],
+                    'domain': domain,
+                    'published_date': publication_date
+                })
+                print(f"Article saved: {title_cleaned}, {d['uuid']}, {domain}, {publication_date}")
+        return save_path
+    except Exception as e:
+        print(f"Error in save_solr_articles: {str(e)}")
+        raise
 def save_embedding_base_articles(query, article_embeddings, titles, contents, uuids, urls, num_articles=15):

help.txt ADDED Viewed

	@@ -0,0 +1,85 @@

+Tobacco Watcher Help: Sentinels
+Sentinels are estimated in real time by comparing observed volumes of news coverage with expected volumes of news coverage. The above image presents a simplification of the method from a single time series of news coverage. At present, sentinels are events where news coverage is two standard deviations greater than what was expected for a given country or subject.
+Tobacco Watcher Help: Articles Page
+Introduction
+The Articles Page in Tobacco Watcher allows you to explore the latest tobacco-related news collected from across the web. Articles are:
+Sorted by date, with the newest articles shown first.
+Automatically identified by subject, product focus, and location.
+Translated into English for non-English articles, with access to the original text.
+Keyword Search
+Use Tobacco Watcher’s powerful keyword search to find relevant articles:
+Combine terms using logical operators (e.g., AND, OR).
+Add wildcards (e.g., cig*) to broaden searches.
+Use quotes for exact matches.
+Search in the translated text or the original foreign language.
+Limit searches to:
+Title
+Opening text
+Full text
+Control duplicates: Choose whether to hide or display duplicate articles in search results.
+For more advanced search options, click the Advanced Search icon to the left of the search box.
+Search Results
+The Results section displays the number of articles matching your search criteria and filters:
+By default, duplicate articles are hidden, and only unique articles are shown.
+If duplicates are displayed, the full list of articles, including their duplicates, will appear.
+For hidden duplicates, a link is provided to view all versions of the article.
+Filters
+Popularity
+Filter for popular articles:
+These articles come from prominent sources and/or are widely discussed on social media.
+Popular articles are marked with a popularity icon ( ).
+Subject
+Articles are labeled based on key tobacco control topics, such as:
+Smoking bans, prevalence, cessation, advertising restrictions, and more.
+Product
+Filter articles by the primary tobacco product being discussed, such as:
+Cigarettes, electronic cigarettes, hookah, smokeless tobacco, and others.
+Location
+Filter articles based on their primary subject location:
+Choose from specific countries or regions.
+Start typing in the text box to select a location.
+Language
+Filter articles by their source language:
+Tobacco Watcher supports 16 languages.
+Articles are translated into English for convenience, with a link to view the original article in its source language.
+Timeframe
+Narrow results to a specific date range:
+Use the date drop-down menu or manually enter start and end dates.
+Additional Features
+Explore more tools from the Articles Page:
+Create an Alert: Go to the Alerts Page to set up email alerts based on your current search filters.
+Analyze Trends: Go to the Analyses Page to analyze time trends using the same filters.
+Article Box
+Each article is presented in a compact summary box with interactive features:
+Article Title
+The article box includes:
+The truncated title of the article.
+The source of the article (e.g., Reuters).
+Article Summary
+Summarized information includes:
+The subject of the article.
+The product focus (e.g., electronic cigarettes).
+The primary location.
+For articles with duplicates, the number of duplicate articles is shown under "Additional Coverage."
+Article Features
+View More: Learn more about the article, including additional coverage (exact duplicates) and related coverage (similar stories).
+Share: Send the article to colleagues.
+Feedback: Report errors in classification to help improve the system.
+Help & Support
+Look for the ? icons throughout the platform for additional tips and information.
+If you need further help, click "Help" in the top-right corner of the page to revisit this walkthrough.
+Summary
+With the Articles Page, you can:
+Search for relevant articles using advanced keyword tools.
+Apply filters for subjects, products, locations, languages, and timeframes.
+Review articles in a user-friendly format with summaries and interactive options.
+Use additional features like alerts and trend analyses to stay informed.
+For more assistance, contact us at contact@tobaccowatcher.org.

shots.json ADDED Viewed

	@@ -0,0 +1,34 @@

+[
+  {
+    "query": "Show me trending articles about age restrictions",
+    "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?popularity=trending_on_social&c=Age&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
+  },
+  {
+    "query": "Find popular articles about cigarettes and age limits",
+    "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?popularity=trending_on_social&c=Age&pro=cigarettes_primary&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
+  },
+  {
+    "query": "Show me trending articles about age and agriculture for cigarettes and cigars",
+    "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?popularity=trending_on_social&c=Age&c=Agriculture&pro=cigarettes_primary&pro=cigars_primary&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
+  },
+  {
+    "query": "Find trending articles about age and agriculture for cigarettes and cigars in Africa and Central America",
+    "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?popularity=trending_on_social&c=Age&c=Agriculture&pro=cigarettes_primary&pro=cigars_primary&st=&e=&r=Africa&r=Central+America&lang=en&section=keywords&dups=0&sort=-timestamp"
+  },
+  {
+    "query": "Show me Chinese articles about age and agriculture in East Asia",
+    "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?c=Age&c=Agriculture&la=zh&r=East+Asia&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
+  },
+  {
+    "query": "Find trending Chinese articles about cigarettes and cigars in East Asia",
+    "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?popularity=trending_on_social&pro=cigarettes_primary&pro=cigars_primary&la=zh&r=East+Asia&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
+  },
+  {
+    "query": "Show me French articles about e-cigarettes",
+    "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?pro=ecigs_primary&la=fr&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
+  },
+  {
+    "query": "Find articles about smoking prevalence in India",
+    "url": "https://tobaccowatcher.globaltobactocontrol.org/articles/?c=Prevalence&r=Central+Asia%3A+India&st=&e=&lang=en&section=keywords&dups=0&sort=-timestamp"
+  }
+]