Spaces:

MusoraProductDepartment
/

Sentiment_analysis

Running

App Files Files Community

Danialebrat commited on 2 days ago

Commit

58db664

1 Parent(s): bf7e929

Adding HelpScout to UI

Browse files

Files changed (33) hide show

.idea/vcs.xml +3 -1
process_helpscout/README.md +339 -0
process_helpscout/agents/README.md +310 -0
process_helpscout/agents/__init__.py +0 -0
process_helpscout/agents/base_agent.py +58 -0
process_helpscout/agents/sentiment_analysis_agent.py +229 -0
process_helpscout/agents/topic_extraction_agent.py +268 -0
process_helpscout/config_files/processing_config.json +125 -0
process_helpscout/config_files/topics.json +90 -0
process_helpscout/data_fetcher.py +77 -0
process_helpscout/fetch_and_export.py +183 -0
process_helpscout/html_cleaner.py +169 -0
process_helpscout/main.py +423 -0
process_helpscout/snowflake_conn.py +106 -0
process_helpscout/workflow/__init__.py +0 -0
process_helpscout/workflow/conversation_processor.py +334 -0
visualization/README.md +279 -140
visualization/agents/helpscout_summary_agent.py +309 -0
visualization/app.py +38 -10
visualization/components/dashboard.py +55 -1
visualization/components/helpscout_analysis.py +491 -0
visualization/components/helpscout_dashboard.py +278 -0
visualization/components/sentiment_analysis.py +38 -6
visualization/config/viz_config.json +61 -1
visualization/data/data_loader.py +25 -5
visualization/data/helpscout_data_loader.py +382 -0
visualization/utils/auth.py +0 -2
visualization/utils/data_processor.py +46 -0
visualization/utils/helpscout_pdf.py +471 -0
visualization/utils/helpscout_utils.py +107 -0
visualization/utils/pdf_exporter.py +80 -0
visualization/visualizations/distribution_charts.py +131 -0
visualization/visualizations/helpscout_charts.py +413 -0

.idea/vcs.xml CHANGED Viewed

@@ -1,4 +1,6 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <project version="4">
-  <component name="VcsDirectoryMappings" defaultProject="true" />
 </project>

 <?xml version="1.0" encoding="UTF-8"?>
 <project version="4">
+  <component name="VcsDirectoryMappings">
+    <mapping directory="" vcs="Git" />
+  </component>
 </project>

process_helpscout/README.md ADDED Viewed

	@@ -0,0 +1,339 @@

+# HelpScout Processing Pipeline
+Extracts, cleans, and enriches customer support conversations from HelpScout.
+The module has two distinct responsibilities:
+1. **Data export** (`fetch_and_export.py`) — fetches raw threads, cleans HTML, and exports CSVs for the Streamlit dashboard.
+2. **AI processing pipeline** (`main.py`) — fetches the same conversations, runs them through a two-step agentic workflow (sentiment + topic extraction), and writes enriched records to Snowflake.
+---
+## Folder Structure
+```
+process_helpscout/
+│
+├── main.py                          # Pipeline entry point (parallel processing)
+├── data_fetcher.py                  # Fetches & aggregates conversations; deduplication check
+├── fetch_and_export.py              # CSV export script (separate from the pipeline)
+├── html_cleaner.py                  # HTML → clean plain text (shared by both workflows)
+├── snowflake_conn.py                # Snowflake connection wrapper
+│
+├── agents/                          # LLM-based extraction agents
+│   ├── README.md                    # Agent architecture docs (read this to extend)
+│   ├── base_agent.py                # Abstract base class for all agents
+│   ├── sentiment_analysis_agent.py  # Classifies sentiment polarity + emotions
+│   └── topic_extraction_agent.py    # Assigns topic tags + billing flags
+│
+├── workflow/
+│   └── conversation_processor.py   # LangGraph workflow: sentiment → topics → END
+│
+├── config_files/
+│   ├── processing_config.json       # Agent models, batch settings, output table, sentiment categories
+│   └── topics.json                  # HelpScout topic taxonomy (source of truth for topic extraction)
+│
+├── queries/
+│   └── helpscout_conversations.sql  # SQL that fetches customer threads from Snowflake
+│
+├── sql/
+│   └── create_features_table.sql   # DDL — run once before first pipeline execution
+│
+├── output/                          # Auto-created; holds CSV exports
+│   ├── helpscout_threads.csv
+│   └── helpscout_conversations.csv
+│
+└── visualization/                   # Streamlit dashboard (reads from CSV exports)
+    ├── app.py
+    ├── components/dashboard.py
+    └── utils/data_processor.py
+```
+---
+## Data Flow
+### CSV Export (Dashboard)
+```
+Snowflake (STITCH.HELPSCOUT.CONVERSATION_THREADS)
+        │  queries/helpscout_conversations.sql
+        ▼
+fetch_and_export.py
+        │  process_threads()       — clean HTML, add word_count, date columns
+        │  aggregate_conversations() — one row per conversation_id
+        ▼
+output/helpscout_threads.csv        (one row per message thread)
+output/helpscout_conversations.csv  (one row per conversation)
+        │
+        ▼
+visualization/app.py  →  Streamlit dashboard
+```
+### AI Processing Pipeline
+```
+Snowflake (STITCH.HELPSCOUT.CONVERSATION_THREADS)
+        │  Same SQL — customer threads only, Feb 17 2026+
+        ▼
+data_fetcher.fetch_conversations()
+        │  Cleans HTML (html_cleaner.py)
+        │  Aggregates to one row per conversation
+        │  Checks HELPSCOUT_CONVERSATION_FEATURES for already-processed IDs
+        ▼
+main.py  —  splits into parallel batches
+        │
+        ├── Worker 1: ConversationProcessingWorkflow
+        │       ├── Node 1: SentimentAnalysisAgent  →  polarity + emotions
+        │       └── Node 2: TopicExtractionAgent    →  topics + billing flags
+        │
+        ├── Worker 2: ... (same)
+        └── Worker N: ... (same)
+        │
+        ▼
+SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES
+```
+---
+## Setup
+### 1. Environment variables
+All credentials are read from the project root `.env` file.
+| Key | Description |
+|-----|-------------|
+| `SNOWFLAKE_USER` | Snowflake username |
+| `SNOWFLAKE_PASSWORD` | Snowflake password |
+| `SNOWFLAKE_ACCOUNT` | Snowflake account identifier |
+| `SNOWFLAKE_ROLE` | Role with access to `STITCH`, `ESTUARY`, and `SOCIAL_MEDIA_DB` |
+| `SNOWFLAKE_WAREHOUSE` | Compute warehouse |
+| `OPENAI_API_KEY` | Required for the AI pipeline only |
+### 2. Dependencies
+All dependencies are in the project root `requirements.txt`:
+- `snowflake-snowpark-python`
+- `beautifulsoup4`
+- `pandas`, `numpy`
+- `langchain-openai`, `langgraph`
+- `python-dotenv`
+- `streamlit`, `plotly` (dashboard only)
+### 3. Create the output table (once)
+Before running the pipeline for the first time, execute the DDL in Snowflake:
+```sql
+-- Run this in your Snowflake worksheet or via the Snowflake CLI
+-- File: sql/create_features_table.sql
+```
+This creates `SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES` with a primary key on `CONVERSATION_ID`. The pipeline always appends — it never truncates the table.
+---
+## Usage
+### Run the AI processing pipeline
+```bash
+cd process_helpscout
+# Process all new conversations (parallel, recommended)
+python main.py
+# Limit to 100 conversations — useful for a first test run
+python main.py --limit 100
+# Sequential mode — single process, easier to read logs when debugging
+python main.py --sequential
+# Use a custom config file
+python main.py --config /path/to/my_config.json
+```
+On every run the pipeline:
+1. Fetches all conversations (from Feb 17 2026 to today)
+2. Queries the output table for already-processed `CONVERSATION_ID`s
+3. Skips those — only new conversations are sent to the LLM
+4. Appends results to the Snowflake output table
+### Run the CSV export (dashboard data)
+```bash
+cd process_helpscout
+python fetch_and_export.py
+```
+### Launch the Streamlit dashboard
+```bash
+cd process_helpscout
+streamlit run visualization/app.py
+```
+---
+## Output Table
+**Table:** `SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES`
+| Column | Type | Description |
+|--------|------|-------------|
+| `CONVERSATION_ID` | VARCHAR | HelpScout conversation ID (primary key) |
+| `CUSTOMER_EMAIL` | VARCHAR | Customer email address |
+| `CUSTOMER_FIRST` | VARCHAR | Customer first name |
+| `CUSTOMER_LAST` | VARCHAR | Customer last name |
+| `CUSTOMER_HS_ID` | NUMBER | HelpScout internal customer ID |
+| `THREAD_COUNT` | NUMBER | Number of customer message threads |
+| `FIRST_MESSAGE_AT` | TIMESTAMP_TZ | When the first customer message was sent |
+| `LAST_MESSAGE_AT` | TIMESTAMP_TZ | When the last customer message was sent |
+| `DURATION_HOURS` | FLOAT | Hours between first and last message |
+| `STATUS` | VARCHAR | Last known HelpScout status |
+| `STATE` | VARCHAR | Last known HelpScout state |
+| `SOURCE_TYPE` | VARCHAR | e.g. `email`, `chat` |
+| `SOURCE_VIA` | VARCHAR | e.g. `api`, `mailbox` |
+| `COMBINED_TEXT` | TEXT | Raw aggregated customer messages |
+| `CONVERSATION_TEXT_USED` | TEXT | Formatted + truncated text sent to the LLM |
+| `SENTIMENT_POLARITY` | VARCHAR | `very_positive` / `positive` / `neutral` / `negative` / `very_negative` |
+| `EMOTIONS` | VARCHAR | Comma-separated emotion values (NULL if none valid) |
+| `SENTIMENT_CONFIDENCE` | VARCHAR | `high` / `medium` / `low` |
+| `SENTIMENT_NOTES` | TEXT | 1-2 sentence LLM explanation of the sentiment |
+| `TOPICS` | VARCHAR | Comma-separated topic IDs (multi-label) |
+| `IS_REFUND_REQUEST` | BOOLEAN | Customer explicitly asked for a refund |
+| `IS_CANCELLATION` | BOOLEAN | Customer explicitly wants to cancel |
+| `IS_MEMBERSHIP` | BOOLEAN | Customer wants to join/rejoin and purchase membership |
+| `TOPIC_CONFIDENCE` | VARCHAR | `high` / `medium` / `low` |
+| `TOPIC_NOTES` | TEXT | 1-2 sentence LLM explanation of topics |
+| `SUMMARY` | TEXT | 2-3 sentence neutral summary of the conversation |
+| `PROCESSING_ERRORS` | TEXT | Semicolon-separated errors (NULL on full success) |
+| `PROCESSED_AT` | TIMESTAMP_NTZ | When this record was written by the pipeline |
+| `WORKFLOW_VERSION` | VARCHAR | Pipeline version for auditability |
+---
+## Configuration
+All pipeline settings live in `config_files/processing_config.json`.
+### Agent models
+```json
+"agents": {
+  "sentiment_analysis": {
+    "model": "gpt-4o-mini",
+    "temperature": 0.2,
+    "max_retries": 3
+  },
+  "topic_extraction": {
+    "model": "gpt-4o-mini",
+    "temperature": 0.2,
+    "max_retries": 3
+  }
+}
+```
+Switch any agent to `gpt-4o` for higher accuracy (at higher cost) by changing the `"model"` value.
+### Conversation length
+```json
+"processing": {
+  "max_conversation_chars": 3000,
+  "min_batch_size": 10,
+  "max_batch_size": 50
+}
+```
+`max_conversation_chars` controls how many characters of conversation text are sent to the LLM. Increasing this improves context for long conversations but raises token costs. The workflow formats messages as `[1] msg\n[2] msg…` and truncates at this limit.
+### Output destination
+```json
+"output": {
+  "database": "SOCIAL_MEDIA_DB",
+  "schema": "ML_FEATURES",
+  "table": "HELPSCOUT_CONVERSATION_FEATURES"
+}
+```
+To write to a different table (e.g. a staging or test table), change these values and re-run the DDL in `sql/create_features_table.sql` for the new table name.
+### Sentiment categories
+The `sentiment_polarity` and `emotions` blocks in `processing_config.json` define the valid values for classification. Adding, removing, or renaming a category here is automatically reflected in both the LLM prompt and the output validation — no code changes required.
+### Topic taxonomy
+Topic definitions live in `config_files/topics.json`. This file is the single source of truth: the `TopicExtractionAgent` builds its system prompt directly from it. To add a new topic:
+1. Add an entry to the `"topics"` array with a unique `id`, `label`, and `description`.
+2. If the topic has boolean sub-flags (like billing), add a `"flags"` key — then update `topic_extraction_agent.py` to extract those flags.
+3. Re-run the pipeline — the new topic will be available immediately.
+---
+## SQL Query
+**File:** `queries/helpscout_conversations.sql`
+| Design decision | Detail |
+|-----------------|--------|
+| Date filter | `CREATED_AT >= '2026-02-17'` to current date |
+| Team exclusion | Anti-join with `USORA_USERS WHERE access_level = 'team'` — only customer messages reach the pipeline |
+| Thread types | `TYPE IN ('customer', 'message')` — excludes notes, forwarded threads, system messages |
+| JSON extraction | Snowflake semi-structured syntax: `COLUMN:field::VARCHAR` |
+To change the date range, edit the `WHERE ct.CREATED_AT >= '...'` line in the SQL file.
+---
+## HTML Cleaner
+`html_cleaner.py` runs a four-stage pipeline on every message body:
+| Stage | What it removes |
+|-------|----------------|
+| `_remove_quoted_sections()` | `<blockquote>` tags and Gmail/Outlook/Yahoo quoted-reply CSS wrappers |
+| `_remove_boilerplate()` | `<table>`, `<img>`, `<script>`, `<style>` tags and footer/unsubscribe blocks |
+| `_extract_text()` | Extracts plain text while preserving line breaks |
+| `_clean_text()` | Strips invisible Unicode, collapses whitespace, removes `>` quote lines, cuts off at "On … wrote:" markers |
+To add a new boilerplate pattern, append a string to `footer_keywords` inside `_remove_boilerplate()`, or add a CSS class fragment to `_QUOTED_CLASS_PATTERNS` at the top of the file.
+---
+## Extending the Pipeline
+### Add a third agentic step
+1. Create `agents/your_new_agent.py` inheriting from `BaseAgent` (see `agents/README.md`).
+2. Add a new node method `_your_node()` in `workflow/conversation_processor.py`.
+3. Add the node and a new edge in `_build_workflow()`:
+   ```python
+   graph.add_node("your_step", self._your_node)
+   graph.add_edge("topic_extraction", "your_step")
+   graph.add_edge("your_step", END)
+   ```
+4. Add the corresponding output fields to `ConversationState`.
+5. Map new columns in `main.py`'s `column_map` dict and add them to the DDL.
+### Change the date range
+Edit `queries/helpscout_conversations.sql`:
+```sql
+ct.CREATED_AT >= '2026-02-17 00:00:00'   -- ← change start date
+```
+### Include team replies
+Remove the anti-join in `helpscout_conversations.sql` and broaden `TYPE` to include `'note'` and `'message'`. Be sure to update the HTML cleaning and aggregation if team messages need different handling.
+### Process a different HelpScout mailbox
+Add a `WHERE` clause on a mailbox ID column if available, or filter by `source_via` / `status`.
+### Automate daily runs
+Schedule `main.py` with a cron job, Airflow DAG, or any task scheduler. Because the pipeline skips already-processed conversations, re-running it daily processes only new conversations — no manual bookkeeping needed.

process_helpscout/agents/README.md ADDED Viewed

	@@ -0,0 +1,310 @@

+# Agents
+The agents package contains the LLM-based extraction components used in the HelpScout processing pipeline. Each agent is a self-contained class responsible for one well-defined task.
+---
+## Architecture
+```
+BaseAgent  (base_agent.py)
+    │
+    ├── SentimentAnalysisAgent  (sentiment_analysis_agent.py)
+    │       Classifies overall sentiment polarity and emotions
+    │       from a customer support conversation.
+    │
+    └── TopicExtractionAgent  (topic_extraction_agent.py)
+            Assigns one or more topic tags and extracts
+            billing-specific boolean flags.
+```
+All agents follow the same contract defined in `BaseAgent`:
+| Method | Required | Description |
+|--------|----------|-------------|
+| `validate_input(input_data)` | Yes | Returns `True` if the input dict has the required fields |
+| `process(input_data)` | Yes | Main entry point — validates, calls LLM, returns result dict |
+| `log_processing(message, level)` | Inherited | Logs `[AgentName] message` at the given level |
+| `handle_error(error, context)` | Inherited | Returns a standardised `{"success": False, "error": ...}` dict |
+The workflow (`workflow/conversation_processor.py`) calls `agent.process(input_data)` for each node. Agents never call each other — they are orchestrated exclusively by the workflow.
+---
+## BaseAgent (`base_agent.py`)
+Defines the interface every agent must implement. Contains no LLM logic.
+### Key attributes set from config
+```python
+self.model        # LLM model name, e.g. "gpt-4o-mini"
+self.temperature  # Sampling temperature (default: 0.2)
+self.max_retries  # Reserved for retry logic in subclasses
+```
+These are read from the agent's block in `config_files/processing_config.json`:
+```json
+"agents": {
+  "sentiment_analysis": { "model": "gpt-4o-mini", "temperature": 0.2, "max_retries": 3 }
+}
+```
+### Return contract
+Every `process()` implementation must return a dict with at minimum:
+```python
+{"success": True, ...}   # on success — include extracted fields
+{"success": False, "error": "<reason>"}  # on failure
+```
+The workflow checks `success` to decide whether to mark a conversation as failed.
+---
+## SentimentAnalysisAgent (`sentiment_analysis_agent.py`)
+Classifies the overall **sentiment polarity** and **emotions** expressed across a customer's conversation messages.
+### Input
+```python
+agent.process({
+    "conversation_text": "<formatted, truncated customer messages>"
+})
+```
+The `conversation_text` is prepared by the workflow before calling the agent — it is numbered, pipe-delimited messages truncated to `max_conversation_chars`.
+### Output (on success)
+```python
+{
+    "success": True,
+    "sentiment_polarity": "negative",        # one of the 5 polarity values
+    "emotions": "frustration, disappointment", # comma-separated, or None (soft-fail)
+    "sentiment_confidence": "high",
+    "sentiment_notes": "Customer is frustrated by repeated login failures."
+}
+```
+### Validation rules
+| Field | Behaviour on invalid value |
+|-------|---------------------------|
+| `sentiment_polarity` | Hard fail — conversation is not stored |
+| `emotions` | Soft fail — `None` is stored, conversation is still written |
+| `confidence` | Silently corrected to `"medium"` |
+### Where categories are defined
+Polarity and emotion categories (their `value` and `description` strings) live in `config_files/processing_config.json` under `"sentiment_polarity"` and `"emotions"`. The system prompt is **built at init time from the config**, so updating the config is all you need to change what the LLM is instructed to classify.
+### Modifying the sentiment prompt
+The system prompt is assembled in `_build_system_prompt()`. To change the framing or add additional instructions, edit that method directly. The category lists are injected automatically from config — do not hardcode them in the prompt.
+---
+## TopicExtractionAgent (`topic_extraction_agent.py`)
+Assigns one or more **topic tags** from the Musora HelpScout taxonomy, extracts three **billing/membership boolean flags**, and produces a brief **neutral summary** of the conversation.
+### Input
+```python
+agent.process({
+    "conversation_text": "<formatted, truncated customer messages>"
+})
+```
+### Output (on success)
+```python
+{
+    "success": True,
+    "topics": "billing_and_subscription, account_and_access",  # comma-separated IDs
+    "is_refund_request": True,    # customer explicitly asked for money back
+    "is_cancellation": False,     # customer did NOT explicitly ask to cancel
+    "is_membership": False,       # customer wants to join/rejoin and purchase membership
+    "topic_confidence": "high",
+    "topic_notes": "Customer was unexpectedly charged and is requesting a refund.",
+    "summary": "The customer reports being charged after believing they had cancelled their subscription. They are requesting a full refund and confirmation that no further charges will occur."
+}
+```
+### Validation rules
+| Field | Behaviour on invalid value |
+|-------|---------------------------|
+| `topics` | Hard fail if no valid topic IDs remain after filtering |
+| `is_refund_request` / `is_cancellation` / `is_membership` | Coerced to `bool`; defaults to `False` if missing |
+| `confidence` | Silently corrected to `"medium"` |
+| `summary` | Soft fail — `""` stored if missing; conversation still written |
+### Where topics are defined
+All topic definitions live in `config_files/topics.json`. The agent builds its system prompt directly from this file at init time — adding, removing, or rewriting a topic description requires only a config change.
+### Billing and membership flags
+`is_refund_request`, `is_cancellation`, and `is_membership` are extracted on every conversation regardless of which topics are assigned. They are defined in `topics.json` under `billing_and_subscription.flags` for documentation purposes, but the agent always asks the LLM to evaluate them independently.
+### Summary
+The `summary` field is a 2-3 sentence factual, third-person overview of the conversation — what the customer contacted support about, relevant context they provided, and their core request. It is designed to give a reader instant context without reading the full conversation, and can also be used as compact input when chaining LLM calls.
+---
+## How to Add a New Agent
+Follow these steps to add a third extraction step (e.g. urgency scoring):
+### Step 1 — Create the agent file
+```python
+# agents/urgency_agent.py
+from agents.base_agent import BaseAgent
+from langchain_openai import ChatOpenAI
+from langchain.schema import HumanMessage, SystemMessage
+import json, logging
+logger = logging.getLogger(__name__)
+class UrgencyAgent(BaseAgent):
+    def __init__(self, config, api_key):
+        super().__init__("UrgencyAgent", config)
+        self.llm = ChatOpenAI(
+            model=self.model,
+            temperature=self.temperature,
+            api_key=api_key,
+            model_kwargs={"response_format": {"type": "json_object"}},
+        )
+        self._system_prompt = (
+            "Classify the urgency of this customer support conversation.\n"
+            'Return JSON: {"urgency": "high"|"medium"|"low", "urgency_notes": "<reason>"}'
+        )
+    def validate_input(self, input_data):
+        return "conversation_text" in input_data and bool(input_data["conversation_text"])
+    def process(self, input_data):
+        if not self.validate_input(input_data):
+            return {"success": False, "error": "Missing conversation_text"}
+        try:
+            response = self.llm.invoke([
+                SystemMessage(content=self._system_prompt),
+                HumanMessage(content=input_data["conversation_text"]),
+            ])
+            raw = json.loads(response.content)
+            urgency = raw.get("urgency", "medium")
+            if urgency not in {"high", "medium", "low"}:
+                urgency = "medium"
+            return {
+                "success": True,
+                "urgency": urgency,
+                "urgency_notes": raw.get("urgency_notes", ""),
+            }
+        except Exception as e:
+            return self.handle_error(e, "urgency_classification")
+```
+### Step 2 — Add config for the new agent
+In `config_files/processing_config.json`:
+```json
+"agents": {
+  "sentiment_analysis": { ... },
+  "topic_extraction": { ... },
+  "urgency": {
+    "model": "gpt-4o-mini",
+    "temperature": 0.1,
+    "max_retries": 3
+  }
+}
+```
+### Step 3 — Add a node to the workflow
+In `workflow/conversation_processor.py`:
+```python
+# 1. Import the new agent
+from agents.urgency_agent import UrgencyAgent
+# 2. Instantiate in __init__
+self.urgency_agent = UrgencyAgent(config["agents"]["urgency"], api_key)
+# 3. Add fields to ConversationState
+urgency: str
+urgency_notes: str
+# 4. Add the node method
+def _urgency_node(self, state):
+    try:
+        result = self.urgency_agent.process({"conversation_text": state["conversation_text"]})
+        if result.get("success"):
+            state["urgency"] = result.get("urgency")
+            state["urgency_notes"] = result.get("urgency_notes", "")
+        else:
+            state["processing_errors"] = state.get("processing_errors", []) + [
+                f"Urgency failed: {result.get('error')}"
+            ]
+            state["urgency"] = None
+    except Exception as e:
+        state["processing_errors"] = state.get("processing_errors", []) + [str(e)]
+    return state
+# 5. Wire into the graph in _build_workflow()
+graph.add_node("urgency", self._urgency_node)
+graph.add_edge("topic_extraction", "urgency")   # replaces the old edge to END
+graph.add_edge("urgency", END)
+```
+### Step 4 — Add output columns
+In `main.py`, add to the `column_map` dict:
+```python
+"urgency":       "URGENCY",
+"urgency_notes": "URGENCY_NOTES",
+```
+In `sql/create_features_table.sql`, add:
+```sql
+URGENCY         VARCHAR(20),
+URGENCY_NOTES   TEXT,
+```
+Run `ALTER TABLE` or recreate the table for the new columns to appear.
+---
+## How to Modify an Existing Agent
+### Change the LLM model or temperature
+Edit `config_files/processing_config.json` — no code change needed.
+### Add or rename a sentiment category
+In `config_files/processing_config.json`, update `sentiment_polarity.categories` or `emotions.categories`. The agent reads these at init and builds the prompt and validation set dynamically. The only code-level change is updating the output table column type/constraint if the new value is longer than the current `VARCHAR` size.
+### Add or rename a topic
+In `config_files/topics.json`, add or edit an entry in the `"topics"` array. The `TopicExtractionAgent` reads this file at init — the new topic appears in the prompt and validation automatically.
+### Change the conversation truncation limit
+In `config_files/processing_config.json`:
+```json
+"processing": {
+  "max_conversation_chars": 3000
+}
+```
+This is read by the workflow (`conversation_processor.py`) before formatting the conversation text — no agent code changes needed.
+### Modify the system prompt framing
+Each agent builds its prompt in a `_build_system_prompt()` method. Edit that method directly. Category lists are always injected from config — avoid hardcoding values that already live in the JSON.

process_helpscout/agents/__init__.py ADDED Viewed

File without changes

process_helpscout/agents/base_agent.py ADDED Viewed

	@@ -0,0 +1,58 @@

+"""
+Base Agent class for all agents in the HelpScout processing workflow.
+Provides a common interface and consistent error handling.
+"""
+from abc import ABC, abstractmethod
+from typing import Dict, Any
+import logging
+logger = logging.getLogger(__name__)
+class BaseAgent(ABC):
+    """
+    Abstract base class for all agents in the agentic workflow.
+    Enforces a consistent interface and provides shared utilities.
+    """
+    def __init__(self, name: str, config: Dict[str, Any]):
+        self.name = name
+        self.config = config
+        self.model = config.get("model", "gpt-5-nano")
+        self.temperature = config.get("temperature", 0.2)
+        self.max_retries = config.get("max_retries", 3)
+        logger.info(f"Initialized {self.name} with model {self.model}")
+    @abstractmethod
+    def process(self, input_data: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Process input data and return results.
+        Must be implemented by all concrete agent classes.
+        """
+        pass
+    @abstractmethod
+    def validate_input(self, input_data: Dict[str, Any]) -> bool:
+        """
+        Validate input data before processing.
+        Returns True if input is valid, False otherwise.
+        """
+        pass
+    def log_processing(self, message: str, level: str = "info"):
+        log_method = getattr(logger, level, logger.info)
+        log_method(f"[{self.name}] {message}")
+    def handle_error(self, error: Exception, context: str = "") -> Dict[str, Any]:
+        error_msg = f"Error in {self.name}"
+        if context:
+            error_msg += f" ({context})"
+        error_msg += f": {str(error)}"
+        logger.error(error_msg)
+        return {
+            "success": False,
+            "error": str(error),
+            "agent": self.name,
+            "context": context,
+        }

process_helpscout/agents/sentiment_analysis_agent.py ADDED Viewed

	@@ -0,0 +1,229 @@

+"""
+Sentiment Analysis Agent for HelpScout customer support conversations.
+Classifies the overall sentiment polarity and emotions from a customer's
+conversation with Musora support. Unlike the social media variant, this
+agent operates on full conversations (multiple messages) rather than
+individual comments, and does not extract intents or compute requires_reply
+(all support tickets inherently require a response).
+"""
+from typing import Dict, Any, List, Optional
+import json
+from langchain_openai import ChatOpenAI
+from langchain.schema import HumanMessage, SystemMessage
+from agents.base_agent import BaseAgent
+import logging
+logger = logging.getLogger(__name__)
+class SentimentAnalysisAgent(BaseAgent):
+    """
+    Classifies the sentiment polarity and emotions of a customer support
+    conversation from HelpScout.
+    Design decisions:
+    - System prompt is built once at init from the config categories
+    - Emotions are soft-fail: None stored when the field is missing or invalid
+    - Input is the formatted conversation text (already truncated upstream)
+    """
+    def __init__(self, config: Dict[str, Any], api_key: str, processing_config: Dict[str, Any]):
+        """
+        Args:
+            config: Agent-level config dict (model, temperature, max_retries)
+            api_key: OpenAI API key
+            processing_config: Full processing_config.json content (for categories)
+        """
+        super().__init__("SentimentAnalysisAgent", config)
+        self.api_key = api_key
+        # Pre-compute valid value sets from config for O(1) validation
+        self._valid_polarities = {
+            cat["value"] for cat in processing_config["sentiment_polarity"]["categories"]
+        }
+        self._valid_emotions = {
+            cat["value"] for cat in processing_config["emotions"]["categories"]
+        }
+        self._emotions_soft_fail = processing_config["emotions"].get("soft_fail", True)
+        self.llm = ChatOpenAI(
+            model=self.model,
+            temperature=self.temperature,
+            api_key=self.api_key,
+            model_kwargs={"response_format": {"type": "json_object"}},
+        )
+        # Build system prompt once — reused for every LLM call
+        self._system_prompt = self._build_system_prompt(processing_config)
+    # ------------------------------------------------------------------
+    # Prompt construction
+    # ------------------------------------------------------------------
+    def _build_system_prompt(self, processing_config: Dict[str, Any]) -> str:
+        polarity_lines = "\n".join(
+            f"- {cat['value']}: {cat['description']}"
+            for cat in processing_config["sentiment_polarity"]["categories"]
+        )
+        emotion_lines = "\n".join(
+            f"- {cat['value']}: {cat['description']}"
+            for cat in processing_config["emotions"]["categories"]
+        )
+        return (
+            "You are analyzing customer support conversations for Musora, a music education platform.\n\n"
+            "You will receive one or more messages from a customer (team responses are excluded). "
+            "Classify the overall sentiment and emotional tone of the CUSTOMER's messages as a whole.\n\n"
+            "Return JSON only:\n"
+            '{"sentiment_polarity": <value>, "emotions": [<values>], '
+            '"confidence": "high"|"medium"|"low", "analysis_notes": "<1-2 sentences>"}\n\n'
+            f"POLARITY (pick one):\n{polarity_lines}\n\n"
+            f"EMOTIONS (multi-label, pick all that apply; use [\"neutral\"] if none detected):\n{emotion_lines}\n\n"
+            "Guidelines:\n"
+            "- Base your classification on the customer's overall tone, not isolated words\n"
+            "- A customer reporting a technical issue with no emotional language → neutral\n"
+            "- A customer expressing frustration alongside their issue → negative\n"
+            "- analysis_notes: 1-2 sentences highlighting the key sentiment drivers"
+        )
+    def _build_user_prompt(self, conversation_text: str) -> str:
+        return f"Customer conversation:\n\n{conversation_text}"
+    # ------------------------------------------------------------------
+    # Output validation
+    # ------------------------------------------------------------------
+    def _parse_emotions(self, raw_emotions: Any) -> Optional[List[str]]:
+        """Soft-fail emotion parsing — returns None instead of raising."""
+        if not raw_emotions:
+            return None
+        if isinstance(raw_emotions, str):
+            raw_emotions = [e.strip() for e in raw_emotions.split(",")]
+        if not isinstance(raw_emotions, list):
+            return None
+        valid = [e for e in raw_emotions if e in self._valid_emotions]
+        return valid if valid else None
+    def _validate_result(self, raw: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Validate LLM output against config-defined allowed values.
+        - Invalid polarity → hard fail (conversation will not be stored)
+        - Invalid emotions → soft fail (None; conversation still stored)
+        - Invalid confidence → corrected to "medium"
+        """
+        polarity = raw.get("sentiment_polarity")
+        if not polarity or polarity not in self._valid_polarities:
+            return {
+                "success": False,
+                "error": (
+                    f"Invalid sentiment_polarity '{polarity}'. "
+                    f"Expected one of: {sorted(self._valid_polarities)}"
+                ),
+            }
+        confidence = raw.get("confidence", "medium")
+        if confidence not in {"high", "medium", "low"}:
+            confidence = "medium"
+        emotions = self._parse_emotions(raw.get("emotions"))
+        return {
+            "success": True,
+            "sentiment_polarity": polarity,
+            "emotions": emotions,
+            "confidence": confidence,
+            "analysis_notes": str(raw.get("analysis_notes", "")).strip(),
+        }
+    # ------------------------------------------------------------------
+    # Core analysis
+    # ------------------------------------------------------------------
+    def analyze(self, conversation_text: str) -> Dict[str, Any]:
+        """
+        Call the LLM to classify sentiment of the customer conversation.
+        Args:
+            conversation_text: Pre-formatted, truncated conversation text
+        Returns:
+            Success dict with sentiment fields, or failure dict with error key.
+        """
+        user_prompt = self._build_user_prompt(conversation_text)
+        try:
+            messages = [
+                SystemMessage(content=self._system_prompt),
+                HumanMessage(content=user_prompt),
+            ]
+            response = self.llm.invoke(messages)
+            raw = json.loads(response.content)
+            validated = self._validate_result(raw)
+            if not validated["success"]:
+                self.log_processing(f"Validation failed: {validated['error']}", "warning")
+                return validated
+            emotions_list = validated.get("emotions")
+            return {
+                "success": True,
+                "sentiment_polarity": validated["sentiment_polarity"],
+                "emotions": ", ".join(emotions_list) if emotions_list else None,
+                "sentiment_confidence": validated["confidence"],
+                "sentiment_notes": validated["analysis_notes"],
+            }
+        except json.JSONDecodeError as e:
+            self.log_processing(f"JSON decode error: {e}", "warning")
+            return {"success": False, "error": f"JSON parse error: {e}"}
+        except Exception as e:
+            self.log_processing(f"Sentiment analysis failed: {e}", "error")
+            return {"success": False, "error": str(e)}
+    # ------------------------------------------------------------------
+    # Agent interface
+    # ------------------------------------------------------------------
+    def validate_input(self, input_data: Dict[str, Any]) -> bool:
+        return "conversation_text" in input_data and bool(input_data["conversation_text"])
+    def process(self, input_data: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Args:
+            input_data: Must contain 'conversation_text' (formatted, truncated).
+        Returns:
+            Dict with sentiment fields merged on top of input_data.
+        """
+        try:
+            if not self.validate_input(input_data):
+                return {
+                    "success": False,
+                    "error": "Invalid input: 'conversation_text' is required and must be non-empty",
+                }
+            self.log_processing("Analyzing conversation sentiment", "debug")
+            result = self.analyze(input_data["conversation_text"])
+            output = {
+                "success": result.get("success", False),
+                "sentiment_polarity": result.get("sentiment_polarity"),
+                "emotions": result.get("emotions"),
+                "sentiment_confidence": result.get("sentiment_confidence"),
+                "sentiment_notes": result.get("sentiment_notes", ""),
+            }
+            if "error" in result:
+                output["sentiment_error"] = result["error"]
+            # Preserve all original input fields
+            for key, value in input_data.items():
+                if key not in output:
+                    output[key] = value
+            return output
+        except Exception as e:
+            return self.handle_error(e, "sentiment_analysis")

process_helpscout/agents/topic_extraction_agent.py ADDED Viewed

	@@ -0,0 +1,268 @@

+"""
+Topic Extraction Agent for HelpScout customer support conversations.
+Assigns one or more topic tags from the Musora HelpScout taxonomy to a
+customer conversation. Also extracts three boolean billing signals:
+  - is_refund_request: customer explicitly wants their money back
+  - is_cancellation: customer wants to cancel their subscription
+  - is_membership: customer wants to join/rejoin and purchase membership
+Also produces a brief neutral summary (2-3 sentences) of the conversation.
+Topic definitions are loaded from config_files/topics.json so any taxonomy
+update is automatically reflected in the prompt without code changes.
+"""
+from typing import Dict, Any, List, Optional
+import json
+from langchain_openai import ChatOpenAI
+from langchain.schema import HumanMessage, SystemMessage
+from agents.base_agent import BaseAgent
+import logging
+logger = logging.getLogger(__name__)
+class TopicExtractionAgent(BaseAgent):
+    """
+    Extracts topic tags and billing flags from a customer support conversation.
+    Design decisions:
+    - Topics are multi-label: a conversation can receive multiple tags
+    - The 'uncategorized' topic is valid but discouraged (see topics.json notes)
+    - is_refund_request / is_cancellation are always extracted independently,
+      even when billing_and_subscription is not the primary topic
+    - System prompt is built once at init from topics.json
+    """
+    def __init__(self, config: Dict[str, Any], api_key: str, topics_config: Dict[str, Any]):
+        """
+        Args:
+            config: Agent-level config dict (model, temperature, max_retries)
+            api_key: OpenAI API key
+            topics_config: Parsed topics.json content
+        """
+        super().__init__("TopicExtractionAgent", config)
+        self.api_key = api_key
+        self.topics_config = topics_config
+        # Pre-compute valid topic ID set for O(1) validation
+        self._valid_topics = {topic["id"] for topic in topics_config["topics"]}
+        self.llm = ChatOpenAI(
+            model=self.model,
+            temperature=self.temperature,
+            api_key=self.api_key,
+            model_kwargs={"response_format": {"type": "json_object"}},
+        )
+        # Build system prompt once — reused for every LLM call
+        self._system_prompt = self._build_system_prompt()
+    # ------------------------------------------------------------------
+    # Prompt construction
+    # ------------------------------------------------------------------
+    def _build_system_prompt(self) -> str:
+        topic_lines = "\n".join(
+            f"- {topic['id']}: {topic['description']}"
+            for topic in self.topics_config["topics"]
+        )
+        usage_notes = "\n".join(
+            f"  • {note}"
+            for note in self.topics_config.get("_meta", {}).get("usage_notes", [])
+        )
+        return (
+            "You are classifying customer support conversations for Musora, a music education platform.\n\n"
+            "Assign one or more topic tags to the customer's conversation based on what they are "
+            "contacting support about.\n\n"
+            "Return JSON only:\n"
+            '{\n'
+            '  "topics": [<topic_ids>],\n'
+            '  "is_refund_request": true|false,\n'
+            '  "is_cancellation": true|false,\n'
+            '  "is_membership": true|false,\n'
+            '  "confidence": "high"|"medium"|"low",\n'
+            '  "topic_notes": "<1-2 sentences explaining the classification>",\n'
+            '  "summary": "<2-3 sentence neutral summary of the conversation>"\n'
+            '}\n\n'
+            f"AVAILABLE TOPICS (use the id values exactly):\n{topic_lines}\n\n"
+            f"RULES:\n{usage_notes}\n\n"
+            "BILLING FLAGS (always extract, regardless of topic):\n"
+            "  • is_refund_request: true ONLY when the customer explicitly asks for money back\n"
+            "  • is_cancellation: true ONLY when the customer explicitly wants to cancel their subscription\n"
+            "  • is_membership: true ONLY when the customer wants to join or rejoin and purchase a membership\n\n"
+            "SUMMARY GUIDELINES:\n"
+            "  • Write 2-3 sentences maximum\n"
+            "  • Be factual and neutral — do not repeat sentiment or topic labels\n"
+            "  • Capture: what the customer contacted support about, any key context or history they provided, "
+            "and the core request or outcome they are seeking\n"
+            "  • Write in third person (e.g. 'The customer reports...')\n\n"
+            "IMPORTANT:\n"
+            "  - Focus on the customer's messages; ignore any team response context\n"
+            "  - Use exact topic id strings from the list above\n"
+            "  - topic_notes: briefly explain why you chose these topics"
+        )
+    def _build_user_prompt(self, conversation_text: str) -> str:
+        return f"Customer conversation:\n\n{conversation_text}"
+    # ------------------------------------------------------------------
+    # Output validation
+    # ------------------------------------------------------------------
+    def _validate_topics(self, raw_topics: Any) -> Optional[List[str]]:
+        """
+        Validate and filter the topics list from LLM output.
+        Returns None if no valid topics remain (hard fail).
+        """
+        if not raw_topics:
+            return None
+        if isinstance(raw_topics, str):
+            raw_topics = [t.strip() for t in raw_topics.split(",")]
+        if not isinstance(raw_topics, list):
+            return None
+        valid = [t for t in raw_topics if t in self._valid_topics]
+        return valid if valid else None
+    def _validate_result(self, raw: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Validate LLM output.
+        - No valid topics → hard fail
+        - Invalid confidence → corrected to "medium"
+        - Boolean flags: default to False if missing or non-boolean
+        """
+        topics = self._validate_topics(raw.get("topics"))
+        if not topics:
+            return {
+                "success": False,
+                "error": (
+                    f"No valid topics in response: {raw.get('topics')}. "
+                    f"Expected values from: {sorted(self._valid_topics)}"
+                ),
+            }
+        confidence = raw.get("confidence", "medium")
+        if confidence not in {"high", "medium", "low"}:
+            confidence = "medium"
+        is_refund = raw.get("is_refund_request", False)
+        is_cancel = raw.get("is_cancellation", False)
+        is_membership = raw.get("is_membership", False)
+        # Coerce to bool in case LLM returns strings
+        if not isinstance(is_refund, bool):
+            is_refund = str(is_refund).lower() in ("true", "1", "yes")
+        if not isinstance(is_cancel, bool):
+            is_cancel = str(is_cancel).lower() in ("true", "1", "yes")
+        if not isinstance(is_membership, bool):
+            is_membership = str(is_membership).lower() in ("true", "1", "yes")
+        return {
+            "success": True,
+            "topics": topics,
+            "is_refund_request": is_refund,
+            "is_cancellation": is_cancel,
+            "is_membership": is_membership,
+            "confidence": confidence,
+            "topic_notes": str(raw.get("topic_notes", "")).strip(),
+            "summary": str(raw.get("summary", "")).strip(),
+        }
+    # ------------------------------------------------------------------
+    # Core extraction
+    # ------------------------------------------------------------------
+    def extract(self, conversation_text: str) -> Dict[str, Any]:
+        """
+        Call the LLM to assign topics and billing flags.
+        Args:
+            conversation_text: Pre-formatted, truncated conversation text
+        Returns:
+            Success dict with topic fields, or failure dict with error key.
+        """
+        user_prompt = self._build_user_prompt(conversation_text)
+        try:
+            messages = [
+                SystemMessage(content=self._system_prompt),
+                HumanMessage(content=user_prompt),
+            ]
+            response = self.llm.invoke(messages)
+            raw = json.loads(response.content)
+            validated = self._validate_result(raw)
+            if not validated["success"]:
+                self.log_processing(f"Validation failed: {validated['error']}", "warning")
+                return validated
+            return {
+                "success": True,
+                "topics": ", ".join(validated["topics"]),  # comma-separated for DB storage
+                "is_refund_request": validated["is_refund_request"],
+                "is_cancellation": validated["is_cancellation"],
+                "is_membership": validated["is_membership"],
+                "topic_confidence": validated["confidence"],
+                "topic_notes": validated["topic_notes"],
+                "summary": validated["summary"],
+            }
+        except json.JSONDecodeError as e:
+            self.log_processing(f"JSON decode error: {e}", "warning")
+            return {"success": False, "error": f"JSON parse error: {e}"}
+        except Exception as e:
+            self.log_processing(f"Topic extraction failed: {e}", "error")
+            return {"success": False, "error": str(e)}
+    # ------------------------------------------------------------------
+    # Agent interface
+    # ------------------------------------------------------------------
+    def validate_input(self, input_data: Dict[str, Any]) -> bool:
+        return "conversation_text" in input_data and bool(input_data["conversation_text"])
+    def process(self, input_data: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Args:
+            input_data: Must contain 'conversation_text'.
+        Returns:
+            Dict with topic fields merged on top of input_data.
+        """
+        try:
+            if not self.validate_input(input_data):
+                return {
+                    "success": False,
+                    "error": "Invalid input: 'conversation_text' is required and must be non-empty",
+                }
+            self.log_processing("Extracting topics from conversation", "debug")
+            result = self.extract(input_data["conversation_text"])
+            output = {
+                "success": result.get("success", False),
+                "topics": result.get("topics"),
+                "is_refund_request": result.get("is_refund_request", False),
+                "is_cancellation": result.get("is_cancellation", False),
+                "is_membership": result.get("is_membership", False),
+                "topic_confidence": result.get("topic_confidence"),
+                "topic_notes": result.get("topic_notes", ""),
+                "summary": result.get("summary", ""),
+            }
+            if "error" in result:
+                output["topic_error"] = result["error"]
+            # Preserve all original input fields
+            for key, value in input_data.items():
+                if key not in output:
+                    output[key] = value
+            return output
+        except Exception as e:
+            return self.handle_error(e, "topic_extraction")

process_helpscout/config_files/processing_config.json ADDED Viewed

	@@ -0,0 +1,125 @@

+{
+  "_meta": {
+    "description": "Configuration for the HelpScout conversation processing pipeline. Controls agent models, processing limits, and output destination.",
+    "version": "1.0.0"
+  },
+  "agents": {
+    "sentiment_analysis": {
+      "model": "gpt-5-nano",
+      "temperature": 0.2,
+      "max_retries": 3
+    },
+    "topic_extraction": {
+      "model": "gpt-5-nano",
+      "temperature": 0.2,
+      "max_retries": 3
+    }
+  },
+  "sentiment_polarity": {
+    "categories": [
+      {
+        "value": "very_positive",
+        "label": "Very Positive",
+        "description": "Extremely enthusiastic, excited, deeply grateful, or highly satisfied"
+      },
+      {
+        "value": "positive",
+        "label": "Positive",
+        "description": "Generally positive, appreciative, supportive, or encouraging"
+      },
+      {
+        "value": "neutral",
+        "label": "Neutral",
+        "description": "Factual, informational, balanced, or lacking clear emotional tone"
+      },
+      {
+        "value": "negative",
+        "label": "Negative",
+        "description": "Disappointed, critical, frustrated, or mildly dissatisfied"
+      },
+      {
+        "value": "very_negative",
+        "label": "Very Negative",
+        "description": "Highly critical, angry, abusive, or extremely dissatisfied"
+      }
+    ]
+  },
+  "emotions": {
+    "soft_fail": true,
+    "multi_label": true,
+    "categories": [
+      {
+        "value": "joy",
+        "label": "Joy",
+        "description": "Happiness, delight, or elation"
+      },
+      {
+        "value": "excitement",
+        "label": "Excitement",
+        "description": "Enthusiasm, energy, or eagerness"
+      },
+      {
+        "value": "gratitude",
+        "label": "Gratitude",
+        "description": "Thankfulness or appreciation"
+      },
+      {
+        "value": "admiration",
+        "label": "Admiration",
+        "description": "Deep respect or positive regard for the platform, team or products"
+      },
+      {
+        "value": "curiosity",
+        "label": "Curiosity",
+        "description": "Interest, eagerness to learn, or wondering about something"
+      },
+      {
+        "value": "frustration",
+        "label": "Frustration",
+        "description": "Irritation, annoyance, or blocked goals"
+      },
+      {
+        "value": "disappointment",
+        "label": "Disappointment",
+        "description": "Unmet expectations or a let-down feeling"
+      },
+      {
+        "value": "sadness",
+        "label": "Sadness",
+        "description": "Sorrow, emotional heaviness, or distress"
+      },
+      {
+        "value": "anger",
+        "label": "Anger",
+        "description": "Strong outrage or hostility"
+      },
+      {
+        "value": "humor",
+        "label": "Humor",
+        "description": "Amusement, playfulness, or levity in tone"
+      },
+      {
+        "value": "neutral",
+        "label": "Neutral",
+        "description": "No discernible emotion; use only when no other emotion applies"
+      }
+    ]
+  },
+  "processing": {
+    "max_conversation_chars": 5000,
+    "min_batch_size": 10,
+    "max_batch_size": 50
+  },
+  "output": {
+    "database": "SOCIAL_MEDIA_DB",
+    "schema": "ML_FEATURES",
+    "table": "HELPSCOUT_CONVERSATION_FEATURES"
+  },
+  "sql_query_file": "queries/helpscout_conversations.sql"
+}

process_helpscout/config_files/topics.json ADDED Viewed

	@@ -0,0 +1,90 @@

+{
+  "_meta": {
+    "version": "1.0.0",
+    "last_updated": "2025-04-09",
+    "description": "Musora HelpScout auto-tagging taxonomy. Used as the source configuration for the LLM-based tagging pipeline. Topics are mutually exclusive at the top level; a conversation may receive multiple topic tags. Sub-categories are listed for reference and future use in a separate config. Special boolean flags are defined inline for high-signal billing events.",
+    "usage_notes": [
+      "Assign one or more topic tags per conversation.",
+      "Boolean flags under billing_and_subscription should be extracted independently even when the parent topic is detected.",
+      "Use the 'uncategorized' topic when no other topic clearly applies — never as a fallback for uncertain cases.",
+      "feedback_and_suggestions should be used as a supplementary tag alongside a primary topic when applicable."
+    ]
+  },
+  "topics": [
+    {
+      "id": "video_and_playback",
+      "label": "Video & Playback",
+      "description": "The student is experiencing a problem with audio or video content during viewing. The issue is with how media plays, not with the surrounding app or UI. "
+    },
+    {
+      "id": "app_and_technical_errors",
+      "label": "App & Technical Errors",
+      "description": "A software bug, crash, or system failure that is NOT limited to video playback. The app, website, technology related, or a specific feature is broken, unresponsive, or showing an error message. Use this when the problem is with the platform itself rather than the content being watched."
+    },
+    {
+      "id": "navigation_and_ux",
+      "label": "Navigation & UX",
+      "description": "The student is confused by the interface or cannot find something, but is not technically blocked from accessing it. The issue is about discoverability, layout clarity, or unintuitive design rather than a bug or access restriction. Often triggered by redesigns or renamed features."
+    },
+    {
+      "id": "account_and_access",
+      "label": "Account & Access",
+      "description": "The student cannot log in, is locked out, or cannot access content they are entitled to. Also covers profile and settings issues. Distinct from billing: use this when the problem is authentication or permissions, even if the underlying cause might be a billing state."
+    },
+    {
+      "id": "billing_and_subscription",
+      "label": "Billing & Subscription",
+      "description": "Any conversation involving money, charges, plan status, or membership. This includes unexpected charges, plan changes, promotions, and invoice requests. ",
+      "flags": {
+        "is_refund_request": {
+          "type": "boolean",
+          "description": "True when the student is explicitly asking for their money back, regardless of reason."
+        },
+        "is_cancellation": {
+          "type": "boolean",
+          "description": "True when the student wants to cancel their subscription or membership, even if they haven't asked for a refund."
+        },
+        "is_membership": {
+          "type": "boolean",
+          "description": "True when the student wants to join/rejoin and purchase membership."
+        }
+      }
+    },
+    {
+      "id": "learning_and_progress",
+      "label": "Learning & Progress",
+      "description": "Issues with how the student's learning journey, including asking for help or recommendations, is tracked or structured over time. Covers broken progress tracking, practice session logging, playlist management, curriculum navigation, and access to legacy or assigned content. The problem is with the learning system, not the content itself."
+    },
+    {
+      "id": "content_and_resources",
+      "label": "Content & Resources",
+      "description": "Problems with the lesson content itself or supplementary learning materials — not the video player. Covers missing PDFs, sheet music, backing tracks, incorrect lesson information, requests for new content, and missing assignment or review links."
+    },
+    {
+      "id": "community_and_notifications",
+      "label": "Community & Notifications",
+      "description": "Issues involving forums, comments, student profiles, social features, or the delivery of notifications. Use this when the problem is about communication and social interaction within the platform, not content access or playback."
+    },
+    {
+      "id": "feedback_and_suggestions",
+      "label": "Feedback & Suggestions",
+      "description": "The student is sharing an opinion, making a feature request, or expressing general satisfaction or dissatisfaction — not reporting a specific failure. This should typically be applied as a supplementary tag alongside a primary topic when a complaint conversation also carries strong sentiment or a request for new functionality."
+    },
+    {
+      "id": "uncategorized",
+      "label": "Uncategorized",
+      "description": "Assign ONLY when no other topic clearly applies after careful consideration. Do not use as a fallback for low-confidence cases where a topic still partially fits — prefer the closest matching topic. The primary purpose of this tag is to surface new conversation patterns that may warrant expanding the taxonomy."
+    }
+  ]
+}

process_helpscout/data_fetcher.py ADDED Viewed

	@@ -0,0 +1,77 @@

+"""
+Data Fetcher for the HelpScout processing pipeline.
+Responsible for:
+  1. Fetching raw customer threads from Snowflake (reusing fetch_and_export logic)
+  2. Cleaning HTML and aggregating to conversation level
+  3. Checking which conversations have already been processed (for deduplication)
+Reuses fetch_raw(), process_threads(), and aggregate_conversations() from
+fetch_and_export.py so the cleaning and aggregation logic stays in one place.
+"""
+import logging
+from pathlib import Path
+from typing import Set
+import pandas as pd
+from snowflake_conn import SnowflakeConn
+from fetch_and_export import fetch_raw, process_threads, aggregate_conversations
+logger = logging.getLogger(__name__)
+def fetch_conversations(conn: SnowflakeConn) -> pd.DataFrame:
+    """
+    Fetch, clean, and aggregate all customer conversations from HelpScout.
+    Returns one row per conversation_id with the following key columns:
+      - conversation_id
+      - combined_text  (all customer messages joined with ' | ')
+      - customer_email, customer_first, customer_last, customer_hs_id
+      - thread_count, first_message_at, last_message_at, duration_hours
+      - status, state, source_type, source_via
+    Returns an empty DataFrame if no data is available.
+    """
+    raw_df = fetch_raw(conn)
+    if raw_df.empty:
+        logger.warning("No raw threads returned from Snowflake.")
+        return pd.DataFrame()
+    threads_df = process_threads(raw_df)
+    if threads_df.empty:
+        logger.warning("All threads were empty after HTML cleaning.")
+        return pd.DataFrame()
+    conversations_df = aggregate_conversations(threads_df)
+    logger.info(f"Ready to process: {len(conversations_df):,} conversations")
+    return conversations_df
+def fetch_processed_ids(
+    conn: SnowflakeConn,
+    database: str,
+    schema: str,
+    table: str,
+) -> Set[str]:
+    """
+    Return the set of conversation_ids already stored in the output table.
+    Returns an empty set if the table does not exist yet (first run) or if
+    the query fails for any other reason — the pipeline will then process
+    all conversations.
+    """
+    try:
+        query = f"SELECT CONVERSATION_ID FROM {database}.{schema}.{table}"
+        df = conn.run_query(query, description="fetch_processed_ids")
+        ids = set(df["conversation_id"].dropna().astype(str).tolist())
+        logger.info(f"Found {len(ids):,} already-processed conversations in {table}")
+        return ids
+    except Exception as exc:
+        logger.warning(
+            f"Could not fetch processed IDs from {database}.{schema}.{table} "
+            f"(table may not exist yet): {exc}"
+        )
+        return set()

process_helpscout/fetch_and_export.py ADDED Viewed

	@@ -0,0 +1,183 @@

+"""
+HelpScout Data Fetcher & Exporter
+==================================
+Fetches raw conversation data from Snowflake, cleans HTML bodies,
+computes derived columns, and exports two CSV files:
+  output/helpscout_threads.csv      — one row per message thread
+  output/helpscout_conversations.csv — one row per conversation (aggregated)
+Run:
+    python fetch_and_export.py
+"""
+import logging
+import sys
+from pathlib import Path
+import pandas as pd
+import numpy as np
+# Local modules
+from snowflake_conn import SnowflakeConn
+from html_cleaner import clean_html_series
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s",
+    handlers=[logging.StreamHandler(sys.stdout)],
+)
+logger = logging.getLogger(__name__)
+# ---------------------------------------------------------------------------
+# Paths
+# ---------------------------------------------------------------------------
+BASE_DIR = Path(__file__).resolve().parent
+SQL_FILE = BASE_DIR / "queries" / "helpscout_conversations.sql"
+OUTPUT_DIR = BASE_DIR / "output"
+OUTPUT_DIR.mkdir(exist_ok=True)
+THREADS_CSV = OUTPUT_DIR / "helpscout_threads.csv"
+CONVERSATIONS_CSV = OUTPUT_DIR / "helpscout_conversations.csv"
+# ---------------------------------------------------------------------------
+# Fetch
+# ---------------------------------------------------------------------------
+def fetch_raw(conn: SnowflakeConn) -> pd.DataFrame:
+    logger.info("Fetching HelpScout threads from Snowflake…")
+    df = conn.run_query_from_file(SQL_FILE, description="helpscout_conversations")
+    logger.info(f"Fetched {len(df):,} raw thread rows.")
+    return df
+# ---------------------------------------------------------------------------
+# Clean & enrich threads
+# ---------------------------------------------------------------------------
+def process_threads(df: pd.DataFrame) -> pd.DataFrame:
+    logger.info("Cleaning HTML bodies…")
+    df = df.copy()
+    # Parse timestamps
+    for col in ("created_at", "opened_at"):
+        if col in df.columns:
+            df[col] = pd.to_datetime(df[col], utc=True, errors="coerce")
+    # Clean HTML → plain text
+    df["body_clean"] = clean_html_series(df["body"])
+    # Drop rows where cleaning produced empty text
+    before = len(df)
+    df = df[df["body_clean"].str.strip().str.len() > 0].copy()
+    logger.info(f"Dropped {before - len(df):,} rows with empty body after cleaning.")
+    # Derived columns
+    df["word_count"] = df["body_clean"].str.split().str.len().fillna(0).astype(int)
+    df["char_count"] = df["body_clean"].str.len().fillna(0).astype(int)
+    # Date helpers
+    df["date"] = df["created_at"].dt.date
+    df["week"] = df["created_at"].dt.to_period("W").dt.start_time
+    df["month"] = df["created_at"].dt.to_period("M").dt.start_time
+    df["hour_of_day"] = df["created_at"].dt.hour
+    df["day_of_week"] = df["created_at"].dt.day_name()
+    # Normalise free-text columns
+    for col in ("source_type", "source_via", "status", "state", "type"):
+        if col in df.columns:
+            df[col] = df[col].fillna("unknown").str.lower().str.strip()
+    # Identify the display name for the sender
+    df["sender_name"] = (
+        (df.get("created_by_first", "").fillna("") + " " +
+         df.get("created_by_last", "").fillna("")).str.strip()
+    )
+    df["sender_name"] = df["sender_name"].replace("", "Unknown")
+    logger.info(f"Processed threads: {len(df):,} rows.")
+    return df
+# ---------------------------------------------------------------------------
+# Aggregate to conversation level
+# ---------------------------------------------------------------------------
+def aggregate_conversations(threads: pd.DataFrame) -> pd.DataFrame:
+    logger.info("Aggregating to conversation level…")
+    agg = (
+        threads.groupby("conversation_id")
+        .agg(
+            first_message_at=("created_at", "min"),
+            last_message_at=("created_at", "max"),
+            thread_count=("thread_id", "count"),
+            customer_email=("customer_email", "first"),
+            customer_first=("customer_first", "first"),
+            customer_last=("customer_last", "first"),
+            customer_hs_id=("customer_hs_id", "first"),
+            source_type=("source_type", "first"),
+            source_via=("source_via", "first"),
+            status=("status", "last"),       # last known status
+            state=("state", "last"),
+            total_word_count=("word_count", "sum"),
+            avg_word_count=("word_count", "mean"),
+            combined_text=("body_clean", lambda x: " | ".join(x.dropna())),
+        )
+        .reset_index()
+    )
+    # Duration in hours from first to last message
+    agg["duration_hours"] = (
+        (agg["last_message_at"] - agg["first_message_at"])
+        .dt.total_seconds()
+        .div(3600)
+        .round(2)
+    )
+    agg["date"] = agg["first_message_at"].dt.date
+    agg["week"] = agg["first_message_at"].dt.to_period("W").dt.start_time
+    agg["month"] = agg["first_message_at"].dt.to_period("M").dt.start_time
+    logger.info(f"Aggregated {len(agg):,} unique conversations.")
+    return agg
+# ---------------------------------------------------------------------------
+# Export
+# ---------------------------------------------------------------------------
+def export(threads: pd.DataFrame, conversations: pd.DataFrame) -> None:
+    # Drop raw HTML before saving (keeps CSV manageable)
+    threads_export = threads.drop(columns=["body"], errors="ignore")
+    threads_export.to_csv(THREADS_CSV, index=False, encoding="utf-8-sig")
+    logger.info(f"Exported threads → {THREADS_CSV}")
+    conversations.to_csv(CONVERSATIONS_CSV, index=False, encoding="utf-8-sig")
+    logger.info(f"Exported conversations → {CONVERSATIONS_CSV}")
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    conn = SnowflakeConn()
+    try:
+        raw_df = fetch_raw(conn)
+        if raw_df.empty:
+            logger.warning("No data returned. Check date range and table access.")
+            return
+        threads_df = process_threads(raw_df)
+        conversations_df = aggregate_conversations(threads_df)
+        export(threads_df, conversations_df)
+        logger.info("Done.")
+        logger.info(f"  Threads:       {len(threads_df):,}")
+        logger.info(f"  Conversations: {len(conversations_df):,}")
+        logger.info(f"  Unique customers: {conversations_df['customer_email'].nunique():,}")
+    finally:
+        conn.close()
+if __name__ == "__main__":
+    main()

process_helpscout/html_cleaner.py ADDED Viewed

	@@ -0,0 +1,169 @@

+"""
+HTML Cleaner for HelpScout message bodies.
+Strategy:
+  1. Remove blockquotes (quoted previous email threads).
+  2. Remove Gmail/Outlook quoted-reply wrappers (ex-gmail_extra, gmail_quote, etc.).
+  3. Remove HelpScout / marketing email boilerplate sections.
+  4. Extract plain text from the remaining DOM.
+  5. Strip invisible Unicode spacers (\\u200c, \\u00ad, etc.) and collapse whitespace.
+"""
+import re
+import unicodedata
+from bs4 import BeautifulSoup, Comment
+# CSS class / id fragments that indicate quoted / boilerplate content
+_QUOTED_CLASS_PATTERNS = [
+    "gmail_extra",
+    "gmail_quote",
+    "ex-gmail",
+    "yahoo_quoted",
+    "moz-cite-prefix",
+    "OutlookMessageHeader",
+    "protonmail_quote",
+    "apple-mail-previous",
+]
+# Markers that indicate the start of a quoted section (text-based heuristics)
+_QUOTE_TEXT_MARKERS = [
+    r"On .{5,80} wrote:",          # "On Mar 2, 2026 ... wrote:"
+    r"From:\s",
+    r"Sent:\s",
+    r"To:\s.*\nCc:",
+    r">{1,}",                       # > quoted lines (plain text fallback)
+]
+_COMPILED_QUOTE_MARKERS = [re.compile(p, re.IGNORECASE) for p in _QUOTE_TEXT_MARKERS]
+# Tags whose entire sub-tree we drop unconditionally
+_DROP_TAGS = {"script", "style", "head", "meta", "link", "img", "table"}
+# Invisible / spacer Unicode characters
+_INVISIBLE_CHARS = re.compile(
+    r"[\u00ad\u200b\u200c\u200d\u2060\ufeff\u00a0\u034f]"
+)
+# Collapse multiple blank lines to one
+_MULTI_BLANK = re.compile(r"\n{3,}")
+def _remove_quoted_sections(soup: BeautifulSoup) -> None:
+    """Remove DOM nodes that represent quoted/threaded email history."""
+    # 1. All <blockquote> tags
+    for tag in soup.find_all("blockquote"):
+        tag.decompose()
+    # 2. Divs / spans with known quoted-reply class names
+    # Collect candidates first; decompose() invalidates attrs on child nodes
+    # that may still appear later in the iteration, so we guard with a check.
+    candidates = soup.find_all(True)
+    for tag in candidates:
+        if tag.attrs is None:
+            # Already decomposed (child of a previously decomposed parent)
+            continue
+        css_classes = " ".join(tag.get("class") or []).lower()
+        tag_id = (tag.get("id") or "").lower()
+        combined = css_classes + " " + tag_id
+        if any(pattern in combined for pattern in _QUOTED_CLASS_PATTERNS):
+            tag.decompose()
+    # 3. HTML comments (<!-- --> contain no user text)
+    for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
+        comment.extract()
+def _remove_boilerplate(soup: BeautifulSoup) -> None:
+    """Remove marketing / footer / unsubscribe sections."""
+    # Drop heavy layout tags entirely (tables, images carry no message text)
+    for tag in soup.find_all(_DROP_TAGS):
+        tag.decompose()
+    # Drop any element whose text is purely an unsubscribe / footer line
+    footer_keywords = ["unsubscribe", "musora media", "31265 wheel", "customeriomail"]
+    for tag in soup.find_all(True):
+        if tag.attrs is None:
+            continue
+        text = tag.get_text(separator=" ", strip=True).lower()
+        if any(kw in text for kw in footer_keywords) and len(text) < 300:
+            tag.decompose()
+def _extract_text(soup: BeautifulSoup) -> str:
+    """Get plain text from the cleaned soup, preserving line breaks."""
+    lines = []
+    for element in soup.recursiveChildGenerator():
+        if isinstance(element, str):
+            stripped = element.strip()
+            if stripped:
+                lines.append(stripped)
+        elif hasattr(element, "name") and element.name in {"br", "p", "div", "li", "h1", "h2", "h3"}:
+            lines.append("\n")
+    return " ".join(lines)
+def _clean_text(raw: str) -> str:
+    """Final text cleanup: invisible chars, excessive whitespace, quote markers."""
+    # Remove invisible spacers
+    text = _INVISIBLE_CHARS.sub("", raw)
+    # Normalize unicode (e.g. soft-hyphen variants)
+    text = unicodedata.normalize("NFKC", text)
+    # Collapse whitespace sequences (keep single newlines intentional)
+    text = re.sub(r"[ \t]+", " ", text)
+    text = re.sub(r" \n", "\n", text)
+    text = re.sub(r"\n ", "\n", text)
+    text = _MULTI_BLANK.sub("\n\n", text)
+    # Remove lines that are purely quote markers ("> some text")
+    lines = text.split("\n")
+    lines = [ln for ln in lines if not ln.strip().startswith(">")]
+    text = "\n".join(lines)
+    # Cut off at first "On <date> wrote:" marker (inline quoted replies)
+    for pattern in _COMPILED_QUOTE_MARKERS:
+        match = pattern.search(text)
+        if match and match.start() > 20:   # don't cut if marker is at very start
+            text = text[: match.start()].strip()
+            break
+    return text.strip()
+def clean_html(html_body: str) -> str:
+    """
+    Full pipeline: HTML → clean plain text containing only the customer's message.
+    Args:
+        html_body: Raw HTML string from CONVERSATION_THREADS.BODY
+    Returns:
+        Clean UTF-8 plain text string.
+    """
+    if not html_body or not html_body.strip():
+        return ""
+    soup = BeautifulSoup(html_body, "html.parser")
+    _remove_quoted_sections(soup)
+    _remove_boilerplate(soup)
+    raw_text = _extract_text(soup)
+    return _clean_text(raw_text)
+def clean_html_series(series):
+    """
+    Vectorized version for a pandas Series.
+    Args:
+        series: pd.Series of HTML strings
+    Returns:
+        pd.Series of cleaned plain text strings
+    """
+    return series.fillna("").apply(clean_html)

process_helpscout/main.py ADDED Viewed

	@@ -0,0 +1,423 @@

+"""
+Main execution script for the HelpScout conversation processing pipeline.
+Steps:
+  1. Fetch all customer conversations from Snowflake (HTML cleaned + aggregated)
+  2. Filter out conversations already in the output table
+  3. Run sentiment analysis + topic extraction in parallel batches
+  4. Append results to SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES
+Run:
+    python main.py                     # process all new conversations, parallel
+    python main.py --limit 100         # process at most 100 conversations
+    python main.py --sequential        # single-process mode (useful for debugging)
+    python main.py --config <path>     # use a custom config file
+"""
+import json
+import logging
+import os
+import sys
+import argparse
+import traceback
+from datetime import datetime
+from multiprocessing import Pool, cpu_count
+from pathlib import Path
+from typing import Any, Dict, List
+import pandas as pd
+from dotenv import load_dotenv
+# ---------------------------------------------------------------------------
+# Path setup — allows imports from the process_helpscout package directory
+# ---------------------------------------------------------------------------
+SCRIPT_DIR = Path(__file__).resolve().parent
+ROOT_DIR = SCRIPT_DIR.parent
+load_dotenv(ROOT_DIR / ".env")
+sys.path.insert(0, str(SCRIPT_DIR))
+# ---------------------------------------------------------------------------
+# Logging — file + console; log directory is created on first run
+# ---------------------------------------------------------------------------
+_logs_dir = SCRIPT_DIR / "logs"
+_logs_dir.mkdir(exist_ok=True)
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+    handlers=[
+        logging.FileHandler(
+            _logs_dir / f"helpscout_processing_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
+        ),
+        logging.StreamHandler(),
+    ],
+)
+logger = logging.getLogger(__name__)
+# ---------------------------------------------------------------------------
+# Local imports (after sys.path is set)
+# ---------------------------------------------------------------------------
+from snowflake_conn import SnowflakeConn
+from data_fetcher import fetch_conversations, fetch_processed_ids
+from workflow.conversation_processor import ConversationProcessingWorkflow
+# ---------------------------------------------------------------------------
+# Batch size helper
+# ---------------------------------------------------------------------------
+def calculate_optimal_batch_size(
+    total: int,
+    num_workers: int,
+    min_batch: int = 10,
+    max_batch: int = 50,
+) -> int:
+    """
+    Distribute work evenly across workers within the configured min/max bounds.
+    Args:
+        total: Total number of conversations to process
+        num_workers: Number of parallel worker processes
+        min_batch: Minimum conversations per batch
+        max_batch: Maximum conversations per batch
+    Returns:
+        Optimal batch size
+    """
+    if total <= min_batch:
+        return total
+    batch_size = total // num_workers
+    return max(min_batch, min(max_batch, batch_size))
+# ---------------------------------------------------------------------------
+# Batch worker — runs in a separate process (must be module-level for pickle)
+# ---------------------------------------------------------------------------
+def process_batch_worker(batch_data: tuple) -> dict:
+    """
+    Worker function executed in a separate process for one batch of conversations.
+    Each worker creates its own Snowflake connection and workflow instance so
+    resources are not shared across processes.
+    Args:
+        batch_data: (batch_num, conversations, config, api_key)
+    Returns:
+        Statistics dict for this batch.
+    """
+    batch_num, batch_conversations, config, api_key = batch_data
+    worker_logger = logging.getLogger(f"Worker-{batch_num}")
+    try:
+        worker_logger.info(f"Batch {batch_num}: Processing {len(batch_conversations)} conversations")
+        # Worker-local Snowflake connection and workflow
+        conn = SnowflakeConn()
+        workflow = ConversationProcessingWorkflow(config, api_key)
+        # Run the workflow
+        results = workflow.process_batch(batch_conversations)
+        results_df = pd.DataFrame(results)
+        # Separate successful results
+        initial_count = len(results_df)
+        df_ok = results_df[results_df["success"] == True].copy()
+        failed_count = initial_count - len(df_ok)
+        worker_logger.info(
+            f"Batch {batch_num}: {len(df_ok)} successful, {failed_count} failed"
+        )
+        # ----------------------------------------------------------------
+        # Build output DataFrame with Snowflake column names
+        # ----------------------------------------------------------------
+        column_map = {
+            "conversation_id":      "CONVERSATION_ID",
+            "customer_email":       "CUSTOMER_EMAIL",
+            "customer_first":       "CUSTOMER_FIRST",
+            "customer_last":        "CUSTOMER_LAST",
+            "customer_hs_id":       "CUSTOMER_HS_ID",
+            "thread_count":         "THREAD_COUNT",
+            "first_message_at":     "FIRST_MESSAGE_AT",
+            "last_message_at":      "LAST_MESSAGE_AT",
+            "duration_hours":       "DURATION_HOURS",
+            "status":               "STATUS",
+            "state":                "STATE",
+            "source_type":          "SOURCE_TYPE",
+            "source_via":           "SOURCE_VIA",
+            "combined_text":        "COMBINED_TEXT",
+            "conversation_text":    "CONVERSATION_TEXT_USED",
+            "sentiment_polarity":   "SENTIMENT_POLARITY",
+            "emotions":             "EMOTIONS",
+            "sentiment_confidence": "SENTIMENT_CONFIDENCE",
+            "sentiment_notes":      "SENTIMENT_NOTES",
+            "topics":               "TOPICS",
+            "is_refund_request":    "IS_REFUND_REQUEST",
+            "is_cancellation":      "IS_CANCELLATION",
+            "is_membership":        "IS_MEMBERSHIP",
+            "topic_confidence":     "TOPIC_CONFIDENCE",
+            "topic_notes":          "TOPIC_NOTES",
+            "summary":              "SUMMARY",
+            "processing_errors":    "PROCESSING_ERRORS",
+        }
+        output_df = pd.DataFrame()
+        for src_col, tgt_col in column_map.items():
+            output_df[tgt_col] = df_ok[src_col] if src_col in df_ok.columns else None
+        # Flatten processing_errors list to a semicolon-separated string
+        if "PROCESSING_ERRORS" in output_df.columns:
+            output_df["PROCESSING_ERRORS"] = output_df["PROCESSING_ERRORS"].apply(
+                lambda x: "; ".join(x) if isinstance(x, list) else (str(x) if x else None)
+            )
+        # Pipeline metadata
+        output_df["PROCESSED_AT"] = datetime.now()
+        output_df["WORKFLOW_VERSION"] = "1.0"
+        # ----------------------------------------------------------------
+        # Store to Snowflake
+        # ----------------------------------------------------------------
+        out_cfg = config["output"]
+        if not output_df.empty:
+            conn.store_df_to_snowflake(
+                table_name=out_cfg["table"],
+                dataframe=output_df,
+                database=out_cfg["database"],
+                schema=out_cfg["schema"],
+                overwrite=False,    # Always append; deduplication is handled upstream
+            )
+        conn.close()
+        return {
+            "batch_num": batch_num,
+            "success": True,
+            "total_processed": initial_count,
+            "total_stored": len(output_df),
+            "failed_count": failed_count,
+            "error": None,
+        }
+    except Exception as exc:
+        error_msg = f"Batch {batch_num} failed: {exc}"
+        worker_logger.error(error_msg)
+        worker_logger.error(traceback.format_exc())
+        return {
+            "batch_num": batch_num,
+            "success": False,
+            "total_processed": len(batch_conversations),
+            "total_stored": 0,
+            "failed_count": len(batch_conversations),
+            "error": str(exc),
+        }
+# ---------------------------------------------------------------------------
+# Main processor class
+# ---------------------------------------------------------------------------
+class HelpScoutProcessor:
+    """
+    Orchestrates the end-to-end HelpScout conversation processing pipeline.
+    Typical usage:
+        processor = HelpScoutProcessor()
+        processor.run(limit=500)
+    """
+    def __init__(self, config_path: str = None):
+        """
+        Args:
+            config_path: Path to processing_config.json.
+                         Defaults to config_files/processing_config.json
+                         relative to this script.
+        """
+        if config_path is None:
+            config_path = SCRIPT_DIR / "config_files" / "processing_config.json"
+        with open(config_path, "r") as f:
+            self.config = json.load(f)
+        self.conn = SnowflakeConn()
+        self.api_key = os.getenv("OPENAI_API_KEY")
+        if not self.api_key:
+            raise ValueError("OPENAI_API_KEY not found in environment variables")
+        logger.info("HelpScoutProcessor initialized")
+    def _calculate_num_workers(self) -> int:
+        """CPU count minus 2, capped at 5 — mirrors the processing_comments pattern."""
+        num_cpus = cpu_count()
+        num_workers = max(1, min(5, num_cpus - 2))
+        logger.info(f"Using {num_workers} parallel workers (CPU count: {num_cpus})")
+        return num_workers
+    def run(self, limit: int = None, sequential: bool = False):
+        """
+        Execute the full pipeline.
+        Args:
+            limit: Cap the number of conversations processed in this run.
+                   Useful for incremental or test runs. Default: process all new.
+            sequential: If True, bypass multiprocessing (single-process debug mode).
+        """
+        try:
+            logger.info("=" * 70)
+            logger.info("HelpScout Conversation Processing Pipeline")
+            logger.info(f"Mode: {'SEQUENTIAL (debug)' if sequential else 'PARALLEL'}")
+            logger.info("=" * 70)
+            # ------------------------------------------------------------------
+            # Step 1: Fetch + preprocess conversations
+            # ------------------------------------------------------------------
+            logger.info("Step 1: Fetching conversations from Snowflake...")
+            conversations_df = fetch_conversations(self.conn)
+            if conversations_df.empty:
+                logger.warning("No conversations returned. Exiting.")
+                return
+            logger.info(f"Fetched {len(conversations_df):,} total conversations")
+            # ------------------------------------------------------------------
+            # Step 2: Skip already-processed conversations
+            # ------------------------------------------------------------------
+            out_cfg = self.config["output"]
+            processed_ids = fetch_processed_ids(
+                self.conn,
+                out_cfg["database"],
+                out_cfg["schema"],
+                out_cfg["table"],
+            )
+            if processed_ids:
+                before = len(conversations_df)
+                conversations_df = conversations_df[
+                    ~conversations_df["conversation_id"].astype(str).isin(processed_ids)
+                ].copy()
+                skipped = before - len(conversations_df)
+                logger.info(f"Skipped {skipped:,} already-processed conversations")
+            if conversations_df.empty:
+                logger.info("All conversations are already processed. Nothing to do.")
+                return
+            # ------------------------------------------------------------------
+            # Step 3: Apply optional limit
+            # ------------------------------------------------------------------
+            if limit:
+                conversations_df = conversations_df.head(limit)
+                logger.info(f"Limit applied: processing {len(conversations_df):,} conversations")
+            total = len(conversations_df)
+            logger.info(f"Processing {total:,} new conversations...")
+            # ------------------------------------------------------------------
+            # Step 4: Split into batches
+            # ------------------------------------------------------------------
+            num_workers = self._calculate_num_workers()
+            proc_cfg = self.config.get("processing", {})
+            batch_size = calculate_optimal_batch_size(
+                total,
+                num_workers,
+                min_batch=proc_cfg.get("min_batch_size", 10),
+                max_batch=proc_cfg.get("max_batch_size", 50),
+            )
+            conversations = conversations_df.to_dict("records")
+            batches = []
+            for i in range(0, total, batch_size):
+                batch = conversations[i : i + batch_size]
+                batch_num = (i // batch_size) + 1
+                batches.append((batch_num, batch, self.config, self.api_key))
+            logger.info(
+                f"Split into {len(batches)} batch(es) "
+                f"(batch size: {batch_size}, workers: {num_workers})"
+            )
+            # ------------------------------------------------------------------
+            # Step 5: Run batches
+            # ------------------------------------------------------------------
+            start_time = datetime.now()
+            if sequential:
+                results = [process_batch_worker(b) for b in batches]
+            else:
+                with Pool(processes=num_workers) as pool:
+                    results = pool.map(process_batch_worker, batches)
+            elapsed = (datetime.now() - start_time).total_seconds()
+            # ------------------------------------------------------------------
+            # Step 6: Summary
+            # ------------------------------------------------------------------
+            total_processed = sum(r["total_processed"] for r in results)
+            total_stored = sum(r["total_stored"] for r in results)
+            total_failed = sum(r["failed_count"] for r in results)
+            failed_batches = [r for r in results if not r["success"]]
+            logger.info("=" * 70)
+            logger.info("Pipeline Summary")
+            logger.info(f"  Output table : {out_cfg['database']}.{out_cfg['schema']}.{out_cfg['table']}")
+            logger.info(f"  Processed    : {total_processed:,}")
+            logger.info(f"  Stored       : {total_stored:,}")
+            logger.info(f"  Failed       : {total_failed:,}")
+            if failed_batches:
+                logger.error(f"  Failed batches ({len(failed_batches)}):")
+                for fb in failed_batches:
+                    logger.error(f"    Batch {fb['batch_num']}: {fb['error']}")
+            logger.info(f"  Elapsed      : {elapsed:.1f}s")
+            logger.info(
+                f"  Avg per conv : {elapsed / max(total_processed, 1):.2f}s"
+            )
+            logger.info("=" * 70)
+        except Exception as exc:
+            logger.error(f"Pipeline failed: {exc}", exc_info=True)
+            raise
+        finally:
+            self.conn.close()
+            logger.info("Snowflake connection closed")
+# ---------------------------------------------------------------------------
+# CLI entry point
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(
+        description="Process HelpScout conversations: sentiment analysis + topic extraction"
+    )
+    parser.add_argument(
+        "--limit",
+        type=int,
+        default=None,
+        help="Maximum number of new conversations to process in this run (default: all)",
+    )
+    parser.add_argument(
+        "--sequential",
+        action="store_true",
+        default=False,
+        help="Single-process mode — useful for debugging (default: parallel)",
+    )
+    parser.add_argument(
+        "--config",
+        type=str,
+        default=None,
+        help="Path to processing_config.json (default: config_files/processing_config.json)",
+    )
+    args = parser.parse_args()
+    processor = HelpScoutProcessor(config_path=args.config)
+    processor.run(limit=args.limit, sequential=args.sequential)
+if __name__ == "__main__":
+    main()

process_helpscout/snowflake_conn.py ADDED Viewed

	@@ -0,0 +1,106 @@

+"""
+Snowflake connection layer for the HelpScout processing module.
+Adapted from processing_comments/SnowFlakeConnection.py.
+"""
+import os
+from pathlib import Path
+from snowflake.snowpark import Session
+from dotenv import load_dotenv
+import logging
+logger = logging.getLogger(__name__)
+# Load .env from the project root (two levels up from this file)
+_root_env = Path(__file__).resolve().parent.parent / ".env"
+load_dotenv(dotenv_path=_root_env)
+class SnowflakeConn:
+    """Thin wrapper around Snowpark Session for running read queries."""
+    def __init__(self):
+        self.session = self._connect()
+    # ------------------------------------------------------------------
+    def _connect(self) -> Session:
+        conn_params = dict(
+            user=os.getenv("SNOWFLAKE_USER"),
+            password=os.getenv("SNOWFLAKE_PASSWORD"),
+            account=os.getenv("SNOWFLAKE_ACCOUNT"),
+            role=os.getenv("SNOWFLAKE_ROLE"),
+            warehouse=os.getenv("SNOWFLAKE_WAREHOUSE"),
+            # No default database/schema — queries use fully-qualified names
+        )
+        session = Session.builder.configs(conn_params).create()
+        logger.info("Snowflake session created successfully.")
+        return session
+    # ------------------------------------------------------------------
+    def run_query(self, query: str, description: str = "query"):
+        """Execute a SELECT query and return a pandas DataFrame."""
+        try:
+            df = self.session.sql(query).to_pandas()
+            df.columns = df.columns.str.lower()
+            logger.info(f"Query '{description}' returned {len(df):,} rows.")
+            return df
+        except Exception as exc:
+            logger.error(f"Error executing '{description}': {exc}")
+            raise
+    # ------------------------------------------------------------------
+    def run_query_from_file(self, sql_path: str, description: str = ""):
+        """Read a .sql file and execute it, returning a pandas DataFrame."""
+        sql_path = Path(sql_path)
+        if not sql_path.exists():
+            raise FileNotFoundError(f"SQL file not found: {sql_path}")
+        query = sql_path.read_text(encoding="utf-8")
+        return self.run_query(query, description or sql_path.name)
+    # ------------------------------------------------------------------
+    def store_df_to_snowflake(
+        self,
+        table_name: str,
+        dataframe,
+        database: str,
+        schema: str,
+        overwrite: bool = False,
+    ):
+        """
+        Write a pandas DataFrame to a Snowflake table.
+        Args:
+            table_name: Target table name (without database/schema prefix)
+            dataframe: pandas DataFrame to write
+            database: Target Snowflake database
+            schema: Target Snowflake schema
+            overwrite: If True, truncate the table before inserting;
+                       if False (default), append rows
+        """
+        if dataframe is None or len(dataframe) == 0:
+            logger.warning(f"store_df_to_snowflake: empty DataFrame, skipping write to {table_name}")
+            return
+        mode = "overwrite" if overwrite else "append"
+        try:
+            self.session.write_pandas(
+                df=dataframe,
+                table_name=table_name,
+                database=database,
+                schema=schema,
+                overwrite=overwrite,
+                auto_create_table=False,   # Table must be created via SQL first
+                quote_identifiers=False,
+                use_logical_type = True
+            )
+            logger.info(
+                f"Stored {len(dataframe):,} rows to {database}.{schema}.{table_name} "
+                f"(mode={mode})"
+            )
+        except Exception as exc:
+            logger.error(f"Error storing to {database}.{schema}.{table_name}: {exc}")
+            raise
+    # ------------------------------------------------------------------
+    def close(self):
+        self.session.close()
+        logger.info("Snowflake session closed.")

process_helpscout/workflow/__init__.py ADDED Viewed

File without changes

process_helpscout/workflow/conversation_processor.py ADDED Viewed

	@@ -0,0 +1,334 @@

+"""
+Conversation Processing Workflow using LangGraph.
+Two-node linear graph:
+  sentiment_analysis → topic_extraction → END
+All conversations are assumed to be in English (no translation step).
+The workflow operates on the full customer conversation text, pre-formatted
+and truncated upstream before entering the graph.
+"""
+from typing import Dict, Any, List, TypedDict, Annotated
+import operator
+import json
+import os
+from pathlib import Path
+from langgraph.graph import StateGraph, END
+from agents.sentiment_analysis_agent import SentimentAnalysisAgent
+from agents.topic_extraction_agent import TopicExtractionAgent
+import logging
+logger = logging.getLogger(__name__)
+# Maximum characters to send to the LLM — balances context richness vs. cost
+_MAX_CONVERSATION_CHARS = 5000
+class ConversationState(TypedDict):
+    """
+    State flowing through the conversation processing workflow.
+    Source fields come from the aggregated conversations DataFrame.
+    Processing fields are added/updated by each workflow node.
+    """
+    # --- Source / aggregation fields ---
+    conversation_id: str
+    customer_email: str
+    customer_first: str
+    customer_last: str
+    customer_hs_id: Any
+    thread_count: int
+    first_message_at: Any
+    last_message_at: Any
+    duration_hours: float
+    status: str
+    state: str
+    source_type: str
+    source_via: str
+    combined_text: str          # Raw aggregated customer messages (pipe-separated)
+    # --- Pipeline input ---
+    conversation_text: str      # Formatted + truncated text sent to agents
+    # --- Sentiment analysis outputs ---
+    sentiment_polarity: str
+    emotions: str               # Comma-separated emotion values, or None
+    sentiment_confidence: str
+    sentiment_notes: str
+    # --- Topic extraction outputs ---
+    topics: str                 # Comma-separated topic IDs
+    is_refund_request: bool
+    is_cancellation: bool
+    is_membership: bool
+    topic_confidence: str
+    topic_notes: str
+    summary: str                # 2-3 sentence neutral conversation summary
+    # --- Metadata ---
+    processing_errors: Annotated[List[str], operator.add]
+    success: bool
+class ConversationProcessingWorkflow:
+    """
+    LangGraph-based workflow for processing HelpScout conversations.
+    Graph structure:
+        [START] → sentiment_analysis → topic_extraction → [END]
+    Both nodes receive the same conversation_text. The workflow is
+    intentionally linear — no conditional edges — because every
+    conversation goes through both steps.
+    """
+    def __init__(self, config: Dict[str, Any], api_key: str):
+        """
+        Args:
+            config: Full processing_config.json content
+            api_key: OpenAI API key
+        """
+        self.config = config
+        self.api_key = api_key
+        # Agent-level configs
+        sentiment_agent_config = config["agents"]["sentiment_analysis"]
+        topic_agent_config = config["agents"]["topic_extraction"]
+        # Load topics.json — path is relative to this file's parent directory
+        workflow_dir = Path(__file__).resolve().parent
+        module_dir = workflow_dir.parent
+        topics_path = module_dir / "config_files" / "topics.json"
+        with open(topics_path, "r") as f:
+            topics_config = json.load(f)
+        # Override max chars from config if provided
+        proc = config.get("processing", {})
+        self._max_chars = proc.get("max_conversation_chars", _MAX_CONVERSATION_CHARS)
+        # Initialize agents
+        self.sentiment_agent = SentimentAnalysisAgent(sentiment_agent_config, api_key, config)
+        self.topic_agent = TopicExtractionAgent(topic_agent_config, api_key, topics_config)
+        # Compile workflow graph
+        self.workflow = self._build_workflow()
+        logger.info("ConversationProcessingWorkflow initialized")
+    # ------------------------------------------------------------------
+    # Graph construction
+    # ------------------------------------------------------------------
+    def _build_workflow(self) -> StateGraph:
+        graph = StateGraph(ConversationState)
+        graph.add_node("sentiment_analysis", self._sentiment_node)
+        graph.add_node("topic_extraction", self._topic_node)
+        graph.set_entry_point("sentiment_analysis")
+        graph.add_edge("sentiment_analysis", "topic_extraction")
+        graph.add_edge("topic_extraction", END)
+        return graph.compile()
+    # ------------------------------------------------------------------
+    # Preprocessing
+    # ------------------------------------------------------------------
+    def _format_conversation(self, combined_text: str) -> str:
+        """
+        Convert pipe-separated combined_text into a numbered message format
+        suitable for the LLM, truncated to self._max_chars.
+        Input:  "I can't log in | Still not working | Please help!"
+        Output: "[1] I can't log in\n[2] Still not working\n[3] Please help!"
+        """
+        if not combined_text or not str(combined_text).strip():
+            return ""
+        messages = [m.strip() for m in str(combined_text).split("|") if m.strip()]
+        total_messages = len(messages)
+        parts = []
+        char_count = 0
+        for i, msg in enumerate(messages, 1):
+            entry = f"[{i}] {msg}"
+            if char_count + len(entry) + 1 > self._max_chars:
+                parts.append(f"[...truncated after {i - 1} of {total_messages} messages]")
+                break
+            parts.append(entry)
+            char_count += len(entry) + 1
+        return "\n".join(parts)
+    # ------------------------------------------------------------------
+    # Workflow nodes
+    # ------------------------------------------------------------------
+    def _sentiment_node(self, state: ConversationState) -> ConversationState:
+        """Node 1: Classify sentiment polarity and emotions."""
+        try:
+            # Format conversation text once — reused by both nodes
+            state["conversation_text"] = self._format_conversation(state.get("combined_text", ""))
+            if not state["conversation_text"]:
+                state["processing_errors"] = state.get("processing_errors", []) + [
+                    "Empty conversation text after formatting"
+                ]
+                state["success"] = False
+                return state
+            result = self.sentiment_agent.process({"conversation_text": state["conversation_text"]})
+            if result.get("success", False):
+                state["sentiment_polarity"] = result.get("sentiment_polarity")
+                state["emotions"] = result.get("emotions")
+                state["sentiment_confidence"] = result.get("sentiment_confidence")
+                state["sentiment_notes"] = result.get("sentiment_notes", "")
+                state["success"] = True
+            else:
+                error_msg = f"Sentiment analysis failed: {result.get('error', 'Unknown error')}"
+                state["processing_errors"] = state.get("processing_errors", []) + [error_msg]
+                state["success"] = False
+                state["sentiment_polarity"] = None
+                state["emotions"] = None
+                state["sentiment_confidence"] = None
+                state["sentiment_notes"] = ""
+            logger.debug(f"Sentiment: {state['sentiment_polarity']} | Conversation: {state['conversation_id']}")
+        except Exception as e:
+            error_msg = f"Sentiment node error: {str(e)}"
+            logger.error(error_msg)
+            state["processing_errors"] = state.get("processing_errors", []) + [error_msg]
+            state["success"] = False
+        return state
+    def _topic_node(self, state: ConversationState) -> ConversationState:
+        """Node 2: Extract topic tags and billing flags."""
+        try:
+            # Skip topic extraction if sentiment already failed — no point in a partial record
+            if not state.get("success", False):
+                state["topics"] = None
+                state["is_refund_request"] = False
+                state["is_cancellation"] = False
+                state["is_membership"] = False
+                state["topic_confidence"] = None
+                state["topic_notes"] = ""
+                state["summary"] = ""
+                return state
+            result = self.topic_agent.process({"conversation_text": state["conversation_text"]})
+            if result.get("success", False):
+                state["topics"] = result.get("topics")
+                state["is_refund_request"] = result.get("is_refund_request", False)
+                state["is_cancellation"] = result.get("is_cancellation", False)
+                state["is_membership"] = result.get("is_membership", False)
+                state["topic_confidence"] = result.get("topic_confidence")
+                state["topic_notes"] = result.get("topic_notes", "")
+                state["summary"] = result.get("summary", "")
+                state["success"] = True
+            else:
+                error_msg = f"Topic extraction failed: {result.get('error', 'Unknown error')}"
+                state["processing_errors"] = state.get("processing_errors", []) + [error_msg]
+                state["success"] = False
+                state["topics"] = None
+                state["is_refund_request"] = False
+                state["is_cancellation"] = False
+                state["is_membership"] = False
+                state["topic_confidence"] = None
+                state["topic_notes"] = ""
+                state["summary"] = ""
+            logger.debug(f"Topics: {state['topics']} | Conversation: {state['conversation_id']}")
+        except Exception as e:
+            error_msg = f"Topic node error: {str(e)}"
+            logger.error(error_msg)
+            state["processing_errors"] = state.get("processing_errors", []) + [error_msg]
+            state["success"] = False
+        return state
+    # ------------------------------------------------------------------
+    # Public API
+    # ------------------------------------------------------------------
+    def process_conversation(self, conversation_data: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Process a single conversation through the full workflow.
+        Args:
+            conversation_data: Dict with aggregated conversation fields
+                               (conversation_id, combined_text, customer_*, etc.)
+        Returns:
+            Dict with all original fields plus extracted sentiment and topic fields.
+        """
+        combined_text = conversation_data.get("combined_text", "")
+        if not combined_text or not str(combined_text).strip():
+            logger.warning(f"Skipping conversation with empty text: {conversation_data.get('conversation_id')}")
+            return {
+                **conversation_data,
+                "success": False,
+                "processing_errors": ["combined_text is empty — nothing to analyze"],
+                "conversation_text": "",
+            }
+        initial_state = {
+            "conversation_id": str(conversation_data.get("conversation_id", "")),
+            "customer_email": conversation_data.get("customer_email"),
+            "customer_first": conversation_data.get("customer_first"),
+            "customer_last": conversation_data.get("customer_last"),
+            "customer_hs_id": conversation_data.get("customer_hs_id"),
+            "thread_count": conversation_data.get("thread_count"),
+            "first_message_at": conversation_data.get("first_message_at"),
+            "last_message_at": conversation_data.get("last_message_at"),
+            "duration_hours": conversation_data.get("duration_hours"),
+            "status": conversation_data.get("status"),
+            "state": conversation_data.get("state"),
+            "source_type": conversation_data.get("source_type"),
+            "source_via": conversation_data.get("source_via"),
+            "combined_text": str(combined_text).strip(),
+            "conversation_text": "",      # filled by sentiment node
+            "processing_errors": [],
+            "success": True,
+        }
+        try:
+            final_state = self.workflow.invoke(initial_state)
+            # Merge any extra fields from the source that weren't in initial_state
+            result = dict(final_state)
+            for key, value in conversation_data.items():
+                if key not in result:
+                    result[key] = value
+            return result
+        except Exception as e:
+            logger.error(f"Workflow execution error for {conversation_data.get('conversation_id')}: {e}")
+            return {
+                **conversation_data,
+                "success": False,
+                "processing_errors": [str(e)],
+                "conversation_text": "",
+            }
+    def process_batch(self, conversations: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+        """Process a list of conversations sequentially within the batch."""
+        results = []
+        total = len(conversations)
+        for idx, conv in enumerate(conversations, 1):
+            logger.info(f"Processing conversation {idx}/{total} (id={conv.get('conversation_id')})")
+            result = self.process_conversation(conv)
+            results.append(result)
+        logger.info(f"Batch complete: {total} conversations processed")
+        return results

visualization/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Musora Sentiment Analysis Dashboard
-A Streamlit dashboard for visualising sentiment analysis results from **social media comments** (Facebook, Instagram, YouTube, Twitter) and the **Musora internal app** across brands (Drumeo, Pianote, Guitareo, Singeo, Musora).
 ---
@@ -12,9 +12,12 @@ A Streamlit dashboard for visualising sentiment analysis results from **social m
 4. [Pages](#pages)
 5. [Global Filters & Session State](#global-filters--session-state)
 6. [Snowflake Queries](#snowflake-queries)
-7. [Adding or Changing Things](#adding-or-changing-things)
-8. [Running the App](#running-the-app)
-9. [Configuration Reference](#configuration-reference)
 ---
@@ -22,28 +25,38 @@ A Streamlit dashboard for visualising sentiment analysis results from **social m
 ```
 visualization/
-├── app.py                          # Entry point — routing, sidebar, session state
 ├── config/
-│   └── viz_config.json             # Colors, query strings, dashboard settings
 ├── data/
-│   └── data_loader.py              # All Snowflake queries and caching logic
 ├── utils/
-│   ├── data_processor.py           # Pandas aggregations (intent dist, content summary, etc.)
-│   └── metrics.py                  # KPI calculations (sentiment score, urgency, etc.)
 ├── components/
-│   ├── dashboard.py                # Dashboard page renderer
-│   ├── sentiment_analysis.py       # Sentiment Analysis page renderer
-│   └── reply_required.py           # Reply Required page renderer
 ├── visualizations/
-│   ├── sentiment_charts.py         # Plotly sentiment chart functions
-│   ├── distribution_charts.py      # Plotly distribution / heatmap / scatter functions
-│   ├── demographic_charts.py       # Plotly demographic chart functions
-│   └── content_cards.py            # Streamlit card components (comment cards, content cards)
 ├── agents/
-│   └── content_summary_agent.py    # AI analysis agent (OpenAI) for comment summarisation
 ├── img/
-│   └── musora.png                  # Sidebar logo
-└── SnowFlakeConnection.py          # Snowflake connection wrapper (Snowpark session)
 ```
 ---
@@ -53,213 +66,331 @@ visualization/
 ```
 Snowflake
     │
-    ▼
-data_loader.py          ← Three separate loading modes (see below)
     │
-    ├── load_dashboard_data()   ──► st.session_state['dashboard_df']
-    │                                   └─► app.py sidebar (filter options, counts)
-    │                                   └─► dashboard.py (all charts)
-    │
-    ├── load_sa_data()          ──► st.session_state['sa_contents']
-    │   (on-demand, button)          st.session_state['sa_comments']
-    │                                   └─► sentiment_analysis.py
-    │
-    └── load_reply_required_data() ► st.session_state['rr_df']
-        (on-demand, button)             └─► reply_required.py
 ```
 **Key principle:** Data is loaded as little as possible, as late as possible.
-- The **Dashboard** uses a lightweight query (no text columns, no content join) cached for 24 hours.
-- The **Sentiment Analysis** and **Reply Required** pages never load data automatically — they wait for the user to click **Fetch Data**.
-- All data is stored in `st.session_state` so page navigation and widget interactions do not re-trigger Snowflake queries.
 ---
 ## Data Loading Strategy
-All loading logic lives in **`data/data_loader.py`** (`SentimentDataLoader` class).
-### `load_dashboard_data()`
-- Uses `dashboard_query` from `viz_config.json`.
 - Fetches only: `comment_sk, content_sk, platform, brand, sentiment_polarity, intent, requires_reply, detected_language, comment_timestamp, processed_at, author_id`.
-- No text columns, no `DIM_CONTENT` join — significantly faster than the full query.
-- Also merges demographics data if `demographics_query` is configured.
-- Cached for **24 hours** (`@st.cache_data(ttl=86400)`).
-- Called once by `app.py` at startup; result stored in `st.session_state['dashboard_df']`.
-### `load_sa_data(platform, brand, top_n, min_comments, sort_by, sentiments, intents, date_range)`
-- Runs **two** sequential Snowflake queries:
   1. **Content aggregation** — groups by `content_sk`, counts per sentiment, computes severity score, returns top N.
-  2. **Sampled comments** — for the top N `content_sk`s only, fetches up to 50 comments per sentiment group per content (negative, positive, other), using Snowflake `QUALIFY ROW_NUMBER()`. `display_text` is computed in SQL (`CASE WHEN IS_ENGLISH = FALSE AND TRANSLATED_TEXT IS NOT NULL THEN TRANSLATED_TEXT ELSE ORIGINAL_TEXT END`).
-- Returns a tuple `(contents_df, comments_df)`.
-- Cached for **24 hours**.
-- Called only when the user clicks **Fetch Data** on the Sentiment Analysis page.
-### `load_reply_required_data(platforms, brands, date_range)`
-- Runs a single query filtering `REQUIRES_REPLY = TRUE`.
-- Dynamically includes/excludes the social media table and musora table based on selected platforms.
-- `display_text` computed in SQL.
-- Cached for **24 hours**.
-- Called only when the user clicks **Fetch Data** on the Reply Required page.
-### Important: SQL Column Qualification
-Both the social media table (`COMMENT_SENTIMENT_FEATURES`) and the content dimension table (`DIM_CONTENT`) share column names. Any `WHERE` clause inside a query that joins these two tables **must** use the table alias prefix (e.g. `s.PLATFORM`, `s.COMMENT_TIMESTAMP`, `s.CHANNEL_NAME`) to avoid Snowflake `ambiguous column name` errors. The musora table (`MUSORA_COMMENT_SENTIMENT_FEATURES`) has no joins so unqualified column names are fine there.
 ---
 ## Pages
-### Dashboard (`components/dashboard.py`)
-**Receives:** `filtered_df` — the lightweight dashboard dataframe (after optional global filter applied by `app.py`).
-**Does not need:** text, translations, content URLs. All charts work purely on aggregated columns (sentiment_polarity, brand, platform, intent, requires_reply, comment_timestamp).
 **Key sections:**
 - Summary stats + health indicator
 - Sentiment distribution (pie + gauge)
 - Sentiment by brand and platform (stacked + percentage bar charts)
-- Intent analysis
-- Brand-Platform heatmap
 - Reply requirements + urgency breakdown
-- Demographics (age, timezone, experience level) — only rendered if `author_id` is present and demographics were merged
-**To add a new chart:** create the chart function in `visualizations/` and call it from `render_dashboard()`. The function receives `filtered_df`.
 ---
-### Sentiment Analysis (`components/sentiment_analysis.py`)
-**Receives:** `data_loader` instance only (no dataframe).
 **Flow:**
-1. Reads `st.session_state['dashboard_df']` for filter option lists (platforms, brands, sentiments, intents).
 2. Pre-populates platform/brand dropdowns from `st.session_state['global_filters']`.
-3. Shows filter controls (platform, brand, sentiment, intent, top_n, min_comments, sort_by).
-4. On **Fetch Data** click: calls `data_loader.load_sa_data(...)` and stores results in `st.session_state['sa_contents']` and `['sa_comments']`.
-5. Renders content cards, per-content sentiment + intent charts, AI analysis buttons, and sampled comment expanders.
 **Pagination:** `st.session_state['sentiment_page']` (5 contents per page). Reset on new fetch.
-**Comments:** Sampled (up to 50 negative + 50 positive + 50 neutral per content). These are already in memory after the fetch — no extra query is needed when the user expands a comment section.
-**AI Analysis:** Uses `ContentSummaryAgent` (see `agents/`). Results cached in `st.session_state['content_summaries']`.
 ---
-### Reply Required (`components/reply_required.py`)
 **Receives:** `data_loader` instance only.
 **Flow:**
-1. Reads `st.session_state['dashboard_df']` for filter option lists.
-2. Pre-populates platform, brand, and date from `st.session_state['global_filters']`.
-3. On **Fetch Data** click: calls `data_loader.load_reply_required_data(...)` and stores result in `st.session_state['rr_df']`.
-4. Shows urgency breakdown, in-page view filters (priority, platform, brand, intent — applied in Python, no new query), paginated comment cards, and a "Reply by Content" summary.
 **Pagination:** `st.session_state['reply_page']` (10 comments per page). Reset on new fetch.
 ---
 ## Global Filters & Session State
-Global filters live in the sidebar (`app.py`) and are stored in `st.session_state['global_filters']` as a dict:
 ```python
-{
-    'platforms':  ['facebook', 'instagram'],   # list or []
     'brands':     ['drumeo'],
     'sentiments': [],
     'date_range': (date(2025, 1, 1), date(2025, 12, 31)),  # or None
 }
 ```
-- **Dashboard:** `app.py` applies global filters to `dashboard_df` using `data_loader.apply_filters()` and passes the result to `render_dashboard()`.
-- **Sentiment Analysis / Reply Required:** global filters are used to pre-populate their own filter widgets. The actual Snowflake query uses those values when the user clicks Fetch. The pages do **not** receive a pre-filtered dataframe.
 ### Full session state key reference
 | Key | Set by | Used by |
 |-----|--------|---------|
-| `dashboard_df` | `app.py` on startup | sidebar (filter options), dashboard, SA + RR (filter option lists) |
-| `global_filters` | sidebar "Apply Filters" button | app.py (dashboard filter), SA + RR (pre-populate widgets) |
-| `filters_applied` | sidebar buttons | app.py (whether to apply filters) |
-| `sa_contents` | SA fetch button | SA page rendering |
-| `sa_comments` | SA fetch button | SA page rendering |
-| `sa_fetch_key` | SA fetch button | SA page (detect stale data) |
-| `rr_df` | RR fetch button | RR page rendering |
-| `rr_fetch_key` | RR fetch button | RR page (detect stale data) |
 | `sentiment_page` | SA page / fetch | SA pagination |
 | `reply_page` | RR page / fetch | RR pagination |
-| `content_summaries` | AI analysis buttons | SA AI analysis display |
 ---
 ## Snowflake Queries
-All query strings are either stored in `config/viz_config.json` (static queries) or built dynamically in `data/data_loader.py` (page-specific queries).
 ### Static queries (in `viz_config.json`)
 | Key | Purpose |
 |-----|---------|
-| `query` | Full query with all columns (legacy, kept for compatibility) |
-| `dashboard_query` | Lightweight query — no text, no DIM_CONTENT join |
-| `demographics_query` | Joins `usora_users` with `preprocessed.users` to get age/timezone/experience |
-### Dynamic queries (built in `data_loader.py`)
 | Method | Description |
 |--------|-------------|
-| `_build_sa_content_query()` | Content aggregation for SA page; filters by platform + brand + date |
-| `_build_sa_comments_query()` | Sampled comments for SA page; uses `QUALIFY ROW_NUMBER() <= 50` |
-| `_build_rr_query()` | Reply-required comments; filters by platform/brand/date; conditionally includes social media and/or musora table |
-### Data source tables
-| Table | Platform | Notes |
-|-------|----------|-------|
-| `SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES` | facebook, instagram, youtube, twitter | Needs `LEFT JOIN SOCIAL_MEDIA_DB.CORE.DIM_CONTENT` for `PERMALINK_URL` |
-| `SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES` | musora_app | Has `PERMALINK_URL` and `THUMBNAIL_URL` natively; platform stored as `'musora'`, mapped to `'musora_app'` in queries |
 ---
 ## Adding or Changing Things
-### Add a new chart to the Dashboard
 1. Write the chart function in the appropriate `visualizations/` file.
-2. Call it from `render_dashboard()` in `components/dashboard.py`, passing `filtered_df`.
-3. The chart function receives a lightweight df — it has no text columns but has all the columns listed in `dashboard_query`.
-### Add a new filter to the Dashboard sidebar
-1. Add the widget in `app.py` under the "Global Filters" section.
-2. Store the selected value in the `global_filters` dict under `st.session_state`.
-3. Pass it to `data_loader.apply_filters()`.
-### Change what the Sentiment Analysis page queries
-- Edit `_build_sa_content_query()` and/or `_build_sa_comments_query()` in `data_loader.py`.
-- If you add new columns to the content aggregation result, also update `_process_sa_content_stats()` so they are available in `contents_df`.
-- If you add new columns to the comments result, update `_process_sa_comments()`.
-### Change what the Reply Required page queries
-- Edit `_build_rr_query()` in `data_loader.py`.
-- Remember: all column references inside the social media block (which has a `JOIN`) must be prefixed with `s.` to avoid Snowflake ambiguity errors.
 ### Change the cache duration
-- `@st.cache_data(ttl=86400)` is set on `load_dashboard_data`, `_fetch_sa_data`, `_fetch_rr_data`, and `load_demographics_data`.
-- Change `86400` (seconds) to the desired TTL, or set `ttl=None` for no expiry.
-- Users can always force a refresh with the "Reload Data" button in the sidebar (which calls `st.cache_data.clear()` and deletes `st.session_state['dashboard_df']`).
 ### Add a new page
-1. Create `components/new_page.py` with a `render_new_page(data_loader)` function.
 2. Import and add a radio option in `app.py`.
-3. If the page needs its own Snowflake data, add a `load_new_page_data()` method to `SentimentDataLoader` following the same pattern as `load_sa_data`.
-### Add a new column to the Dashboard query
-- Edit `dashboard_query` in `config/viz_config.json`.
-- Both UNION branches must select the same columns in the same order.
-- `_process_dashboard_dataframe()` in `data_loader.py` handles basic type casting — add processing there if needed.
 ---
@@ -280,6 +411,8 @@ SNOWFLAKE_ROLE
 SNOWFLAKE_DATABASE
 SNOWFLAKE_WAREHOUSE
 SNOWFLAKE_SCHEMA
 ```
 ---
@@ -291,19 +424,25 @@ SNOWFLAKE_SCHEMA
 | Section | What it configures |
 |---------|-------------------|
 | `color_schemes.sentiment_polarity` | Hex colors for each sentiment level |
-| `color_schemes.intent` | Hex colors for each intent label |
-| `color_schemes.platform` | Hex colors for each platform |
-| `color_schemes.brand` | Hex colors for each brand |
-| `sentiment_order` | Display order for sentiment categories in charts |
 | `intent_order` | Display order for intent categories |
 | `negative_sentiments` | Which sentiment values count as "negative" |
-| `dashboard.default_date_range_days` | Default date filter window (days) |
-| `dashboard.max_comments_display` | Max comments shown per pagination page |
-| `dashboard.chart_height` | Default Plotly chart height |
-| `dashboard.top_n_contents` | Default top-N for content ranking |
-| `snowflake.query` | Full query (legacy, all columns) |
-| `snowflake.dashboard_query` | Lightweight dashboard query (no text columns) |
-| `snowflake.demographics_query` | Demographics join query |
 | `demographics.age_groups` | Age bucket definitions (label → [min, max]) |
 | `demographics.experience_groups` | Experience bucket definitions |
 | `demographics.top_timezones_count` | How many timezones to show in the geographic chart |

 # Musora Sentiment Analysis Dashboard
+A Streamlit dashboard for visualising sentiment analysis results from **social media comments** (Facebook, Instagram, YouTube, Twitter), the **Musora internal app**, and **HelpScout customer support conversations** across brands (Drumeo, Pianote, Guitareo, Singeo, Musora).
 ---
 4. [Pages](#pages)
 5. [Global Filters & Session State](#global-filters--session-state)
 6. [Snowflake Queries](#snowflake-queries)
+7. [Authentication](#authentication)
+8. [PDF Reports](#pdf-reports)
+9. [AI Agents](#ai-agents)
+10. [Adding or Changing Things](#adding-or-changing-things)
+11. [Running the App](#running-the-app)
+12. [Configuration Reference](#configuration-reference)
 ---
 ```
 visualization/
+├── app.py                              # Entry point — routing, sidebar, session state
 ├── config/
+│   └── viz_config.json                 # Colors, query strings, dashboard settings
 ├── data/
+│   ├── data_loader.py                  # Comment Snowflake queries and caching
+│   └── helpscout_data_loader.py        # HelpScout Snowflake queries and caching
 ├── utils/
+│   ├── auth.py                         # Login page, authentication helpers
+│   ├── data_processor.py               # Pandas aggregations (intent dist, content summary, etc.)
+│   ├── metrics.py                      # KPI calculations (sentiment score, urgency, etc.)
+│   ├── pdf_exporter.py                 # DashboardPDFExporter (comment dashboard PDF)
+│   ├── helpscout_utils.py              # Pure helpers: parse_topics, explode_topics, boolean_flag_counts
+│   └── helpscout_pdf.py                # HelpScoutDashboardPDF + HelpScoutAnalysisPDF
 ├── components/
+│   ├── dashboard.py                    # Comment Dashboard page renderer
+│   ├── sentiment_analysis.py           # Sentiment Analysis page renderer
+│   ├── reply_required.py               # Reply Required page renderer
+│   ├── helpscout_dashboard.py          # HelpScout Dashboard page + compact summary widget
+│   └── helpscout_analysis.py           # HelpScout Analysis page (filter→fetch→charts→LLM→PDF)
 ├── visualizations/
+│   ├── sentiment_charts.py             # Plotly sentiment chart functions
+│   ├── distribution_charts.py          # Plotly distribution / heatmap / scatter functions
+│   ├── demographic_charts.py           # Plotly demographic chart functions
+│   ├── content_cards.py                # Streamlit card components (comment + content cards)
+│   └── helpscout_charts.py             # HelpScoutCharts Plotly factory (16 chart types)
 ├── agents/
+│   ├── base_agent.py                   # BaseVisualizationAgent (shared interface)
+│   ├── content_summary_agent.py        # AI analysis for comment content summarisation
+│   └── helpscout_summary_agent.py      # HelpScoutSummaryAgent — page-level LLM summary from SUMMARY fields
 ├── img/
+│   └── musora.png                      # Sidebar logo
+└── SnowFlakeConnection.py              # Snowflake connection wrapper (Snowpark session)
 ```
 ---
 ```
 Snowflake
     │
+    ├── data_loader.py (SentimentDataLoader)
+    │       ├── load_dashboard_data()        ──► st.session_state['dashboard_df']
+    │       │                                       └─► sidebar (filter options, counts)
+    │       │                                       └─► dashboard.py (all charts)
+    │       ├── load_sa_data()               ──► st.session_state['sa_contents', 'sa_comments']
+    │       │   (on-demand, Fetch button)           └─► sentiment_analysis.py
+    │       └── load_reply_required_data()   ──► st.session_state['rr_df']
+    │           (on-demand, Fetch button)           └─► reply_required.py
     │
+    └── helpscout_data_loader.py (HelpScoutDataLoader)
+            ├── load_dashboard_data()        ──► st.session_state['helpscout_df']
+            │                                       └─► helpscout_dashboard.py
+            │                                       └─► dashboard.py (compact summary)
+            └── load_analysis_data()         ──► st.session_state['hs_analysis_df']
+                (on-demand, Fetch button)           └─► helpscout_analysis.py
 ```
 **Key principle:** Data is loaded as little as possible, as late as possible.
+- **Dashboard** queries are lightweight (no text columns, no content join) and cached 24 hours.
+- **Sentiment Analysis**, **Reply Required**, and **HelpScout Analysis** pages wait for the user to click **Fetch Data**.
+- All data lives in `st.session_state` so page navigation and widget interactions never re-trigger Snowflake queries.
 ---
 ## Data Loading Strategy
+### Comment data (`data/data_loader.py` — `SentimentDataLoader`)
+#### `load_dashboard_data()`
 - Fetches only: `comment_sk, content_sk, platform, brand, sentiment_polarity, intent, requires_reply, detected_language, comment_timestamp, processed_at, author_id`.
+- No text columns, no `DIM_CONTENT` join.
+- Merges demographics data if `demographics_query` is configured.
+- Cached **24 hours**. Called once at startup; stored in `st.session_state['dashboard_df']`.
+#### `load_sa_data(platform, brand, top_n, min_comments, sort_by, sentiments, intents, emotions, date_range)`
+- Runs two Snowflake queries:
   1. **Content aggregation** — groups by `content_sk`, counts per sentiment, computes severity score, returns top N.
+  2. **Sampled comments** — up to 50 per sentiment group per content (`QUALIFY ROW_NUMBER() <= 50`). `display_text` computed in SQL.
+- Returns `(contents_df, comments_df)`. Cached **24 hours**.
+#### `load_reply_required_data(platforms, brands, date_range)`
+- Filters `REQUIRES_REPLY = TRUE`. Conditionally includes the social media table and/or musora table. Cached **24 hours**.
+#### SQL column qualification note
+The social media table and `DIM_CONTENT` share column names. Any `WHERE` clause inside a query that joins them **must** use the table alias prefix (e.g. `s.PLATFORM`, `s.COMMENT_TIMESTAMP`) to avoid Snowflake `ambiguous column name` errors.
+---
+### HelpScout data (`data/helpscout_data_loader.py` — `HelpScoutDataLoader`)
+#### `load_dashboard_data()`
+- Lightweight query from `SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES`.
+- Columns: `conversation_id, status, source, created_at, updated_at, duration_hours, sentiment_polarity, topics, is_refund_request, is_cancellation, is_membership, customer_email`.
+- Merges demographics (age/timezone/experience) via email join (`LOWER(customer_email) = LOWER(usora_users.email)`).
+- Cached **24 hours**. Stored in `st.session_state['helpscout_df']`.
+#### `load_analysis_data(date_start, date_end, topics, sentiments, statuses, sources, is_refund, is_cancellation, is_membership)`
+- Adds `summary, sentiment_notes, topic_notes, customer_first_name, customer_last_name` columns.
+- SQL `WHERE` pushdown for all filters; multi-label topic filter uses `ARRAY_CONTAINS('topic_id'::VARIANT, SPLIT(TOPICS, ','))`.
+- Cached **24 hours** keyed on filter tuple. Stored in `st.session_state['hs_analysis_df']`.
+#### `get_filter_options(df)`
+- Returns `sentiments`, `topics` (exploded and label-mapped from taxonomy), `statuses`, `states`, `sources`.
 ---
 ## Pages
+The app has **5 pages** navigated via the sidebar radio:
+### 1. Sentiment Dashboard (`components/dashboard.py`)
+**Receives:** `filtered_df` — lightweight comment dataframe (after optional global filter from `app.py`).
 **Key sections:**
 - Summary stats + health indicator
 - Sentiment distribution (pie + gauge)
 - Sentiment by brand and platform (stacked + percentage bar charts)
+- Intent analysis (bar + pie)
+- Emotion analysis (bar + pie) — only when `emotions` column is non-null
+- Brand–Platform heatmap
 - Reply requirements + urgency breakdown
+- Demographics (age, timezone, experience) — only when demographics were merged
+- **HelpScout compact summary** — appended at bottom; reads `st.session_state['helpscout_df']` directly (guarded by `try/except` so failures never break the main dashboard)
 ---
+### 2. Custom Sentiment Queries (`components/sentiment_analysis.py`)
+**Receives:** `data_loader` instance only.
 **Flow:**
+1. Reads `st.session_state['dashboard_df']` for filter option lists.
 2. Pre-populates platform/brand dropdowns from `st.session_state['global_filters']`.
+3. On **Fetch Data**: calls `data_loader.load_sa_data(...)`, stores results in `st.session_state['sa_contents']` and `['sa_comments']`.
+4. Renders content cards, per-content sentiment + intent + emotion charts, AI analysis buttons, sampled comment expanders.
 **Pagination:** `st.session_state['sentiment_page']` (5 contents per page). Reset on new fetch.
 ---
+### 3. Reply Required (`components/reply_required.py`)
 **Receives:** `data_loader` instance only.
 **Flow:**
+1. Pre-populates platform/brand/date from `st.session_state['global_filters']`.
+2. On **Fetch Data**: calls `data_loader.load_reply_required_data(...)`, stores result in `st.session_state['rr_df']`.
+3. Shows urgency breakdown, in-page filters (applied in Python, no extra query), paginated comment cards, and "Reply by Content" summary.
 **Pagination:** `st.session_state['reply_page']` (10 comments per page). Reset on new fetch.
 ---
+### 4. HelpScout Dashboard (`components/helpscout_dashboard.py`)
+**Receives:** `helpscout_loader` instance.
+**Reads from:** `st.session_state['helpscout_df']` (loaded at app startup).
+**Key sections:**
+- PDF export button (HelpScout Dashboard PDF)
+- 6 KPI metrics: total conversations, average duration, refund requests, cancellations, negative rate, membership joins
+- Sentiment distribution (pie + bar)
+- Topic distribution and sentiment heatmap (from `process_helpscout/config_files/topics.json` taxonomy)
+- Boolean flags (refund, cancellation, membership) breakdown
+- Status and source breakdown
+- Timelines expander (daily conversation volume, refund/cancel trend)
+- Depth expander (topic co-occurrence, escalation funnel)
+- Demographics (age, timezone, experience)
+> **Note:** Global sidebar filters (brand, platform, sentiment, date) do **not** apply to HelpScout pages — HelpScout is brand-agnostic and uses its own filter panel.
+---
+### 5. HelpScout Analysis (`components/helpscout_analysis.py`)
+**Receives:** `helpscout_loader` instance.
+**Flow:**
+1. **Filter panel** — date range, top_n, topics (multi-select with human-readable labels), sentiments, statuses, sources, and 3 boolean checkboxes (refund / cancellation / membership).
+2. **Fetch Data** button — calls `helpscout_loader.load_analysis_data(...)`, stale-checked via `fetch_key` tuple.
+3. **KPI row** + distribution charts (sentiment, topics, flags, status).
+4. **AI Summary section:**
+   - "Generate AI Summary" button → calls `HelpScoutSummaryAgent`, stores result in `st.session_state['hs_analysis_summary']`.
+   - Renders: executive summary, top themes, top complaints, unexpected insights, notable quotes.
+   - "Export Analysis PDF" button → generates `HelpScoutAnalysisPDF`.
+5. **Paginated conversation cards** — 10 per page; each card shows customer name, status, topics (label-mapped), summary, sentiment/topic notes.
+6. **CSV export** button.
+**Pagination:** `st.session_state['hs_analysis_page']`. Reset on new fetch.
+**Date range default:** Clamps to `max(min_date, max_date − default_date_range_days)` so the default is always within the available data window.
+---
 ## Global Filters & Session State
+Global filters apply **only to comment pages** (Dashboard, Sentiment Analysis, Reply Required). They have no effect on HelpScout pages.
 ```python
+st.session_state['global_filters'] = {
+    'platforms':  ['facebook', 'instagram'],
     'brands':     ['drumeo'],
     'sentiments': [],
     'date_range': (date(2025, 1, 1), date(2025, 12, 31)),  # or None
 }
 ```
 ### Full session state key reference
 | Key | Set by | Used by |
 |-----|--------|---------|
+| `dashboard_df` | `app.py` startup | sidebar, dashboard.py, SA + RR filter lists |
+| `global_filters` | sidebar "Apply Filters" | app.py (dashboard filter), SA + RR pre-populate |
+| `filters_applied` | sidebar buttons | app.py |
+| `sa_contents` | SA fetch button | sentiment_analysis.py |
+| `sa_comments` | SA fetch button | sentiment_analysis.py |
+| `sa_fetch_key` | SA fetch button | SA stale-check |
+| `rr_df` | RR fetch button | reply_required.py |
+| `rr_fetch_key` | RR fetch button | RR stale-check |
 | `sentiment_page` | SA page / fetch | SA pagination |
 | `reply_page` | RR page / fetch | RR pagination |
+| `content_summaries` | SA AI buttons | SA AI analysis display |
+| `helpscout_df` | `app.py` startup | helpscout_dashboard.py, dashboard.py compact summary |
+| `hs_analysis_df` | HS Analysis fetch | helpscout_analysis.py charts + cards |
+| `hs_analysis_fetch_key` | HS Analysis fetch | HS Analysis stale-check |
+| `hs_analysis_filter_desc` | HS Analysis fetch | human-readable filter string for PDF + agent |
+| `hs_analysis_summary` | "Generate AI Summary" | HS Analysis summary renderer |
+| `hs_analysis_summary_key` | "Generate AI Summary" | invalidated on re-fetch |
+| `hs_analysis_page` | HS Analysis page / fetch | HS Analysis pagination |
 ---
 ## Snowflake Queries
+### Comment tables
+| Table | Platform | Notes |
+|-------|----------|-------|
+| `SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES` | facebook, instagram, youtube, twitter | Needs `LEFT JOIN DIM_CONTENT` for `PERMALINK_URL` |
+| `SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES` | musora_app | Has `PERMALINK_URL` and `THUMBNAIL_URL` natively |
+### HelpScout table
+| Table | Notes |
+|-------|-------|
+| `SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES` | One row per conversation; multi-label topics in comma-separated `TOPICS` column |
 ### Static queries (in `viz_config.json`)
 | Key | Purpose |
 |-----|---------|
+| `dashboard_query` | Lightweight comment query — no text, no DIM_CONTENT join |
+| `demographics_query` | Joins `usora_users` + `preprocessed.users` for age/timezone/experience |
+| `helpscout.dashboard_query` | Lightweight HelpScout query (no SUMMARY/notes) |
+| `helpscout.demographics_query` | Same demographics join, keyed on `customer_email` |
+### Dynamic queries (built in `helpscout_data_loader.py`)
 | Method | Description |
 |--------|-------------|
+| `_build_analysis_query()` | Full HelpScout query including SUMMARY/notes; multi-label topic filter via `ARRAY_CONTAINS` |
+---
+## Authentication
+Module: `utils/auth.py`
+- `AUTHORIZED_EMAILS` allowlist + `APP_TOKEN` env var.
+- `render_login_page()` renders the login form and calls `st.stop()` when not authenticated.
+- Gate is placed at the top of `app.py` (after `st.set_page_config`, before data loaders).
+- Current user and logout button are shown in the sidebar.
+**Required env vars:**
+```
+APP_TOKEN=<shared token>
+```
+---
+## PDF Reports
+### Comment Dashboard PDF (`utils/pdf_exporter.py` — `DashboardPDFExporter`)
+Generated from the "Export PDF Report" expander at the top of the Dashboard page.
+Sections: cover, executive summary, sentiment, brand, platform, intent, cross-dimensional, volume, reply requirements, demographics (optional), language (optional), HelpScout summary (if data loaded), data summary.
+### HelpScout Dashboard PDF (`utils/helpscout_pdf.py` — `HelpScoutDashboardPDF`)
+Generated from the HelpScout Dashboard page. Sections: cover, KPI summary, sentiment, topics, flags & escalation, status & source, timelines, demographics.
+### HelpScout Analysis PDF (`utils/helpscout_pdf.py` — `HelpScoutAnalysisPDF`)
+Generated from the "Export Analysis PDF" button on the HelpScout Analysis page (only available after an AI Summary has been generated).
+Sections: cover, filter summary, KPI summary, chart snapshots, AI summary (executive summary, top themes, top complaints, unexpected insights, notable quotes), conversation cards sample, metadata.
+**Dependencies:** `fpdf2`, `kaleido` (for Plotly PNG rendering at 3× scale).
+---
+## AI Agents
+### `ContentSummaryAgent` (`agents/content_summary_agent.py`)
+Summarises sampled comments for a single content item on the Sentiment Analysis page. Called per-content when the user clicks the AI analysis button. Results cached in `st.session_state['content_summaries']`.
+### `HelpScoutSummaryAgent` (`agents/helpscout_summary_agent.py`)
+Produces a **page-level** executive report from the filtered HelpScout conversations by reading their pre-extracted `SUMMARY` fields through an LLM.
+- Stratified sample by `sentiment_polarity` — capped at `max_conversations` (default 300).
+- Builds aggregate context: sentiment breakdown, top topics, flag counts, average duration, then per-conversation summaries (capped at 250 chars each).
+- Prompt asks the LLM to surface patterns **beyond** the pre-tagged topics/sentiments.
+- Output structure:
+```json
+{
+    "executive_summary": "...",
+    "top_themes": [{"theme": "...", "description": "...", "prevalence": "..."}],
+    "top_complaints": ["..."],
+    "unexpected_insights": ["..."],
+    "notable_quotes": ["..."]
+}
+```
+- Uses `LLMHelper.get_structured_completion()` with up to 3 retries.
 ---
 ## Adding or Changing Things
+### Add a new chart to the Comment Dashboard
 1. Write the chart function in the appropriate `visualizations/` file.
+2. Call it from `render_dashboard()` in `components/dashboard.py`.
+### Add a new chart to the HelpScout Dashboard
+1. Add the chart method to `HelpScoutCharts` in `visualizations/helpscout_charts.py`.
+2. Call it from `render_helpscout_dashboard()` in `components/helpscout_dashboard.py`.
+### Add a new HelpScout filter
+1. Add the widget to the filter panel in `helpscout_analysis.py`.
+2. Include the new value in the `fetch_key` tuple.
+3. Add the corresponding `WHERE` clause condition to `_build_analysis_query()` in `helpscout_data_loader.py`.
+### Add a new HelpScout topic
+- Edit `process_helpscout/config_files/topics.json` (the taxonomy file).
+- `helpscout_utils.load_topic_taxonomy()` reloads it on each app start; no other changes needed.
 ### Change the cache duration
+`@st.cache_data(ttl=86400)` appears on `load_dashboard_data`, `_fetch_sa_data`, `_fetch_rr_data`, `load_demographics_data`, and their HelpScout equivalents. Change `86400` to the desired TTL. Users can always force a refresh with "Reload Data" in the sidebar.
 ### Add a new page
+1. Create `components/new_page.py` with a `render_new_page(...)` function.
 2. Import and add a radio option in `app.py`.
+3. Add data loading to the appropriate loader class.
+4. If the page should be excluded from global comment filters, extend the `_hs_page` guard in `app.py`.
+### Change what the Sentiment Analysis page queries
+- Edit `_build_sa_content_query()` and/or `_build_sa_comments_query()` in `data_loader.py`.
+- Update `_process_sa_content_stats()` and/or `_process_sa_comments()` for new columns.
 ---
 SNOWFLAKE_DATABASE
 SNOWFLAKE_WAREHOUSE
 SNOWFLAKE_SCHEMA
+OPENAI_API_KEY
+APP_TOKEN
 ```
 ---
 | Section | What it configures |
 |---------|-------------------|
 | `color_schemes.sentiment_polarity` | Hex colors for each sentiment level |
+| `color_schemes.intent` | Hex colors per intent label |
+| `color_schemes.emotion` | Hex colors per emotion label |
+| `color_schemes.platform` | Hex colors per platform |
+| `color_schemes.brand` | Hex colors per brand |
+| `color_schemes_helpscout.topics` | Hex colors for HelpScout topic bars |
+| `color_schemes_helpscout.status` | Hex colors for conversation status values |
+| `color_schemes_helpscout.boolean_flags` | Hex colors for refund/cancellation/membership flags |
+| `sentiment_order` | Display order for sentiment categories |
 | `intent_order` | Display order for intent categories |
+| `emotion_order` | Display order for emotion categories |
 | `negative_sentiments` | Which sentiment values count as "negative" |
+| `dashboard.default_date_range_days` | Default date filter window for comment pages |
+| `helpscout.default_date_range_days` | Default date filter window for HelpScout Analysis |
+| `helpscout.max_summary_conversations` | Cap on conversations sent to LLM summary agent |
+| `helpscout.escalation_sentiments` | Sentiment values that count as escalation |
+| `snowflake.dashboard_query` | Lightweight comment dashboard query |
+| `snowflake.demographics_query` | Demographics join query (comment pages) |
+| `helpscout.dashboard_query` | Lightweight HelpScout dashboard query |
+| `helpscout.demographics_query` | Demographics join query (HelpScout, keyed on email) |
 | `demographics.age_groups` | Age bucket definitions (label → [min, max]) |
 | `demographics.experience_groups` | Experience bucket definitions |
 | `demographics.top_timezones_count` | How many timezones to show in the geographic chart |

visualization/agents/helpscout_summary_agent.py ADDED Viewed

	@@ -0,0 +1,309 @@

+"""
+HelpScout Summary Agent
+Generates a page-level summary report from filtered HelpScout conversations.
+Analyses the already-extracted SUMMARY fields to surface patterns and insights
+beyond the pre-tagged topics / sentiments.
+"""
+import json
+import sys
+from pathlib import Path
+from typing import Any, Dict
+import pandas as pd
+# Ensure visualization/ is on sys.path so agents.*, utils.* imports resolve
+_parent = Path(__file__).resolve().parent.parent
+if str(_parent) not in sys.path:
+    sys.path.insert(0, str(_parent))
+from agents.base_agent import BaseVisualizationAgent
+from utils.llm_helper import LLMHelper
+from utils.helpscout_utils import topic_label, load_topic_taxonomy
+class HelpScoutSummaryAgent(BaseVisualizationAgent):
+    """
+    Produces an executive summary report from a filtered set of HelpScout
+    conversations by reading their SUMMARY fields through an LLM.
+    """
+    MAX_SUMMARY_CHARS = 250  # per conversation summary sent to LLM
+    def __init__(self, model: str = "gpt-5-nano", temperature: float = 1,
+                 max_conversations: int = 300):
+        super().__init__(name="HelpScoutSummaryAgent", model=model, temperature=temperature)
+        self.llm_helper = LLMHelper(model=model, temperature=temperature)
+        self.max_conversations = max_conversations
+        self.taxonomy = load_topic_taxonomy()
+    # ─────────────────────────────────────────────────────────────
+    # BaseVisualizationAgent interface
+    # ─────────────────────────────────────────────────────────────
+    def validate_input(self, input_data: Dict[str, Any]) -> bool:
+        if "conversations" not in input_data:
+            self.log_processing("Missing 'conversations' key", level="error")
+            return False
+        if not isinstance(input_data["conversations"], pd.DataFrame):
+            self.log_processing("'conversations' must be a DataFrame", level="error")
+            return False
+        if "summary" not in input_data["conversations"].columns:
+            self.log_processing("DataFrame must contain a 'summary' column", level="error")
+            return False
+        return True
+    def process(self, input_data: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Generate an aggregate summary report from filtered HelpScout conversations.
+        Args:
+            input_data: {
+                'conversations':      pd.DataFrame  (must have 'summary' column),
+                'filter_description': str           (human-readable applied filters),
+                'max_conversations':  int           (optional; overrides instance default),
+            }
+        Returns:
+            {
+                'success': bool,
+                'summary': {
+                    'executive_summary':    str,
+                    'top_themes':           [{'theme': str, 'description': str, 'prevalence': str}],
+                    'top_complaints':       [str],
+                    'unexpected_insights':  [str],
+                    'recommended_actions':  [{'priority': str, 'action': str, 'rationale': str}],
+                    'notable_quotes':       [str],
+                },
+                'metadata': {
+                    'total_conversations_analyzed': int,
+                    'model_used':   str,
+                    'tokens_used':  int,
+                    'filter_applied': str,
+                },
+                'error': str | None,
+            }
+        """
+        try:
+            if not self.validate_input(input_data):
+                return {"success": False, "error": "Invalid input data", "summary": None}
+            df = input_data["conversations"]
+            filter_desc = input_data.get("filter_description", "No filters applied")
+            max_convs = input_data.get("max_conversations", self.max_conversations)
+            total_available = len(df)
+            if total_available == 0:
+                return self._empty_result(filter_desc)
+            # Sample if over cap — stratified by sentiment to preserve signal
+            df_sample = self._stratified_sample(df, max_convs)
+            n_analyzed = len(df_sample)
+            self.log_processing(
+                f"Analysing {n_analyzed} of {total_available} conversations"
+                f" (filter: {filter_desc[:60]})"
+            )
+            # Build aggregate context for the LLM
+            agg_context = self._build_aggregate_context(df_sample, df)
+            prompt = self._build_prompt(agg_context, filter_desc, n_analyzed)
+            system_msg = (
+                "You are an expert customer support analyst for Musora, "
+                "a music education platform (Drumeo, Pianote, Guitareo, Singeo, PlayBass). "
+                "Your role is to synthesize customer support conversation summaries "
+                "and surface actionable insights that go beyond simple tagging."
+            )
+            response = self.llm_helper.get_structured_completion(
+                prompt=prompt,
+                system_message=system_msg,
+                max_retries=3,
+            )
+            if not response["success"]:
+                return self.handle_error(
+                    Exception(response.get("error", "LLM call failed")),
+                    context=f"filter={filter_desc[:60]}"
+                )
+            summary = response["content"]
+            summary = self._ensure_defaults(summary)
+            return {
+                "success": True,
+                "summary": summary,
+                "metadata": {
+                    "total_conversations_analyzed": n_analyzed,
+                    "total_available": total_available,
+                    "model_used": response["model"],
+                    "tokens_used": response["usage"]["total_tokens"],
+                    "filter_applied": filter_desc,
+                },
+                "error": None,
+            }
+        except Exception as e:
+            return self.handle_error(e, context=input_data.get("filter_description", ""))
+    # ─────────────────────────────────────────────────────────────
+    # Private helpers
+    # ─────────────────────────────────────────────────────────────
+    def _stratified_sample(self, df: pd.DataFrame, cap: int) -> pd.DataFrame:
+        """Stratified sample by sentiment to keep signal diversity."""
+        if len(df) <= cap:
+            return df
+        try:
+            strat_col = "sentiment_polarity"
+            if strat_col in df.columns and df[strat_col].nunique() > 1:
+                # Proportional allocation per sentiment group
+                groups = df.groupby(strat_col, group_keys=False)
+                sampled = groups.apply(
+                    lambda g: g.sample(
+                        n=max(1, int(cap * len(g) / len(df))),
+                        random_state=42,
+                    )
+                )
+                return sampled.head(cap)
+        except Exception:
+            pass
+        return df.sample(n=cap, random_state=42)
+    def _build_aggregate_context(self, df_sample: pd.DataFrame,
+                                 df_full: pd.DataFrame) -> str:
+        """Build a text block with aggregate stats + conversation summaries."""
+        total = len(df_full)
+        n_sample = len(df_sample)
+        # Aggregate stats from the full filtered set
+        stats = []
+        if "sentiment_polarity" in df_full.columns:
+            sent_counts = df_full["sentiment_polarity"].value_counts()
+            sent_pct = (sent_counts / total * 100).round(1)
+            stats.append("Sentiment breakdown: " +
+                          ", ".join(f"{s} {pct}%" for s, pct in sent_pct.items()))
+        if "topics" in df_full.columns:
+            from utils.helpscout_utils import explode_topics
+            exploded = explode_topics(df_full)
+            if not exploded.empty:
+                top_topics = exploded["topic_id"].value_counts().head(8)
+                topic_strs = [f"{topic_label(t, self.taxonomy)} ({c})" for t, c in top_topics.items()]
+                stats.append("Top topics: " + ", ".join(topic_strs))
+        from utils.helpscout_utils import boolean_flag_counts
+        flags = boolean_flag_counts(df_full)
+        flag_parts = []
+        if flags["is_refund_request"]:
+            flag_parts.append(f"Refund requests: {flags['is_refund_request']}")
+        if flags["is_cancellation"]:
+            flag_parts.append(f"Cancellations: {flags['is_cancellation']}")
+        if flags["is_membership"]:
+            flag_parts.append(f"Membership joins: {flags['is_membership']}")
+        if flag_parts:
+            stats.append(", ".join(flag_parts))
+        if "duration_hours" in df_full.columns:
+            avg_dur = df_full["duration_hours"].mean()
+            stats.append(f"Average conversation duration: {avg_dur:.1f} hours")
+        stats_block = "\n".join(stats)
+        # Individual summaries (capped per conversation)
+        summaries = []
+        for i, row in enumerate(df_sample.itertuples(), 1):
+            s = getattr(row, "summary", None) or ""
+            s = str(s).strip()
+            if s:
+                s = s[:self.MAX_SUMMARY_CHARS] + ("…" if len(s) > self.MAX_SUMMARY_CHARS else "")
+                sent = getattr(row, "sentiment_polarity", "")
+                summaries.append(f"[{i}] ({sent}) {s}")
+        summaries_block = "\n".join(summaries) if summaries else "No summaries available."
+        note = (f"Note: Showing {n_sample} of {total} matched conversations."
+                if n_sample < total else f"Showing all {total} matched conversations.")
+        return f"""=== AGGREGATE STATISTICS ===
+{stats_block}
+{note}
+=== CONVERSATION SUMMARIES ===
+{summaries_block}"""
+    def _build_prompt(self, context: str, filter_desc: str,
+                      n_analyzed: int) -> str:
+        return f"""Analyze the following {n_analyzed} HelpScout customer support conversation summaries for Musora.
+Applied filters: {filter_desc}
+{context}
+Your task: Synthesize these conversations and produce insights that go BEYOND the pre-extracted tags.
+Look for underlying patterns, recurring pain points, emotional signals, product gaps, and operational issues
+that would not be obvious from simple topic counts alone.
+Respond in JSON with this exact structure:
+{{
+    "executive_summary": "3-5 sentence high-level synthesis of what customers are experiencing",
+    "top_themes": [
+        {{
+            "theme": "Short theme name (not a topic tag)",
+            "description": "What customers are actually saying and feeling about this",
+            "prevalence": "Rough estimate: e.g. 'Appears in ~30% of conversations'"
+        }}
+    ],
+    "top_complaints": [
+        "Specific actionable complaint statement (not generic)"
+    ],
+    "unexpected_insights": [
+        "A pattern, contradiction, or insight that would surprise a product manager"
+    ],
+    "notable_quotes": [
+        "Paraphrased quote or representative statement from conversations (not verbatim)"
+    ]
+}}
+Guidelines:
+- Top themes: 5-8 items, each distinct from pre-extracted topics
+- Top complaints: 5-8 bullet points, specific and actionable
+- Unexpected insights: 3-5 items, must genuinely go beyond the tag taxonomy
+- Notable quotes: 3-5 representative paraphrases
+- If a section has fewer relevant items, use fewer — quality over quantity
+"""
+    @staticmethod
+    def _ensure_defaults(summary: dict) -> dict:
+        defaults = {
+            "executive_summary": "",
+            "top_themes": [],
+            "top_complaints": [],
+            "unexpected_insights": [],
+            "notable_quotes": [],
+        }
+        for k, v in defaults.items():
+            if k not in summary:
+                summary[k] = v
+        return summary
+    def _empty_result(self, filter_desc: str) -> dict:
+        return {
+            "success": True,
+            "summary": {
+                "executive_summary": "No conversations matched the selected filters.",
+                "top_themes": [],
+                "top_complaints": [],
+                "unexpected_insights": [],
+                "notable_quotes": [],
+            },
+            "metadata": {
+                "total_conversations_analyzed": 0,
+                "total_available": 0,
+                "model_used": self.model,
+                "tokens_used": 0,
+                "filter_applied": filter_desc,
+            },
+            "error": None,
+        }

visualization/app.py CHANGED Viewed

@@ -14,9 +14,12 @@ parent_dir = Path(__file__).resolve().parent
 sys.path.append(str(parent_dir))
 from data.data_loader import SentimentDataLoader
 from components.dashboard import render_dashboard
 from components.sentiment_analysis import render_sentiment_analysis
 from components.reply_required import render_reply_required
 from utils.auth import check_authentication, render_login_page, logout, get_current_user
 # ── Load configuration ────────────────────────────────────────────────────────
@@ -38,15 +41,13 @@ st.set_page_config(
 if not check_authentication():
     render_login_page()
-# ── Single data-loader instance (cheap: just reads config) ────────────────────
 data_loader = SentimentDataLoader()
 def _ensure_dashboard_data():
-    """
-    Load dashboard data once and store in session_state.
-    Subsequent calls within the same session (or until cache expires) are free.
-    """
     if 'dashboard_df' not in st.session_state or st.session_state['dashboard_df'] is None:
         with st.spinner("Loading dashboard data…"):
             df = data_loader.load_dashboard_data()
@@ -54,6 +55,15 @@ def _ensure_dashboard_data():
     return st.session_state['dashboard_df']
 def main():
     # ── Sidebar ───────────────────────────────────────────────────────────────
     with st.sidebar:
@@ -72,15 +82,22 @@ def main():
         page = st.radio(
             "Select Page",
-            ["📊 Sentiment Dashboard", "🔍 Custom Sentiment Queries", "💬 Reply Required"],
             index=0
         )
         st.markdown("---")
         st.markdown("### 🔍 Global Filters")
-        # Load / retrieve dashboard data for filter options
         dashboard_df = _ensure_dashboard_data()
         if dashboard_df.empty:
             st.error("No data available. Please check your Snowflake connection.")
@@ -148,22 +165,27 @@ def main():
         if st.button("♻️ Reload Data", use_container_width=True):
             st.cache_data.clear()
             st.session_state.pop('dashboard_df', None)
             st.rerun()
         # Data info
         st.markdown("---")
         st.markdown("### ℹ️ Data Info")
-        st.info(f"**Total Records:** {len(dashboard_df):,}")
         if 'processed_at' in dashboard_df.columns and not dashboard_df.empty:
             last_update = dashboard_df['processed_at'].max()
             if hasattr(last_update, 'strftime'):
                 st.info(f"**Last Updated:** {last_update.strftime('%Y-%m-%d %H:%M')}")
-    # ── Build filtered dashboard_df for the Dashboard page ───────────────────
     filters_applied = st.session_state.get('filters_applied', False)
     global_filters = st.session_state.get('global_filters', {})
-    if filters_applied and global_filters:
         filtered_df = data_loader.apply_filters(
             dashboard_df,
             platforms=global_filters.get('platforms') or None,
@@ -190,6 +212,12 @@ def main():
         # RR page fetches its own data on demand; receives only data_loader
         render_reply_required(data_loader)
     # ── Footer ────────────────────────────────────────────────────────────────
     st.markdown("---")
     st.markdown(

 sys.path.append(str(parent_dir))
 from data.data_loader import SentimentDataLoader
+from data.helpscout_data_loader import HelpScoutDataLoader
 from components.dashboard import render_dashboard
 from components.sentiment_analysis import render_sentiment_analysis
 from components.reply_required import render_reply_required
+from components.helpscout_dashboard import render_helpscout_dashboard
+from components.helpscout_analysis import render_helpscout_analysis
 from utils.auth import check_authentication, render_login_page, logout, get_current_user
 # ── Load configuration ────────────────────────────────────────────────────────
 if not check_authentication():
     render_login_page()
+# ── Data loader instances (cheap: just read config) ───────────────────────────
 data_loader = SentimentDataLoader()
+helpscout_loader = HelpScoutDataLoader()
 def _ensure_dashboard_data():
+    """Load comment dashboard data once and store in session_state."""
     if 'dashboard_df' not in st.session_state or st.session_state['dashboard_df'] is None:
         with st.spinner("Loading dashboard data…"):
             df = data_loader.load_dashboard_data()
     return st.session_state['dashboard_df']
+def _ensure_helpscout_data():
+    """Load HelpScout dashboard data once and store in session_state."""
+    if 'helpscout_df' not in st.session_state or st.session_state['helpscout_df'] is None:
+        with st.spinner("Loading HelpScout data…"):
+            hs_df = helpscout_loader.load_dashboard_data()
+        st.session_state['helpscout_df'] = hs_df
+    return st.session_state['helpscout_df']
 def main():
     # ── Sidebar ───────────────────────────────────────────────────────────────
     with st.sidebar:
         page = st.radio(
             "Select Page",
+            [
+                "📊 Sentiment Dashboard",
+                "🔍 Custom Sentiment Queries",
+                "💬 Reply Required",
+                "🎧 HelpScout Dashboard",
+                "🔬 HelpScout Analysis",
+            ],
             index=0
         )
         st.markdown("---")
         st.markdown("### 🔍 Global Filters")
+        # Load both data sources at startup
         dashboard_df = _ensure_dashboard_data()
+        _ensure_helpscout_data()
         if dashboard_df.empty:
             st.error("No data available. Please check your Snowflake connection.")
         if st.button("♻️ Reload Data", use_container_width=True):
             st.cache_data.clear()
             st.session_state.pop('dashboard_df', None)
+            st.session_state.pop('helpscout_df', None)
             st.rerun()
         # Data info
         st.markdown("---")
         st.markdown("### ℹ️ Data Info")
+        st.info(f"**Comments:** {len(dashboard_df):,}")
+        hs_df_info = st.session_state.get('helpscout_df')
+        if hs_df_info is not None and not hs_df_info.empty:
+            st.info(f"**HelpScout:** {len(hs_df_info):,} conversations")
         if 'processed_at' in dashboard_df.columns and not dashboard_df.empty:
             last_update = dashboard_df['processed_at'].max()
             if hasattr(last_update, 'strftime'):
                 st.info(f"**Last Updated:** {last_update.strftime('%Y-%m-%d %H:%M')}")
+    # ── Build filtered dashboard_df (only applies to comment pages) ─────────
+    _hs_page = page in ("🎧 HelpScout Dashboard", "🔬 HelpScout Analysis")
     filters_applied = st.session_state.get('filters_applied', False)
     global_filters = st.session_state.get('global_filters', {})
+    if not _hs_page and filters_applied and global_filters:
         filtered_df = data_loader.apply_filters(
             dashboard_df,
             platforms=global_filters.get('platforms') or None,
         # RR page fetches its own data on demand; receives only data_loader
         render_reply_required(data_loader)
+    elif page == "🎧 HelpScout Dashboard":
+        render_helpscout_dashboard(helpscout_loader)
+    elif page == "🔬 HelpScout Analysis":
+        render_helpscout_analysis(helpscout_loader)
     # ── Footer ────────────────────────────────────────────────────────────────
     st.markdown("---")
     st.markdown(

visualization/components/dashboard.py CHANGED Viewed

@@ -220,6 +220,51 @@ def render_dashboard(df):
     st.markdown("---")
     # Brand-Platform Matrix
     st.markdown("## 🔀 Cross-Dimensional Analysis")
@@ -580,4 +625,13 @@ def render_dashboard(df):
         sunburst = distribution_charts.create_combined_distribution_sunburst(
             df, title="Brand > Platform > Sentiment Distribution"
         )
-        st.plotly_chart(sunburst, use_container_width=True)

     st.markdown("---")
+    # Emotion Analysis
+    st.markdown("## 💭 Emotion Analysis")
+    if 'emotions' in df.columns and df['emotions'].notna().any():
+        col1, col2 = st.columns(2)
+        with col1:
+            emotion_bar = distribution_charts.create_emotion_bar_chart(
+                df, title="Emotion Distribution", orientation='h'
+            )
+            st.plotly_chart(emotion_bar, use_container_width=True)
+        with col2:
+            emotion_pie = distribution_charts.create_emotion_pie_chart(
+                df, title="Emotion Distribution"
+            )
+            st.plotly_chart(emotion_pie, use_container_width=True)
+        with st.expander("💡 Emotion Insights"):
+            emotion_dist = processor.get_emotion_distribution(df)
+            if not emotion_dist.empty:
+                top_emotion = emotion_dist.iloc[0]
+                st.write(f"**Most common emotion:** {top_emotion['emotions'].title()} "
+                         f"({int(top_emotion['count']):,} comments, {top_emotion['percentage']:.1f}%)")
+                negative_emotions = ['frustration', 'disappointment', 'sadness', 'anger']
+                neg_emotion_dist = emotion_dist[emotion_dist['emotions'].isin(negative_emotions)]
+                if not neg_emotion_dist.empty:
+                    total_neg = neg_emotion_dist['count'].sum()
+                    total = emotion_dist['count'].sum()
+                    st.write(f"**Negative emotions** (frustration, disappointment, sadness, anger): "
+                             f"{int(total_neg):,} occurrences ({total_neg / total * 100:.1f}%)")
+                positive_emotions = ['joy', 'excitement', 'gratitude', 'admiration']
+                pos_emotion_dist = emotion_dist[emotion_dist['emotions'].isin(positive_emotions)]
+                if not pos_emotion_dist.empty:
+                    total_pos = pos_emotion_dist['count'].sum()
+                    total = emotion_dist['count'].sum()
+                    st.write(f"**Positive emotions** (joy, excitement, gratitude, admiration): "
+                             f"{int(total_pos):,} occurrences ({total_pos / total * 100:.1f}%)")
+    else:
+        st.info("No emotion data available. Emotions are extracted for newly processed comments.")
+    st.markdown("---")
     # Brand-Platform Matrix
     st.markdown("## 🔀 Cross-Dimensional Analysis")
         sunburst = distribution_charts.create_combined_distribution_sunburst(
             df, title="Brand > Platform > Sentiment Distribution"
         )
+        st.plotly_chart(sunburst, use_container_width=True)
+    # ── HelpScout compact summary (additive — no impact on existing charts) ──
+    hs_df = st.session_state.get("helpscout_df")
+    if hs_df is not None and not hs_df.empty:
+        try:
+            from components.helpscout_dashboard import render_helpscout_compact_summary
+            render_helpscout_compact_summary(hs_df)
+        except Exception:
+            pass  # never break the main dashboard if helpscout module fails

visualization/components/helpscout_analysis.py ADDED Viewed

	@@ -0,0 +1,491 @@

+"""
+HelpScout Analysis Page
+Purpose-built analysis page for HelpScout conversations.
+Mirrors the SA page architecture: filter → fetch → charts → LLM summary → export.
+One page-level summary report for the entire filtered set.
+"""
+import sys
+from datetime import date, timedelta
+from pathlib import Path
+import pandas as pd
+import streamlit as st
+parent_dir = Path(__file__).resolve().parent.parent
+sys.path.append(str(parent_dir))
+from visualizations.helpscout_charts import HelpScoutCharts
+from utils.helpscout_utils import (
+    boolean_flag_counts, build_filter_description, topic_label, load_topic_taxonomy
+)
+from agents.helpscout_summary_agent import HelpScoutSummaryAgent
+def render_helpscout_analysis(data_loader):
+    """
+    Render the HelpScout Analysis page.
+    Args:
+        data_loader: HelpScoutDataLoader instance
+    """
+    st.title("🔬 HelpScout Analysis")
+    st.markdown(
+        "Deep-dive into customer support conversations. Apply filters, fetch the data, "
+        "explore distributions, and generate an AI-powered summary report."
+    )
+    st.markdown("---")
+    charts = HelpScoutCharts()
+    taxonomy = load_topic_taxonomy()
+    # ── Filter options from already-loaded dashboard df ───────────────────────
+    hs_df = st.session_state.get("helpscout_df")
+    if hs_df is None or hs_df.empty:
+        st.warning("HelpScout dashboard data not loaded yet. Please wait for the app to initialise.")
+        return
+    filter_options = data_loader.get_filter_options(hs_df)
+    # ── Filters ───────────────────────────────────────────────────────────────
+    st.markdown("### 🎯 Filters")
+    row1_col1, row1_col2 = st.columns(2)
+    with row1_col1:
+        min_date = hs_df["first_message_at"].min().date() if "first_message_at" in hs_df.columns and not hs_df.empty else date.today() - timedelta(days=60)
+        max_date = hs_df["first_message_at"].max().date() if "first_message_at" in hs_df.columns and not hs_df.empty else date.today()
+        default_start = max(min_date, max_date - timedelta(days=data_loader.default_date_range_days))
+        date_range = st.date_input(
+            "Date Range (First Message At)",
+            value=(default_start, max_date),
+            min_value=min_date, max_value=max_date,
+            key="hs_analysis_date_range",
+        )
+    with row1_col2:
+        top_n_options = [("All", 0), ("50", 50), ("100", 100), ("200", 200), ("500", 500), ("1000", 1000)]
+        top_n_label = st.selectbox(
+            "Limit Results",
+            options=[x[0] for x in top_n_options],
+            index=0,
+            help="Limit number of conversations fetched. 'All' fetches everything matching your filters.",
+            key="hs_analysis_top_n",
+        )
+        top_n = dict(top_n_options)[top_n_label]
+    row2_col1, row2_col2, row2_col3, row2_col4 = st.columns(4)
+    with row2_col1:
+        topic_options = filter_options.get("topics", [])
+        topic_labels_map = {t: topic_label(t, taxonomy) for t in topic_options}
+        selected_topic_labels = st.multiselect(
+            "Topics",
+            options=[topic_labels_map[t] for t in topic_options],
+            default=[],
+            key="hs_analysis_topics",
+        )
+        label_to_id = {v: k for k, v in topic_labels_map.items()}
+        selected_topics = [label_to_id[l] for l in selected_topic_labels if l in label_to_id]
+    with row2_col2:
+        selected_sentiments = st.multiselect(
+            "Sentiments",
+            options=filter_options.get("sentiments", []),
+            default=[],
+            key="hs_analysis_sentiments",
+        )
+    with row2_col3:
+        selected_statuses = st.multiselect(
+            "Status",
+            options=filter_options.get("statuses", []),
+            default=[],
+            key="hs_analysis_statuses",
+        )
+    with row2_col4:
+        selected_sources = st.multiselect(
+            "Source Type",
+            options=filter_options.get("sources", []),
+            default=[],
+            key="hs_analysis_sources",
+        )
+    row3_col1, row3_col2, row3_col3 = st.columns(3)
+    with row3_col1:
+        refund_only = st.checkbox("Refund Requests Only", key="hs_analysis_refund")
+    with row3_col2:
+        cancel_only = st.checkbox("Cancellations Only", key="hs_analysis_cancel")
+    with row3_col3:
+        membership_only = st.checkbox("Membership Joins Only", key="hs_analysis_membership")
+    st.markdown("---")
+    # ── Fetch button ─────────────────────────────────────────────────────────
+    dr_tuple = (str(date_range[0]), str(date_range[1])) if date_range and len(date_range) == 2 else None
+    fetch_key = (
+        dr_tuple,
+        tuple(sorted(selected_sentiments)),
+        tuple(sorted(selected_topics)),
+        tuple(sorted(selected_statuses)),
+        tuple(sorted(selected_sources)),
+        bool(refund_only), bool(cancel_only), bool(membership_only),
+        top_n,
+    )
+    has_data = (
+        "hs_analysis_df" in st.session_state
+        and st.session_state.get("hs_analysis_fetch_key") == fetch_key
+        and not st.session_state["hs_analysis_df"].empty
+    )
+    fetch_col, info_col = st.columns([1, 3])
+    with fetch_col:
+        fetch_clicked = st.button("🚀 Fetch Data", type="primary",
+                                  use_container_width=True, key="hs_fetch_btn")
+    with info_col:
+        if has_data:
+            n = len(st.session_state["hs_analysis_df"])
+            st.success(f"✅ Showing **{n:,}** conversations matching your filters")
+        elif not fetch_clicked:
+            st.info("👆 Set your filters and click **Fetch Data** to query Snowflake.")
+    if fetch_clicked:
+        with st.spinner("Fetching HelpScout data from Snowflake…"):
+            result_df = data_loader.load_analysis_data(
+                sentiments=selected_sentiments or None,
+                topics=selected_topics or None,
+                refund_only=refund_only,
+                cancel_only=cancel_only,
+                membership_only=membership_only,
+                statuses=selected_statuses or None,
+                sources=selected_sources or None,
+                date_range=(date_range[0], date_range[1]) if dr_tuple else None,
+                top_n=top_n or None,
+            )
+        applied_filters = {
+            "date_range": (date_range[0], date_range[1]) if dr_tuple else None,
+            "sentiments": selected_sentiments,
+            "topics": selected_topics,
+            "statuses": selected_statuses,
+            "sources": selected_sources,
+            "refund_only": refund_only,
+            "cancel_only": cancel_only,
+            "membership_only": membership_only,
+        }
+        st.session_state["hs_analysis_df"] = result_df
+        st.session_state["hs_analysis_fetch_key"] = fetch_key
+        st.session_state["hs_analysis_filter_desc"] = build_filter_description(applied_filters, taxonomy)
+        # Invalidate any prior summary when filters change
+        st.session_state.pop("hs_analysis_summary", None)
+        st.session_state.pop("hs_analysis_summary_key", None)
+        st.session_state["hs_analysis_page"] = 1
+        st.rerun()
+    if not has_data and not fetch_clicked:
+        return
+    analysis_df = st.session_state.get("hs_analysis_df", pd.DataFrame())
+    filter_desc = st.session_state.get("hs_analysis_filter_desc", "No filters applied")
+    if analysis_df.empty:
+        st.warning("No conversations found for the selected filters. Try adjusting and re-fetching.")
+        return
+    total = len(analysis_df)
+    flags = boolean_flag_counts(analysis_df)
+    neg_pct = analysis_df["sentiment_polarity"].isin(["negative", "very_negative"]).sum() / total * 100
+    avg_dur = float(analysis_df["duration_hours"].mean()) if "duration_hours" in analysis_df.columns else 0.0
+    # ── KPI Row ───────────────────────────────────────────────────────────────
+    st.markdown("### 📊 Overview")
+    k1, k2, k3, k4, k5 = st.columns(5)
+    k1.metric("Conversations", f"{total:,}")
+    k2.metric("Negative %", f"{neg_pct:.1f}%")
+    k3.metric("Refund Requests", f"{flags['is_refund_request']:,}")
+    k4.metric("Cancellations", f"{flags['is_cancellation']:,}")
+    k5.metric("Avg Duration (h)", f"{avg_dur:.1f}")
+    st.caption(f"**Active filters:** {filter_desc}")
+    st.markdown("---")
+    # ── Distributions ─────────────────────────────────────────────────────────
+    st.markdown("### 📈 Distributions")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.plotly_chart(charts.create_sentiment_pie_chart(analysis_df, title="Sentiment Distribution"),
+                        use_container_width=True, key="hs_analysis_sent_pie")
+    with col2:
+        st.plotly_chart(charts.create_topic_bar_chart(analysis_df, title="Topic Distribution"),
+                        use_container_width=True, key="hs_analysis_topic_bar")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.plotly_chart(charts.create_topic_sentiment_heatmap(analysis_df),
+                        use_container_width=True, key="hs_analysis_topic_heatmap")
+    with col2:
+        st.plotly_chart(charts.create_boolean_flags_chart(analysis_df),
+                        use_container_width=True, key="hs_analysis_flags")
+    if "emotions" in analysis_df.columns and analysis_df["emotions"].notna().any():
+        col1, col2 = st.columns(2)
+        with col1:
+            st.plotly_chart(charts.create_emotion_bar_chart(analysis_df, title="Emotion Distribution"),
+                            use_container_width=True, key="hs_analysis_emotion")
+        with col2:
+            st.plotly_chart(charts.create_volume_timeline(analysis_df, title="Volume Over Time"),
+                            use_container_width=True, key="hs_analysis_vol_timeline")
+    else:
+        st.plotly_chart(charts.create_volume_timeline(analysis_df, title="Volume Over Time"),
+                        use_container_width=True, key="hs_analysis_vol_timeline2")
+    st.markdown("---")
+    # ── AI Summary Report ─────────────────────────────────────────────────────
+    st.markdown("### 🤖 AI Summary Report")
+    st.markdown(
+        "Generate an LLM-powered report from the conversation summaries matching your filters. "
+        "The AI looks beyond the pre-extracted tags to surface patterns, pain points, "
+        "and actionable insights."
+    )
+    summary_available = (
+        "hs_analysis_summary" in st.session_state
+        and st.session_state.get("hs_analysis_summary_key") == fetch_key
+        and st.session_state["hs_analysis_summary"] is not None
+    )
+    gen_col, pdf_col = st.columns([1, 1])
+    with gen_col:
+        gen_clicked = st.button("🧠 Generate Summary Report", type="primary",
+                                use_container_width=True, key="hs_gen_summary_btn")
+    with pdf_col:
+        export_pdf_clicked = st.button("📄 Export as PDF", use_container_width=True,
+                                       key="hs_export_pdf_btn")
+    if gen_clicked:
+        with st.spinner("Analysing conversations with AI… this may take 20–40 seconds…"):
+            agent = HelpScoutSummaryAgent()
+            result = agent.process({
+                "conversations": analysis_df,
+                "filter_description": filter_desc,
+            })
+        st.session_state["hs_analysis_summary"] = result
+        st.session_state["hs_analysis_summary_key"] = fetch_key
+        st.rerun()
+    if export_pdf_clicked:
+        with st.spinner("Generating PDF…"):
+            try:
+                from utils.helpscout_pdf import HelpScoutAnalysisPDF
+                import datetime
+                summary_result = st.session_state.get("hs_analysis_summary")
+                exporter = HelpScoutAnalysisPDF()
+                pdf_bytes = exporter.generate_report(
+                    analysis_df,
+                    filter_info={"Filters": filter_desc, "Total Conversations": str(total)},
+                    summary_result=summary_result,
+                )
+                filename = f"helpscout_analysis_{datetime.datetime.now().strftime('%Y%m%d_%H%M')}.pdf"
+                st.success("Report generated!")
+                st.download_button(
+                    label="Download Analysis PDF",
+                    data=pdf_bytes,
+                    file_name=filename,
+                    mime="application/pdf",
+                    use_container_width=True,
+                    key="hs_download_pdf_btn",
+                )
+            except Exception as e:
+                st.error(f"Failed to generate PDF: {e}")
+                st.exception(e)
+    # Render the summary if available
+    if summary_available:
+        result = st.session_state["hs_analysis_summary"]
+        _render_summary_report(result)
+    st.markdown("---")
+    # ── Conversation Cards ────────────────────────────────────────────────────
+    st.markdown("### 💬 Conversations")
+    if "hs_analysis_page" not in st.session_state:
+        st.session_state.hs_analysis_page = 1
+    per_page = 10
+    total_pages = max(1, (total + per_page - 1) // per_page)
+    if total > per_page:
+        st.info(f"Page {st.session_state.hs_analysis_page} of {total_pages} ({total:,} conversations)")
+        pc1, pc2, pc3 = st.columns([1, 2, 1])
+        with pc1:
+            if st.button("⬅️ Previous", key="hs_prev_top",
+                         disabled=st.session_state.hs_analysis_page == 1):
+                st.session_state.hs_analysis_page -= 1
+                st.rerun()
+        with pc2:
+            st.markdown(
+                f"<div style='text-align:center;padding-top:8px;'>"
+                f"Page {st.session_state.hs_analysis_page} / {total_pages}</div>",
+                unsafe_allow_html=True,
+            )
+        with pc3:
+            if st.button("Next ➡️", key="hs_next_top",
+                         disabled=st.session_state.hs_analysis_page >= total_pages):
+                st.session_state.hs_analysis_page += 1
+                st.rerun()
+        st.markdown("---")
+    start = (st.session_state.hs_analysis_page - 1) * per_page
+    end = min(start + per_page, total)
+    page_df = analysis_df.iloc[start:end]
+    for _, row in page_df.iterrows():
+        _render_conversation_card(row, taxonomy)
+    # Bottom pagination
+    if total > per_page:
+        pb1, pb2, pb3 = st.columns([1, 2, 1])
+        with pb1:
+            if st.button("⬅️ Previous", key="hs_prev_bot",
+                         disabled=st.session_state.hs_analysis_page == 1):
+                st.session_state.hs_analysis_page -= 1
+                st.rerun()
+        with pb2:
+            st.markdown(
+                f"<div style='text-align:center;padding-top:8px;'>"
+                f"Page {st.session_state.hs_analysis_page} / {total_pages}</div>",
+                unsafe_allow_html=True,
+            )
+        with pb3:
+            if st.button("Next ➡️", key="hs_next_bot",
+                         disabled=st.session_state.hs_analysis_page >= total_pages):
+                st.session_state.hs_analysis_page += 1
+                st.rerun()
+    st.markdown("---")
+    # ── Export CSV ────────────────────────────────────────────────────────────
+    st.markdown("### 💾 Export Data")
+    export_cols = [c for c in ["conversation_id", "customer_email", "first_message_at",
+                                "status", "sentiment_polarity", "topics", "summary",
+                                "is_refund_request", "is_cancellation", "is_membership",
+                                "duration_hours"] if c in analysis_df.columns]
+    csv = analysis_df[export_cols].to_csv(index=False)
+    st.download_button(
+        label="📥 Download as CSV",
+        data=csv,
+        file_name=f"helpscout_analysis_{total}conversations.csv",
+        mime="text/csv",
+        key="hs_csv_download",
+    )
+# ─────────────────────────────────────────────────────────────────────────────
+# Helper renderers
+# ─────────────────────────────────────────────────────────────────────────────
+def _render_summary_report(result: dict):
+    """Render the LLM summary result with nice formatting."""
+    if not result.get("success"):
+        st.error(f"AI analysis failed: {result.get('error', 'Unknown error')}")
+        return
+    summary = result.get("summary", {})
+    meta    = result.get("metadata", {})
+    with st.container():
+        st.markdown("---")
+        st.markdown("#### 📋 Executive Summary")
+        st.info(summary.get("executive_summary", ""))
+        col1, col2 = st.columns(2)
+        with col1:
+            themes = summary.get("top_themes", [])
+            if themes:
+                st.markdown("#### 🎯 Top Themes")
+                for t in themes:
+                    st.markdown(
+                        f"**{t.get('theme', '')}** _{t.get('prevalence', '')}_  \n"
+                        f"{t.get('description', '')}"
+                    )
+                    st.markdown("")
+            insights = summary.get("unexpected_insights", [])
+            if insights:
+                st.markdown("#### 💡 Unexpected Insights")
+                for ins in insights:
+                    st.markdown(f"- {ins}")
+        with col2:
+            complaints = summary.get("top_complaints", [])
+            if complaints:
+                st.markdown("#### ⚠️ Top Complaints")
+                for c in complaints:
+                    st.markdown(f"- {c}")
+            quotes = summary.get("notable_quotes", [])
+            if quotes:
+                st.markdown("#### 💬 Notable Quotes")
+                for q in quotes:
+                    st.markdown(f"> {q}")
+        with st.expander("ℹ️ Analysis Metadata"):
+            mc1, mc2, mc3 = st.columns(3)
+            mc1.metric("Conversations Analysed", meta.get("total_conversations_analyzed", 0))
+            mc2.metric("Model Used", meta.get("model_used", "N/A"))
+            mc3.metric("Tokens Used", meta.get("tokens_used", 0))
+            if meta.get("total_available", 0) > meta.get("total_conversations_analyzed", 0):
+                st.caption(
+                    f"Sampled {meta['total_conversations_analyzed']} of "
+                    f"{meta['total_available']} conversations for this analysis."
+                )
+def _render_conversation_card(row, taxonomy: dict):
+    """Render a single conversation card."""
+    sent = str(row.get("sentiment_polarity", "unknown"))
+    sent_emoji = {
+        "very_positive": "🟢", "positive": "🟩", "neutral": "🟡",
+        "negative": "🟠", "very_negative": "🔴",
+    }.get(sent, "⚪")
+    topics_list = row.get("topics_list") or []
+    topic_labels_str = ", ".join(topic_label(t, taxonomy) for t in topics_list) if topics_list else "—"
+    first_name = str(row.get("customer_first") or "").strip()
+    last_name  = str(row.get("customer_last") or "").strip()
+    customer_str = f"{first_name} {last_name[:1]}." if first_name or last_name else "Anonymous"
+    first_msg = row.get("first_message_at")
+    date_str = first_msg.strftime("%Y-%m-%d") if hasattr(first_msg, "strftime") else str(first_msg or "")
+    flags = []
+    if row.get("is_refund_request"): flags.append("💰 Refund")
+    if row.get("is_cancellation"):   flags.append("🚫 Cancel")
+    if row.get("is_membership"):     flags.append("✅ Membership")
+    flags_str = " | ".join(flags) if flags else ""
+    with st.expander(
+        f"{sent_emoji} {customer_str} — {topic_labels_str} | {sent.replace('_', ' ').title()} | {date_str}"
+        + (f"  [{flags_str}]" if flags_str else ""),
+        expanded=False,
+    ):
+        info_col1, info_col2, info_col3 = st.columns(3)
+        info_col1.markdown(f"**Status:** {row.get('status', '—')}")
+        info_col2.markdown(f"**Source:** {row.get('source_type', '—')}")
+        info_col3.markdown(f"**Duration:** {row.get('duration_hours', 0):.1f}h | **Threads:** {row.get('thread_count', 0)}")
+        summary = str(row.get("summary") or "No summary available.")
+        st.markdown(f"**Summary:** {summary}")
+        notes_col1, notes_col2 = st.columns(2)
+        with notes_col1:
+            sent_note = str(row.get("sentiment_notes") or "")
+            if sent_note:
+                st.markdown(f"**Sentiment Note:** _{sent_note}_")
+        with notes_col2:
+            topic_note = str(row.get("topic_notes") or "")
+            if topic_note:
+                st.markdown(f"**Topic Note:** _{topic_note}_")

visualization/components/helpscout_dashboard.py ADDED Viewed

	@@ -0,0 +1,278 @@

+"""
+HelpScout Dashboard Page
+Full dedicated dashboard for HelpScout customer support conversation analysis.
+"""
+import sys
+from pathlib import Path
+import pandas as pd
+import streamlit as st
+parent_dir = Path(__file__).resolve().parent.parent
+sys.path.append(str(parent_dir))
+from utils.helpscout_utils import boolean_flag_counts, topic_label, load_topic_taxonomy
+from visualizations.helpscout_charts import HelpScoutCharts
+from visualizations.demographic_charts import DemographicCharts
+from utils.data_processor import SentimentDataProcessor
+def _sentiment_score(df) -> float:
+    """Compute average sentiment score on a -2 to +2 scale."""
+    score_map = {"very_positive": 2, "positive": 1, "neutral": 0,
+                 "negative": -1, "very_negative": -2}
+    if "sentiment_polarity" not in df.columns or df.empty:
+        return 0.0
+    scores = df["sentiment_polarity"].map(score_map).fillna(0)
+    return float(scores.mean())
+def render_helpscout_dashboard(data_loader):
+    """
+    Render the full HelpScout Dashboard page.
+    Args:
+        data_loader: HelpScoutDataLoader instance
+    """
+    st.title("🎧 HelpScout Support Dashboard")
+    st.markdown("Customer support conversation analysis from HelpScout.")
+    hs_df = st.session_state.get("helpscout_df")
+    if hs_df is None or hs_df.empty:
+        st.warning("No HelpScout data available. Please check your Snowflake connection.")
+        return
+    charts = HelpScoutCharts()
+    taxonomy = load_topic_taxonomy()
+    # ── PDF Export ────────────────────────────────────────────────────────────
+    with st.expander("📄 Export PDF Report", expanded=False):
+        st.markdown(
+            "Generate a comprehensive HelpScout support report. "
+            "Covers sentiment, topics, billing flags, timelines, and demographics."
+        )
+        if st.button("Generate HelpScout PDF Report", type="primary",
+                     use_container_width=True, key="hs_dash_pdf_btn"):
+            with st.spinner("Generating HelpScout PDF report…"):
+                try:
+                    from utils.helpscout_pdf import HelpScoutDashboardPDF
+                    exporter = HelpScoutDashboardPDF()
+                    pdf_bytes = exporter.generate_report(hs_df)
+                    import datetime
+                    filename = f"helpscout_dashboard_{datetime.datetime.now().strftime('%Y%m%d_%H%M')}.pdf"
+                    st.success("Report generated successfully!")
+                    st.download_button(
+                        label="Download HelpScout Dashboard PDF",
+                        data=pdf_bytes,
+                        file_name=filename,
+                        mime="application/pdf",
+                        use_container_width=True,
+                    )
+                except Exception as e:
+                    st.error(f"Failed to generate report: {e}")
+                    st.exception(e)
+    st.markdown("---")
+    # ── KPI Row ───────────────────────────────────────────────────────────────
+    total = len(hs_df)
+    escalation_count = int(hs_df["is_escalation"].sum()) if "is_escalation" in hs_df.columns else 0
+    flags = boolean_flag_counts(hs_df)
+    neg_pct = (hs_df["sentiment_polarity"].isin(["negative", "very_negative"]).sum() / total * 100) if total else 0
+    avg_duration = float(hs_df["duration_hours"].mean()) if "duration_hours" in hs_df.columns else 0.0
+    k1, k2, k3, k4, k5, k6 = st.columns(6)
+    k1.metric("Total Conversations", f"{total:,}")
+    k2.metric("Avg Duration (h)", f"{avg_duration:.1f}")
+    k3.metric("Escalations", f"{escalation_count:,}", delta=f"{escalation_count/total*100:.1f}% of total" if total else None, delta_color="inverse")
+    k4.metric("Refund Requests", f"{flags['is_refund_request']:,}")
+    k5.metric("Cancellations",   f"{flags['is_cancellation']:,}")
+    k6.metric("Membership Joins",f"{flags['is_membership']:,}")
+    st.markdown("---")
+    # ── Sentiment ─────────────────────────────────────────────────────────────
+    st.markdown("## 🎯 Sentiment Distribution")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.plotly_chart(charts.create_sentiment_pie_chart(hs_df), use_container_width=True)
+    with col2:
+        avg_score = _sentiment_score(hs_df)
+        st.plotly_chart(charts.create_sentiment_score_gauge(avg_score), use_container_width=True)
+        m1, m2 = st.columns(2)
+        pos_pct = hs_df["sentiment_polarity"].isin(["positive", "very_positive"]).sum() / total * 100 if total else 0
+        m1.metric("Positive %", f"{pos_pct:.1f}%")
+        m2.metric("Negative %", f"{neg_pct:.1f}%")
+    st.markdown("---")
+    # ── Topics ────────────────────────────────────────────────────────────────
+    st.markdown("## 🏷️ Topic Analysis")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.plotly_chart(charts.create_topic_bar_chart(hs_df, title="Conversations by Topic"),
+                        use_container_width=True)
+    with col2:
+        st.plotly_chart(charts.create_topic_pie_chart(hs_df, title="Topic Share"),
+                        use_container_width=True)
+    st.plotly_chart(charts.create_topic_sentiment_heatmap(hs_df), use_container_width=True)
+    st.markdown("---")
+    # ── Emotions ─────────────────────────────────────────────────────────────
+    if "emotions" in hs_df.columns and hs_df["emotions"].notna().any():
+        st.markdown("## 💭 Emotion Analysis")
+        col1, col2 = st.columns(2)
+        with col1:
+            st.plotly_chart(charts.create_emotion_bar_chart(hs_df, title="Emotion Distribution"),
+                            use_container_width=True)
+        with col2:
+            # Reuse the existing DistributionCharts emotion pie (same df structure with emotions col)
+            from visualizations.distribution_charts import DistributionCharts
+            dist_charts = DistributionCharts()
+            st.plotly_chart(dist_charts.create_emotion_pie_chart(hs_df, title="Emotion Share"),
+                            use_container_width=True)
+        st.markdown("---")
+    # ── Billing Flags ─────────────────────────────────────────────────────────
+    st.markdown("## 💳 Billing & Membership Flags")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.plotly_chart(charts.create_boolean_flags_chart(hs_df), use_container_width=True)
+    with col2:
+        st.plotly_chart(charts.create_escalation_breakdown(hs_df), use_container_width=True)
+    st.markdown("---")
+    # ── Status / Source ───────────────────────────────────────────────────────
+    st.markdown("## 📬 Status & Source Distribution")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.plotly_chart(charts.create_status_distribution(hs_df), use_container_width=True)
+    with col2:
+        st.plotly_chart(charts.create_source_distribution(hs_df), use_container_width=True)
+    st.markdown("---")
+    # ── Volume & Timelines ────────────────────────────────────────────────────
+    with st.expander("📈 Volume & Trends", expanded=False):
+        freq_col, _ = st.columns([1, 3])
+        with freq_col:
+            freq = st.selectbox("Time Granularity", ["D", "W", "M"],
+                                format_func=lambda x: {"D": "Daily", "W": "Weekly", "M": "Monthly"}[x],
+                                index=1, key="hs_dash_freq")
+        st.plotly_chart(charts.create_volume_timeline(hs_df, freq=freq), use_container_width=True)
+        st.plotly_chart(charts.create_sentiment_timeline(hs_df, freq=freq), use_container_width=True)
+        st.plotly_chart(charts.create_topic_timeline(hs_df, freq=freq), use_container_width=True)
+        st.plotly_chart(charts.create_refund_cancel_timeline(hs_df, freq=freq), use_container_width=True)
+    # ── Duration & Thread Count ───────────────────────────────────────────────
+    with st.expander("📊 Conversation Depth", expanded=False):
+        col1, col2 = st.columns(2)
+        with col1:
+            st.plotly_chart(charts.create_duration_histogram(hs_df), use_container_width=True)
+        with col2:
+            st.plotly_chart(charts.create_thread_count_histogram(hs_df), use_container_width=True)
+    # ── Demographics ─────────────────────────────────────────────────────────
+    has_demographics = (
+        "age_group" in hs_df.columns
+        and "timezone_region" in hs_df.columns
+        and (hs_df["age_group"] != "Unknown").any()
+    )
+    if has_demographics:
+        st.markdown("---")
+        st.markdown("## 👥 Customer Demographics")
+        st.info(f"Demographics available for customers whose email matched Musora user records.")
+        processor = SentimentDataProcessor()
+        demo_charts = DemographicCharts()
+        demo_col1, demo_col2, demo_col3, demo_col4 = st.columns(4)
+        known_demo = int((hs_df["age_group"] != "Unknown").sum())
+        demo_col1.metric("With Demographics", f"{known_demo:,}", f"{known_demo/total*100:.1f}% matched")
+        avg_age = hs_df["age"].mean() if "age" in hs_df.columns else None
+        demo_col2.metric("Average Age", f"{avg_age:.1f}" if avg_age else "N/A")
+        top_region = hs_df["timezone_region"].value_counts().index[0] if "timezone_region" in hs_df.columns and not hs_df.empty else "N/A"
+        demo_col3.metric("Top Region", str(top_region))
+        avg_exp = hs_df["experience_level"].mean() if "experience_level" in hs_df.columns else None
+        demo_col4.metric("Avg Experience", f"{avg_exp:.1f}/10" if avg_exp else "N/A")
+        st.markdown("---")
+        age_dist = processor.get_demographics_distribution(hs_df, "age_group")
+        if not age_dist.empty:
+            st.markdown("### Age Distribution")
+            col1, col2 = st.columns(2)
+            with col1:
+                st.plotly_chart(demo_charts.create_age_distribution_chart(age_dist), use_container_width=True)
+            with col2:
+                age_sent = processor.get_demographics_by_sentiment(hs_df, "age_group")
+                if not age_sent.empty:
+                    st.plotly_chart(demo_charts.create_age_sentiment_chart(age_sent), use_container_width=True)
+        region_dist = processor.get_timezone_regions_distribution(hs_df)
+        if not region_dist.empty:
+            st.markdown("### Geographic Distribution")
+            col1, col2 = st.columns(2)
+            with col1:
+                st.plotly_chart(demo_charts.create_region_distribution_chart(region_dist), use_container_width=True)
+            with col2:
+                region_sent = processor.get_demographics_by_sentiment(hs_df, "timezone_region")
+                if not region_sent.empty:
+                    st.plotly_chart(demo_charts.create_region_sentiment_chart(region_sent), use_container_width=True)
+    st.markdown("---")
+    st.caption(
+        "Data source: SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES | "
+        f"Last processed: {hs_df['processed_at'].max().strftime('%Y-%m-%d %H:%M') if 'processed_at' in hs_df.columns and not hs_df.empty else 'Unknown'}"
+    )
+# ─────────────────────────────────────────────────────────────────────────────
+# Compact summary for embedding in the main Sentiment Dashboard
+# ─────────────────────────────────────────────────────────────────────────────
+def render_helpscout_compact_summary(hs_df):
+    """
+    A one-screen HelpScout summary section embedded at the bottom of the
+    main Sentiment Dashboard. Kept purposely brief.
+    """
+    st.markdown("---")
+    st.markdown("## 🎧 HelpScout Support — Quick View")
+    st.caption(f"{len(hs_df):,} processed customer conversations")
+    total = len(hs_df)
+    if total == 0:
+        st.info("No HelpScout conversations available.")
+        return
+    charts = HelpScoutCharts()
+    flags  = boolean_flag_counts(hs_df)
+    escalation_count = int(hs_df["is_escalation"].sum()) if "is_escalation" in hs_df.columns else 0
+    avg_dur = float(hs_df["duration_hours"].mean()) if "duration_hours" in hs_df.columns else 0.0
+    k1, k2, k3, k4 = st.columns(4)
+    k1.metric("Conversations", f"{total:,}")
+    k2.metric("Escalations",   f"{escalation_count:,}", delta=f"{escalation_count/total*100:.1f}%", delta_color="inverse")
+    k3.metric("Refund Requests", f"{flags['is_refund_request']:,}")
+    k4.metric("Avg Duration (h)", f"{avg_dur:.1f}")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.plotly_chart(
+            charts.create_sentiment_pie_chart(hs_df, title="HelpScout Sentiment"),
+            use_container_width=True,
+            key="hs_compact_sentiment_pie",
+        )
+    with col2:
+        st.plotly_chart(
+            charts.create_topic_bar_chart(hs_df, title="Top Topics", top_n=5),
+            use_container_width=True,
+            key="hs_compact_topic_bar",
+        )
+    st.info("👉 Navigate to **🎧 HelpScout Dashboard** for the full analysis.")

visualization/components/sentiment_analysis.py CHANGED Viewed

@@ -116,7 +116,7 @@ def render_sentiment_analysis(data_loader):
     mask = (dashboard_df['platform'] == selected_platform) & (dashboard_df['brand'] == selected_brand)
     preview_df = dashboard_df[mask]
-    filter_col1, filter_col2, filter_col3, filter_col4 = st.columns(4)
     with filter_col1:
         sentiment_options = sorted(preview_df['sentiment_polarity'].unique().tolist())
@@ -141,6 +141,20 @@ def render_sentiment_analysis(data_loader):
         )
     with filter_col3:
         top_n = st.selectbox(
             "Top N Contents",
             options=[5, 10, 15, 20, 25],
@@ -148,12 +162,12 @@ def render_sentiment_analysis(data_loader):
             help="Number of contents to display"
         )
-    with filter_col4:
-        filter_active = bool(selected_sentiments or selected_intents)
         st.metric(
             "Filters Active",
             "✓ Yes" if filter_active else "✗ No",
-            help="Sentiment or intent filters applied" if filter_active else "Showing all sentiments"
         )
     st.markdown("---")
@@ -200,6 +214,7 @@ def render_sentiment_analysis(data_loader):
     fetch_key = (
         selected_platform, selected_brand, top_n, min_comments, sort_by_value,
         tuple(sorted(selected_sentiments)), tuple(sorted(selected_intents)),
         str(query_date_range)
     )
@@ -234,6 +249,7 @@ def render_sentiment_analysis(data_loader):
                 sort_by=sort_by_value,
                 sentiments=selected_sentiments or None,
                 intents=selected_intents or None,
                 date_range=query_date_range,
             )
         st.session_state['sa_contents'] = contents_df
@@ -332,7 +348,7 @@ def render_sentiment_analysis(data_loader):
         if content_comments.empty:
             st.info("No sampled comment details available for this content.")
         else:
-            viz_col1, viz_col2 = st.columns(2)
             with viz_col1:
                 pie = sentiment_charts.create_sentiment_pie_chart(
                     content_comments, title="Sentiment Distribution (sample)"
@@ -345,6 +361,12 @@ def render_sentiment_analysis(data_loader):
                 )
                 st.plotly_chart(bar, use_container_width=True,
                                 key=f"intent_bar_{content_row['content_sk']}")
         # AI Analysis
         st.markdown("#### 🤖 AI-Powered Analysis")
@@ -500,7 +522,7 @@ def render_sentiment_analysis(data_loader):
         comments_df['content_sk'].isin(filtered_contents['content_sk'])
     ] if not comments_df.empty else pd.DataFrame()
-    insight_col1, insight_col2 = st.columns(2)
     with insight_col1:
         st.markdown("#### 🎯 Common Intent Patterns")
         if not all_sampled.empty:
@@ -509,6 +531,16 @@ def render_sentiment_analysis(data_loader):
                 st.markdown(f"- **{row['intent']}**: {row['count']} ({row['percentage']:.1f}%)")
     with insight_col2:
         st.markdown("#### 🌐 Platform Breakdown")
         if not all_sampled.empty:
             for platform, count in all_sampled['platform'].value_counts().items():

     mask = (dashboard_df['platform'] == selected_platform) & (dashboard_df['brand'] == selected_brand)
     preview_df = dashboard_df[mask]
+    filter_col1, filter_col2, filter_col3, filter_col4, filter_col5 = st.columns(5)
     with filter_col1:
         sentiment_options = sorted(preview_df['sentiment_polarity'].unique().tolist())
         )
     with filter_col3:
+        emotion_list = (
+            preview_df['emotions']
+            .str.split(',').explode().str.strip()
+            .dropna().unique().tolist()
+            if 'emotions' in preview_df.columns else []
+        )
+        selected_emotions = st.multiselect(
+            "Emotion",
+            options=sorted(e for e in emotion_list if e),
+            default=[],
+            help="Filter contents that have comments with these emotions"
+        )
+    with filter_col4:
         top_n = st.selectbox(
             "Top N Contents",
             options=[5, 10, 15, 20, 25],
             help="Number of contents to display"
         )
+    with filter_col5:
+        filter_active = bool(selected_sentiments or selected_intents or selected_emotions)
         st.metric(
             "Filters Active",
             "✓ Yes" if filter_active else "✗ No",
+            help="Sentiment, intent, or emotion filters applied" if filter_active else "Showing all sentiments"
         )
     st.markdown("---")
     fetch_key = (
         selected_platform, selected_brand, top_n, min_comments, sort_by_value,
         tuple(sorted(selected_sentiments)), tuple(sorted(selected_intents)),
+        tuple(sorted(selected_emotions)),
         str(query_date_range)
     )
                 sort_by=sort_by_value,
                 sentiments=selected_sentiments or None,
                 intents=selected_intents or None,
+                emotions=selected_emotions or None,
                 date_range=query_date_range,
             )
         st.session_state['sa_contents'] = contents_df
         if content_comments.empty:
             st.info("No sampled comment details available for this content.")
         else:
+            viz_col1, viz_col2, viz_col3 = st.columns(3)
             with viz_col1:
                 pie = sentiment_charts.create_sentiment_pie_chart(
                     content_comments, title="Sentiment Distribution (sample)"
                 )
                 st.plotly_chart(bar, use_container_width=True,
                                 key=f"intent_bar_{content_row['content_sk']}")
+            with viz_col3:
+                emotion_bar = distribution_charts.create_emotion_bar_chart(
+                    content_comments, title="Emotion Distribution (sample)", orientation='h'
+                )
+                st.plotly_chart(emotion_bar, use_container_width=True,
+                                key=f"emotion_bar_{content_row['content_sk']}")
         # AI Analysis
         st.markdown("#### 🤖 AI-Powered Analysis")
         comments_df['content_sk'].isin(filtered_contents['content_sk'])
     ] if not comments_df.empty else pd.DataFrame()
+    insight_col1, insight_col2, insight_col3 = st.columns(3)
     with insight_col1:
         st.markdown("#### 🎯 Common Intent Patterns")
         if not all_sampled.empty:
                 st.markdown(f"- **{row['intent']}**: {row['count']} ({row['percentage']:.1f}%)")
     with insight_col2:
+        st.markdown("#### 💭 Top Emotions")
+        if not all_sampled.empty:
+            emotion_dist = processor.get_emotion_distribution(all_sampled)
+            if not emotion_dist.empty:
+                for _, row in emotion_dist.sort_values('count', ascending=False).head(5).iterrows():
+                    st.markdown(f"- **{row['emotions'].title()}**: {row['count']} ({row['percentage']:.1f}%)")
+            else:
+                st.info("No emotion data available.")
+    with insight_col3:
         st.markdown("#### 🌐 Platform Breakdown")
         if not all_sampled.empty:
             for platform, count in all_sampled['platform'].value_counts().items():

visualization/config/viz_config.json CHANGED Viewed

@@ -17,6 +17,19 @@
             "off_topic": "#9E9E9E",
             "spam_selfpromo": "#795548"
         },
         "platform": {
             "facebook": "#1877F2",
             "instagram": "#E4405F",
@@ -49,6 +62,19 @@
         "off_topic",
         "spam_selfpromo"
     ],
     "negative_sentiments": [
         "negative",
         "very_negative"
@@ -67,7 +93,7 @@
     },
     "snowflake": {
         "query": "SELECT s.COMMENT_SK, s.COMMENT_ID, s.ORIGINAL_TEXT, s.PLATFORM, s.COMMENT_TIMESTAMP, s.AUTHOR_NAME, s.AUTHOR_ID, CAST(NULL AS VARCHAR(16777216)) as PARENT_COMMENT_ID, CAST(NULL AS VARCHAR(16777216)) as PARENT_COMMENT_TEXT, s.CONTENT_SK, s.CONTENT_ID, s.CONTENT_DESCRIPTION, s.CHANNEL_SK, s.CHANNEL_NAME, s.CHANNEL_DISPLAY_NAME, s.DETECTED_LANGUAGE, s.LANGUAGE_CODE, s.IS_ENGLISH, s.LANGUAGE_CONFIDENCE, s.DETECTION_METHOD, s.HAS_TEXT, s.TRANSLATED_TEXT, s.TRANSLATION_PERFORMED, s.TRANSLATION_CONFIDENCE, s.TRANSLATION_NOTES, s.SENTIMENT_POLARITY, s.INTENT, s.REQUIRES_REPLY, s.SENTIMENT_CONFIDENCE, s.ANALYSIS_NOTES, s.PROCESSING_SUCCESS, CAST(NULL AS VARCHAR(16777216)) as PROCESSING_ERRORS, s.PROCESSED_AT, s.WORKFLOW_VERSION, CAST(NULL AS TIMESTAMP_NTZ(9)) as CREATED_AT, CAST(NULL AS TIMESTAMP_NTZ(9)) as UPDATED_AT, s.CHANNEL_NAME as BRAND, c.PERMALINK_URL, CAST(NULL AS VARCHAR(16777216)) as THUMBNAIL_URL FROM SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES s LEFT JOIN SOCIAL_MEDIA_DB.CORE.DIM_CONTENT c ON s.CONTENT_SK = c.CONTENT_SK UNION ALL SELECT COMMENT_SK, COMMENT_ID, ORIGINAL_TEXT, CASE WHEN PLATFORM = 'musora' THEN 'musora_app' ELSE PLATFORM END as PLATFORM, COMMENT_TIMESTAMP, AUTHOR_NAME, AUTHOR_ID, PARENT_COMMENT_ID, PARENT_COMMENT_TEXT, CONTENT_SK, CONTENT_ID, CONTENT_DESCRIPTION, CHANNEL_SK, CHANNEL_NAME, CHANNEL_DISPLAY_NAME, DETECTED_LANGUAGE, LANGUAGE_CODE, IS_ENGLISH, LANGUAGE_CONFIDENCE, DETECTION_METHOD, HAS_TEXT, TRANSLATED_TEXT, TRANSLATION_PERFORMED, TRANSLATION_CONFIDENCE, TRANSLATION_NOTES, SENTIMENT_POLARITY, INTENT, REQUIRES_REPLY, SENTIMENT_CONFIDENCE, ANALYSIS_NOTES, PROCESSING_SUCCESS, PROCESSING_ERRORS, PROCESSED_AT, WORKFLOW_VERSION, CREATED_AT, UPDATED_AT, CHANNEL_NAME as BRAND, PERMALINK_URL, THUMBNAIL_URL FROM SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES",
-        "dashboard_query": "SELECT s.COMMENT_SK, s.CONTENT_SK, LOWER(s.PLATFORM) AS PLATFORM, LOWER(s.CHANNEL_NAME) AS BRAND, s.SENTIMENT_POLARITY, s.INTENT, s.REQUIRES_REPLY, s.DETECTED_LANGUAGE, s.COMMENT_TIMESTAMP, s.PROCESSED_AT, s.AUTHOR_ID FROM SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES s UNION ALL SELECT COMMENT_SK, CONTENT_SK, CASE WHEN LOWER(PLATFORM) = 'musora' THEN 'musora_app' ELSE LOWER(PLATFORM) END AS PLATFORM, LOWER(CHANNEL_NAME) AS BRAND, SENTIMENT_POLARITY, INTENT, REQUIRES_REPLY, DETECTED_LANGUAGE, COMMENT_TIMESTAMP, PROCESSED_AT, AUTHOR_ID FROM SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES",
         "demographics_query": "SELECT u.id as USER_ID, u.birthday as BIRTHDAY, u.timezone as TIMEZONE, GREATEST(COALESCE(p.difficulty, 0), COALESCE(p.self_report_difficulty, 0), COALESCE(p.method_experience, 0)) AS EXPERIENCE_LEVEL FROM stitch.musora_ecom_db.usora_users u JOIN online_recsys.preprocessed.users p ON u.id = p.user_id"
     },
     "demographics": {
@@ -84,5 +110,39 @@
             "Advanced (8-10)": [8, 10]
         },
         "top_timezones_count": 15
     }
 }

             "off_topic": "#9E9E9E",
             "spam_selfpromo": "#795548"
         },
+        "emotion": {
+            "joy": "#FFD700",
+            "excitement": "#FF6B35",
+            "gratitude": "#4CAF50",
+            "admiration": "#2196F3",
+            "curiosity": "#00BCD4",
+            "humor": "#9C27B0",
+            "frustration": "#FF9800",
+            "disappointment": "#795548",
+            "sadness": "#607D8B",
+            "anger": "#D32F2F",
+            "neutral": "#9E9E9E"
+        },
         "platform": {
             "facebook": "#1877F2",
             "instagram": "#E4405F",
         "off_topic",
         "spam_selfpromo"
     ],
+    "emotion_order": [
+        "joy",
+        "excitement",
+        "gratitude",
+        "admiration",
+        "curiosity",
+        "humor",
+        "frustration",
+        "disappointment",
+        "sadness",
+        "anger",
+        "neutral"
+    ],
     "negative_sentiments": [
         "negative",
         "very_negative"
     },
     "snowflake": {
         "query": "SELECT s.COMMENT_SK, s.COMMENT_ID, s.ORIGINAL_TEXT, s.PLATFORM, s.COMMENT_TIMESTAMP, s.AUTHOR_NAME, s.AUTHOR_ID, CAST(NULL AS VARCHAR(16777216)) as PARENT_COMMENT_ID, CAST(NULL AS VARCHAR(16777216)) as PARENT_COMMENT_TEXT, s.CONTENT_SK, s.CONTENT_ID, s.CONTENT_DESCRIPTION, s.CHANNEL_SK, s.CHANNEL_NAME, s.CHANNEL_DISPLAY_NAME, s.DETECTED_LANGUAGE, s.LANGUAGE_CODE, s.IS_ENGLISH, s.LANGUAGE_CONFIDENCE, s.DETECTION_METHOD, s.HAS_TEXT, s.TRANSLATED_TEXT, s.TRANSLATION_PERFORMED, s.TRANSLATION_CONFIDENCE, s.TRANSLATION_NOTES, s.SENTIMENT_POLARITY, s.INTENT, s.REQUIRES_REPLY, s.SENTIMENT_CONFIDENCE, s.ANALYSIS_NOTES, s.PROCESSING_SUCCESS, CAST(NULL AS VARCHAR(16777216)) as PROCESSING_ERRORS, s.PROCESSED_AT, s.WORKFLOW_VERSION, CAST(NULL AS TIMESTAMP_NTZ(9)) as CREATED_AT, CAST(NULL AS TIMESTAMP_NTZ(9)) as UPDATED_AT, s.CHANNEL_NAME as BRAND, c.PERMALINK_URL, CAST(NULL AS VARCHAR(16777216)) as THUMBNAIL_URL FROM SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES s LEFT JOIN SOCIAL_MEDIA_DB.CORE.DIM_CONTENT c ON s.CONTENT_SK = c.CONTENT_SK UNION ALL SELECT COMMENT_SK, COMMENT_ID, ORIGINAL_TEXT, CASE WHEN PLATFORM = 'musora' THEN 'musora_app' ELSE PLATFORM END as PLATFORM, COMMENT_TIMESTAMP, AUTHOR_NAME, AUTHOR_ID, PARENT_COMMENT_ID, PARENT_COMMENT_TEXT, CONTENT_SK, CONTENT_ID, CONTENT_DESCRIPTION, CHANNEL_SK, CHANNEL_NAME, CHANNEL_DISPLAY_NAME, DETECTED_LANGUAGE, LANGUAGE_CODE, IS_ENGLISH, LANGUAGE_CONFIDENCE, DETECTION_METHOD, HAS_TEXT, TRANSLATED_TEXT, TRANSLATION_PERFORMED, TRANSLATION_CONFIDENCE, TRANSLATION_NOTES, SENTIMENT_POLARITY, INTENT, REQUIRES_REPLY, SENTIMENT_CONFIDENCE, ANALYSIS_NOTES, PROCESSING_SUCCESS, PROCESSING_ERRORS, PROCESSED_AT, WORKFLOW_VERSION, CREATED_AT, UPDATED_AT, CHANNEL_NAME as BRAND, PERMALINK_URL, THUMBNAIL_URL FROM SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES",
+        "dashboard_query": "SELECT s.COMMENT_SK, s.CONTENT_SK, LOWER(s.PLATFORM) AS PLATFORM, LOWER(s.CHANNEL_NAME) AS BRAND, s.SENTIMENT_POLARITY, s.INTENT, s.EMOTIONS, s.REQUIRES_REPLY, s.DETECTED_LANGUAGE, s.COMMENT_TIMESTAMP, s.PROCESSED_AT, s.AUTHOR_ID FROM SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES s UNION ALL SELECT COMMENT_SK, CONTENT_SK, CASE WHEN LOWER(PLATFORM) = 'musora' THEN 'musora_app' ELSE LOWER(PLATFORM) END AS PLATFORM, LOWER(CHANNEL_NAME) AS BRAND, SENTIMENT_POLARITY, INTENT, EMOTIONS, REQUIRES_REPLY, DETECTED_LANGUAGE, COMMENT_TIMESTAMP, PROCESSED_AT, AUTHOR_ID FROM SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES",
         "demographics_query": "SELECT u.id as USER_ID, u.birthday as BIRTHDAY, u.timezone as TIMEZONE, GREATEST(COALESCE(p.difficulty, 0), COALESCE(p.self_report_difficulty, 0), COALESCE(p.method_experience, 0)) AS EXPERIENCE_LEVEL FROM stitch.musora_ecom_db.usora_users u JOIN online_recsys.preprocessed.users p ON u.id = p.user_id"
     },
     "demographics": {
             "Advanced (8-10)": [8, 10]
         },
         "top_timezones_count": 15
+    },
+    "helpscout": {
+        "dashboard_query": "SELECT CONVERSATION_ID, LOWER(CUSTOMER_EMAIL) AS CUSTOMER_EMAIL, THREAD_COUNT, FIRST_MESSAGE_AT, LAST_MESSAGE_AT, DURATION_HOURS, STATUS, STATE, SOURCE_TYPE, SOURCE_VIA, SENTIMENT_POLARITY, EMOTIONS, TOPICS, IS_REFUND_REQUEST, IS_CANCELLATION, IS_MEMBERSHIP, SENTIMENT_CONFIDENCE, TOPIC_CONFIDENCE, PROCESSED_AT FROM SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES",
+        "demographics_query": "SELECT LOWER(u.email) AS CUSTOMER_EMAIL, TO_VARCHAR(u.birthday, 'YYYY-MM-DD HH24:MI:SS.FF6 TZHTZM') AS BIRTHDAY, u.timezone AS TIMEZONE, GREATEST(COALESCE(p.difficulty, 0), COALESCE(p.self_report_difficulty, 0), COALESCE(p.method_experience, 0)) AS EXPERIENCE_LEVEL FROM stitch.musora_ecom_db.usora_users u JOIN online_recsys.preprocessed.users p ON u.id = p.user_id WHERE u.email IS NOT NULL",
+        "default_top_n": 10,
+        "default_date_range_days": 60,
+        "escalation_sentiments": ["negative", "very_negative"],
+        "max_summary_conversations": 300
+    },
+    "color_schemes_helpscout": {
+        "topics": {
+            "video_and_playback":         "#1982C4",
+            "app_and_technical_errors":   "#D32F2F",
+            "navigation_and_ux":          "#9C27B0",
+            "account_and_access":         "#FF6F00",
+            "billing_and_subscription":   "#00C851",
+            "learning_and_progress":      "#2196F3",
+            "content_and_resources":      "#4CAF50",
+            "community_and_notifications":"#FFB300",
+            "feedback_and_suggestions":   "#00BCD4",
+            "uncategorized":              "#9E9E9E"
+        },
+        "status": {
+            "active":  "#FF6F00",
+            "pending": "#FFB300",
+            "closed":  "#4CAF50",
+            "spam":    "#9E9E9E",
+            "default": "#607D8B"
+        },
+        "boolean_flags": {
+            "is_refund_request": "#D32F2F",
+            "is_cancellation":   "#FF6F00",
+            "is_membership":     "#00C851"
+        }
     }
 }

visualization/data/data_loader.py CHANGED Viewed

@@ -90,6 +90,10 @@ class SentimentDataLoader:
         df['platform'] = df['platform'].fillna('unknown').str.lower()
         df['brand'] = df['brand'].fillna('unknown').str.lower()
         if 'requires_reply' in df.columns:
             df['requires_reply'] = df['requires_reply'].astype(bool)
@@ -166,7 +170,7 @@ class SentimentDataLoader:
     def load_sa_data(self, platform, brand, top_n=10, min_comments=10,
                      sort_by='severity_score', sentiments=None, intents=None,
-                     date_range=None):
         """
         Load Sentiment Analysis page data:
           1. Content aggregation stats for top-N contents
@@ -180,6 +184,7 @@ class SentimentDataLoader:
             sort_by: 'severity_score' | 'sentiment_percentage' | 'sentiment_count' | 'total_comments'
             sentiments: List of sentiments to filter by (dominant_sentiment)
             intents: List of intents to filter by
             date_range: Tuple (start_date, end_date) or None
         Returns:
@@ -187,16 +192,17 @@ class SentimentDataLoader:
         """
         sentiments_key = tuple(sorted(sentiments)) if sentiments else ()
         intents_key = tuple(sorted(intents)) if intents else ()
         date_key = (str(date_range[0]), str(date_range[1])) if date_range and len(date_range) == 2 else ()
         return self._fetch_sa_data(
             platform, brand, top_n, min_comments, sort_by,
-            sentiments_key, intents_key, date_key
         )
     @st.cache_data(ttl=86400)
     def _fetch_sa_data(_self, platform, brand, top_n, min_comments, sort_by,
-                       sentiments, intents, date_range):
         """Cached SA data fetch — returns (contents_df, comments_df)."""
         try:
             conn = SnowFlakeConn()
@@ -245,6 +251,16 @@ class SentimentDataLoader:
                     ]['content_sk'].unique()
                     contents_df = contents_df[contents_df['content_sk'].isin(valid_sks)]
                     comments_df = comments_df[comments_df['content_sk'].isin(valid_sks)]
             else:
                 comments_df = pd.DataFrame()
@@ -387,7 +403,7 @@ class SentimentDataLoader:
                     LOWER(s.PLATFORM)      AS PLATFORM,
                     LOWER(s.CHANNEL_NAME)  AS BRAND,
                     s.COMMENT_TIMESTAMP, s.AUTHOR_NAME,
-                    s.DETECTED_LANGUAGE, s.SENTIMENT_POLARITY, s.INTENT,
                     s.REQUIRES_REPLY, s.SENTIMENT_CONFIDENCE, s.IS_ENGLISH,
                     c.PERMALINK_URL
                 FROM SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES s
@@ -407,7 +423,7 @@ class SentimentDataLoader:
                     'musora_app'           AS PLATFORM,
                     LOWER(CHANNEL_NAME)    AS BRAND,
                     COMMENT_TIMESTAMP, AUTHOR_NAME,
-                    DETECTED_LANGUAGE, SENTIMENT_POLARITY, INTENT,
                     REQUIRES_REPLY, SENTIMENT_CONFIDENCE, IS_ENGLISH,
                     PERMALINK_URL
                 FROM SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES
@@ -448,6 +464,10 @@ class SentimentDataLoader:
         df['intent'] = df['intent'].fillna('unknown')
         df['platform'] = df['platform'].fillna('unknown').str.lower()
         if 'requires_reply' in df.columns:
             df['requires_reply'] = df['requires_reply'].astype(bool)

         df['platform'] = df['platform'].fillna('unknown').str.lower()
         df['brand'] = df['brand'].fillna('unknown').str.lower()
+        # emotions is optional (soft-fail); keep NaN as-is
+        if 'emotions' not in df.columns:
+            df['emotions'] = None
         if 'requires_reply' in df.columns:
             df['requires_reply'] = df['requires_reply'].astype(bool)
     def load_sa_data(self, platform, brand, top_n=10, min_comments=10,
                      sort_by='severity_score', sentiments=None, intents=None,
+                     emotions=None, date_range=None):
         """
         Load Sentiment Analysis page data:
           1. Content aggregation stats for top-N contents
             sort_by: 'severity_score' | 'sentiment_percentage' | 'sentiment_count' | 'total_comments'
             sentiments: List of sentiments to filter by (dominant_sentiment)
             intents: List of intents to filter by
+            emotions: List of emotions to filter by (content must have at least one comment with these emotions)
             date_range: Tuple (start_date, end_date) or None
         Returns:
         """
         sentiments_key = tuple(sorted(sentiments)) if sentiments else ()
         intents_key = tuple(sorted(intents)) if intents else ()
+        emotions_key = tuple(sorted(emotions)) if emotions else ()
         date_key = (str(date_range[0]), str(date_range[1])) if date_range and len(date_range) == 2 else ()
         return self._fetch_sa_data(
             platform, brand, top_n, min_comments, sort_by,
+            sentiments_key, intents_key, emotions_key, date_key
         )
     @st.cache_data(ttl=86400)
     def _fetch_sa_data(_self, platform, brand, top_n, min_comments, sort_by,
+                       sentiments, intents, emotions, date_range):
         """Cached SA data fetch — returns (contents_df, comments_df)."""
         try:
             conn = SnowFlakeConn()
                     ]['content_sk'].unique()
                     contents_df = contents_df[contents_df['content_sk'].isin(valid_sks)]
                     comments_df = comments_df[comments_df['content_sk'].isin(valid_sks)]
+                # Python-side emotion filter — keep only content_sks that have
+                # at least one comment matching any selected emotion
+                if emotions:
+                    pattern = '|'.join(re.escape(e) for e in emotions)
+                    valid_sks = comments_df[
+                        comments_df['emotions'].str.contains(pattern, na=False, case=False)
+                    ]['content_sk'].unique()
+                    contents_df = contents_df[contents_df['content_sk'].isin(valid_sks)]
+                    comments_df = comments_df[comments_df['content_sk'].isin(valid_sks)]
             else:
                 comments_df = pd.DataFrame()
                     LOWER(s.PLATFORM)      AS PLATFORM,
                     LOWER(s.CHANNEL_NAME)  AS BRAND,
                     s.COMMENT_TIMESTAMP, s.AUTHOR_NAME,
+                    s.DETECTED_LANGUAGE, s.SENTIMENT_POLARITY, s.INTENT, s.EMOTIONS,
                     s.REQUIRES_REPLY, s.SENTIMENT_CONFIDENCE, s.IS_ENGLISH,
                     c.PERMALINK_URL
                 FROM SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES s
                     'musora_app'           AS PLATFORM,
                     LOWER(CHANNEL_NAME)    AS BRAND,
                     COMMENT_TIMESTAMP, AUTHOR_NAME,
+                    DETECTED_LANGUAGE, SENTIMENT_POLARITY, INTENT, EMOTIONS,
                     REQUIRES_REPLY, SENTIMENT_CONFIDENCE, IS_ENGLISH,
                     PERMALINK_URL
                 FROM SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES
         df['intent'] = df['intent'].fillna('unknown')
         df['platform'] = df['platform'].fillna('unknown').str.lower()
+        # emotions is optional (soft-fail); keep NaN as-is for chart filtering
+        if 'emotions' not in df.columns:
+            df['emotions'] = None
         if 'requires_reply' in df.columns:
             df['requires_reply'] = df['requires_reply'].astype(bool)

visualization/data/helpscout_data_loader.py ADDED Viewed

	@@ -0,0 +1,382 @@

+"""
+HelpScout data loader — mirrors SentimentDataLoader architecture.
+Three loading modes:
+  - load_dashboard_data()     : lightweight (no long text), cached 24 h
+  - load_analysis_data(...)   : filtered with SUMMARY + notes, on-demand, cached 24 h
+  - load_demographics_data()  : email-keyed user demographics, cached 24 h
+"""
+import re
+import sys
+from datetime import datetime, timedelta
+from pathlib import Path
+import pandas as pd
+import streamlit as st
+from dateutil.relativedelta import relativedelta
+root_dir = Path(__file__).resolve().parent.parent.parent
+sys.path.append(str(root_dir))
+from visualization.SnowFlakeConnection import SnowFlakeConn
+from visualization.utils.helpscout_utils import (
+    load_topic_taxonomy, parse_topics, compute_escalation_flag
+)
+import json
+class HelpScoutDataLoader:
+    """
+    Loads HelpScout conversation features from Snowflake with caching.
+    """
+    def __init__(self, config_path=None):
+        if config_path is None:
+            config_path = Path(__file__).parent.parent / "config" / "viz_config.json"
+        with open(config_path, "r") as f:
+            self.config = json.load(f)
+        self.hs_config = self.config.get("helpscout", {})
+        self.dashboard_query = self.hs_config.get("dashboard_query", "")
+        self.demographics_query = self.hs_config.get("demographics_query", "")
+        self.escalation_sentiments = self.hs_config.get("escalation_sentiments", ["negative", "very_negative"])
+        self.default_date_range_days = self.hs_config.get("default_date_range_days", 60)
+        self.max_summary_conversations = self.hs_config.get("max_summary_conversations", 300)
+        self.topic_colors = self.config.get("color_schemes_helpscout", {}).get("topics", {})
+        self.status_colors = self.config.get("color_schemes_helpscout", {}).get("status", {})
+        self.flag_colors = self.config.get("color_schemes_helpscout", {}).get("boolean_flags", {})
+        self.sentiment_colors = self.config.get("color_schemes", {}).get("sentiment_polarity", {})
+        self.demographics_config = self.config.get("demographics", {})
+        self.taxonomy = load_topic_taxonomy()
+    # ─────────────────────────────────────────────────────────────
+    # Dashboard data (lightweight, 24-hour cache)
+    # ─────────────────────────────────────────────────────────────
+    @st.cache_data(ttl=86400)
+    def load_dashboard_data(_self):
+        """Load lightweight HelpScout dashboard data — no long-form text columns."""
+        try:
+            conn = SnowFlakeConn()
+            df = conn.run_read_query(_self.dashboard_query, "HelpScout dashboard data")
+            conn.close_connection()
+            if df is None or df.empty:
+                st.error("No HelpScout data returned from Snowflake")
+                return pd.DataFrame()
+            df = _self._process_dashboard_df(df)
+            if _self.demographics_query:
+                demo_df = _self.load_demographics_data()
+                if not demo_df.empty:
+                    df = _self.merge_demographics(df, demo_df)
+            return df
+        except Exception as e:
+            st.error(f"Error loading HelpScout dashboard data: {e}")
+            return pd.DataFrame()
+    def _process_dashboard_df(self, df):
+        df.columns = df.columns.str.lower()
+        for ts_col in ("first_message_at", "last_message_at", "processed_at"):
+            if ts_col in df.columns:
+                df[ts_col] = pd.to_datetime(df[ts_col], errors="coerce", utc=True).dt.tz_localize(None)
+        df["sentiment_polarity"] = df["sentiment_polarity"].fillna("unknown")
+        df["status"]  = df["status"].fillna("unknown").str.lower()
+        df["state"]   = df["state"].fillna("unknown").str.lower()
+        df["source_type"] = df["source_type"].fillna("unknown").str.lower()
+        for bool_col in ("is_refund_request", "is_cancellation", "is_membership"):
+            if bool_col in df.columns:
+                df[bool_col] = df[bool_col].fillna(False).astype(bool)
+        if "emotions" not in df.columns:
+            df["emotions"] = None
+        # topics_list for filter options
+        df["topics_list"] = df["topics"].apply(parse_topics)
+        # escalation flag
+        df["is_escalation"] = compute_escalation_flag(df, self.escalation_sentiments)
+        return df
+    # ─────────────────────────────────────────────────────────────
+    # Analysis page data (on-demand, 24-hour cache)
+    # ──────────────────────────────────────────────────────────��──
+    def load_analysis_data(self, sentiments=None, topics=None,
+                           refund_only=False, cancel_only=False,
+                           membership_only=False, statuses=None,
+                           sources=None, date_range=None, top_n=None):
+        """
+        Load filtered HelpScout conversations with full text for the Analysis page.
+        Caches based on argument tuple.
+        """
+        sentiments_key  = tuple(sorted(sentiments)) if sentiments else ()
+        topics_key      = tuple(sorted(topics))     if topics     else ()
+        statuses_key    = tuple(sorted(statuses))   if statuses   else ()
+        sources_key     = tuple(sorted(sources))    if sources    else ()
+        date_key        = (str(date_range[0]), str(date_range[1])) if date_range and len(date_range) == 2 else ()
+        return self._fetch_analysis_data(
+            sentiments_key, topics_key, bool(refund_only), bool(cancel_only),
+            bool(membership_only), statuses_key, sources_key, date_key, top_n or 0
+        )
+    @st.cache_data(ttl=86400)
+    def _fetch_analysis_data(_self, sentiments, topics, refund_only, cancel_only,
+                             membership_only, statuses, sources, date_range, top_n):
+        """Cached analysis data fetch — returns full-detail conversation df."""
+        try:
+            query = _self._build_analysis_query(
+                sentiments, topics, refund_only, cancel_only,
+                membership_only, statuses, sources, date_range, top_n
+            )
+            conn = SnowFlakeConn()
+            df = conn.run_read_query(query, "HelpScout analysis data")
+            conn.close_connection()
+            if df is None or df.empty:
+                return pd.DataFrame()
+            df = _self._process_analysis_df(df)
+            return df
+        except Exception as e:
+            st.error(f"Error loading HelpScout analysis data: {e}")
+            return pd.DataFrame()
+    def _build_analysis_query(self, sentiments, topics, refund_only, cancel_only,
+                              membership_only, statuses, sources, date_range, top_n):
+        """Build dynamic SQL for the analysis page with all filters pushed to Snowflake."""
+        where_clauses = []
+        if date_range and len(date_range) == 2:
+            where_clauses.append(f"FIRST_MESSAGE_AT >= '{date_range[0]}' AND FIRST_MESSAGE_AT <= '{date_range[1]}'")
+        if sentiments:
+            safe = "', '".join(self._sanitize(s) for s in sentiments)
+            where_clauses.append(f"SENTIMENT_POLARITY IN ('{safe}')")
+        if topics:
+            topic_conditions = []
+            for t in topics:
+                safe_t = self._sanitize(t)
+                topic_conditions.append(
+                    f"ARRAY_CONTAINS('{safe_t}'::VARIANT, SPLIT(TOPICS, ','))"
+                )
+            where_clauses.append("(" + " OR ".join(topic_conditions) + ")")
+        if statuses:
+            safe = "', '".join(self._sanitize(s.lower()) for s in statuses)
+            where_clauses.append(f"LOWER(STATUS) IN ('{safe}')")
+        if sources:
+            safe = "', '".join(self._sanitize(s.lower()) for s in sources)
+            where_clauses.append(f"LOWER(SOURCE_TYPE) IN ('{safe}')")
+        if refund_only:
+            where_clauses.append("IS_REFUND_REQUEST = TRUE")
+        if cancel_only:
+            where_clauses.append("IS_CANCELLATION = TRUE")
+        if membership_only:
+            where_clauses.append("IS_MEMBERSHIP = TRUE")
+        where_sql = ("WHERE " + " AND ".join(where_clauses)) if where_clauses else ""
+        limit_sql = f"LIMIT {int(top_n)}" if top_n and top_n > 0 else ""
+        return f"""
+        SELECT
+            CONVERSATION_ID,
+            LOWER(CUSTOMER_EMAIL) AS CUSTOMER_EMAIL,
+            CUSTOMER_FIRST,
+            CUSTOMER_LAST,
+            THREAD_COUNT,
+            FIRST_MESSAGE_AT,
+            LAST_MESSAGE_AT,
+            DURATION_HOURS,
+            STATUS,
+            STATE,
+            SOURCE_TYPE,
+            SOURCE_VIA,
+            SENTIMENT_POLARITY,
+            EMOTIONS,
+            SENTIMENT_CONFIDENCE,
+            SENTIMENT_NOTES,
+            TOPICS,
+            IS_REFUND_REQUEST,
+            IS_CANCELLATION,
+            IS_MEMBERSHIP,
+            TOPIC_CONFIDENCE,
+            TOPIC_NOTES,
+            SUMMARY,
+            PROCESSED_AT
+        FROM SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES
+        {where_sql}
+        ORDER BY FIRST_MESSAGE_AT DESC
+        {limit_sql}
+        """
+    def _process_analysis_df(self, df):
+        df.columns = df.columns.str.lower()
+        for ts_col in ("first_message_at", "last_message_at", "processed_at"):
+            if ts_col in df.columns:
+                df[ts_col] = pd.to_datetime(df[ts_col], errors="coerce", utc=True).dt.tz_localize(None)
+        df["sentiment_polarity"] = df["sentiment_polarity"].fillna("unknown")
+        df["status"]      = df["status"].fillna("unknown").str.lower()
+        df["source_type"] = df["source_type"].fillna("unknown").str.lower()
+        for bool_col in ("is_refund_request", "is_cancellation", "is_membership"):
+            if bool_col in df.columns:
+                df[bool_col] = df[bool_col].fillna(False).astype(bool)
+        if "emotions" not in df.columns:
+            df["emotions"] = None
+        df["topics_list"] = df["topics"].apply(parse_topics)
+        df["is_escalation"] = compute_escalation_flag(df, self.escalation_sentiments)
+        # Short summary for cards (100 chars)
+        if "summary" in df.columns:
+            text = df["summary"].fillna("").astype(str)
+            df["summary_short"] = text.where(text.str.len() <= 120, text.str[:120] + "…")
+        return df
+    # ─────────────────────────────────────────────────────────────
+    # Demographics (email-keyed, 24-hour cache)
+    # ─────────────────────────────────────────────────────────────
+    @st.cache_data(ttl=86400)
+    def load_demographics_data(_self):
+        """Load user demographics keyed by email."""
+        if not _self.demographics_query:
+            return pd.DataFrame()
+        try:
+            conn = SnowFlakeConn()
+            df = conn.run_read_query(_self.demographics_query, "HelpScout user demographics")
+            conn.close_connection()
+            if df is None or df.empty:
+                return pd.DataFrame()
+            return _self._process_demographics_df(df)
+        except Exception as e:
+            st.warning(f"Could not load HelpScout demographics: {e}")
+            return pd.DataFrame()
+    def _process_demographics_df(self, df):
+        df.columns = df.columns.str.lower()
+        if "birthday" in df.columns:
+            df["birthday"] = df["birthday"].astype(str)
+            df["birthday"] = pd.to_datetime(df["birthday"], errors="coerce", utc=True)
+            df["birthday"] = df["birthday"].dt.tz_localize(None)
+            df["age"] = df["birthday"].apply(self._calculate_age)
+            df["age_group"] = df["age"].apply(self._categorize_age)
+        if "timezone" in df.columns:
+            df["timezone_region"] = df["timezone"].apply(self._extract_timezone_region)
+        if "experience_level" in df.columns:
+            df["experience_group"] = df["experience_level"].apply(self._categorize_experience)
+        if "customer_email" in df.columns:
+            df = df[df["customer_email"].notna()]
+            df["customer_email"] = df["customer_email"].str.lower()
+        return df
+    def merge_demographics(self, df, demo_df):
+        """Merge demographic data with HelpScout conversations on customer_email."""
+        if demo_df.empty or "customer_email" not in df.columns:
+            for col, val in [("age", None), ("age_group", "Unknown"),
+                             ("timezone", None), ("timezone_region", "Unknown"),
+                             ("experience_level", None), ("experience_group", "Unknown")]:
+                df[col] = val
+            return df
+        if "customer_email" not in demo_df.columns:
+            return df
+        merge_cols = ["customer_email"]
+        for c in ["age", "age_group", "timezone", "timezone_region", "experience_level", "experience_group"]:
+            if c in demo_df.columns:
+                merge_cols.append(c)
+        merged = df.merge(demo_df[merge_cols], on="customer_email", how="left")
+        for col in ["age_group", "timezone_region", "experience_group"]:
+            if col in merged.columns:
+                merged[col] = merged[col].fillna("Unknown")
+        return merged
+    # ─────────────────────────────────────────────────────────────
+    # Filter helpers
+    # ─────────────────────────────────────────────────────────────
+    def get_filter_options(self, df):
+        """Return unique values for all in-page filters from the dashboard df."""
+        topics_flat = df["topics_list"].explode().dropna().unique().tolist() if "topics_list" in df.columns else []
+        return {
+            "sentiments": sorted(df["sentiment_polarity"].dropna().unique().tolist()),
+            "topics":     sorted(t for t in topics_flat if t),
+            "statuses":   sorted(df["status"].dropna().unique().tolist()),
+            "states":     sorted(df["state"].dropna().unique().tolist()) if "state" in df.columns else [],
+            "sources":    sorted(df["source_type"].dropna().unique().tolist()),
+        }
+    # ─────────────────────────────────────────────────────────────
+    # Demographics calculation helpers (mirrors SentimentDataLoader)
+    # ─────────────────────────────────────────────────────────────
+    @staticmethod
+    def _calculate_age(birthday):
+        if pd.isna(birthday):
+            return None
+        try:
+            age = relativedelta(datetime.now(), birthday).years
+            return age if 0 <= age <= 120 else None
+        except Exception:
+            return None
+    def _categorize_age(self, age):
+        if pd.isna(age) or age is None:
+            return "Unknown"
+        for group_name, (min_age, max_age) in self.demographics_config.get("age_groups", {}).items():
+            if min_age <= age <= max_age:
+                return group_name
+        return "Unknown"
+    @staticmethod
+    def _extract_timezone_region(timezone):
+        if pd.isna(timezone) or not isinstance(timezone, str):
+            return "Unknown"
+        parts = timezone.split("/")
+        return parts[0] if parts else "Unknown"
+    def _categorize_experience(self, experience_level):
+        if pd.isna(experience_level):
+            return "Unknown"
+        try:
+            exp_level = float(experience_level)
+        except Exception:
+            return "Unknown"
+        for group_name, (min_exp, max_exp) in self.demographics_config.get("experience_groups", {}).items():
+            if min_exp <= exp_level <= max_exp:
+                return group_name
+        return "Unknown"
+    # ─────────────────────────────────────────────────────────────
+    # Internal helpers
+    # ─────────────────────────────────────────────────────────────
+    @staticmethod
+    def _sanitize(value: str) -> str:
+        return re.sub(r"['\";\\]", "", str(value))

visualization/utils/auth.py CHANGED Viewed

@@ -24,8 +24,6 @@ AUTHORIZED_EMAILS = {
     "gabriel@musora.com",
     "jmilligan@musora.com",
     "dave@musora.com",
-    "amy@musora.com",
-    "karissa@musora.com"
 }

     "gabriel@musora.com",
     "jmilligan@musora.com",
     "dave@musora.com",
 }

visualization/utils/data_processor.py CHANGED Viewed

@@ -113,6 +113,52 @@ class SentimentDataProcessor:
         return intent_counts
     @staticmethod
     def get_content_summary(df):
         """

         return intent_counts
+    @staticmethod
+    def get_emotion_distribution(df, group_by=None):
+        """
+        Calculate emotion distribution (handles multi-label).
+        Args:
+            df: Sentiment dataframe with 'emotions' column
+            group_by: Optional column(s) to group by
+        Returns:
+            pd.DataFrame: Emotion distribution with columns [emotion, count, percentage]
+        """
+        if 'emotions' not in df.columns:
+            return pd.DataFrame()
+        df_exploded = df.dropna(subset=['emotions']).copy()
+        df_exploded['emotions'] = df_exploded['emotions'].str.split(',')
+        df_exploded = df_exploded.explode('emotions')
+        df_exploded['emotions'] = df_exploded['emotions'].str.strip()
+        df_exploded = df_exploded[df_exploded['emotions'] != '']
+        if df_exploded.empty:
+            return pd.DataFrame()
+        if group_by:
+            if isinstance(group_by, str):
+                group_by = [group_by]
+            emotion_counts = df_exploded.groupby(
+                group_by + ['emotions'],
+                as_index=False
+            ).size().rename(columns={'size': 'count'})
+            emotion_counts['percentage'] = emotion_counts.groupby(group_by)['count'].transform(
+                lambda x: (x / x.sum() * 100).round(2)
+            )
+        else:
+            emotion_counts = df_exploded['emotions'].value_counts().reset_index()
+            emotion_counts.columns = ['emotions', 'count']
+            emotion_counts['percentage'] = (
+                emotion_counts['count'] / emotion_counts['count'].sum() * 100
+            ).round(2)
+        return emotion_counts
     @staticmethod
     def get_content_summary(df):
         """

visualization/utils/helpscout_pdf.py ADDED Viewed

	@@ -0,0 +1,471 @@

+"""
+HelpScout PDF Exporters.
+Two classes sharing the MusoraPDF base from pdf_exporter.py:
+  - HelpScoutDashboardPDF  : full HelpScout dashboard report
+  - HelpScoutAnalysisPDF   : filtered analysis report + optional LLM summary
+"""
+import logging
+import os
+import sys
+import tempfile
+from datetime import datetime
+from pathlib import Path
+import plotly.io as pio
+_parent = Path(__file__).resolve().parent.parent
+if str(_parent) not in sys.path:
+    sys.path.insert(0, str(_parent))
+from utils.pdf_exporter import MusoraPDF          # reuse base class
+from utils.helpscout_utils import boolean_flag_counts, topic_label, load_topic_taxonomy
+from visualizations.helpscout_charts import HelpScoutCharts
+logger = logging.getLogger(__name__)
+_RENDER_SCALE = 3
+# ---------------------------------------------------------------------------
+# Shared rendering helpers (mixin-style functions)
+# ---------------------------------------------------------------------------
+def _prepare_fig(fig, is_side_by_side=False):
+    base_fs = 13 if is_side_by_side else 14
+    fig.update_layout(
+        paper_bgcolor="white", plot_bgcolor="white",
+        font=dict(color="black", size=base_fs),
+        title_font_size=base_fs + 4,
+        margin=(dict(l=60, r=40, t=60, b=60) if is_side_by_side else dict(l=80, r=40, t=60, b=80)),
+    )
+    fig.update_xaxes(automargin=True)
+    fig.update_yaxes(automargin=True)
+def _fig_to_tmp(fig, width=800, height=400, is_side_by_side=False) -> str:
+    _prepare_fig(fig, is_side_by_side)
+    img = pio.to_image(fig, format="png", width=width, height=height,
+                       scale=_RENDER_SCALE, engine="kaleido")
+    tmp = tempfile.NamedTemporaryFile(suffix=".png", delete=False)
+    tmp.write(img)
+    tmp.close()
+    return tmp.name
+def _cleanup(paths):
+    for p in paths:
+        try:
+            os.unlink(p)
+        except OSError:
+            pass
+# ---------------------------------------------------------------------------
+# HelpScoutDashboardPDF
+# ---------------------------------------------------------------------------
+class HelpScoutDashboardPDF:
+    """
+    Generates a comprehensive HelpScout dashboard PDF report.
+    """
+    def __init__(self):
+        self.charts = HelpScoutCharts()
+        self.taxonomy = load_topic_taxonomy()
+        self._tmp: list = []
+    def generate_report(self, df, filter_info: dict = None) -> bytes:
+        """Build and return the full dashboard PDF."""
+        self.pdf = MusoraPDF()
+        self._tmp = []
+        try:
+            self._cover(df, filter_info)
+            self._executive_summary(df)
+            self._sentiment_section(df)
+            self._topic_section(df)
+            self._emotion_section(df)
+            self._flags_section(df)
+            self._status_source_section(df)
+            self._timelines_section(df)
+            self._depth_section(df)
+            self._data_summary(df, filter_info)
+            return bytes(self.pdf.output())
+        finally:
+            _cleanup(self._tmp)
+    # ── Rendering helpers ──
+    def _add_chart(self, fig, width=180, img_w=800, img_h=400):
+        try:
+            p = _fig_to_tmp(fig, img_w, img_h)
+            self._tmp.append(p)
+            h_mm = width * (img_h / img_w)
+            self.pdf.check_page_break(h_mm + 5)
+            self.pdf.image(p, x=10, w=width)
+            self.pdf.ln(3)
+        except Exception:
+            logger.exception("Chart render failed")
+            self.pdf.body_text("[Chart could not be rendered]")
+    def _add_two_charts(self, fig1, fig2, width=92):
+        try:
+            p1 = _fig_to_tmp(fig1, 700, 450, is_side_by_side=True); self._tmp.append(p1)
+            p2 = _fig_to_tmp(fig2, 700, 450, is_side_by_side=True); self._tmp.append(p2)
+            h_mm = width * (450 / 700)
+            self.pdf.check_page_break(h_mm + 5)
+            y = self.pdf.get_y()
+            self.pdf.image(p1, x=10, y=y, w=width)
+            self.pdf.image(p2, x=10 + width + 4, y=y, w=width)
+            self.pdf.set_y(y + h_mm + 3)
+        except Exception:
+            logger.exception("Side-by-side render failed")
+            self.pdf.body_text("[Charts could not be rendered]")
+    # ── Sections ──
+    def _cover(self, df, filter_info):
+        self.pdf.add_page()
+        self.pdf.ln(40)
+        r, g, b = MusoraPDF.PRIMARY
+        self.pdf.set_fill_color(r, g, b)
+        self.pdf.rect(0, 60, 210, 4, style="F")
+        self.pdf.ln(20)
+        self.pdf.set_font("Helvetica", "B", 28)
+        self.pdf.set_text_color(r, g, b)
+        self.pdf.cell(0, 15, "Musora", align="C", new_x="LMARGIN", new_y="NEXT")
+        self.pdf.set_font("Helvetica", "", 16)
+        self.pdf.set_text_color(80, 80, 80)
+        self.pdf.cell(0, 10, "HelpScout Support Dashboard Report",
+                      align="C", new_x="LMARGIN", new_y="NEXT")
+        self.pdf.ln(10)
+        self.pdf.set_font("Helvetica", "", 12)
+        self.pdf.set_text_color(100, 100, 100)
+        self.pdf.cell(0, 8, f"Generated: {datetime.now().strftime('%B %d, %Y at %H:%M')}",
+                      align="C", new_x="LMARGIN", new_y="NEXT")
+        self.pdf.ln(5)
+        self.pdf.set_font("Helvetica", "", 10)
+        self.pdf.cell(0, 7, f"Total Conversations: {len(df):,}",
+                      align="C", new_x="LMARGIN", new_y="NEXT")
+        if "first_message_at" in df.columns and not df.empty:
+            valid = df["first_message_at"].dropna()
+            if not valid.empty:
+                dr = f"{valid.min().strftime('%b %d, %Y')} to {valid.max().strftime('%b %d, %Y')}"
+                self.pdf.ln(3)
+                self.pdf.set_font("Helvetica", "I", 9)
+                self.pdf.set_text_color(120, 120, 120)
+                self.pdf.cell(0, 6, MusoraPDF._sanitize(f"Data period: {dr}"),
+                              align="C", new_x="LMARGIN", new_y="NEXT")
+        self.pdf.ln(20)
+        self.pdf.set_font("Helvetica", "I", 8)
+        self.pdf.set_text_color(150, 150, 150)
+        self.pdf.cell(0, 6, "Confidential - For Internal Use Only",
+                      align="C", new_x="LMARGIN", new_y="NEXT")
+    def _executive_summary(self, df):
+        self.pdf.add_page()
+        self.pdf.section_header("Executive Summary")
+        total = len(df)
+        flags = boolean_flag_counts(df)
+        neg = df["sentiment_polarity"].isin(["negative", "very_negative"]).sum()
+        pos = df["sentiment_polarity"].isin(["positive", "very_positive"]).sum()
+        neg_pct = neg / total * 100 if total else 0
+        pos_pct = pos / total * 100 if total else 0
+        esc = int(df["is_escalation"].sum()) if "is_escalation" in df.columns else 0
+        avg_dur = float(df["duration_hours"].mean()) if "duration_hours" in df.columns else 0
+        self.pdf.metric_row([
+            ("Total Conversations", f"{total:,}"),
+            ("Positive %", f"{pos_pct:.1f}%"),
+            ("Negative %", f"{neg_pct:.1f}%"),
+            ("Avg Duration (h)", f"{avg_dur:.1f}"),
+        ])
+        self.pdf.metric_row([
+            ("Escalations", f"{esc:,}"),
+            ("Refund Requests", f"{flags['is_refund_request']:,}"),
+            ("Cancellations", f"{flags['is_cancellation']:,}"),
+            ("Membership Joins", f"{flags['is_membership']:,}"),
+        ])
+    def _sentiment_section(self, df):
+        self.pdf.add_page()
+        self.pdf.section_header("Sentiment Distribution")
+        pie  = self.charts.create_sentiment_pie_chart(df, title="Sentiment Distribution")
+        gauge = self.charts.create_sentiment_score_gauge(self._avg_score(df))
+        self._add_two_charts(pie, gauge)
+    def _topic_section(self, df):
+        self.pdf.add_page()
+        self.pdf.section_header("Topic Analysis")
+        bar = self.charts.create_topic_bar_chart(df, title="Conversations by Topic")
+        pie = self.charts.create_topic_pie_chart(df, title="Topic Share")
+        self._add_two_charts(bar, pie)
+        self._add_chart(self.charts.create_topic_sentiment_heatmap(df), img_h=500)
+    def _emotion_section(self, df):
+        if "emotions" not in df.columns or df["emotions"].dropna().empty:
+            return
+        self.pdf.add_page()
+        self.pdf.section_header("Emotion Analysis")
+        self._add_chart(self.charts.create_emotion_bar_chart(df, title="Emotion Distribution"))
+    def _flags_section(self, df):
+        self.pdf.add_page()
+        self.pdf.section_header("Billing & Membership Flags")
+        flags_chart = self.charts.create_boolean_flags_chart(df)
+        esc_chart   = self.charts.create_escalation_breakdown(df)
+        self._add_two_charts(flags_chart, esc_chart)
+    def _status_source_section(self, df):
+        self.pdf.add_page()
+        self.pdf.section_header("Status & Source Distribution")
+        status_chart = self.charts.create_status_distribution(df)
+        source_chart = self.charts.create_source_distribution(df)
+        self._add_two_charts(status_chart, source_chart)
+    def _timelines_section(self, df):
+        self.pdf.add_page()
+        self.pdf.section_header("Volume & Trends (Weekly)")
+        self._add_chart(self.charts.create_volume_timeline(df, freq="W"))
+        self._add_chart(self.charts.create_sentiment_timeline(df, freq="W"))
+        self._add_chart(self.charts.create_refund_cancel_timeline(df, freq="W"))
+    def _depth_section(self, df):
+        self.pdf.add_page()
+        self.pdf.section_header("Conversation Depth")
+        dur = self.charts.create_duration_histogram(df)
+        thd = self.charts.create_thread_count_histogram(df)
+        self._add_two_charts(dur, thd)
+    def _data_summary(self, df, filter_info):
+        self.pdf.add_page()
+        self.pdf.section_header("Data Summary")
+        self.pdf.body_text(f"Report generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+        self.pdf.body_text(f"Total conversations: {len(df):,}")
+        self.pdf.callout_box(
+            "Data source: SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES\n"
+            "This report is confidential and intended for internal Musora team use only.",
+            bg_color=(245, 245, 245),
+        )
+    @staticmethod
+    def _avg_score(df) -> float:
+        score_map = {"very_positive": 2, "positive": 1, "neutral": 0,
+                     "negative": -1, "very_negative": -2}
+        if "sentiment_polarity" not in df.columns or df.empty:
+            return 0.0
+        return float(df["sentiment_polarity"].map(score_map).fillna(0).mean())
+# ---------------------------------------------------------------------------
+# HelpScoutAnalysisPDF
+# ---------------------------------------------------------------------------
+class HelpScoutAnalysisPDF:
+    """
+    Generates a focused analysis PDF from the HelpScout Analysis page.
+    Includes filter summary, distributions, and optionally the LLM summary report.
+    """
+    def __init__(self):
+        self.charts = HelpScoutCharts()
+        self.taxonomy = load_topic_taxonomy()
+        self._tmp: list = []
+    def generate_report(self, df, filter_info: dict = None,
+                        summary_result: dict = None) -> bytes:
+        """
+        Build and return the analysis PDF.
+        Args:
+            df:             Filtered HelpScout analysis DataFrame.
+            filter_info:    Dict of filter descriptions for the cover.
+            summary_result: Output from HelpScoutSummaryAgent.process() or None.
+        """
+        self.pdf = MusoraPDF()
+        self._tmp = []
+        try:
+            self._cover(df, filter_info)
+            self._filter_summary_section(filter_info, df)
+            self._kpi_section(df)
+            self._distributions_section(df)
+            self._summary_section(summary_result)
+            self._data_summary(df, filter_info)
+            return bytes(self.pdf.output())
+        finally:
+            _cleanup(self._tmp)
+    # ── Rendering helpers ──
+    def _add_chart(self, fig, width=180, img_w=800, img_h=400):
+        try:
+            p = _fig_to_tmp(fig, img_w, img_h)
+            self._tmp.append(p)
+            h_mm = width * (img_h / img_w)
+            self.pdf.check_page_break(h_mm + 5)
+            self.pdf.image(p, x=10, w=width)
+            self.pdf.ln(3)
+        except Exception:
+            logger.exception("Chart render failed")
+            self.pdf.body_text("[Chart could not be rendered]")
+    def _add_two_charts(self, fig1, fig2, width=92):
+        try:
+            p1 = _fig_to_tmp(fig1, 700, 450, is_side_by_side=True); self._tmp.append(p1)
+            p2 = _fig_to_tmp(fig2, 700, 450, is_side_by_side=True); self._tmp.append(p2)
+            h_mm = width * (450 / 700)
+            self.pdf.check_page_break(h_mm + 5)
+            y = self.pdf.get_y()
+            self.pdf.image(p1, x=10, y=y, w=width)
+            self.pdf.image(p2, x=10 + width + 4, y=y, w=width)
+            self.pdf.set_y(y + h_mm + 3)
+        except Exception:
+            logger.exception("Side-by-side render failed")
+            self.pdf.body_text("[Charts could not be rendered]")
+    # ── Sections ──
+    def _cover(self, df, filter_info):
+        self.pdf.add_page()
+        self.pdf.ln(40)
+        r, g, b = MusoraPDF.PRIMARY
+        self.pdf.set_fill_color(r, g, b)
+        self.pdf.rect(0, 60, 210, 4, style="F")
+        self.pdf.ln(20)
+        self.pdf.set_font("Helvetica", "B", 28)
+        self.pdf.set_text_color(r, g, b)
+        self.pdf.cell(0, 15, "Musora", align="C", new_x="LMARGIN", new_y="NEXT")
+        self.pdf.set_font("Helvetica", "", 16)
+        self.pdf.set_text_color(80, 80, 80)
+        self.pdf.cell(0, 10, "HelpScout Analysis Report",
+                      align="C", new_x="LMARGIN", new_y="NEXT")
+        self.pdf.ln(10)
+        self.pdf.set_font("Helvetica", "", 12)
+        self.pdf.set_text_color(100, 100, 100)
+        self.pdf.cell(0, 8, f"Generated: {datetime.now().strftime('%B %d, %Y at %H:%M')}",
+                      align="C", new_x="LMARGIN", new_y="NEXT")
+        self.pdf.ln(5)
+        self.pdf.set_font("Helvetica", "", 10)
+        self.pdf.cell(0, 7, f"Matched Conversations: {len(df):,}",
+                      align="C", new_x="LMARGIN", new_y="NEXT")
+        if filter_info:
+            self.pdf.ln(8)
+            self.pdf.set_font("Helvetica", "B", 9)
+            self.pdf.set_text_color(80, 80, 80)
+            self.pdf.cell(0, 6, "Applied Filters:", align="C", new_x="LMARGIN", new_y="NEXT")
+            self.pdf.set_font("Helvetica", "", 9)
+            for k, v in filter_info.items():
+                if v:
+                    self.pdf.cell(0, 5, MusoraPDF._sanitize(f"{k}: {v}"),
+                                  align="C", new_x="LMARGIN", new_y="NEXT")
+        self.pdf.ln(20)
+        self.pdf.set_font("Helvetica", "I", 8)
+        self.pdf.set_text_color(150, 150, 150)
+        self.pdf.cell(0, 6, "Confidential - For Internal Use Only",
+                      align="C", new_x="LMARGIN", new_y="NEXT")
+    def _filter_summary_section(self, filter_info, df):
+        self.pdf.add_page()
+        self.pdf.section_header("Filter Set Summary")
+        if filter_info:
+            rows = [(k, MusoraPDF._sanitize(str(v))) for k, v in filter_info.items() if v]
+            if rows:
+                self.pdf.add_table(["Filter", "Value"], rows, col_widths=[80, 110])
+        else:
+            self.pdf.body_text("No filters applied — report covers all available conversations.")
+    def _kpi_section(self, df):
+        total = len(df)
+        flags = boolean_flag_counts(df)
+        neg_pct = df["sentiment_polarity"].isin(["negative", "very_negative"]).sum() / total * 100 if total else 0
+        pos_pct = df["sentiment_polarity"].isin(["positive", "very_positive"]).sum() / total * 100 if total else 0
+        avg_dur = float(df["duration_hours"].mean()) if "duration_hours" in df.columns else 0
+        esc = int(df["is_escalation"].sum()) if "is_escalation" in df.columns else 0
+        self.pdf.section_header("Key Metrics")
+        self.pdf.metric_row([
+            ("Conversations",     f"{total:,}"),
+            ("Positive %",        f"{pos_pct:.1f}%"),
+            ("Negative %",        f"{neg_pct:.1f}%"),
+            ("Avg Duration (h)",  f"{avg_dur:.1f}"),
+        ])
+        self.pdf.metric_row([
+            ("Escalations",       f"{esc:,}"),
+            ("Refund Requests",   f"{flags['is_refund_request']:,}"),
+            ("Cancellations",     f"{flags['is_cancellation']:,}"),
+            ("Membership Joins",  f"{flags['is_membership']:,}"),
+        ])
+    def _distributions_section(self, df):
+        self.pdf.add_page()
+        self.pdf.section_header("Distributions")
+        pie  = self.charts.create_sentiment_pie_chart(df, title="Sentiment Distribution")
+        tbar = self.charts.create_topic_bar_chart(df, title="Topic Distribution")
+        self._add_two_charts(pie, tbar)
+        self._add_chart(self.charts.create_topic_sentiment_heatmap(df), img_h=500)
+    def _summary_section(self, result: dict):
+        self.pdf.add_page()
+        self.pdf.section_header("AI Summary Report")
+        if result is None or not result.get("success"):
+            self.pdf.callout_box(
+                "AI summary not generated. To include it, click 'Generate Summary Report' "
+                "in the app before exporting the PDF.",
+                bg_color=(255, 250, 230),
+            )
+            return
+        summary = result.get("summary", {})
+        meta    = result.get("metadata", {})
+        exec_summary = MusoraPDF._sanitize(summary.get("executive_summary", ""))
+        if exec_summary:
+            self.pdf.subsection_header("Executive Summary")
+            self.pdf.section_description(exec_summary)
+        themes = summary.get("top_themes", [])
+        if themes:
+            self.pdf.subsection_header("Top Themes")
+            for t in themes:
+                theme_text = MusoraPDF._sanitize(
+                    f"{t.get('theme', '')} — {t.get('prevalence', '')}: {t.get('description', '')}"
+                )
+                self.pdf.body_text(f"  * {theme_text}")
+        complaints = summary.get("top_complaints", [])
+        if complaints:
+            self.pdf.subsection_header("Top Complaints")
+            for c in complaints:
+                self.pdf.body_text(f"  * {MusoraPDF._sanitize(c)}")
+        insights = summary.get("unexpected_insights", [])
+        if insights:
+            self.pdf.subsection_header("Unexpected Insights")
+            for ins in insights:
+                self.pdf.body_text(f"  * {MusoraPDF._sanitize(ins)}")
+        quotes = summary.get("notable_quotes", [])
+        if quotes:
+            self.pdf.subsection_header("Notable Quotes")
+            for q in quotes:
+                self.pdf.body_text(f'  "{MusoraPDF._sanitize(q)}"')
+        self.pdf.ln(4)
+        self.pdf.callout_box(
+            f"Analysis based on {meta.get('total_conversations_analyzed', 0)} conversations "
+            f"| Model: {meta.get('model_used', 'N/A')} "
+            f"| Tokens: {meta.get('tokens_used', 0):,}",
+            bg_color=(240, 248, 255),
+        )
+    def _data_summary(self, df, filter_info):
+        self.pdf.add_page()
+        self.pdf.section_header("Data Summary")
+        self.pdf.body_text(f"Report generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+        self.pdf.body_text(f"Total conversations in report: {len(df):,}")
+        self.pdf.callout_box(
+            "Data source: SOCIAL_MEDIA_DB.ML_FEATURES.HELPSCOUT_CONVERSATION_FEATURES\n"
+            "This report is confidential and intended for internal Musora team use only.",
+            bg_color=(245, 245, 245),
+        )

visualization/utils/helpscout_utils.py ADDED Viewed

	@@ -0,0 +1,107 @@

+"""
+HelpScout utility helpers — pure functions, no Streamlit dependency.
+"""
+import json
+from pathlib import Path
+import pandas as pd
+# ---------------------------------------------------------------------------
+# Topic taxonomy helpers
+# ---------------------------------------------------------------------------
+def load_topic_taxonomy(path: str = None) -> dict:
+    """
+    Load topics.json and return {id: {'label': str, 'description': str}}.
+    Default path resolves to process_helpscout/config_files/topics.json
+    relative to the project root.
+    """
+    if path is None:
+        root = Path(__file__).resolve().parent.parent.parent
+        path = root / "process_helpscout" / "config_files" / "topics.json"
+    with open(path, "r", encoding="utf-8") as f:
+        raw = json.load(f)
+    return {t["id"]: {"label": t["label"], "description": t.get("description", "")}
+            for t in raw.get("topics", [])}
+def topic_label(topic_id: str, taxonomy: dict) -> str:
+    """Return human-readable label for a topic id. Falls back to title-cased id."""
+    if topic_id in taxonomy:
+        return taxonomy[topic_id]["label"]
+    return topic_id.replace("_", " ").title()
+def parse_topics(value) -> list:
+    """Split a comma-separated TOPICS string into a list of stripped lowercase ids."""
+    if pd.isna(value) or not isinstance(value, str) or not value.strip():
+        return []
+    return [t.strip().lower() for t in value.split(",") if t.strip()]
+def explode_topics(df: pd.DataFrame, topics_col: str = "topics") -> pd.DataFrame:
+    """
+    Return a new dataframe with one row per (conversation_id, topic_id).
+    Requires df to have a 'conversation_id' column and a topics_col column.
+    """
+    df = df.copy()
+    df["_topic_list"] = df[topics_col].apply(parse_topics)
+    exploded = df.explode("_topic_list").rename(columns={"_topic_list": "topic_id"})
+    exploded = exploded[exploded["topic_id"].notna() & (exploded["topic_id"] != "")]
+    return exploded.drop(columns=[topics_col], errors="ignore").reset_index(drop=True)
+# ---------------------------------------------------------------------------
+# Boolean flag helpers
+# ---------------------------------------------------------------------------
+def boolean_flag_counts(df: pd.DataFrame) -> dict:
+    """Return counts for refund / cancellation / membership flags."""
+    return {
+        "is_refund_request": int(df["is_refund_request"].sum()) if "is_refund_request" in df.columns else 0,
+        "is_cancellation":   int(df["is_cancellation"].sum())   if "is_cancellation"   in df.columns else 0,
+        "is_membership":     int(df["is_membership"].sum())     if "is_membership"     in df.columns else 0,
+    }
+def compute_escalation_flag(df: pd.DataFrame, escalation_sentiments: list) -> pd.Series:
+    """
+    Boolean Series: True when conversation is negative-sentiment
+    OR is a refund request OR is a cancellation.
+    """
+    is_neg = df["sentiment_polarity"].isin(escalation_sentiments)
+    is_refund = df.get("is_refund_request", pd.Series(False, index=df.index)).fillna(False).astype(bool)
+    is_cancel = df.get("is_cancellation",   pd.Series(False, index=df.index)).fillna(False).astype(bool)
+    return is_neg | is_refund | is_cancel
+# ---------------------------------------------------------------------------
+# Filter description builder
+# ---------------------------------------------------------------------------
+def build_filter_description(filters: dict, taxonomy: dict) -> str:
+    """
+    Convert the filter dict from the analysis page into a human-readable string
+    suitable for the agent prompt and PDF cover.
+    """
+    parts = []
+    if filters.get("date_range"):
+        s, e = filters["date_range"]
+        parts.append(f"Date: {s} to {e}")
+    if filters.get("sentiments"):
+        parts.append(f"Sentiments: {', '.join(filters['sentiments'])}")
+    if filters.get("topics"):
+        labels = [topic_label(t, taxonomy) for t in filters["topics"]]
+        parts.append(f"Topics: {', '.join(labels)}")
+    if filters.get("statuses"):
+        parts.append(f"Status: {', '.join(filters['statuses'])}")
+    if filters.get("sources"):
+        parts.append(f"Source: {', '.join(filters['sources'])}")
+    if filters.get("refund_only"):
+        parts.append("Refund requests only")
+    if filters.get("cancel_only"):
+        parts.append("Cancellations only")
+    if filters.get("membership_only"):
+        parts.append("Membership requests only")
+    return "; ".join(parts) if parts else "No filters applied — showing all conversations"

visualization/utils/pdf_exporter.py CHANGED Viewed

@@ -79,6 +79,13 @@ _DESCRIPTIONS = {
         "Note: These charts reflect only users who have filled in their profile information - "
         "they do not represent all community members."
     ),
     "language": (
         "Language distribution shows what languages comments are written in. "
         "Non-English comments are automatically translated for analysis."
@@ -342,6 +349,7 @@ class DashboardPDFExporter:
             self._add_brand_section(df)
             self._add_platform_section(df)
             self._add_intent_section(df)
             self._add_cross_dimensional_section(df)
             self._add_volume_section(df)
             self._add_reply_requirements_section(df)
@@ -350,6 +358,7 @@ class DashboardPDFExporter:
             if "detected_language" in df.columns:
                 self._add_language_section(df)
             self._add_data_summary(df, filter_info)
             return bytes(self.pdf.output())
         finally:
@@ -782,6 +791,39 @@ class DashboardPDFExporter:
         )
         self._add_two_charts(intent_bar, intent_pie)
     def _add_cross_dimensional_section(self, df) -> None:
         if "brand" not in df.columns or "platform" not in df.columns:
             return
@@ -913,6 +955,44 @@ class DashboardPDFExporter:
             self.distribution_charts.create_language_distribution(df, top_n=10, title="Top 10 Languages")
         )
     def _add_data_summary(self, df, filter_info: dict) -> None:
         self.pdf.add_page()
         self.pdf.section_header("Data Summary")

         "Note: These charts reflect only users who have filled in their profile information - "
         "they do not represent all community members."
     ),
+    "emotion": (
+        "Beyond sentiment polarity, the AI identifies the underlying emotion in each comment: "
+        "joy, excitement, gratitude, admiration, curiosity, humor, frustration, "
+        "disappointment, sadness, anger, or neutral. "
+        "Comments can have multiple emotions (multi-label). "
+        "Emotions with no data are omitted from the charts."
+    ),
     "language": (
         "Language distribution shows what languages comments are written in. "
         "Non-English comments are automatically translated for analysis."
             self._add_brand_section(df)
             self._add_platform_section(df)
             self._add_intent_section(df)
+            self._add_emotion_section(df)
             self._add_cross_dimensional_section(df)
             self._add_volume_section(df)
             self._add_reply_requirements_section(df)
             if "detected_language" in df.columns:
                 self._add_language_section(df)
             self._add_data_summary(df, filter_info)
+            self._add_helpscout_summary_section()
             return bytes(self.pdf.output())
         finally:
         )
         self._add_two_charts(intent_bar, intent_pie)
+    def _add_emotion_section(self, df) -> None:
+        if "emotions" not in df.columns or df["emotions"].dropna().empty:
+            return
+        self.pdf.add_page()
+        self.pdf.section_header("Emotion Analysis")
+        self.pdf.section_description(_DESCRIPTIONS["emotion"])
+        emotion_bar = self.distribution_charts.create_emotion_bar_chart(
+            df, title="Emotion Distribution", orientation="h"
+        )
+        emotion_pie = self.distribution_charts.create_emotion_pie_chart(
+            df, title="Emotion Distribution"
+        )
+        self._add_two_charts(emotion_bar, emotion_pie)
+        # Top 5 emotions summary
+        emotion_dist = self.processor.get_emotion_distribution(df)
+        if not emotion_dist.empty:
+            self.pdf.subsection_header("Top Emotions")
+            rows = []
+            for _, row in emotion_dist.sort_values('count', ascending=False).head(8).iterrows():
+                rows.append((
+                    str(row['emotions']).title(),
+                    f"{int(row['count']):,}",
+                    f"{row['percentage']:.1f}%",
+                ))
+            self.pdf.add_table(
+                headers=["Emotion", "Count", "Percentage"],
+                rows=rows,
+                col_widths=[80, 55, 55],
+            )
     def _add_cross_dimensional_section(self, df) -> None:
         if "brand" not in df.columns or "platform" not in df.columns:
             return
             self.distribution_charts.create_language_distribution(df, top_n=10, title="Top 10 Languages")
         )
+    def _add_helpscout_summary_section(self) -> None:
+        """Short HelpScout overview appended to the combined dashboard PDF."""
+        try:
+            import streamlit as st
+            hs_df = st.session_state.get("helpscout_df")
+            if hs_df is None or hs_df.empty:
+                return
+            from utils.helpscout_utils import boolean_flag_counts
+            from visualizations.helpscout_charts import HelpScoutCharts
+            self.pdf.add_page()
+            self.pdf.section_header("HelpScout Support Overview")
+            self.pdf.section_description(
+                "Summary of customer support conversations processed through the "
+                "HelpScout sentiment pipeline."
+            )
+            total = len(hs_df)
+            flags = boolean_flag_counts(hs_df)
+            neg_pct = hs_df["sentiment_polarity"].isin(["negative", "very_negative"]).sum() / total * 100 if total else 0
+            esc = int(hs_df["is_escalation"].sum()) if "is_escalation" in hs_df.columns else 0
+            self.pdf.metric_row([
+                ("Conversations",    f"{total:,}"),
+                ("Negative %",       f"{neg_pct:.1f}%"),
+                ("Escalations",      f"{esc:,}"),
+                ("Refund Requests",  f"{flags['is_refund_request']:,}"),
+            ])
+            hs_charts = HelpScoutCharts()
+            pie = hs_charts.create_sentiment_pie_chart(hs_df, title="HelpScout Sentiment Distribution")
+            tbar = hs_charts.create_topic_bar_chart(hs_df, title="Top Topics", top_n=5)
+            self._add_two_charts(pie, tbar)
+        except Exception:
+            logger.exception("HelpScout summary section failed — skipping")
     def _add_data_summary(self, df, filter_info: dict) -> None:
         self.pdf.add_page()
         self.pdf.section_header("Data Summary")

visualization/visualizations/distribution_charts.py CHANGED Viewed

@@ -29,9 +29,11 @@ class DistributionCharts:
             self.config = json.load(f)
         self.intent_colors = self.config['color_schemes']['intent']
         self.platform_colors = self.config['color_schemes']['platform']
         self.brand_colors = self.config['color_schemes']['brand']
         self.intent_order = self.config['intent_order']
         self.chart_height = self.config['dashboard']['chart_height']
     def create_intent_bar_chart(self, df, title="Intent Distribution", orientation='h'):
@@ -141,6 +143,135 @@ class DistributionCharts:
         return fig
     def create_platform_distribution(self, df, title="Comments by Platform"):
         """
         Create bar chart for platform distribution

             self.config = json.load(f)
         self.intent_colors = self.config['color_schemes']['intent']
+        self.emotion_colors = self.config['color_schemes'].get('emotion', {})
         self.platform_colors = self.config['color_schemes']['platform']
         self.brand_colors = self.config['color_schemes']['brand']
         self.intent_order = self.config['intent_order']
+        self.emotion_order = self.config.get('emotion_order', [])
         self.chart_height = self.config['dashboard']['chart_height']
     def create_intent_bar_chart(self, df, title="Intent Distribution", orientation='h'):
         return fig
+    def create_emotion_bar_chart(self, df, title="Emotion Distribution", orientation='h'):
+        """
+        Create bar chart for emotion distribution (handles multi-label).
+        Args:
+            df: Sentiment dataframe with 'emotions' column
+            title: Chart title
+            orientation: 'h' for horizontal, 'v' for vertical
+        Returns:
+            plotly.graph_objects.Figure
+        """
+        if 'emotions' not in df.columns:
+            return go.Figure().add_annotation(
+                text="No emotion data available",
+                xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False
+            )
+        df_exploded = df.dropna(subset=['emotions']).copy()
+        df_exploded['emotions'] = df_exploded['emotions'].str.split(',')
+        df_exploded = df_exploded.explode('emotions')
+        df_exploded['emotions'] = df_exploded['emotions'].str.strip()
+        df_exploded = df_exploded[df_exploded['emotions'] != '']
+        if df_exploded.empty:
+            return go.Figure().add_annotation(
+                text="No emotion data available",
+                xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False
+            )
+        emotion_counts = df_exploded['emotions'].value_counts()
+        ordered_emotions = [e for e in self.emotion_order if e in emotion_counts.index]
+        # include any emotions not in the order list
+        remaining = [e for e in emotion_counts.index if e not in ordered_emotions]
+        ordered_emotions = ordered_emotions + remaining
+        emotion_counts = emotion_counts[ordered_emotions]
+        colors = [self.emotion_colors.get(e, '#CCCCCC') for e in emotion_counts.index]
+        if orientation == 'h':
+            fig = go.Figure(data=[go.Bar(
+                y=emotion_counts.index,
+                x=emotion_counts.values,
+                orientation='h',
+                marker=dict(color=colors),
+                text=emotion_counts.values,
+                textposition='auto',
+                hovertemplate='<b>%{y}</b><br>Count: %{x}<extra></extra>'
+            )])
+            fig.update_layout(
+                title=title,
+                xaxis_title="Number of Comments",
+                yaxis_title="Emotion",
+                height=self.chart_height,
+                yaxis={'categoryorder': 'total ascending'}
+            )
+        else:
+            fig = go.Figure(data=[go.Bar(
+                x=emotion_counts.index,
+                y=emotion_counts.values,
+                marker=dict(color=colors),
+                text=emotion_counts.values,
+                textposition='auto',
+                hovertemplate='<b>%{x}</b><br>Count: %{y}<extra></extra>'
+            )])
+            fig.update_layout(
+                title=title,
+                xaxis_title="Emotion",
+                yaxis_title="Number of Comments",
+                height=self.chart_height
+            )
+        return fig
+    def create_emotion_pie_chart(self, df, title="Emotion Distribution"):
+        """
+        Create pie chart for emotion distribution.
+        Args:
+            df: Sentiment dataframe with 'emotions' column
+            title: Chart title
+        Returns:
+            plotly.graph_objects.Figure
+        """
+        if 'emotions' not in df.columns:
+            return go.Figure().add_annotation(
+                text="No emotion data available",
+                xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False
+            )
+        df_exploded = df.dropna(subset=['emotions']).copy()
+        df_exploded['emotions'] = df_exploded['emotions'].str.split(',')
+        df_exploded = df_exploded.explode('emotions')
+        df_exploded['emotions'] = df_exploded['emotions'].str.strip()
+        df_exploded = df_exploded[df_exploded['emotions'] != '']
+        if df_exploded.empty:
+            return go.Figure().add_annotation(
+                text="No emotion data available",
+                xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False
+            )
+        emotion_counts = df_exploded['emotions'].value_counts()
+        ordered_emotions = [e for e in self.emotion_order if e in emotion_counts.index]
+        remaining = [e for e in emotion_counts.index if e not in ordered_emotions]
+        ordered_emotions = ordered_emotions + remaining
+        emotion_counts = emotion_counts[ordered_emotions]
+        colors = [self.emotion_colors.get(e, '#CCCCCC') for e in emotion_counts.index]
+        fig = go.Figure(data=[go.Pie(
+            labels=emotion_counts.index,
+            values=emotion_counts.values,
+            marker=dict(colors=colors),
+            textinfo='label+percent',
+            textposition='auto',
+            hovertemplate='<b>%{label}</b><br>Count: %{value}<br>Percentage: %{percent}<extra></extra>'
+        )])
+        fig.update_layout(
+            title=title,
+            height=self.chart_height,
+            showlegend=True,
+            legend=dict(orientation="v", yanchor="middle", y=0.5, xanchor="left", x=1.05)
+        )
+        return fig
     def create_platform_distribution(self, df, title="Comments by Platform"):
         """
         Create bar chart for platform distribution

visualization/visualizations/helpscout_charts.py ADDED Viewed

	@@ -0,0 +1,413 @@

+"""
+HelpScout-specific Plotly chart functions.
+All functions accept a HelpScout conversations DataFrame and return a
+plotly.graph_objects.Figure.
+"""
+import json
+import sys
+from pathlib import Path
+import pandas as pd
+import plotly.graph_objects as go
+# Ensure project root is on sys.path so visualization.* imports resolve
+_root = Path(__file__).resolve().parent.parent.parent
+if str(_root) not in sys.path:
+    sys.path.insert(0, str(_root))
+from visualization.utils.helpscout_utils import (
+    explode_topics, parse_topics, topic_label, load_topic_taxonomy
+)
+class HelpScoutCharts:
+    """Plotly chart factory for HelpScout conversation data."""
+    def __init__(self, config_path=None):
+        if config_path is None:
+            config_path = Path(__file__).parent.parent / "config" / "viz_config.json"
+        with open(config_path, "r") as f:
+            config = json.load(f)
+        hs_colors = config.get("color_schemes_helpscout", {})
+        self.topic_colors   = hs_colors.get("topics", {})
+        self.status_colors  = hs_colors.get("status", {})
+        self.flag_colors    = hs_colors.get("boolean_flags", {})
+        self.sentiment_colors = config.get("color_schemes", {}).get("sentiment_polarity", {})
+        self.sentiment_order  = config.get("sentiment_order", [])
+        self.chart_height = config.get("dashboard", {}).get("chart_height", 400)
+        self.taxonomy = load_topic_taxonomy()
+    # ─────────────────────────────────────────────────────────────
+    # Sentiment charts
+    # ─────────────────────────────────────────────────────────────
+    def create_sentiment_pie_chart(self, df, title="Sentiment Distribution"):
+        counts = df["sentiment_polarity"].value_counts()
+        ordered = [s for s in self.sentiment_order if s in counts.index]
+        counts = counts[ordered]
+        colors = [self.sentiment_colors.get(s, "#CCCCCC") for s in counts.index]
+        fig = go.Figure(go.Pie(
+            labels=counts.index,
+            values=counts.values,
+            marker=dict(colors=colors),
+            textinfo="label+percent",
+            hovertemplate="<b>%{label}</b><br>Count: %{value}<br>%{percent}<extra></extra>",
+        ))
+        fig.update_layout(title=title, height=self.chart_height,
+                          legend=dict(orientation="v", yanchor="middle", y=0.5))
+        return fig
+    def create_sentiment_score_gauge(self, avg_score, title="Sentiment Score"):
+        normalized = ((avg_score + 2) / 4) * 100
+        fig = go.Figure(go.Indicator(
+            mode="gauge+number",
+            value=normalized,
+            title={"text": title, "font": {"size": 18}},
+            number={"font": {"size": 36}},
+            gauge={
+                "axis": {"range": [0, 100]},
+                "bar": {"color": "darkblue"},
+                "steps": [
+                    {"range": [0, 20],  "color": "#D32F2F"},
+                    {"range": [20, 40], "color": "#FF6F00"},
+                    {"range": [40, 60], "color": "#FFB300"},
+                    {"range": [60, 80], "color": "#7CB342"},
+                    {"range": [80, 100],"color": "#00C851"},
+                ],
+            },
+        ))
+        fig.update_layout(height=300, margin=dict(l=20, r=20, t=60, b=20))
+        return fig
+    def create_sentiment_timeline(self, df, title="Sentiment Over Time", freq="W"):
+        if "first_message_at" not in df.columns:
+            return self._empty_fig(title, "No timestamp data")
+        df_t = df.copy()
+        df_t["date"] = pd.to_datetime(df_t["first_message_at"]).dt.to_period(freq).dt.to_timestamp()
+        agg = df_t.groupby(["date", "sentiment_polarity"]).size().reset_index(name="count")
+        fig = go.Figure()
+        for s in self.sentiment_order:
+            d = agg[agg["sentiment_polarity"] == s]
+            if not d.empty:
+                fig.add_trace(go.Scatter(
+                    x=d["date"], y=d["count"], name=s, mode="lines+markers",
+                    line=dict(color=self.sentiment_colors.get(s, "#CCCCCC"), width=2),
+                    hovertemplate="<b>%{x}</b><br>%{y}<extra></extra>",
+                ))
+        fig.update_layout(title=title, xaxis_title="Date",
+                          yaxis_title="Conversations", height=self.chart_height,
+                          hovermode="x unified")
+        return fig
+    # ─────────────────────────────────────────────────────────────
+    # Topic charts
+    # ─────────────────────────────────────────────────────────────
+    def create_topic_bar_chart(self, df, title="Topic Distribution",
+                               orientation="h", top_n=None):
+        exploded = explode_topics(df)
+        if exploded.empty:
+            return self._empty_fig(title, "No topic data")
+        counts = exploded["topic_id"].value_counts()
+        if top_n:
+            counts = counts.head(top_n)
+        labels = [topic_label(t, self.taxonomy) for t in counts.index]
+        colors = [self.topic_colors.get(t, "#607D8B") for t in counts.index]
+        if orientation == "h":
+            fig = go.Figure(go.Bar(
+                y=labels, x=counts.values, orientation="h",
+                marker=dict(color=colors),
+                text=counts.values, textposition="auto",
+                hovertemplate="<b>%{y}</b><br>%{x} conversations<extra></extra>",
+            ))
+            fig.update_layout(title=title, xaxis_title="Conversations",
+                              yaxis_title="Topic", height=self.chart_height,
+                              yaxis={"categoryorder": "total ascending"})
+        else:
+            fig = go.Figure(go.Bar(
+                x=labels, y=counts.values,
+                marker=dict(color=colors),
+                text=counts.values, textposition="auto",
+                hovertemplate="<b>%{x}</b><br>%{y}<extra></extra>",
+            ))
+            fig.update_layout(title=title, xaxis_title="Topic",
+                              yaxis_title="Conversations", height=self.chart_height)
+        return fig
+    def create_topic_pie_chart(self, df, title="Topic Distribution"):
+        exploded = explode_topics(df)
+        if exploded.empty:
+            return self._empty_fig(title, "No topic data")
+        counts = exploded["topic_id"].value_counts()
+        labels = [topic_label(t, self.taxonomy) for t in counts.index]
+        colors = [self.topic_colors.get(t, "#607D8B") for t in counts.index]
+        fig = go.Figure(go.Pie(
+            labels=labels, values=counts.values,
+            marker=dict(colors=colors),
+            textinfo="label+percent",
+            hovertemplate="<b>%{label}</b><br>%{value}<br>%{percent}<extra></extra>",
+        ))
+        fig.update_layout(title=title, height=self.chart_height)
+        return fig
+    def create_topic_sentiment_heatmap(self, df, title="Topic × Sentiment Heatmap"):
+        exploded = explode_topics(df)
+        if exploded.empty or "sentiment_polarity" not in exploded.columns:
+            return self._empty_fig(title, "No data")
+        pivot = pd.crosstab(exploded["topic_id"], exploded["sentiment_polarity"])
+        pivot.index = [topic_label(t, self.taxonomy) for t in pivot.index]
+        ordered_cols = [s for s in self.sentiment_order if s in pivot.columns]
+        pivot = pivot[ordered_cols] if ordered_cols else pivot
+        fig = go.Figure(go.Heatmap(
+            z=pivot.values,
+            x=pivot.columns.tolist(),
+            y=pivot.index.tolist(),
+            colorscale="Blues",
+            text=pivot.values,
+            texttemplate="%{text}",
+            hovertemplate="<b>%{y} — %{x}</b><br>%{z}<extra></extra>",
+            colorbar=dict(title="Conversations"),
+        ))
+        fig.update_layout(title=title, xaxis_title="Sentiment",
+                          yaxis_title="Topic", height=self.chart_height + 100)
+        return fig
+    def create_topic_timeline(self, df, title="Topic Volume Over Time",
+                              freq="W", top_n=5):
+        if "first_message_at" not in df.columns:
+            return self._empty_fig(title, "No timestamp data")
+        exploded = explode_topics(df)
+        if exploded.empty:
+            return self._empty_fig(title, "No topic data")
+        top_topics = exploded["topic_id"].value_counts().head(top_n).index.tolist()
+        exploded = exploded[exploded["topic_id"].isin(top_topics)].copy()
+        exploded["date"] = pd.to_datetime(exploded["first_message_at"]).dt.to_period(freq).dt.to_timestamp()
+        agg = exploded.groupby(["date", "topic_id"]).size().reset_index(name="count")
+        fig = go.Figure()
+        for t in top_topics:
+            d = agg[agg["topic_id"] == t]
+            if not d.empty:
+                fig.add_trace(go.Scatter(
+                    x=d["date"], y=d["count"],
+                    name=topic_label(t, self.taxonomy), mode="lines+markers",
+                    line=dict(color=self.topic_colors.get(t, "#607D8B"), width=2),
+                    hovertemplate="<b>%{x}</b><br>%{y}<extra></extra>",
+                ))
+        fig.update_layout(title=title, xaxis_title="Date",
+                          yaxis_title="Conversations", height=self.chart_height,
+                          hovermode="x unified")
+        return fig
+    # ─────────────────────────────────────────────────────────────
+    # Volume & timelines
+    # ─────────────────────────────────���───────────────────────────
+    def create_volume_timeline(self, df, title="Conversation Volume Over Time",
+                               freq="W"):
+        if "first_message_at" not in df.columns:
+            return self._empty_fig(title, "No timestamp data")
+        df_t = df.copy()
+        df_t["date"] = pd.to_datetime(df_t["first_message_at"]).dt.to_period(freq).dt.to_timestamp()
+        agg = df_t.groupby("date").size().reset_index(name="count")
+        fig = go.Figure(go.Bar(
+            x=agg["date"], y=agg["count"],
+            marker_color="#1982C4",
+            hovertemplate="<b>%{x}</b><br>%{y} conversations<extra></extra>",
+        ))
+        fig.update_layout(title=title, xaxis_title="Date",
+                          yaxis_title="Conversations", height=self.chart_height)
+        return fig
+    def create_refund_cancel_timeline(self, df, title="Refund & Cancellation Over Time",
+                                      freq="W"):
+        if "first_message_at" not in df.columns:
+            return self._empty_fig(title, "No timestamp data")
+        df_t = df.copy()
+        df_t["date"] = pd.to_datetime(df_t["first_message_at"]).dt.to_period(freq).dt.to_timestamp()
+        fig = go.Figure()
+        for col, label, color in [
+            ("is_refund_request", "Refund Requests",  "#D32F2F"),
+            ("is_cancellation",   "Cancellations",    "#FF6F00"),
+            ("is_membership",     "Membership Joins",  "#00C851"),
+        ]:
+            if col in df_t.columns:
+                agg = df_t[df_t[col] == True].groupby("date").size().reset_index(name="count")
+                if not agg.empty:
+                    fig.add_trace(go.Scatter(
+                        x=agg["date"], y=agg["count"], name=label,
+                        mode="lines+markers", line=dict(color=color, width=2),
+                        hovertemplate="<b>%{x}</b><br>%{y}<extra></extra>",
+                    ))
+        fig.update_layout(title=title, xaxis_title="Date",
+                          yaxis_title="Conversations", height=self.chart_height,
+                          hovermode="x unified")
+        return fig
+    # ─────────────────────────────────────────────────────────────
+    # Status / source / flags
+    # ─────────────────────────────────────────────────────────────
+    def create_status_distribution(self, df, title="Conversations by Status"):
+        if "status" not in df.columns:
+            return self._empty_fig(title, "No status data")
+        counts = df["status"].value_counts()
+        colors = [self.status_colors.get(s, self.status_colors.get("default", "#607D8B"))
+                  for s in counts.index]
+        fig = go.Figure(go.Bar(
+            x=counts.index, y=counts.values,
+            marker=dict(color=colors),
+            text=counts.values, textposition="auto",
+            hovertemplate="<b>%{x}</b><br>%{y}<extra></extra>",
+        ))
+        fig.update_layout(title=title, xaxis_title="Status",
+                          yaxis_title="Conversations", height=self.chart_height)
+        return fig
+    def create_source_distribution(self, df, title="Conversations by Source Type"):
+        if "source_type" not in df.columns:
+            return self._empty_fig(title, "No source data")
+        counts = df["source_type"].value_counts()
+        fig = go.Figure(go.Bar(
+            x=counts.index, y=counts.values,
+            marker_color="#1982C4",
+            text=counts.values, textposition="auto",
+            hovertemplate="<b>%{x}</b><br>%{y}<extra></extra>",
+        ))
+        fig.update_layout(title=title, xaxis_title="Source",
+                          yaxis_title="Conversations", height=self.chart_height)
+        return fig
+    def create_boolean_flags_chart(self, df, title="Key Billing & Membership Flags"):
+        labels, values, colors = [], [], []
+        for col, label in [("is_refund_request", "Refund Requests"),
+                            ("is_cancellation",   "Cancellations"),
+                            ("is_membership",     "Membership Joins")]:
+            if col in df.columns:
+                labels.append(label)
+                values.append(int(df[col].sum()))
+                colors.append(self.flag_colors.get(col, "#607D8B"))
+        if not values:
+            return self._empty_fig(title, "No flag data")
+        fig = go.Figure(go.Bar(
+            x=labels, y=values,
+            marker=dict(color=colors),
+            text=values, textposition="auto",
+            hovertemplate="<b>%{x}</b><br>%{y}<extra></extra>",
+        ))
+        fig.update_layout(title=title, xaxis_title="Flag",
+                          yaxis_title="Conversations", height=self.chart_height)
+        return fig
+    def create_escalation_breakdown(self, df, title="Escalation Queue by Topic"):
+        if "is_escalation" not in df.columns:
+            return self._empty_fig(title, "No escalation data")
+        exploded = explode_topics(df)
+        if exploded.empty:
+            return self._empty_fig(title, "No topic data")
+        pivot = pd.crosstab(exploded["topic_id"], exploded["is_escalation"])
+        pivot.index = [topic_label(t, self.taxonomy) for t in pivot.index]
+        fig = go.Figure()
+        for flag, label, color in [(False, "Normal", "#4CAF50"), (True, "Escalation", "#D32F2F")]:
+            if flag in pivot.columns:
+                fig.add_trace(go.Bar(
+                    name=label, y=pivot.index, x=pivot[flag],
+                    orientation="h", marker_color=color,
+                    hovertemplate="<b>%{y}</b><br>%{x}<extra></extra>",
+                ))
+        fig.update_layout(title=title, barmode="stack", xaxis_title="Conversations",
+                          yaxis_title="Topic", height=self.chart_height,
+                          yaxis={"categoryorder": "total ascending"})
+        return fig
+    # ─────────────────────────────────────────────────────────────
+    # Duration & thread count
+    # ─────────────────────────────────────────────────────────────
+    def create_duration_histogram(self, df, title="Conversation Duration Distribution"):
+        if "duration_hours" not in df.columns:
+            return self._empty_fig(title, "No duration data")
+        d = df["duration_hours"].dropna()
+        fig = go.Figure(go.Histogram(
+            x=d, nbinsx=40, marker_color="#1982C4",
+            hovertemplate="Duration: %{x:.1f}h<br>Count: %{y}<extra></extra>",
+        ))
+        fig.update_layout(title=title, xaxis_title="Duration (hours)",
+                          yaxis_title="Conversations", height=self.chart_height)
+        return fig
+    def create_thread_count_histogram(self, df, title="Thread Count Distribution"):
+        if "thread_count" not in df.columns:
+            return self._empty_fig(title, "No thread data")
+        t = df["thread_count"].dropna()
+        fig = go.Figure(go.Histogram(
+            x=t, nbinsx=30, marker_color="#9C27B0",
+            hovertemplate="Threads: %{x}<br>Count: %{y}<extra></extra>",
+        ))
+        fig.update_layout(title=title, xaxis_title="Number of Threads",
+                          yaxis_title="Conversations", height=self.chart_height)
+        return fig
+    # ─────────────────────────────────────────────────────────────
+    # Emotion (same logic as DistributionCharts but with helpscout df)
+    # ─────────────────────────────────────────────────────────────
+    def create_emotion_bar_chart(self, df, title="Emotion Distribution",
+                                 orientation="h"):
+        if "emotions" not in df.columns or df["emotions"].isna().all():
+            return self._empty_fig(title, "No emotion data")
+        emotion_colors = {
+            "joy": "#FFD700", "excitement": "#FF6B35", "gratitude": "#4CAF50",
+            "admiration": "#2196F3", "curiosity": "#00BCD4", "humor": "#9C27B0",
+            "frustration": "#FF9800", "disappointment": "#795548",
+            "sadness": "#607D8B", "anger": "#D32F2F", "neutral": "#9E9E9E",
+        }
+        df_e = df.copy()
+        df_e["emotions"] = df_e["emotions"].str.split(",")
+        df_e = df_e.explode("emotions")
+        df_e["emotions"] = df_e["emotions"].str.strip().str.lower()
+        counts = df_e["emotions"].dropna().value_counts()
+        colors = [emotion_colors.get(e, "#CCCCCC") for e in counts.index]
+        if orientation == "h":
+            fig = go.Figure(go.Bar(
+                y=counts.index, x=counts.values, orientation="h",
+                marker=dict(color=colors), text=counts.values, textposition="auto",
+                hovertemplate="<b>%{y}</b><br>%{x}<extra></extra>",
+            ))
+            fig.update_layout(title=title, xaxis_title="Conversations",
+                              yaxis_title="Emotion", height=self.chart_height,
+                              yaxis={"categoryorder": "total ascending"})
+        else:
+            fig = go.Figure(go.Bar(
+                x=counts.index, y=counts.values,
+                marker=dict(color=colors), text=counts.values, textposition="auto",
+                hovertemplate="<b>%{x}</b><br>%{y}<extra></extra>",
+            ))
+            fig.update_layout(title=title, xaxis_title="Emotion",
+                              yaxis_title="Conversations", height=self.chart_height)
+        return fig
+    # ���────────────────────────────────────────────────────────────
+    # Helpers
+    # ─────────────────────────────────────────────────────────────
+    @staticmethod
+    def _empty_fig(title, message):
+        fig = go.Figure()
+        fig.add_annotation(text=message, xref="paper", yref="paper",
+                           x=0.5, y=0.5, showarrow=False, font=dict(size=14))
+        fig.update_layout(title=title, height=300)
+        return fig