{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.11.11","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[],"dockerImageVersionId":31012,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"## Spa Customer Service Chatbot with XML Sitemap-Based Scraping\n\n**Problem Statement**\n\nThe goal of this project is to create an LLM-Powered Assistant for \"Benefit Body Spa\" that can efficiently answer customer queries by leveraging website data scraped via XML sitemaps. The system should comprehensively cover the spa's services, business hours, contact information, and any other relevant details. Additionally, the solution should provide structured and unstructured data handling, embeddings generation, and a user-friendly interface for real-time interactions.","metadata":{}},{"cell_type":"markdown","source":"## 1. Setup and Installation","metadata":{}},{"cell_type":"markdown","source":"**1.1 Install Required Libraries and Packages**","metadata":{}},{"cell_type":"code","source":"import os\nimport pandas as pd\nimport numpy as np\nimport requests\nfrom bs4 import BeautifulSoup\nimport re\nimport json\nimport time\nfrom typing import List, Dict, Any\nfrom tqdm.notebook import tqdm\nfrom urllib.parse import urljoin\nfrom IPython.display import display, HTML, Markdown\nfrom datetime import datetime\nfrom sklearn.metrics.pairwise import cosine_similarity\n\n# Install required packages\n!pip install -q google-genai==1.7.0 beautifulsoup4>=4.12.0 requests>=2.31.0 lxml>=4.9.0\n\n# Import Google Generative AI\nfrom google import genai\nfrom google.genai import types\n\n# Set up API key\nfrom kaggle_secrets import UserSecretsClient\nGOOGLE_API_KEY = UserSecretsClient().get_secret(\"GOOGLE_API_KEY\")\nclient = genai.Client(api_key=GOOGLE_API_KEY)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-20T23:06:36.691920Z","iopub.execute_input":"2025-04-20T23:06:36.692127Z","iopub.status.idle":"2025-04-20T23:06:41.318030Z","shell.execute_reply.started":"2025-04-20T23:06:36.692106Z","shell.execute_reply":"2025-04-20T23:06:41.316907Z"}},"outputs":[],"execution_count":1},{"cell_type":"markdown","source":"## 1.2 project Constants","metadata":{}},{"cell_type":"code","source":"# Project constants\nPROJECT_NAME = \"Benefit Body Spa Customer Service Agent\"\nWEBSITE_URL = \"https://benefitbodyspa.com\"\nSITEMAP_URL = \"https://benefitbodyspa.com/sitemap.xml\"\nDATA_DIR = \"data\"\nos.makedirs(DATA_DIR, exist_ok=True)\nEMBEDDING_MODEL = \"models/text-embedding-004\"\nLLM_MODEL = \"gemini-2.0-flash\"\nVECTOR_DB_PATH = f\"{DATA_DIR}/benefit_spa_vector_db.pkl\"\nSIMILARITY_THRESHOLD = 0.65","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-20T23:06:41.319502Z","iopub.execute_input":"2025-04-20T23:06:41.320066Z","iopub.status.idle":"2025-04-20T23:06:44.319728Z","shell.execute_reply.started":"2025-04-20T23:06:41.320035Z","shell.execute_reply":"2025-04-20T23:06:44.318805Z"}},"outputs":[],"execution_count":2},{"cell_type":"code","source":"# Manual information to supplement website data\nSUPPLEMENTAL_INFO = {\n    \"business_hours\": \"\"\"\n    Benefit Body Spa Business Hours:\n    Monday to Friday: 10:00 AM - 8:00 PM\n    Saturday: 10:00 AM - 5:00 PM\n    Sunday: Closed\n    \"\"\",\n    \"appointment_policy\": \"\"\"\n    Appointment Policy:\n    - Please arrive 10-15 minutes before your scheduled appointment time\n    - First-time clients should complete registration forms prior to treatment\n    - Wear comfortable clothing and avoid jewelry\n    - Please inform us of any medical conditions or concerns before treatment\n    \"\"\",\n    \"cancellation_policy\": \"\"\"\n    Cancellation Policy:\n    - Please provide at least 24 hours notice for cancellations or rescheduling\n    - Late cancellations (less than 24 hours) may incur a 50% fee\n    - No-shows will be charged the full service fee\n    - Repeated cancellations may affect future booking privileges\n    \"\"\"\n}","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-20T23:06:44.320544Z","iopub.execute_input":"2025-04-20T23:06:44.320749Z","iopub.status.idle":"2025-04-20T23:06:44.325946Z","shell.execute_reply.started":"2025-04-20T23:06:44.320728Z","shell.execute_reply":"2025-04-20T23:06:44.324688Z"}},"outputs":[],"execution_count":3},{"cell_type":"markdown","source":"## 2. Data Scraping and Preprocessing","metadata":{}},{"cell_type":"code","source":"# 2.1. XML Sitemap-Based Scraping\n\ndef fetch_url(url, retry_count=3, delay=2):\n    \"\"\"Fetch a URL with robust error handling.\"\"\"\n    headers = {\n        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\n    }\n    \n    for attempt in range(retry_count):\n        try:\n            print(f\"Fetching URL: {url} (attempt {attempt+1}/{retry_count})\")\n            response = requests.get(url, headers=headers, timeout=30)\n            response.raise_for_status()\n            print(f\"✅ Successfully fetched {url} - {len(response.content)} bytes\")\n            return response.content\n        except requests.exceptions.RequestException as e:\n            print(f\"❌ Error fetching URL: {url} - {str(e)}\")\n            if attempt < retry_count - 1:\n                sleep_time = delay * (attempt + 1)\n                print(f\"Retrying in {sleep_time} seconds...\")\n                time.sleep(sleep_time)\n    \n    print(f\"❌ Failed to fetch URL after {retry_count} attempts: {url}\")\n    return None\n\ndef parse_sitemap_recursive(sitemap_url):\n    \"\"\"Parse sitemap.xml recursively to handle sitemap indexes.\"\"\"\n    print(f\"\\n🔍 Parsing sitemap: {sitemap_url}\")\n    content = fetch_url(sitemap_url)\n    \n    if not content:\n        print(\"❌ Failed to fetch sitemap\")\n        return []\n    \n    urls = []\n    try:\n        soup = BeautifulSoup(content, 'lxml-xml')  # Using lxml for better XML handling\n        \n        # Check if this is a sitemap index (contains other sitemaps)\n        sitemap_tags = soup.find_all('sitemap')\n        if sitemap_tags:\n            print(f\"📑 Found sitemap index with {len(sitemap_tags)} child sitemaps\")\n            for sitemap_tag in sitemap_tags:\n                loc = sitemap_tag.find('loc')\n                if loc:\n                    child_sitemap_url = loc.text.strip()\n                    # Recursively parse child sitemaps\n                    child_urls = parse_sitemap_recursive(child_sitemap_url)\n                    urls.extend(child_urls)\n        \n        # Regular sitemap with URLs\n        url_tags = soup.find_all('url')\n        if url_tags:\n            for url_tag in url_tags:\n                loc = url_tag.find('loc')\n                if loc:\n                    page_url = loc.text.strip()\n                    \n                    # Get last modification date if available\n                    lastmod = url_tag.find('lastmod')\n                    lastmod_date = lastmod.text if lastmod else None\n                    \n                    # Get priority if available\n                    priority = url_tag.find('priority')\n                    priority_value = float(priority.text) if priority else 0.5\n                    \n                    urls.append({\n                        'url': page_url,\n                        'lastmod': lastmod_date,\n                        'priority': priority_value\n                    })\n            \n            print(f\"✅ Found {len(url_tags)} URLs in sitemap\")\n    \n    except Exception as e:\n        print(f\"❌ Error parsing sitemap: {e}\")\n    \n    return urls\n\ndef extract_page_content(url_data):\n    \"\"\"Extract comprehensive content from a page.\"\"\"\n    url = url_data['url']\n    content = fetch_url(url)\n    \n    if not content:\n        return {\n            \"url\": url,\n            \"title\": \"\",\n            \"content\": \"\",\n            \"lastmod\": url_data.get('lastmod'),\n            \"priority\": url_data.get('priority', 0.5),\n            \"success\": False\n        }\n    \n    try:\n        soup = BeautifulSoup(content, 'html.parser')\n        \n        # Extract title\n        title = soup.title.string.strip() if soup.title else \"\"\n        \n        # Extract meta description\n        meta_description = \"\"\n        meta_tag = soup.find('meta', attrs={'name': 'description'})\n        if meta_tag and meta_tag.get('content'):\n            meta_description = meta_tag.get('content', '').strip()\n        \n        # Extract schema metadata for business info\n        structured_data = []\n        for script in soup.find_all('script', type='application/ld+json'):\n            try:\n                if script.string:\n                    data = json.loads(script.string)\n                    structured_data.append(data)\n            except json.JSONDecodeError:\n                pass\n        \n        # Look for business hours, address and contact info in structured data\n        business_info = \"\"\n        for data in structured_data:\n            if isinstance(data, dict):\n                # Look for business hours\n                if 'openingHoursSpecification' in data:\n                    business_info += \"BUSINESS HOURS:\\n\"\n                    hours_data = data['openingHoursSpecification']\n                    if isinstance(hours_data, list):\n                        for hours in hours_data:\n                            day = hours.get('dayOfWeek', '')\n                            opens = hours.get('opens', '')\n                            closes = hours.get('closes', '')\n                            business_info += f\"{day}: {opens} - {closes}\\n\"\n                    business_info += \"\\n\"\n                \n                # Look for address\n                if 'address' in data:\n                    business_info += \"ADDRESS:\\n\"\n                    address = data['address']\n                    if isinstance(address, dict):\n                        street = address.get('streetAddress', '')\n                        city = address.get('addressLocality', '')\n                        region = address.get('addressRegion', '')\n                        postal = address.get('postalCode', '')\n                        business_info += f\"{street}, {city}, {region} {postal}\\n\\n\"\n                \n                # Look for contact info\n                if 'telephone' in data:\n                    business_info += f\"PHONE: {data['telephone']}\\n\"\n                if 'email' in data:\n                    business_info += f\"EMAIL: {data['email']}\\n\"\n        \n        # Remove script and style elements\n        for tag in soup(['script', 'style']):\n            tag.decompose()\n        \n        # Extract all visible text with better formatting\n        text_parts = []\n        \n        # Add business info at the beginning if found\n        if business_info:\n            text_parts.append(business_info)\n        \n        # Extract headers with emphasis\n        for h_tag in soup.find_all(['h1', 'h2', 'h3']):\n            header_text = h_tag.get_text(strip=True)\n            if header_text:\n                text_parts.append(f\"SECTION: {header_text}\")\n        \n        # Try multiple content selectors\n        selectors = [\n            'main', 'article', '.content', '.entry-content',\n            'section', '.page-content', '#content',\n            '.post-content', '.page', '.main-content'\n        ]\n        \n        content_found = False\n        for selector in selectors:\n            elements = soup.select(selector)\n            if elements:\n                for element in elements:\n                    # Extract paragraphs\n                    paragraphs = element.find_all('p')\n                    for p in paragraphs:\n                        p_text = p.get_text(strip=True)\n                        if p_text and len(p_text) > 10:  # Skip very short paragraphs\n                            text_parts.append(p_text)\n                    \n                    # Extract list items\n                    list_items = element.find_all('li')\n                    if list_items:\n                        for li in list_items:\n                            li_text = li.get_text(strip=True)\n                            if li_text and len(li_text) > 5:  # Skip very short list items\n                                text_parts.append(f\"- {li_text}\")\n                \n                content_found = True\n                break  # Stop after first successful selector\n        \n        # If no content found with selectors, use body\n        if not content_found:\n            body = soup.find('body')\n            if body:\n                # Remove navigation, header, footer\n                for tag in body.find_all(['nav', 'header', 'footer']):\n                    tag.decompose()\n                \n                # Extract paragraphs\n                paragraphs = body.find_all('p')\n                for p in paragraphs:\n                    p_text = p.get_text(strip=True)\n                    if p_text and len(p_text) > 10:\n                        text_parts.append(p_text)\n                \n                # Extract list items\n                list_items = body.find_all('li')\n                for li in list_items:\n                    li_text = li.get_text(strip=True)\n                    if li_text and len(li_text) > 5:\n                        text_parts.append(f\"- {li_text}\")\n        \n        # Combine all text with proper formatting\n        if meta_description:\n            text_parts.insert(0, f\"DESCRIPTION: {meta_description}\")\n        \n        full_content = \"\\n\\n\".join(text_parts)\n        \n        # Look for contact information in the content\n        contact_patterns = {\n            'phone': r'(?:Phone|Tel|Telephone|Call)(?:\\s*(?:us|:))?\\s*(?:\\+\\d{1,2}\\s*)?(?:\\(?\\d{3}\\)?[\\s.-]?)?\\d{3}[\\s.-]?\\d{4}',\n            'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}',\n            'hours': r'(?:Hours|Open|We are open)(?:\\s*(?:of operation|:))?\\s*(?:Monday|Mon|Tuesday|Tue|Wednesday|Wed|Thursday|Thu|Friday|Fri|Saturday|Sat|Sunday|Sun)',\n            'address': r'\\d+\\s+[A-Za-z0-9\\s,]+(?:Street|St|Avenue|Ave|Road|Rd|Boulevard|Blvd|Drive|Dr|Lane|Ln|Court|Ct|Way|Parkway|Pkwy|Plaza|Plz|Square|Sq)\\s*,\\s*[A-Za-z\\s]+,\\s*[A-Z]{2}\\s*\\d{5}'\n        }\n        \n        for info_type, pattern in contact_patterns.items():\n            matches = re.findall(pattern, full_content, re.IGNORECASE)\n            if matches and info_type not in full_content.upper():\n                additional_info = f\"{info_type.upper()}: {matches[0]}\\n\"\n                full_content = additional_info + full_content\n        \n        result = {\n            \"url\": url,\n            \"title\": title,\n            \"content\": full_content,\n            \"lastmod\": url_data.get('lastmod'),\n            \"priority\": url_data.get('priority', 0.5),\n            \"success\": bool(full_content)\n        }\n        \n        print(f\"✅ Successfully extracted content from {url}: {len(full_content)} chars\")\n        return result\n    except Exception as e:\n        print(f\"❌ Error extracting content from {url}: {e}\")\n        return {\n            \"url\": url,\n            \"title\": \"\",\n            \"content\": \"\",\n            \"lastmod\": url_data.get('lastmod'),\n            \"priority\": url_data.get('priority', 0.5),\n            \"success\": False\n        }\n\ndef scrape_website_from_sitemap(max_urls=None):\n    \"\"\"Scrape website content based on sitemap.xml with detailed logging.\"\"\"\n    print(\"\\n🔍 Starting complete website scraping from sitemap.xml\")\n    \n    # Get all URLs from sitemap\n    sitemap_urls = parse_sitemap_recursive(SITEMAP_URL)\n    \n    if not sitemap_urls:\n        print(\"❌ No URLs found in sitemap.xml\")\n        return None\n    \n    # Sort URLs by priority (if available)\n    sitemap_urls.sort(key=lambda x: x.get('priority', 0.5), reverse=True)\n    \n    if max_urls:\n        sitemap_urls = sitemap_urls[:max_urls]\n        print(f\"Limiting to {max_urls} URLs\")\n    \n    print(f\"📋 Processing {len(sitemap_urls)} URLs from sitemap\")\n    \n    results = []\n    for url_data in tqdm(sitemap_urls, desc=\"Scraping pages from sitemap\"):\n        page_data = extract_page_content(url_data)\n        if page_data[\"success\"]:\n            results.append(page_data)\n    \n    if not results:\n        print(\"❌ No content extracted from any page\")\n        return None\n    \n    df = pd.DataFrame(results)\n    df.to_csv(f\"{DATA_DIR}/spa_content_from_sitemap.csv\", index=False)\n    print(f\"✅ Successfully scraped {len(df)} pages from sitemap\")\n    \n    # Print a summary of scraped pages\n    print(\"\\n📊 Website Scraping Summary:\")\n    print(f\"Total URLs in sitemap: {len(sitemap_urls)}\")\n    print(f\"Successfully scraped: {len(df)} pages\")\n    print(f\"Failed: {len(sitemap_urls) - len(df)} pages\")\n    \n    return df","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-20T23:06:44.352033Z","iopub.execute_input":"2025-04-20T23:06:44.352328Z","iopub.status.idle":"2025-04-20T23:06:44.379827Z","shell.execute_reply.started":"2025-04-20T23:06:44.352303Z","shell.execute_reply":"2025-04-20T23:06:44.378690Z"}},"outputs":[],"execution_count":5},{"cell_type":"code","source":"# 2.2 Text Processing\ndef clean_text(text):\n    \"\"\"Clean and normalize text.\"\"\"\n    if not text:\n        return \"\"\n    \n    # Replace multiple spaces, newlines, tabs with single space\n    text = re.sub(r'\\s+', ' ', text)\n    # Remove HTML remnants\n    text = re.sub(r'<[^>]+>', '', text)\n    # Fix spacing around punctuation\n    text = re.sub(r'\\s+([.,;:!?])', r'\\1', text)\n    text = re.sub(r'([.,;:!?])([^\\s])', r'\\1 \\2', text)\n    \n    return text.strip()\n\ndef create_chunks(text, chunk_size=800, overlap=200):\n    \"\"\"Split text into chunks with semantic boundaries.\"\"\"\n    if not text or len(text) < 100:\n        return []\n    \n    text = clean_text(text)\n    chunks = []\n    start = 0\n    text_len = len(text)\n    \n    while start < text_len:\n        end = min(start + chunk_size, text_len)\n        \n        # Try to find a good break point\n        if end < text_len:\n            # Look for semantic boundaries\n            section_break = text.find(\"SECTION:\", start, end)\n            paragraph_break = text.rfind('\\n\\n', start, end)\n            sentence_break = max(\n                text.rfind('. ', start, end),\n                text.rfind('? ', start, end),\n                text.rfind('! ', start, end)\n            )\n            \n            if section_break != -1 and section_break > start + (chunk_size / 2):\n                end = section_break\n            elif paragraph_break != -1 and paragraph_break > start + (chunk_size / 2):\n                end = paragraph_break + 2\n            elif sentence_break != -1 and sentence_break > start + (chunk_size / 4):\n                end = sentence_break + 2\n        \n        chunk = text[start:end].strip()\n        if chunk:\n            chunks.append(chunk)\n        \n        start = max(start + 1, end - overlap)\n    \n    return chunks\n\ndef add_supplemental_info():\n    \"\"\"Create chunks from supplemental information.\"\"\"\n    supplemental_chunks = []\n    \n    for info_type, content in SUPPLEMENTAL_INFO.items():\n        chunk_data = {\n            'id': f\"supp_{info_type}\",\n            'url': f\"{WEBSITE_URL}/{info_type}\",\n            'title': f\"Benefit Body Spa {info_type.replace('_', ' ').title()}\",\n            'chunk_number': 1,\n            'total_chunks': 1,\n            'chunk_text': f\"Page: {info_type.replace('_', ' ').title()}\\nURL: {WEBSITE_URL}/{info_type}\\n\\n{content}\",\n            'original_text': content\n        }\n        supplemental_chunks.append(chunk_data)\n    \n    return supplemental_chunks\n\ndef process_content_to_chunks(df):\n    \"\"\"Process content into chunks with metadata.\"\"\"\n    if df is None or len(df) == 0:\n        print(\"❌ No content to process\")\n        return None\n    \n    all_chunks = []\n    for i, row in tqdm(df.iterrows(), total=len(df), desc=\"Creating chunks\"):\n        url = row['url']\n        title = row['title']\n        content = row['content']\n        \n        if not content or len(content) < 100:\n            print(f\"⚠️ Skipping {url}: content too short\")\n            continue\n        \n        chunks = create_chunks(content)\n        \n        if not chunks:\n            print(f\"⚠️ No chunks created for {url}\")\n            continue\n        \n        for j, chunk in enumerate(chunks):\n            chunk_id = f\"{i}_{j}\"\n            context_header = f\"Page: {title}\\nURL: {url}\\nChunk {j+1} of {len(chunks)}\\n\\n\"\n            \n            chunk_data = {\n                'id': chunk_id,\n                'url': url,\n                'title': title,\n                'chunk_number': j+1,\n                'total_chunks': len(chunks),\n                'chunk_text': context_header + chunk,\n                'original_text': chunk\n            }\n            all_chunks.append(chunk_data)\n    \n    # Add supplemental info chunks\n    supplemental_chunks = add_supplemental_info()\n    all_chunks.extend(supplemental_chunks)\n    \n    if not all_chunks:\n        print(\"❌ No chunks created\")\n        return None\n    \n    chunks_df = pd.DataFrame(all_chunks)\n    chunks_df.to_csv(f\"{DATA_DIR}/spa_chunks.csv\", index=False)\n    print(f\"✅ Created {len(chunks_df)} chunks (including {len(supplemental_chunks)} supplemental)\")\n    \n    return chunks_df","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-20T23:06:44.389011Z","iopub.execute_input":"2025-04-20T23:06:44.389332Z","iopub.status.idle":"2025-04-20T23:06:44.404389Z","shell.execute_reply.started":"2025-04-20T23:06:44.389306Z","shell.execute_reply":"2025-04-20T23:06:44.403452Z"}},"outputs":[],"execution_count":6},{"cell_type":"markdown","source":"## 4. Embedding Generation","metadata":{}},{"cell_type":"code","source":"# 4. Embedding Generation\nclass EmbeddingManager:\n    \"\"\"Manage embedding generation with robust error handling.\"\"\"\n    \n    def __init__(self, client, model=EMBEDDING_MODEL):\n        self.client = client\n        self.model = model\n        self.dimensions = None\n    \n    def generate_embeddings(self, texts, batch_size=5):\n        \"\"\"Generate embeddings for texts with batching.\"\"\"\n        all_embeddings = []\n        \n        # Process in batches\n        for i in range(0, len(texts), batch_size):\n            batch_texts = texts[i:i+batch_size]\n            batch_num = i//batch_size + 1\n            total_batches = (len(texts) + batch_size - 1)//batch_size\n            \n            print(f\"Generating embeddings batch {batch_num}/{total_batches}\")\n            \n            # Retry logic for API resilience\n            for attempt in range(3):\n                try:\n                    response = self.client.models.embed_content(\n                        model=self.model,\n                        contents=batch_texts,\n                        config=types.EmbedContentConfig(task_type='semantic_similarity')\n                    )\n                    \n                    batch_embeddings = [e.values for e in response.embeddings]\n                    \n                    if self.dimensions is None and batch_embeddings:\n                        self.dimensions = len(batch_embeddings[0])\n                        print(f\"✅ Embedding dimensions: {self.dimensions}\")\n                    \n                    all_embeddings.extend(batch_embeddings)\n                    \n                    # Rate limiting\n                    if i + batch_size < len(texts):\n                        time.sleep(0.5)\n                    \n                    break  # Success\n                except Exception as e:\n                    print(f\"❌ Error in batch {batch_num}/{total_batches} (attempt {attempt+1}/3): {e}\")\n                    if attempt < 2:\n                        wait_time = (attempt + 1) * 2\n                        print(f\"Retrying in {wait_time} seconds...\")\n                        time.sleep(wait_time)\n                    else:\n                        print(f\"❌ Failed to generate embeddings for batch. Using zero vectors.\")\n                        # Add zero vectors as placeholders\n                        all_embeddings.extend([[0.0] * (self.dimensions or 768) for _ in range(len(batch_texts))])\n        \n        return all_embeddings","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-20T23:59:30.567667Z","iopub.execute_input":"2025-04-20T23:59:30.568044Z","iopub.status.idle":"2025-04-20T23:59:30.577046Z","shell.execute_reply.started":"2025-04-20T23:59:30.568015Z","shell.execute_reply":"2025-04-20T23:59:30.576142Z"}},"outputs":[],"execution_count":26},{"cell_type":"markdown","source":"# 5. Vector Database","metadata":{}},{"cell_type":"code","source":"# 5.1 Vector Database\nclass VectorDatabase:\n    \"\"\"Vector database for semantic search.\"\"\"\n    \n    def __init__(self, client, embedding_model=EMBEDDING_MODEL):\n        self.client = client\n        self.embedding_model = embedding_model\n        self.embedding_manager = EmbeddingManager(client, embedding_model)\n        self.chunks = None\n        self.embeddings = None\n        self.initialized = False\n    \n    def initialize(self, chunks_df):\n        \"\"\"Initialize the vector database from chunks.\"\"\"\n        print(\"🔄 Initializing vector database...\")\n        \n        # Extract chunk texts for embedding\n        texts = chunks_df['chunk_text'].tolist()\n        \n        # Generate embeddings\n        print(f\"📊 Generating embeddings for {len(texts)} chunks\")\n        embeddings = self.embedding_manager.generate_embeddings(texts)\n        \n        # Store data\n        self.chunks = chunks_df.to_dict('records')\n        self.embeddings = np.array(embeddings)\n        self.initialized = True\n        \n        print(f\"✅ Vector database initialized with {len(self.chunks)} chunks\")\n        \n        # Save to disk\n        self.export_to_pickle()\n        \n        return self\n    \n    def search(self, query, top_k=5, threshold=SIMILARITY_THRESHOLD):\n        \"\"\"Search for relevant chunks with improved retrieval.\"\"\"\n        if not self.initialized:\n            print(\"❌ Vector database not initialized\")\n            return []\n        \n        # Generate embedding for query\n        query_embedding = self.embedding_manager.generate_embeddings([query])[0]\n        query_embedding_array = np.array([query_embedding])\n        \n        # Calculate cosine similarity\n        similarities = cosine_similarity(query_embedding_array, self.embeddings)[0]\n        \n        # Create results with similarity scores\n        results = []\n        for i, score in enumerate(similarities):\n            if score >= threshold:\n                result = dict(self.chunks[i])\n                result['similarity'] = float(score)\n                results.append(result)\n        \n        # Sort by similarity and get top_k\n        results.sort(key=lambda x: x['similarity'], reverse=True)\n        results = results[:top_k]\n        \n        # Special handling for business hours, appointment, and cancellation policy\n        if len(results) == 0:\n            query_lower = query.lower()\n            \n            # Check if query is about hours\n            if any(term in query_lower for term in ['hour', 'open', 'close', 'time', 'schedule']):\n                print(\"⚠️ Fallback: Using supplemental business hours info\")\n                for chunk in self.chunks:\n                    if 'id' in chunk and 'supp_business_hours' in chunk['id']:\n                        results.append({**chunk, 'similarity': 1.0})\n            \n            # Check if query is about appointments\n            elif any(term in query_lower for term in ['appointment', 'arrive', 'early', 'late', 'booking']):\n                print(\"⚠️ Fallback: Using supplemental appointment info\")\n                for chunk in self.chunks:\n                    if 'id' in chunk and 'supp_appointment_policy' in chunk['id']:\n                        results.append({**chunk, 'similarity': 1.0})\n            \n            # Check if query is about cancellation\n            elif any(term in query_lower for term in ['cancel', 'reschedule', 'policy']):\n                print(\"⚠️ Fallback: Using supplemental cancellation info\")\n                for chunk in self.chunks:\n                    if 'id' in chunk and 'supp_cancellation_policy' in chunk['id']:\n                        results.append({**chunk, 'similarity': 1.0})\n        \n        print(f\"🔍 Found {len(results)} relevant chunks for query: '{query}'\")\n        return results\n    \n    def export_to_pickle(self, path=VECTOR_DB_PATH):\n        \"\"\"Export the vector database to a pickle file.\"\"\"\n        if not self.initialized:\n            print(\"❌ Cannot export: vector database not initialized\")\n            return\n        \n        data = {\n            'chunks': self.chunks,\n            'embeddings': self.embeddings,\n            'metadata': {\n                'model': self.embedding_model,\n                'count': len(self.chunks),\n                'dimensions': self.embeddings.shape[1],\n                'created_at': datetime.now().strftime(\"%Y-%m-%d %H:%M:%S\")\n            }\n        }\n        \n        pd.to_pickle(data, path)\n        print(f\"💾 Vector database exported to {path}\")\n    \n    def import_from_pickle(self, path=VECTOR_DB_PATH):\n        \"\"\"Import the vector database from a pickle file.\"\"\"\n        if os.path.exists(path):\n            try:\n                data = pd.read_pickle(path)\n                self.chunks = data['chunks']\n                self.embeddings = data['embeddings']\n                self.initialized = True\n                \n                metadata = data.get('metadata', {})\n                print(f\"📥 Vector database imported from {path}\")\n                print(f\"   Model: {metadata.get('model', 'unknown')}\")\n                print(f\"   Chunks: {metadata.get('count', len(self.chunks))}\")\n                print(f\"   Dimensions: {metadata.get('dimensions', self.embeddings.shape[1])}\")\n                print(f\"   Created: {metadata.get('created_at', 'unknown')}\")\n                \n                return True\n            except Exception as e:\n                print(f\"❌ Error importing vector database: {e}\")\n                return False\n        else:\n            print(f\"❌ File not found: {path}\")\n            return False","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-21T00:01:15.544645Z","iopub.execute_input":"2025-04-21T00:01:15.544998Z","iopub.status.idle":"2025-04-21T00:01:15.561056Z","shell.execute_reply.started":"2025-04-21T00:01:15.544976Z","shell.execute_reply":"2025-04-21T00:01:15.560312Z"}},"outputs":[],"execution_count":28},{"cell_type":"markdown","source":"# 6. Response Generation","metadata":{}},{"cell_type":"code","source":"# 6 Response Generation\ndef generate_response(vector_db, query, history=None):\n    \"\"\"Generate a response using vector database and LLM.\"\"\"\n    if not vector_db or not vector_db.initialized:\n        return \"I'm sorry, but my knowledge base is not initialized. Please try again later.\"\n    \n    # Search for relevant chunks\n    results = vector_db.search(query, top_k=3)\n    \n    # Prepare context from search results\n    context = \"\"\n    if results:\n        for i, result in enumerate(results):\n            context += f\"\\nInformation {i+1}:\\n{result['chunk_text']}\\n\"\n    else:\n        # If no results, provide a general response\n        return (\n            \"I'm sorry, but I don't have specific information about that. \"\n            \"For the most accurate and up-to-date information, please contact Benefit Body Spa directly \"\n            \"at (506) 454-5403 or email info@benefitbodyspa.com.\"\n        )\n    \n    # Prepare the prompt\n    system_prompt = \"\"\"You are a helpful customer service assistant for Benefit Body Spa. \nAnswer the user's question based on the information provided. \nIf the information doesn't fully address the question, acknowledge what you know and suggest contacting the spa directly for more details.\nKeep your responses friendly, professional, and focused on the customer's needs.\nFormat your response in a clear, easy-to-read manner.\"\"\"\n    \n    messages = [\n        (\"system\", system_prompt),\n        (\"user\", f\"Here is information about Benefit Body Spa:\\n{context}\\n\\nBased on this information, please answer the following question: {query}\")\n    ]\n    \n    # Generate response\n    response = client.models.generate_content(\n        model=LLM_MODEL,\n        contents=messages\n    )\n    \n    return response.text","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-20T23:06:44.457468Z","iopub.execute_input":"2025-04-20T23:06:44.457740Z","iopub.status.idle":"2025-04-20T23:06:44.479437Z","shell.execute_reply.started":"2025-04-20T23:06:44.457718Z","shell.execute_reply":"2025-04-20T23:06:44.478158Z"}},"outputs":[],"execution_count":9},{"cell_type":"markdown","source":"# 7. Main Functions","metadata":{}},{"cell_type":"code","source":"# 7. Main Functions\ndef create_knowledge_base(from_scratch=False):\n    \"\"\"Create and initialize the knowledge base.\"\"\"\n    vector_db = VectorDatabase(client)\n    \n    # Try to load existing database first\n    if not from_scratch and os.path.exists(VECTOR_DB_PATH):\n        if vector_db.import_from_pickle():\n            print(\"📌 Using existing vector database\")\n            return vector_db\n    \n    # Otherwise build from scratch\n    print(\"🔄 Building new vector database...\")\n    \n    # Scrape and process data\n    spa_data = scrape_website_from_sitemap()\n    \n    if spa_data is None or len(spa_data) == 0:\n        print(\"❌ No data scraped. Cannot build vector database.\")\n        return None\n    \n    # Process content into chunks\n    chunks_df = process_content_to_chunks(spa_data)\n    \n    if chunks_df is None or len(chunks_df) == 0:\n        print(\"❌ No chunks created. Cannot build vector database.\")\n        return None\n    \n    # Initialize database with chunks\n    vector_db.initialize(chunks_df)\n    \n    return vector_db\n\ndef test_knowledge_base(vector_db):\n    \"\"\"Test the knowledge base with sample queries.\"\"\"\n    if not vector_db or not vector_db.initialized:\n        print(\"❌ Vector database not initialized\")\n        return\n    \n    test_queries = [\n        \"What massage services do you offer?\",\n        \"How much does a facial cost?\",\n        \"What are your business hours?\",\n        \"Do I need to arrive early for my appointment?\",\n        \"What is your cancellation policy?\",\n        \"Do you offer financing options?\",\n        \"Where are you located?\",\n        \"What treatments do you offer for skin rejuvenation?\",\n        \"Tell me about your Emsculpt treatments\",\n        \"Do you have laser hair removal services?\"\n    ]\n    \n    print(\"\\n=== Testing Knowledge Base with Sample Queries ===\")\n    \n    for query in test_queries:\n        print(f\"\\n🔍 Query: {query}\")\n        \n        # Search for relevant chunks\n        results = vector_db.search(query, top_k=2)\n        \n        if results:\n            for i, result in enumerate(results):\n                print(f\"\\nResult {i+1} (Similarity: {result['similarity']:.4f}):\")\n                print(f\"Title: {result['title']}\")\n                print(f\"URL: {result['url']}\")\n                # Print a preview of the content\n                content_preview = result['original_text'][:100] + \"...\" if len(result['original_text']) > 100 else result['original_text']\n                print(f\"Content: {content_preview}\")\n                \n            # Generate a response\n            response = generate_response(vector_db, query)\n            print(f\"\\n💬 Chatbot Response:\\n{response}\")\n        else:\n            print(\"❌ No relevant results found\")\n\ndef chat_with_spa_agent(vector_db):\n    \"\"\"Interactive chat with the spa agent.\"\"\"\n    if not vector_db or not vector_db.initialized:\n        print(\"❌ Vector database not initialized\")\n        return\n    \n    print(\"\\n=== Benefit Body Spa Customer Service Agent ===\")\n    print(\"Ask any question about our services, or type 'exit' to quit.\\n\")\n    \n    while True:\n        query = input(\"You: \")\n        if query.lower() in ['exit', 'quit', 'bye']:\n            print(\"\\nThank you for chatting with us!\")\n            break\n        \n        response = generate_response(vector_db, query)\n        print(f\"\\nSpa Agent: {response}\\n\")\n\ndef analyze_knowledge_coverage(vector_db):\n    \"\"\"Analyze the coverage of the knowledge base.\"\"\"\n    if not vector_db or not vector_db.initialized:\n        print(\"❌ Vector database not initialized\")\n        return\n    \n    # Get statistics about chunks\n    pages = set()\n    url_chunks = {}\n    \n    for chunk in vector_db.chunks:\n        url = chunk['url']\n        pages.add(url)\n        \n        if url not in url_chunks:\n            url_chunks[url] = []\n        \n        url_chunks[url].append(chunk)\n    \n    # Print statistics\n    print(\"\\n📊 Knowledge Base Coverage Analysis:\")\n    print(f\"Total pages: {len(pages)}\")\n    print(f\"Total chunks: {len(vector_db.chunks)}\")\n    print(f\"Average chunks per page: {len(vector_db.chunks)/len(pages):.2f}\")\n    \n    # Pages with most chunks\n    sorted_pages = sorted(url_chunks.items(), key=lambda x: len(x[1]), reverse=True)\n    \n    print(\"\\n📑 Pages with Most Content:\")\n    for url, chunks in sorted_pages[:5]:\n        # Find a chunk to get the title\n        title = chunks[0]['title'] if chunks else \"Unknown\"\n        print(f\"- {title} ({url}): {len(chunks)} chunks\")\n    \n    # Test some important business queries\n    important_queries = [\n        \"What are your hours?\",\n        \"How can I book an appointment?\",\n        \"What's your cancellation policy?\",\n        \"Where are you located?\",\n        \"What types of payment do you accept?\",\n        \"Do you offer gift cards?\"\n    ]\n    \n    print(\"\\n🔍 Testing Important Business Queries:\")\n    coverage_score = 0\n    \n    for query in important_queries:\n        results = vector_db.search(query, top_k=1)\n        if results:\n            coverage_score += 1\n            print(f\"✅ '{query}' - Found relevant information\")\n        else:\n            print(f\"❌ '{query}' - No relevant information found\")\n    \n    coverage_percentage = (coverage_score / len(important_queries)) * 100\n    print(f\"\\nBusiness Information Coverage: {coverage_percentage:.1f}%\")\n\n# Run the main process\ndef main():\n    \"\"\"Main function to run the entire process.\"\"\"\n    print(\"🏁 Starting Benefit Body Spa Customer Service Agent\")\n    \n    # Check if we should rebuild from scratch\n    rebuild = input(\"Rebuild knowledge base from scratch? (y/n): \").lower() == 'y'\n    \n    # Create knowledge base\n    vector_db = create_knowledge_base(from_scratch=rebuild)\n    \n    if not vector_db or not vector_db.initialized:\n        print(\"❌ Failed to create knowledge base\")\n        return\n    \n    # Analyze knowledge coverage\n    analyze_knowledge_coverage(vector_db)\n    \n    # Test knowledge base\n    test_knowledge_base(vector_db)\n    \n    # Ask if user wants to start chatting\n    start_chat = input(\"\\nStart interactive chat? (y/n): \").lower() == 'y'\n    if start_chat:\n        chat_with_spa_agent(vector_db)\n    \n    print(\"🏁 Process completed\")\n\n# Run the main process if executed directly\nif __name__ == \"__main__\":\n    main()","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-21T00:02:12.459863Z","iopub.execute_input":"2025-04-21T00:02:12.460165Z","iopub.status.idle":"2025-04-21T00:02:20.205469Z","shell.execute_reply.started":"2025-04-21T00:02:12.460145Z","shell.execute_reply":"2025-04-21T00:02:20.204554Z"}},"outputs":[{"name":"stdout","text":"🏁 Starting Benefit Body Spa Customer Service Agent\n","output_type":"stream"},{"traceback":["\u001b[0;31m---------------------------------------------------------------------------\u001b[0m","\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)","\u001b[0;32m/tmp/ipykernel_372/2571548771.py\u001b[0m in \u001b[0;36m<cell line: 0>\u001b[0;34m()\u001b[0m\n\u001b[1;32m    180\u001b[0m \u001b[0;31m# Run the main process if executed directly\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    181\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0m__name__\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m\"__main__\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 182\u001b[0;31m     \u001b[0mmain\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m","\u001b[0;32m/tmp/ipykernel_372/2571548771.py\u001b[0m in \u001b[0;36mmain\u001b[0;34m()\u001b[0m\n\u001b[1;32m    156\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    157\u001b[0m     \u001b[0;31m# Check if we should rebuild from scratch\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 158\u001b[0;31m     \u001b[0mrebuild\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Rebuild knowledge base from scratch? (y/n): \"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlower\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'y'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    159\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    160\u001b[0m     \u001b[0;31m# Create knowledge base\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py\u001b[0m in \u001b[0;36mraw_input\u001b[0;34m(self, prompt)\u001b[0m\n\u001b[1;32m   1175\u001b[0m                 \u001b[0;34m\"raw_input was called, but this frontend does not support input requests.\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1176\u001b[0m             )\n\u001b[0;32m-> 1177\u001b[0;31m         return self._input_request(\n\u001b[0m\u001b[1;32m   1178\u001b[0m             \u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mprompt\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1179\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_parent_ident\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"shell\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;32m/usr/local/lib/python3.11/dist-packages/ipykernel/kernelbase.py\u001b[0m in \u001b[0;36m_input_request\u001b[0;34m(self, prompt, ident, parent, password)\u001b[0m\n\u001b[1;32m   1217\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mKeyboardInterrupt\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1218\u001b[0m                 \u001b[0;31m# re-raise KeyboardInterrupt, to truncate traceback\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1219\u001b[0;31m                 \u001b[0;32mraise\u001b[0m \u001b[0mKeyboardInterrupt\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Interrupted by user\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1220\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1221\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlog\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwarning\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Invalid Message:\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexc_info\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n","\u001b[0;31mKeyboardInterrupt\u001b[0m: Interrupted by user"],"ename":"KeyboardInterrupt","evalue":"Interrupted by user","output_type":"error"}],"execution_count":29},{"cell_type":"markdown","source":"**7.2 Save Vector DB**","metadata":{}},{"cell_type":"code","source":"#7.2 Save Vector DB\n\n# Cell 1 - Save Vector DB to persistent storage\nimport shutil\n\n# Option 1: Save to Google Drive (mount first)\n#from google.colab import drive\n#drive.mount('/content/drive')\n#shutil.copy(VECTOR_DB_PATH, '/content/drive/MyDrive/benefit_spa_vector_db.pkl')\n\n# Option 2: Save to Kaggle Dataset (if running in Kaggle)\n#import opendatasets as od\n#od.download_kaggle_dataset('bellosabur/benefit-spa-vector-db')\n\n# Option 3: Download directly from notebook environment\nfrom IPython.display import FileLink\nFileLink(VECTOR_DB_PATH)  # Creates downloadable link","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-21T00:03:41.401108Z","iopub.execute_input":"2025-04-21T00:03:41.401407Z","iopub.status.idle":"2025-04-21T00:03:41.407353Z","shell.execute_reply.started":"2025-04-21T00:03:41.401389Z","shell.execute_reply":"2025-04-21T00:03:41.406434Z"}},"outputs":[{"execution_count":31,"output_type":"execute_result","data":{"text/plain":"/kaggle/working/data/benefit_spa_vector_db.pkl","text/html":"<a href='data/benefit_spa_vector_db.pkl' target='_blank'>data/benefit_spa_vector_db.pkl</a><br>"},"metadata":{}}],"execution_count":31},{"cell_type":"markdown","source":"**7.3 Deployment Preparation**","metadata":{}},{"cell_type":"code","source":"# Cell 2 - Deployment Preparation\ndef export_for_deployment(vector_db):\n    \"\"\"Package components for production deployment\"\"\"\n    deployment_package = {\n        'vector_db_path': VECTOR_DB_PATH,\n        'config': {\n            'EMBEDDING_MODEL': EMBEDDING_MODEL,\n            'LLM_MODEL': LLM_MODEL,\n            'SIMILARITY_THRESHOLD': SIMILARITY_THRESHOLD\n        },\n        'api_example': '''\n        # FastAPI Example Endpoint\n        from fastapi import FastAPI\n        app = FastAPI()\n        \n        @app.post(\"/chat\")\n        async def chat_endpoint(query: str):\n            return generate_response(vector_db, query)\n        '''\n    }\n    \n    print(\"Ready for integration with:\")\n    print(\"- Jane App: Use FastAPI example to create endpoints\")\n    print(\"- POS Systems: Export vector_db.pkl and config to POS environment\")\n    print(\"- Web Chat: Build frontend that calls generate_response()\")\n    return deployment_package\n\n# Execute deployment prep\ndeployment = export_for_deployment(vector_db)  # Removed the undefined parameter","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-21T00:05:32.139083Z","iopub.execute_input":"2025-04-21T00:05:32.139387Z","iopub.status.idle":"2025-04-21T00:05:32.146303Z","shell.execute_reply.started":"2025-04-21T00:05:32.139365Z","shell.execute_reply":"2025-04-21T00:05:32.144892Z"}},"outputs":[{"name":"stdout","text":"Ready for integration with:\n- Jane App: Use FastAPI example to create endpoints\n- POS Systems: Export vector_db.pkl and config to POS environment\n- Web Chat: Build frontend that calls generate_response()\n","output_type":"stream"}],"execution_count":32},{"cell_type":"code","source":"!pip install gradio -q\nimport gradio as gr\n\ndef enhanced_chat(query, history):\n    # Step 1: LLM processes raw input\n    processed_query = client.models.generate_content(\n        model=LLM_MODEL,\n        contents=[f\"Rephrase this spa service query for better search: {query}\"]\n    ).text\n    \n    # Step 2: Vector DB search\n    results = vector_db.search(processed_query)\n    \n    # Step 3: LLM generates response\n    context = \"\\n\".join([r['chunk_text'] for r in results])\n    response = client.models.generate_content(\n        model=LLM_MODEL,\n        contents=[f\"\"\"Answer this spa query concisely: {query} \n                  using only: {context}\"\"\"\n        ]\n    ).text\n    \n    # Improved follow-up suggestions\n    follow_ups = client.models.generate_content(\n        model=LLM_MODEL,\n        contents=[f\"\"\"Generate 2-3 natural follow-up questions about Benefit Body Spa services \n                  based on this response: '{response}'. Format as a bullet list without numbering.\"\"\"\n        ]\n    ).text\n    \n    # Clean up the output\n    follow_ups = [q.strip('•- ') for q in follow_ups.split('\\n') if q.strip()]\n    return response, follow_ups[:3]  # Return response + max 3 suggestions\n\nwith gr.Blocks() as demo:\n    gr.Markdown(\"## Benefit Body Spa - LLM-Powered Assistant\")\n    \n    with gr.Row():\n        chatbot = gr.Chatbot(height=400, bubble_full_width=False)\n        with gr.Accordion(\"💡 Suggested Questions\", open=False):\n            followup = gr.HTML()  # Using HTML for better formatting\n    \n    msg = gr.Textbox(label=\"Type your question...\", placeholder=\"Ask about services, pricing, or policies\")\n    btn = gr.Button(\"Send\", variant=\"primary\")\n    \n    def respond(message, chat_history):\n        response, suggestions = enhanced_chat(message, chat_history)\n        chat_history.append((message, response))\n        \n        # Format suggestions as clickable buttons\n        suggestions_html = \"<div style='margin-top:10px'>\" + \\\n                           \"\".join([f\"\"\"<button style='margin:5px; padding:8px 12px; \n                                       border-radius:5px; cursor:pointer'\n                                       onclick='this.innerHTML=this.innerHTML'>\n                                       {q}\n                                       </button><br>\"\"\" \n                                   for q in suggestions]) + \\\n                           \"</div>\"\n        \n        return \"\", chat_history, suggestions_html\n    \n    btn.click(respond, [msg, chatbot], [msg, chatbot, followup])\n    msg.submit(respond, [msg, chatbot], [msg, chatbot, followup])\n\ndemo.launch(share=True, debug=True)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2025-04-21T00:12:08.351489Z","iopub.execute_input":"2025-04-21T00:12:08.351795Z"}},"outputs":[{"name":"stderr","text":"/tmp/ipykernel_372/3414328118.py:39: UserWarning: You have not specified a value for the `type` parameter. Defaulting to the 'tuples' format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style dictionaries with 'role' and 'content' keys.\n  chatbot = gr.Chatbot(height=400, bubble_full_width=False)\n/tmp/ipykernel_372/3414328118.py:39: DeprecationWarning: The 'bubble_full_width' parameter is deprecated and will be removed in a future version. This parameter no longer has any effect.\n  chatbot = gr.Chatbot(height=400, bubble_full_width=False)\n","output_type":"stream"},{"name":"stdout","text":"* Running on local URL:  http://127.0.0.1:7861\n* Running on public URL: https://5c38f15ce56f45caf2.gradio.live\n\nThis share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)\n","output_type":"stream"},{"output_type":"display_data","data":{"text/plain":"<IPython.core.display.HTML object>","text/html":"<div><iframe src=\"https://5c38f15ce56f45caf2.gradio.live\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"},"metadata":{}},{"name":"stdout","text":"Generating embeddings batch 1/1\n🔍 Found 5 relevant chunks for query: 'Here are a few ways to rephrase the spa service query \"price ofn facial\" for better search results, depending on what you're looking for:\n\n**More General:**\n\n*   **Facial price**\n*   **Cost of facial**\n*   **How much does a facial cost?**\n*   **Facial prices near me** (If you want local results)\n\n**More Specific (if you have a location in mind):**\n\n*   **Facial cost [city name]**\n*   **Price of facial at [spa name]**\n*   **[Spa name] facial prices**\n\n**Even More Specific (if you know the type of facial):**\n\n*   **Price of [type of facial]** (e.g., \"Price of hydrating facial\")\n*   **Cost of [type of facial] near me**\n\n**Why these are better:**\n\n*   **Eliminate the typo:** \"ofn\" is likely a typo and should be corrected to \"of\".\n*   **Clear and concise language:** Using keywords like \"price\" and \"cost\" is straightforward.\n*   **Local focus:** Adding \"near me\" or a city name helps narrow down results to your area.\n*   **Specificity:** Specifying the type of facial will give you more targeted pricing.\n'\nGenerating embeddings batch 1/1\n🔍 Found 5 relevant chunks for query: 'Here are some ways to rephrase the spa service query \"and emsella\" for better search, depending on what the user is looking for:\n\n**If the user is looking for spas that offer Emsella treatments:**\n\n*   **Emsella spa treatments**\n*   **Spas with Emsella**\n*   **Find Emsella near me** (if location is important)\n*   **Emsella for urinary incontinence spa** (if they are specifying the purpose)\n*   **Pelvic floor strengthening spa treatments** (if they don't know the name Emsella but know the benefit)\n*   **BTL Emsella treatment** (if they know the manufacturer)\n\n**If the user is looking for information about Emsella itself:**\n\n*   **What is Emsella?**\n*   **Emsella treatment cost**\n*   **Emsella reviews**\n*   **Emsella before and after**\n*   **Emsella benefits**\n*   **Emsella side effects**\n\n**If the user wants to combine Emsella with other spa services:**\n\n*   **Emsella and [another spa treatment, e.g., massage]**\n*   **Spas offering Emsella and [another spa treatment]**\n*   **Emsella treatment packages**\n\n**Why is \"and emsella\" a bad search query?**\n\n*   \"And\" is often ignored by search engines.\n*   \"Emsella\" alone is a fine search term if they just want general information.\n*   Adding context like \"spa\" or \"treatment\" will drastically improve results.\n\nThe best rephrased query depends entirely on the user's intent. The options above provide a range of possibilities based on different potential needs.\n'\n","output_type":"stream"}],"execution_count":null},{"cell_type":"code","source":"","metadata":{"trusted":true},"outputs":[],"execution_count":null}]}