Spaces:

schandel08
/

WebIQ_backend

Sleeping

App Files Files Community

schandel08 commited on 28 days ago

Commit

9f84bcd

1 Parent(s): 01d3dd8

Backend added

Browse files

Files changed (7) hide show

.gitignore +9 -0
Dockerfile +50 -0
README.md +92 -0
chatbot.py +178 -0
main.py +190 -0
requirements.txt +158 -0
worker.py +154 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,9 @@

+venv
+cache
+model
+__pycache__
+venv
+t.py
+temp.py
+.env
+app.py

Dockerfile ADDED Viewed

	@@ -0,0 +1,50 @@

+# Use a Python base image compatible with Hugging Face Spaces
+FROM python:3.11-slim
+# Set the working directory inside the container
+WORKDIR /app
+# Prevent Python from writing pyc files and buffering stdout
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+# Install system dependencies required for Playwright + Crawl4AI
+RUN apt-get update && apt-get install -y \
+    curl \
+    wget \
+    unzip \
+    git \
+    xvfb \
+    libnss3 \
+    libatk-bridge2.0-0 \
+    libx11-xcb1 \
+    libxcomposite1 \
+    libxdamage1 \
+    libxrandr2 \
+    libgbm-dev \
+    libasound2 \
+    libatk1.0-0 \
+    libxkbcommon0 \
+    libcups2 \
+    libgtk-3-0 \
+    fonts-liberation \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements and install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Install Playwright Chromium (used by Crawl4AI)
+RUN playwright install --with-deps chromium
+# Copy the entire app (including .env)
+COPY . .
+# Expose the port expected by Hugging Face (7860)
+EXPOSE 7860
+# Hugging Face expects the app to listen on port 7860
+ENV PORT=7860
+# Command to run FastAPI with uvicorn
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -8,3 +8,95 @@ pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# WebIQ – Boosts your web intelligence with AI-powered insights
+## Overview
+WebIQ is a powerful **web scraping** and **question-answering (QA)** chatbot that follows the **Retrieval-Augmented Generation (RAG)** pipeline. It extracts and retrieves key insights from any website and generates **AI-powered** responses based on the extracted data. WebIQ leverages **FAISS** for efficient similarity search, **LangChain** for retrieval orchestration, and state-of-the-art **LLMs** for response generation.
+## Features
+- **Automated Web Scraping**: Extracts text data from webpages, caches it locally, and supports both targeted and full-site scraping.
+- **Vector Embeddings**: Uses FAISS to store and retrieve information efficiently.
+- **LLM Integration**: Supports OpenAI (GPT-4) and Hugging Face (Llama-2, Mistral, etc.).
+- **Chunking for Optimization**: Splits documents into meaningful chunks to enhance retrieval quality.
+- **Asynchronous Processing**: Uses `asyncio` for efficient execution.
+- **Caching Mechanism**: Ensures previously processed webpages are not reprocessed.
+- **Batch Processing**: Processes large numbers of URLs efficiently.
+- **Memory Usage Logging**: Tracks memory consumption before and after each batch for efficiency monitoring.
+- **Multi-Page Scraping**: Seamlessly scrapes content from multiple webpages and aggregates insights.
+## Installation
+1. Clone the repository:
+    ```sh
+    git clone https://github.com/Siddharth-Chandel/WebIQ.git
+    cd WebIQ
+    ```
+2. Create a virtual environment and activate it:
+    ```sh
+    python -m venv venv
+    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
+    ```
+3. Install the required dependencies:
+    ```sh
+    pip install -r requirements.txt
+    ```
+4. Set up environment variables by creating a `.env` file:
+    ```sh
+    HUGGINGFACEHUB_API_TOKEN=your_huggingface_token
+    OPENAI_API_KEY=your_openai_api_key  # If using OpenAI
+    ```
+## Usage
+1. Run the chatbot script:
+    ```sh
+    python chatbot.py
+    ```
+2. Enter a **URL** when prompted (e.g., `https://playwright.dev`).
+3. Enter your **query** (e.g., `Describe Playwright and its benefits`).
+4. The chatbot will scrape the webpage, process the data, and return an AI-generated response.
+## Example Output
+```
+====================* Answer *====================
+Playwright is an end-to-end testing framework that provides...
+=================* Source Documents *=================
+Source 1:
+file: cache/playwright-dev/pages/page_1.txt
+Content: Playwright is a Node.js library that automates browsers.
+```
+## Practical Use Cases
+- **Research Assistance**: Quickly extract and summarize information from research papers, blogs, or documentation.
+- **Competitive Analysis**: Monitor competitors' websites and extract relevant insights for business strategy.
+- **Customer Support**: Enhance chatbot capabilities by integrating real-time website data retrieval.
+- **Market Intelligence**: Gather structured data from news sites, product pages, or financial reports for analysis.
+- **SEO Optimization**: Analyze webpage content for better keyword targeting and content strategy.
+## Technologies Used
+- **RAG** (Providing better context)
+- **LangChain** (Retrieval-based QA system)
+- **FAISS** (Efficient similarity search)
+- **Hugging Face Transformers** (LLMs & embeddings)
+- **OpenAI GPT-4** (Optional for LLM-based response generation)
+- **Crawl4AI** (An LLM-based web-scraper)
+- **AsyncIO** (Increment the processing speed)
+- **Rich** (For colorful CLI outputs)
+## Future Enhancements
+- Develop an interactive web UI using Streamlit or FastAPI for a seamless user experience.
+- Enhance retrieval quality with advanced RAG tuning and improved embeddings.
+## License
+This project is licensed under the **MIT License**.
+## Author
+Siddharth Chandel - Developed as part of NLP & AI research.
+Let's connect on [LinkedIn](https://www.linkedin.com/in/siddharth-chandel-001097245/) !!!
+---
+_Contributions are welcome! Feel free to fork and enhance._ 🚀

chatbot.py ADDED Viewed

	@@ -0,0 +1,178 @@

+from dotenv import load_dotenv
+load_dotenv()
+import os
+import asyncio
+import logging
+from langchain_community.document_loaders import TextLoader
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from langchain_huggingface import HuggingFaceEmbeddings
+from langchain_community.vectorstores import FAISS
+from langchain_openai import ChatOpenAI
+from langchain_community.llms import CTransformers
+from langchain_core.prompts import PromptTemplate
+from transformers import pipeline
+from langchain_huggingface import HuggingFacePipeline
+from rich import print as rprint
+from worker import scrape_website
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
+os.environ["HUGGINGFACEHUB_API_TOKEN"] = os.getenv("HUGGINGFACEHUB_API_TOKEN")
+DEFAULT_MODEL = "TheBloke/Llama-2-7B-Chat-GGML"
+EMBEDDING_MODEL = "BAAI/bge-small-en"
+# -------------------- Document Preparation --------------------
+async def prepare_document(url: str | list[str]):
+    if isinstance(url, str):
+        folder = f"{url[8:].replace('.', '-').split('/')[0]}"
+        cache_path = os.path.join("cache", folder, "pages")
+    else:
+        folder = f"{url[0][8:].replace('.', '-').split('/')[0]}"
+        cache_path = os.path.join("cache", f"list_{folder}", "pages")
+    os.makedirs(cache_path, exist_ok=True)
+    if not os.path.exists(f"{cache_path}/page_1.txt"):
+        logging.info("Document not found. Scraping website...")
+        await scrape_website(url, cache_path)
+        logging.info("Scraping completed.")
+    return cache_path
+# -------------------- Embedding --------------------
+def get_embedding_model(embedding_model_name="", api_key=""):
+    # Use OpenAI if api_key provided or model name indicates OpenAI
+    if api_key or "openai" in embedding_model_name.lower():
+        if not api_key:
+            raise ValueError("OpenAI API key required for OpenAI embeddings")
+        from langchain_openai import OpenAIEmbeddings
+        return OpenAIEmbeddings(model="text-embedding-3-small", api_key=api_key)
+    # Use HuggingFace otherwise
+    else:
+        # Ensure HF token is set in env for this thread
+        HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
+        os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN
+        return HuggingFaceEmbeddings(model_name=embedding_model_name or EMBEDDING_MODEL)
+# -------------------- Process & Build Vector Store --------------------
+def process_documents(file_path: str, embedding_model, chunk_size=500, chunk_overlap=100):
+    try:
+        cache_path = os.path.dirname(file_path)
+        faiss_path = f"{cache_path}/faiss_index_store"
+        if os.path.exists(faiss_path):
+            logging.info("FAISS index exists. Skipping rebuild.")
+            return
+        documents = []
+        for file in os.listdir(f"{cache_path}/pages"):
+            doc_loader = TextLoader(os.path.join(cache_path, "pages", file), encoding="utf-8")
+            documents.extend(doc_loader.load())
+        logging.info(f"Loaded {len(documents)} pages")
+        text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
+        chunks = text_splitter.split_documents(documents)
+        vector_db = FAISS.from_documents(chunks, embedding_model)
+        vector_db.save_local(faiss_path)
+        logging.info("FAISS store saved successfully")
+    except Exception as e:
+        logging.error(f"Error in document processing: {e}")
+# -------------------- Load Retriever --------------------
+async def load_retriever(file_path: str, embedding_model_name="", api_key=""):
+    cache_path = os.path.dirname(file_path)
+    embedding_model = get_embedding_model(embedding_model_name, api_key)
+    faiss_path = f"{cache_path}/faiss_index_store"
+    if not os.path.exists(faiss_path):
+        logging.warning("FAISS index missing. Rebuilding...")
+        process_documents(file_path, embedding_model)
+    vector_db = FAISS.load_local(faiss_path, embedding_model, allow_dangerous_deserialization=True)
+    return vector_db.as_retriever(search_kwargs={"k": 3})
+# -------------------- Build Custom QA Pipeline --------------------
+async def build_pipeline(url: str | list, llm_model="", embedding_model="", api_key=""):
+    # Force default model if llm_model is empty or 'default'
+    if not llm_model or llm_model.lower() == "default":
+        llm_model = DEFAULT_MODEL
+    logging.info(f"[LLM] Using model: {llm_model}")
+    file_path = await prepare_document(url)
+    retriever = await load_retriever(file_path, embedding_model, api_key)
+    llm_model_lower = llm_model.lower()
+    # OpenAI LLM
+    if "openai" in llm_model_lower:
+        llm = ChatOpenAI(model_name="gpt-3.5-turbo", openai_api_key=api_key)
+    # GGML model
+    elif llm_model_lower.endswith("-ggml"):
+        llm = CTransformers(model=llm_model, model_type="llama", config={"context_length": 4096})
+    # Hugging Face PyTorch model
+    else:
+        try:
+            hf_pipeline = pipeline(
+                "text-generation",
+                model=llm_model,
+                use_auth_token=HUGGINGFACEHUB_API_TOKEN
+            )
+            llm = HuggingFacePipeline(pipeline=hf_pipeline)
+        except Exception as e:
+            logging.error(f"Failed to load Hugging Face model '{llm_model}'. Error: {e}")
+            raise RuntimeError(f"Cannot load Hugging Face model: {e}")
+    prompt = PromptTemplate(
+        input_variables=["context", "question"],
+        template="You are a helpful assistant. Use the following context to answer.\n\nContext:\n{context}\n\nQuestion: {question}\n\nAnswer:"
+    )
+    return llm, retriever, prompt
+class Chatbot:
+    def __init__(self, url: str | list, llm_model="", embedding_model="", api_key=""):
+        self.url = url
+        self.llm_model = llm_model
+        self.embedding_model = embedding_model
+        self.api_key = api_key
+    async def initialize(self):
+        self.llm, self.retriever, self.prompt = await build_pipeline(
+            self.url, self.llm_model, self.embedding_model, self.api_key
+        )
+    async def query(self, question: str):
+        # Use async method if available
+        if hasattr(self.retriever, "aretrieve"):
+            docs = await self.retriever.aretrieve(question)
+        else:
+            # fallback: call the private method with run_manager=None
+            docs = await asyncio.to_thread(self.retriever._get_relevant_documents, question, run_manager=None)
+        context = "\n\n".join([d.page_content for d in docs])
+        prompt_text = self.prompt.format(context=context, question=question)
+        response = await asyncio.to_thread(self.llm.invoke, prompt_text)
+        return response
+# -------------------- Example Runner --------------------
+async def main():
+    url = input("Enter URL: ").strip()
+    query = input("Enter your question: ").strip()
+    bot = Chatbot([url])
+    await bot.initialize()
+    answer = await bot.query(query)
+    rprint(f"\n[bold cyan]=== Answer ===[/bold cyan]\n{answer}")
+if __name__ == "__main__":
+    asyncio.run(main())

main.py ADDED Viewed

	@@ -0,0 +1,190 @@

+# main.py
+from dotenv import load_dotenv
+load_dotenv()
+import sys
+import uuid
+import asyncio
+import logging
+from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect, BackgroundTasks
+from fastapi.middleware.cors import CORSMiddleware
+from chatbot import Chatbot
+# -----------------------------
+# Windows Asyncio Fix
+# -----------------------------
+if sys.platform == "win32":
+    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
+# -----------------------------
+# FastAPI app & CORS
+# -----------------------------
+app = FastAPI(
+    title="Session-Based RAG Chatbot API",
+    description="Session-based RAG Chatbot API with WebSocket support",
+    version="1.1.0"
+)
+origins = [
+    "http://localhost:8080",
+    "http://127.0.0.1:8080",
+    "http://127.0.0.1:5500",
+]
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=origins,
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# -----------------------------
+# Session storage
+# -----------------------------
+chatbot_sessions = {}  # {session_id: Chatbot instance or None if failed}
+# -----------------------------
+# Root endpoint
+# -----------------------------
+@app.get("/")
+def read_root():
+    return {"message": "Welcome to the Session-Based RAG Chatbot API!", "status": "Ready"}
+@app.get("/create_session")
+def create_session():
+    return {"session":str(uuid.uuid4())}
+@app.get("/session_status/{session_id}")
+def session_status(session_id: str):
+    """
+    Returns the current status of a chatbot session.
+    Status can be:
+      - initializing (session exists but chatbot not ready)
+      - ready (chatbot instance ready)
+      - failed (chatbot initialization failed)
+    """
+    if session_id not in chatbot_sessions:
+        return {"status": "not_found"}
+    chatbot = chatbot_sessions[session_id]
+    if chatbot is None:
+        return {"status": "initializing"}
+    elif chatbot == "err":
+        chatbot = None
+        return {"status": "failed"}
+    return {"status": "ready"}
+# -----------------------------
+# Helper: Run async init in background
+# -----------------------------
+def run_chatbot_init(session_id, urls, llm_model, embedding_model, api_key):
+    asyncio.create_task(initialize_chatbot(session_id, urls, llm_model, embedding_model, api_key))
+# -----------------------------
+# Scrape & initialize chatbot
+# -----------------------------
+@app.post("/scrape/")
+async def scrape_and_load(response: dict, background_tasks: BackgroundTasks):
+    session_id = response.get("session_id")
+    urls = response.get("urls")
+    llm_model = response.get("llm_model", "TheBloke/Llama-2-7B-Chat-GGML")
+    embedding_model = response.get("embedding_model", "BAAI/bge-small-en")
+    api_key = response.get("api_key", None)
+    if not urls:
+        raise HTTPException(status_code=400, detail="urls are required.")
+    if session_id in chatbot_sessions:
+        return {"message": f"Chatbot for session {session_id} already initialized.", "session_id": session_id}
+    # Mark session as initializing
+    chatbot_sessions[session_id] = None
+    # Use a **blocking wrapper** to run async in thread safely
+    async def init_wrapper():
+        try:
+            await initialize_chatbot(session_id, urls, llm_model, embedding_model, api_key)
+        except Exception as e:
+            logging.error(f"[{session_id}] Initialization error: {e}", exc_info=True)
+            chatbot_sessions[session_id] = None
+    background_tasks.add_task(init_wrapper)
+    logging.info(f"[{session_id}] Chatbot initialization scheduled in background.")
+    return {"message": "Chatbot initialization started.", "session_id": session_id}
+# -----------------------------
+# Initialize chatbot
+# -----------------------------
+async def initialize_chatbot(session_id, urls, llm_model, embedding_model, api_key):
+    try:
+        logging.info(f"[{session_id}] Initializing chatbot...")
+        chatbot = Chatbot(
+            url=urls,
+            llm_model=llm_model,
+            embedding_model=embedding_model,
+            api_key=api_key
+        )
+        await chatbot.initialize()
+        chatbot_sessions[session_id] = chatbot
+        logging.info(f"[{session_id}] Chatbot ready.")
+    except NotImplementedError as e:
+        logging.error(f"[{session_id}] Playwright async not supported on Windows: {e}", exc_info=True)
+        chatbot_sessions[session_id] = None
+    except Exception as e:
+        logging.error(f"[{session_id}] Initialization failed: {e}", exc_info=True)
+        chatbot_sessions[session_id] = "err"
+# -----------------------------
+# WebSocket endpoint
+# -----------------------------
+@app.websocket("/ws/chat/{session_id}")
+async def websocket_endpoint(websocket: WebSocket, session_id: str):
+    await websocket.accept()
+    logging.info(f"[{session_id}] WebSocket connected.")
+    try:
+        # Wait until chatbot is ready
+        while session_id not in chatbot_sessions or chatbot_sessions[session_id] is None:
+            await websocket.send_json({"text": "Initializing chatbot, please wait..."})
+            await asyncio.sleep(1)
+        chatbot_instance = chatbot_sessions[session_id]
+        if chatbot_instance is None:
+            await websocket.send_json({
+                "text": "Chatbot initialization failed. Likely due to Playwright async issue on Windows."
+            })
+            return
+        await websocket.send_json({"text": f"Chatbot session {session_id} is ready! You can start chatting."})
+        while True:
+            data = await websocket.receive_json()
+            query = data.get("query")
+            if not query:
+                continue
+            response_text = await chatbot_instance.query(query)
+            await websocket.send_json({"text": response_text})
+    except WebSocketDisconnect:
+        logging.info(f"[{session_id}] WebSocket disconnected.")
+    except Exception as e:
+        logging.error(f"[{session_id}] WebSocket error: {e}", exc_info=True)
+        try:
+            await websocket.send_json({"text": "An unexpected server error occurred."})
+        except:
+            pass
+# -----------------------------
+# Run with: uvicorn main:app --reload
+# -----------------------------
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("main:app", host="127.0.0.1", port=8000, reload=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,158 @@

+accelerate==1.11.0
+aiofiles==25.1.0
+aiohappyeyeballs==2.6.1
+aiohttp==3.13.1
+aiosignal==1.4.0
+aiosqlite==0.21.0
+alphashape==1.3.1
+altair==5.5.0
+annotated-types==0.7.0
+anyio==4.11.0
+attrs==25.4.0
+beautifulsoup4==4.14.2
+bitsandbytes==0.42.0
+blinker==1.9.0
+Brotli==1.1.0
+cachetools==6.2.1
+certifi==2025.10.5
+cffi==2.0.0
+chardet==5.2.0
+charset-normalizer==3.4.4
+click==8.3.0
+click-log==0.4.0
+Crawl4AI==0.7.6
+cryptography==46.0.3
+cssselect==1.3.0
+ctransformers==0.2.27
+dataclasses-json==0.6.7
+distro==1.9.0
+dotenv==0.9.9
+faiss-cpu==1.12.0
+fake-http-header==0.3.5
+fake-useragent==2.2.0
+fastapi==0.119.1
+fastuuid==0.14.0
+filelock==3.20.0
+frozenlist==1.8.0
+fsspec==2025.9.0
+gitdb==4.0.12
+GitPython==3.1.45
+greenlet==3.2.4
+h11==0.16.0
+h2==4.3.0
+hf-xet==1.1.10
+hpack==4.1.0
+httpcore==1.0.9
+httptools==0.7.1
+httpx==0.28.1
+httpx-sse==0.4.3
+huggingface-hub==0.35.3
+humanize==4.14.0
+hyperframe==6.1.0
+idna==3.11
+importlib_metadata==8.7.0
+Jinja2==3.1.6
+jiter==0.11.1
+joblib==1.5.2
+jsonpatch==1.33
+jsonpointer==3.0.0
+jsonschema==4.25.1
+jsonschema-specifications==2025.9.1
+langchain==1.0.2
+langchain-classic==1.0.0
+langchain-community==0.4
+langchain-core==1.0.0
+langchain-huggingface==1.0.0
+langchain-openai==1.0.1
+langchain-text-splitters==1.0.0
+langgraph==1.0.1
+langgraph-checkpoint==3.0.0
+langgraph-prebuilt==1.0.1
+langgraph-sdk==0.2.9
+langsmith==0.4.37
+lark==1.3.0
+litellm==1.78.6
+lxml==5.4.0
+markdown-it-py==4.0.0
+MarkupSafe==3.0.3
+marshmallow==3.26.1
+mdurl==0.1.2
+mpmath==1.3.0
+multidict==6.7.0
+mypy_extensions==1.1.0
+narwhals==2.9.0
+networkx==3.5
+nltk==3.9.2
+numpy==2.3.4
+openai==2.6.0
+orjson==3.11.3
+ormsgpack==1.11.0
+packaging==25.0
+pandas==2.3.3
+patchright==1.55.2
+pillow==11.3.0
+playwright==1.55.0
+propcache==0.4.1
+protobuf==6.33.0
+psutil==7.1.1
+py-cpuinfo==9.0.0
+pyarrow==21.0.0
+pycparser==2.23
+pydantic==2.12.3
+pydantic-settings==2.11.0
+pydantic_core==2.41.4
+pydeck==0.9.1
+pyee==13.0.0
+Pygments==2.19.2
+pyOpenSSL==25.3.0
+python-dateutil==2.9.0.post0
+python-dotenv==1.1.1
+pytz==2025.2
+PyYAML==6.0.3
+rank-bm25==0.2.2
+referencing==0.37.0
+regex==2025.10.23
+requests==2.32.5
+requests-toolbelt==1.0.0
+rich==14.2.0
+rpds-py==0.27.1
+rtree==1.4.1
+safetensors==0.6.2
+scikit-learn==1.7.2
+scipy==1.16.2
+sentence-transformers==5.1.2
+setuptools==80.9.0
+shapely==2.1.2
+six==1.17.0
+smmap==5.0.2
+sniffio==1.3.1
+snowballstemmer==2.2.0
+soupsieve==2.8
+SQLAlchemy==2.0.44
+starlette==0.48.0
+streamlit==1.50.0
+sympy==1.14.0
+tenacity==9.1.2
+tf-playwright-stealth==1.2.0
+threadpoolctl==3.6.0
+tiktoken==0.12.0
+tokenizers==0.22.1
+toml==0.10.2
+torch==2.9.0
+tornado==6.5.2
+tqdm==4.67.1
+transformers==4.57.1
+trimesh==4.8.3
+typing-inspect==0.9.0
+typing-inspection==0.4.2
+typing_extensions==4.15.0
+tzdata==2025.2
+urllib3==2.5.0
+uvicorn==0.38.0
+uvloop==0.22.1
+watchfiles==1.1.1
+websockets==15.0.1
+xxhash==3.6.0
+yarl==1.22.0
+zipp==3.23.0
+zstandard==0.25.0

worker.py ADDED Viewed

	@@ -0,0 +1,154 @@

+# worker.py
+import os
+import asyncio
+import psutil
+from urllib.parse import urlparse, urlunparse
+from typing import List
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+from crawl4ai.content_filter_strategy import PruningContentFilter
+import traceback
+# ------------------------------
+# File paths & config
+# ------------------------------
+__location__ = os.path.dirname(os.path.abspath(__file__))
+batch = 32  # max concurrent crawls
+goto_timeout = 60_000  # 1 minutes
+# ------------------------------
+# Utility functions
+# ------------------------------
+def normalize_url(url: str) -> str:
+    """Normalize URL to avoid duplicates."""
+    parsed = urlparse(url)
+    return urlunparse((parsed.scheme, parsed.netloc, parsed.path, '', '', ''))
+async def get_internal_urls(url_set: set, visited: set, crawler) -> set:
+    """Collect internal links from a page."""
+    internal_urls = crawler.links.get("internal", [])
+    for link in internal_urls:
+        href = link.get("href")
+        if href and href.startswith("http"):
+            normalized_href = normalize_url(href)
+            if normalized_href not in visited:
+                url_set.add(normalized_href)
+    return url_set
+# ------------------------------
+# Core crawling function
+# ------------------------------
+async def crawl_parallel(urls: List[str] | str, file_path: str, max_concurrent: int = batch):
+    """Crawl multiple URLs asynchronously with retries, save pages, and track failures."""
+    text_pages = set()
+    not_visited = set(urls if isinstance(urls, list) else [urls])
+    visited = set()
+    retry = set()
+    failed = set()
+    was_str = isinstance(urls, str)
+    n = 1
+    os.makedirs(file_path, exist_ok=True)
+    process = psutil.Process()
+    peak_memory = 0
+    def log_memory(prefix: str = ""):
+        nonlocal peak_memory
+        current_mem = process.memory_info().rss
+        peak_memory = max(peak_memory, current_mem)
+        print(f"{prefix} Memory: {current_mem // (1024*1024)} MB | Peak: {peak_memory // (1024*1024)} MB")
+    # Browser & crawler config
+    browser_config = BrowserConfig(
+        headless=True,
+        verbose=False,
+        extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
+        text_mode=True
+    )
+    crawl_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        markdown_generator=DefaultMarkdownGenerator(
+            content_filter=PruningContentFilter(threshold=0.6),
+            options={"ignore_links": True}
+        ),
+        page_timeout=goto_timeout
+    )
+    crawler = AsyncWebCrawler(config=browser_config)
+    await crawler.start()
+    print("\n=== Starting robust parallel crawling ===")
+    async def safe_crawl(url, session_id):
+        """Crawl a URL safely, return result or None."""
+        try:
+            result = await crawler.arun(url=url, config=crawl_config, session_id=session_id)
+            return result
+        except Exception as e:
+            print(f"[WARN] Failed to crawl {url}: {e}")
+            return None
+    try:
+        while not_visited:
+            urls_batch = list(not_visited)[:max_concurrent]
+            tasks = [safe_crawl(url, f"session_{i}") for i, url in enumerate(urls_batch)]
+            log_memory(prefix=f"Before batch {n}: ")
+            results = await asyncio.gather(*tasks, return_exceptions=True)
+            log_memory(prefix=f"After batch {n}: ")
+            for url, result in zip(urls_batch, results):
+                if isinstance(result, Exception) or result is None or not getattr(result, "success", False):
+                    if url not in retry:
+                        retry.add(url)
+                        print(f"[INFO] Retry scheduled for {url}")
+                    else:
+                        failed.add(url)
+                        not_visited.discard(url)
+                        visited.add(url)
+                        print(f"[ERROR] Crawling failed for {url} after retry")
+                else:
+                    text_pages.add(result.markdown.fit_markdown)
+                    if was_str:
+                        internal_urls = result.links.get("internal", [])
+                        for link in internal_urls:
+                            href = link.get("href")
+                            if href and href.startswith("http"):
+                                normalized_href = normalize_url(href)
+                                if normalized_href not in visited:
+                                    not_visited.add(normalized_href)
+                    visited.add(url)
+                    retry.discard(url)
+                    not_visited.discard(url)
+            n += 1
+    except Exception as e:
+        traceback.print_exc()
+        print(e)
+    finally:
+        await crawler.close()
+        log_memory(prefix="Final: ")
+    # Save pages
+    pages = [p for p in text_pages if p.strip()]
+    for i, page in enumerate(pages):
+        with open(os.path.join(file_path, f"page_{i+1}.txt"), "w", encoding="utf-8") as f:
+            f.write(page)
+    print(f"\nSummary:")
+    print(f"  - Successfully crawled pages: {len(pages)}")
+    print(f"  - Failed URLs: {len(failed)} -> {failed}")
+    print(f"Peak memory usage: {peak_memory // (1024*1024)} MB")
+    return {
+        "success_count": len(pages),
+        "failed_urls": list(failed),
+        "peak_memory_MB": peak_memory // (1024*1024)
+    }
+# ------------------------------
+# Public scrape function
+# ------------------------------
+async def scrape_website(urls: str | list, file_path: str):
+    """Wrapper to start crawling and return summary."""
+    summary = await crawl_parallel(urls, file_path, max_concurrent=batch)
+    return summary