Spaces:

BinKhoaLe1812
/

EdSummariser

Sleeping

App Files Files Community

LiamKhoaLe commited on Oct 6

Commit

ee39cc9

2 Parent(s): e55c935 e34edc7

Merge commit 'e34edc7cd55f292dd0b192dc00b782c22208fde6' as 'ingestion_python'

Browse files

Files changed (25) hide show

ingestion_python/.dockerignore +46 -0
ingestion_python/CURL.md +138 -0
ingestion_python/Dockerfile +33 -0
ingestion_python/README.md +246 -0
ingestion_python/api/models.py +37 -0
ingestion_python/api/routes.py +238 -0
ingestion_python/app.py +58 -0
ingestion_python/helpers/pages.py +24 -0
ingestion_python/requirements.txt +17 -0
ingestion_python/services/ingestion_service.py +119 -0
ingestion_python/services/maverick_captioner.py +141 -0
ingestion_python/test_upload1.sh +241 -0
ingestion_python/test_upload2.sh +238 -0
ingestion_python/test_upload3.sh +227 -0
ingestion_python/utils/__init__.py +2 -0
ingestion_python/utils/api/rotator.py +67 -0
ingestion_python/utils/api/router.py +359 -0
ingestion_python/utils/embedding.py +44 -0
ingestion_python/utils/ingestion/chunker.py +130 -0
ingestion_python/utils/ingestion/parser.py +63 -0
ingestion_python/utils/logger.py +71 -0
ingestion_python/utils/rag/embeddings.py +39 -0
ingestion_python/utils/rag/rag.py +278 -0
ingestion_python/utils/service/common.py +20 -0
ingestion_python/utils/service/summarizer.py +48 -0

ingestion_python/.dockerignore ADDED Viewed

	@@ -0,0 +1,46 @@

+# Ignore unnecessary files for Docker build
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+env/
+venv/
+.venv/
+pip-log.txt
+pip-delete-this-directory.txt
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.log
+.git/
+.gitignore
+.dockerignore
+# Ignore documentation files (keep only main README)
+*.md
+!README.md
+# Ignore test files
+test_*.py
+*_test.py
+*.sh
+tests/
+# Ignore IDE files
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# Ignore OS files
+.DS_Store
+Thumbs.db
+# Ignore unnecessary directories
+# config/ and services/ are needed for the application

ingestion_python/CURL.md ADDED Viewed

	@@ -0,0 +1,138 @@

+# CURL Test Commands for Ingestion Pipeline
+## Backend Configuration
+- **URL**: `https://binkhoale1812-studdybuddy-ingestion1.hf.space/`
+- **User ID**: `44e65346-8eaa-4f95-b17a-f6219953e7a8`
+- **Project ID**: `496e2fad-ec7e-4562-b06a-ea2491f2460`
+- **Test Files**: `Lecture5_ML.pdf`, `Lecture6_ANN_DL.pdf`
+## 1. Health Check
+```bash
+curl -X GET "https://binkhoale1812-studdybuddy-ingestion1.hf.space/health" \
+  -H "Content-Type: application/json"
+```
+## 2. Upload Files
+```bash
+curl -X POST "https://binkhoale1812-studdybuddy-ingestion1.hf.space/upload" \
+  -F "user_id=44e65346-8eaa-4f95-b17a-f6219953e7a8" \
+  -F "project_id=496e2fad-ec7e-4562-b06a-ea2491f2460" \
+  -F "files=@../exefiles/Lecture5_ML.pdf" \
+  -F "files=@../exefiles/Lecture6_ANN_DL.pdf"
+```
+## 3. Check Upload Status
+Replace `{JOB_ID}` with the job_id from the upload response:
+```bash
+curl -X GET "https://binkhoale1812-studdybuddy-ingestion1.hf.space/upload/status?job_id={JOB_ID}" \
+  -H "Content-Type: application/json"
+```
+## 4. List Uploaded Files
+```bash
+curl -X GET "https://binkhoale1812-studdybuddy-ingestion1.hf.space/files?user_id=44e65346-8eaa-4f95-b17a-f6219953e7a8&project_id=496e2fad-ec7e-4562-b06a-ea2491f2460" \
+  -H "Content-Type: application/json"
+```
+## 5. Get File Chunks (Lecture5_ML.pdf)
+```bash
+curl -X GET "https://binkhoale1812-studdybuddy-ingestion1.hf.space/files/chunks?user_id=44e65346-8eaa-4f95-b17a-f6219953e7a8&project_id=496e2fad-ec7e-4562-b06a-ea2491f2460&filename=Lecture5_ML.pdf&limit=5" \
+  -H "Content-Type: application/json"
+```
+## 6. Get File Chunks (Lecture6_ANN_DL.pdf)
+```bash
+curl -X GET "https://binkhoale1812-studdybuddy-ingestion1.hf.space/files/chunks?user_id=44e65346-8eaa-4f95-b17a-f6219953e7a8&project_id=496e2fad-ec7e-4562-b06a-ea2491f2460&filename=Lecture6_ANN_DL.pdf&limit=5" \
+  -H "Content-Type: application/json"
+```
+## Expected Responses
+### Health Check Response
+```json
+{
+  "ok": true,
+  "mongodb_connected": true,
+  "service": "ingestion_pipeline"
+}
+```
+### Upload Response
+```json
+{
+  "job_id": "uuid-string",
+  "status": "processing",
+  "total_files": 2
+}
+```
+### Status Response
+```json
+{
+  "job_id": "uuid-string",
+  "status": "completed",
+  "total": 2,
+  "completed": 2,
+  "progress": 100.0,
+  "last_error": null,
+  "created_at": 1234567890.123
+}
+```
+### Files List Response
+```json
+{
+  "files": [
+    {
+      "filename": "Lecture5_ML.pdf",
+      "summary": "Document summary..."
+    },
+    {
+      "filename": "Lecture6_ANN_DL.pdf",
+      "summary": "Document summary..."
+    }
+  ],
+  "filenames": ["Lecture5_ML.pdf", "Lecture6_ANN_DL.pdf"]
+}
+```
+### Chunks Response
+```json
+{
+  "chunks": [
+    {
+      "user_id": "44e65346-8eaa-4f95-b17a-f6219953e7a8",
+      "project_id": "496e2fad-ec7e-4562-b06a-ea2491f2460",
+      "filename": "Lecture5_ML.pdf",
+      "topic_name": "Machine Learning Introduction",
+      "summary": "Chunk summary...",
+      "content": "Chunk content...",
+      "embedding": [0.1, 0.2, ...],
+      "page_span": [1, 3],
+      "card_id": "lecture5_ml-c0001"
+    }
+  ]
+}
+```
+## Testing Steps
+1. **Run Health Check**: Verify the service is running
+2. **Upload Files**: Upload both PDF files
+3. **Monitor Progress**: Check job status until completion
+4. **Verify Files**: List uploaded files
+5. **Inspect Chunks**: Get document chunks to verify processing
+## Troubleshooting
+- **Connection Issues**: Check if the backend URL is accessible
+- **File Not Found**: Ensure PDF files exist in `../exefiles/` directory
+- **Upload Fails**: Check file size limits and format support
+- **Processing Stuck**: Monitor job status and check logs

ingestion_python/Dockerfile ADDED Viewed

	@@ -0,0 +1,33 @@

+# Hugging Face Spaces - Docker for Ingestion Pipeline
+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+# System deps (same as main system)
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential curl git libglib2.0-0 libgl1 \
+    && rm -rf /var/lib/apt/lists/*
+# Create and use a non-root user (same as main system)
+RUN useradd -m -u 1000 user
+USER user
+ENV PATH="/home/user/.local/bin:$PATH"
+# Set working directory
+WORKDIR /app
+# Copy ingestion pipeline files (includes utils and helpers)
+COPY . .
+# Install Python dependencies (same as main system)
+RUN pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt
+# No local model caches or warmup needed (remote embedding service)
+# Expose port for HF Spaces
+ENV PORT=7860
+EXPOSE 7860
+# Start FastAPI (single worker so app.state.jobs remains consistent)
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

ingestion_python/README.md ADDED Viewed

	@@ -0,0 +1,246 @@

+---
+title: StuddyBuddy Ingestion
+emoji: ⚙️
+colorFrom: blue
+colorTo: pink
+sdk: docker
+pinned: false
+license: mit
+short_description: 'backend for data ingestion'
+---
+# Ingestion Pipeline
+A dedicated service for processing file uploads and storing them in MongoDB Atlas. This service mirrors the main system's file processing functionality while running as a separate service to share the processing load.
+[API docs](API.md)  |  [System docs](COMPATIBILITY.md)
+## 🏗️ Architecture
+```
+┌─────────────────────────────────────────────────────────────────────────────────┐
+│                                USER INTERFACE                                   │
+│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
+│  │   Frontend UI   │    │  Load Balancer   │    │  Main System    │              │
+│  │                 │◄──►│                  │◄──►│   (Port 7860)   │              │
+│  │ - File Upload   │    │ - Route Requests │    │ - Chat & Reports│              │
+│  │ - Chat Interface│    │ - Health Checks │    │ - User Management│             │
+│  │ - Project Mgmt │    │ - Load Balancing │    │ - Analytics     │              │
+│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
+└─────────────────────────────────────────────────────────────────────────────────┘
+                                        │
+                                        ▼
+┌─────────────────────────────────────────────────────────────────────────────────┐
+│                              INGESTION PIPELINE                                │
+│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
+│  │ File Processing │    │   Data Storage   │    │   Monitoring    │              │
+│  │ - PDF/DOCX Parse│    │ - MongoDB Atlas │    │ - Job Status    │              │
+│  │ - Image Caption │    │ - Vector Search │    │ - Health Checks │              │
+│  │ - Text Chunking │    │ - Embeddings    │    │ - Error Handling│              │
+│  │ - Embedding Gen │    │ - User/Project  │    │ - Logging       │              │
+│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
+└─────────────────────────────────────────────────────────────────────────────────┘
+                                        │
+                                        ▼
+┌─────────────────────────────────────────────────────────────────────────────────┐
+│                              SHARED DATABASE                                   │
+│  ┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐              │
+│  │   MongoDB Atlas  │    │   Collections    │    │   Indexes       │              │
+│  │                 │    │ - chunks         │    │ - Vector Search │              │
+│  │ - Same Cluster  │    │ - files          │    │ - Text Search   │              │
+│  │ - Same Database │    │ - chat_sessions  │    │ - User/Project  │              │
+│  │ - Same Schema   │    │ - chat_messages  │    │ - Performance   │              │
+│  └─────────────────┘    └──────────────────┘    └─────────────────┘              │
+└───────────────────────────────────���─────────────────────────────────────────────┘
+```
+## 📁 Project Structure
+```
+ingestion_pipeline/
+├── __init__.py
+├── app.py                    # Main FastAPI application
+├── requirements.txt          # Python dependencies
+├── Dockerfile               # HuggingFace deployment
+├── deploy.sh               # Deployment script
+├── test_pipeline.py        # Test script
+├── README.md               # This file
+├── config/               # Configuration
+│   ├── __init__.py
+│   └── settings.py
+├── api/                   # API layer
+│   ├── __init__.py
+│   ├── models.py         # Pydantic models
+│   └── routes.py         # API routes
+└── services/             # Business logic
+    ├── __init__.py
+    └── ingestion_service.py
+```
+## 🚀 Quick Start
+### Prerequisites
+- Docker
+- MongoDB Atlas cluster
+- Python 3.11+
+## 🔧 API Endpoints
+### Health Check
+```http
+GET /health
+```
+### Upload Files
+```http
+POST /upload
+Content-Type: multipart/form-data
+user_id: string
+project_id: string
+files: File[]
+replace_filenames: string (optional)
+rename_map: string (optional)
+```
+### Job Status
+```http
+GET /upload/status?job_id={job_id}
+```
+### List Files
+```http
+GET /files?user_id={user_id}&project_id={project_id}
+```
+### Get File Chunks
+```http
+GET /files/chunks?user_id={user_id}&project_id={project_id}&filename={filename}&limit={limit}
+```
+## 🔄 Data Flow
+### File Processing Pipeline
+1. **File Upload**: User uploads files via frontend
+2. **Load Balancing**: Request routed to ingestion pipeline
+3. **File Processing**:
+   - PDF/DOCX parsing with image extraction
+   - BLIP image captioning
+   - Semantic chunking with overlap
+   - Embedding generation (all-MiniLM-L6-v2)
+4. **Data Storage**:
+   - Chunks stored in `chunks` collection
+   - File summaries in `files` collection
+   - Both scoped by `user_id` and `project_id`
+5. **Response**: Job ID returned for progress tracking
+### Data Consistency
+- **Same Database**: Uses identical MongoDB Atlas cluster
+- **Same Collections**: Stores in `chunks` and `files` collections
+- **Same Schema**: Identical data structure and metadata
+- **Same Scoping**: All data scoped by `user_id` and `project_id`
+- **Same Indexes**: Uses identical database indexes
+## 🐳 Docker Deployment
+### HuggingFace Spaces
+The service is designed for HuggingFace Spaces deployment with:
+- Port 7860 (HuggingFace default)
+- Non-root user for security
+- HuggingFace cache directories
+- Model preloading and warmup
+### Logging
+- Comprehensive logging for all operations
+- Error tracking and debugging
+- Performance monitoring
+### Job Tracking
+- Upload progress monitoring
+- Error handling and reporting
+- Status updates
+## 🔧 Configuration
+### Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MONGO_URI` | Required | MongoDB connection string |
+| `MONGO_DB` | `studybuddy` | Database name |
+| `EMBED_MODEL` | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model |
+| `ATLAS_VECTOR` | `0` | Enable Atlas Vector Search |
+| `MAX_FILES_PER_UPLOAD` | `15` | Maximum files per upload |
+| `MAX_FILE_MB` | `50` | Maximum file size in MB |
+| `INGESTION_PORT` | `7860` | Service port |
+### Processing Configuration
+- **Vector Dimension**: 384 (all-MiniLM-L6-v2)
+- **Chunk Max Words**: 500
+- **Chunk Min Words**: 150
+- **Chunk Overlap**: 50 words
+## 🔒 Security
+### Security Features
+- Non-root user in Docker container
+- Input validation and sanitization
+- Error handling and logging
+- Rate limiting (configurable)
+### Best Practices
+- Use environment variables for secrets
+- Regular security updates
+- Monitor logs for anomalies
+- Implement proper access controls
+## 🚀 Performance
+### Optimization Features
+- Lazy loading of ML models
+- Efficient file processing
+- Background task processing
+- Memory management
+### Scaling
+- Horizontal scaling support
+- Load balancing ready
+- Resource optimization
+- Performance monitoring
+## 📚 Integration
+### Main System Integration
+The ingestion pipeline is designed to work seamlessly with the main system:
+- Same API endpoints
+- Same data structures
+- Same processing pipeline
+- Same storage format
+### Load Balancer Integration
+- Automatic request routing
+- Health check integration
+- Failover support
+- Performance monitoring
+## 🐛 Troubleshooting
+### Common Issues
+1. **MongoDB Connection**: Verify `MONGO_URI` is correct
+2. **Port Conflicts**: Ensure port 7860 is available
+3. **Model Loading**: Check HuggingFace cache permissions
+4. **File Processing**: Verify file format support
+## 📈 Future Enhancements
+### Planned Features
+- Multiple file format support
+- Advanced chunking strategies
+- Performance optimizations
+- Enhanced monitoring
+### Scalability
+- Kubernetes deployment
+- Auto-scaling support
+- Load balancing improvements
+- Resource optimization

ingestion_python/api/models.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""
+Pydantic models for the ingestion pipeline API
+"""
+from typing import List, Dict, Any, Optional
+from pydantic import BaseModel
+# Response models (same as main system)
+class UploadResponse(BaseModel):
+    job_id: str
+    status: str
+    total_files: Optional[int] = None
+class JobStatusResponse(BaseModel):
+    job_id: str
+    status: str
+    total: int
+    completed: int
+    progress: float
+    last_error: Optional[str] = None
+    created_at: float
+class HealthResponse(BaseModel):
+    ok: bool
+    mongodb_connected: bool
+    service: str = "ingestion_pipeline"
+class FileResponse(BaseModel):
+    filename: str
+    summary: str
+class FilesListResponse(BaseModel):
+    files: List[FileResponse]
+    filenames: List[str]
+class ChunksResponse(BaseModel):
+    chunks: List[Dict[str, Any]]

ingestion_python/api/routes.py ADDED Viewed

	@@ -0,0 +1,238 @@

+"""
+API routes for the ingestion pipeline
+"""
+import os
+import asyncio
+import uuid
+import time
+import json
+from typing import List, Dict, Any, Optional
+from fastapi import APIRouter, Form, File, UploadFile, HTTPException, BackgroundTasks, Request
+from api.models import UploadResponse, JobStatusResponse, HealthResponse, FilesListResponse, ChunksResponse
+from services.ingestion_service import IngestionService
+from services.maverick_captioner import _normalize_caption
+from utils.logger import get_logger
+logger = get_logger("INGESTION_ROUTES", __name__)
+# Create router
+router = APIRouter()
+# Global services (will be injected)
+rag = None
+embedder = None
+captioner = None
+ingestion_service = None
+def initialize_services(rag_store, embedder_client, captioner_client):
+    """Initialize services"""
+    global rag, embedder, captioner, ingestion_service
+    rag = rag_store
+    embedder = embedder_client
+    captioner = captioner_client
+    ingestion_service = IngestionService(rag_store, embedder_client, captioner_client)
+@router.get("/health", response_model=HealthResponse)
+async def health():
+    """Health check endpoint"""
+    mongodb_connected = rag is not None
+    return HealthResponse(
+        ok=mongodb_connected,
+        mongodb_connected=mongodb_connected
+    )
+@router.post("/upload", response_model=UploadResponse)
+async def upload_files(
+    request: Request,
+    background_tasks: BackgroundTasks,
+    user_id: str = Form(...),
+    project_id: str = Form(...),
+    files: List[UploadFile] = File(...),
+    replace_filenames: Optional[str] = Form(None),
+    rename_map: Optional[str] = Form(None),
+):
+    """
+    Upload and process files
+    This endpoint mirrors the main system's upload functionality exactly.
+    """
+    if not rag:
+        raise HTTPException(500, detail="MongoDB connection not available")
+    job_id = str(uuid.uuid4())
+    # File limits (same as main system)
+    max_files = int(os.getenv("MAX_FILES_PER_UPLOAD", "15"))
+    max_mb = int(os.getenv("MAX_FILE_MB", "50"))
+    if len(files) > max_files:
+        raise HTTPException(400, detail=f"Too many files. Max {max_files} allowed per upload.")
+    # Parse replace/rename directives (same as main system)
+    replace_set = set()
+    try:
+        if replace_filenames:
+            replace_set = set(json.loads(replace_filenames))
+    except Exception:
+        pass
+    rename_dict: Dict[str, str] = {}
+    try:
+        if rename_map:
+            rename_dict = json.loads(rename_map)
+    except Exception:
+        pass
+    # Preload files (same as main system)
+    preloaded_files = []
+    for uf in files:
+        raw = await uf.read()
+        if len(raw) > max_mb * 1024 * 1024:
+            raise HTTPException(400, detail=f"{uf.filename} exceeds {max_mb} MB limit")
+        eff_name = rename_dict.get(uf.filename, uf.filename)
+        preloaded_files.append((eff_name, raw))
+    # Initialize job status (same as main system)
+    from app import app
+    app.state.jobs[job_id] = {
+        "created_at": time.time(),
+        "total": len(preloaded_files),
+        "completed": 0,
+        "status": "processing",
+        "last_error": None,
+    }
+    # Background processing (mirrors main system exactly)
+    async def _process_all():
+        for idx, (fname, raw) in enumerate(preloaded_files, start=1):
+            try:
+                # Handle file replacement (same as main system)
+                if fname in replace_set:
+                    try:
+                        rag.db["chunks"].delete_many({"user_id": user_id, "project_id": project_id, "filename": fname})
+                        rag.db["files"].delete_many({"user_id": user_id, "project_id": project_id, "filename": fname})
+                        logger.info(f"[{job_id}] Replaced prior data for {fname}")
+                    except Exception as de:
+                        logger.warning(f"[{job_id}] Replace delete failed for {fname}: {de}")
+                logger.info(f"[{job_id}] ({idx}/{len(preloaded_files)}) Parsing {fname} ({len(raw)} bytes)")
+                # Extract pages (same as main system)
+                from helpers.pages import _extract_pages
+                pages = _extract_pages(fname, raw)
+                # Process images with captions (same as main system)
+                num_imgs = sum(len(p.get("images", [])) for p in pages)
+                captions = []
+                if num_imgs > 0:
+                    for p in pages:
+                        caps = []
+                        for im in p.get("images", []):
+                            try:
+                                cap = captioner.caption_image(im)
+                                caps.append(cap)
+                            except Exception as e:
+                                logger.warning(f"[{job_id}] Caption error in {fname}: {e}")
+                        captions.append(caps)
+                else:
+                    captions = [[] for _ in pages]
+                # Merge captions into text (same as main system)
+                for p, caps in zip(pages, captions):
+                    if caps:
+                        normalized = [ _normalize_caption(c) for c in caps if c ]
+                        if normalized:
+                            p["text"] = (p.get("text", "") + "\n\n" + "\n".join([f"[Image] {c}" for c in normalized])).strip()
+                # Build cards (same as main system)
+                from utils.ingestion.chunker import build_cards_from_pages
+                cards = await build_cards_from_pages(pages, filename=fname, user_id=user_id, project_id=project_id)
+                logger.info(f"[{job_id}] Built {len(cards)} cards for {fname}")
+                # Generate embeddings (same as main system)
+                embeddings = embedder.embed([c["content"] for c in cards])
+                for c, vec in zip(cards, embeddings):
+                    c["embedding"] = vec
+                # Store in MongoDB (same as main system)
+                rag.store_cards(cards)
+                # Create file summary (same as main system)
+                from utils.service.summarizer import cheap_summarize
+                full_text = "\n\n".join(p.get("text", "") for p in pages)
+                file_summary = await cheap_summarize(full_text, max_sentences=6)
+                rag.upsert_file_summary(user_id=user_id, project_id=project_id, filename=fname, summary=file_summary)
+                logger.info(f"[{job_id}] Completed {fname}")
+                # Update job progress (same as main system)
+                job = app.state.jobs.get(job_id)
+                if job:
+                    job["completed"] = idx
+                    job["status"] = "processing" if idx < job.get("total", 0) else "completed"
+            except Exception as e:
+                logger.error(f"[{job_id}] Failed processing {fname}: {e}")
+                job = app.state.jobs.get(job_id)
+                if job:
+                    job["last_error"] = str(e)
+                    job["completed"] = idx
+            finally:
+                await asyncio.sleep(0)
+        # Finalize job (same as main system)
+        logger.info(f"[{job_id}] Ingestion complete for {len(preloaded_files)} files")
+        job = app.state.jobs.get(job_id)
+        if job:
+            job["status"] = "completed"
+    background_tasks.add_task(_process_all)
+    return UploadResponse(
+        job_id=job_id,
+        status="processing",
+        total_files=len(preloaded_files)
+    )
+@router.get("/upload/status", response_model=JobStatusResponse)
+async def upload_status(job_id: str):
+    """Get upload job status"""
+    from app import app
+    job = app.state.jobs.get(job_id)
+    if not job:
+        raise HTTPException(404, detail="Job not found")
+    progress = (job["completed"] / job["total"]) * 100 if job["total"] > 0 else 0
+    return JobStatusResponse(
+        job_id=job_id,
+        status=job["status"],
+        total=job["total"],
+        completed=job["completed"],
+        progress=progress,
+        last_error=job.get("last_error"),
+        created_at=job["created_at"]
+    )
+@router.get("/files", response_model=FilesListResponse)
+async def list_files(user_id: str, project_id: str):
+    """List files for a project (compatible with main system)"""
+    if not rag:
+        raise HTTPException(500, detail="MongoDB connection not available")
+    files = rag.list_files(user_id, project_id)
+    return FilesListResponse(
+        files=[{"filename": f["filename"], "summary": f["summary"]} for f in files],
+        filenames=[f["filename"] for f in files]
+    )
+@router.get("/files/chunks", response_model=ChunksResponse)
+async def get_file_chunks(user_id: str, project_id: str, filename: str, limit: int = 20):
+    """Get chunks for a specific file (compatible with main system)"""
+    if not rag:
+        raise HTTPException(500, detail="MongoDB connection not available")
+    chunks = rag.get_file_chunks(user_id, project_id, filename, limit)
+    return ChunksResponse(chunks=chunks)

ingestion_python/app.py ADDED Viewed

	@@ -0,0 +1,58 @@

+"""
+Ingestion Pipeline Service
+A dedicated service for processing file uploads and storing them in MongoDB Atlas.
+This service mirrors the main system's file processing functionality while
+running as a separate service to share the processing load.
+"""
+import os
+from fastapi import FastAPI
+from fastapi.middleware.cors import CORSMiddleware
+# Import shared utilities (now local)
+from utils.logger import get_logger
+from utils.rag.rag import RAGStore, ensure_indexes
+from utils.embedding import RemoteEmbeddingClient
+from services.maverick_captioner import NvidiaMaverickCaptioner
+from api.routes import router, initialize_services
+logger = get_logger("INGESTION_PIPELINE", __name__)
+# FastAPI app
+app = FastAPI(title="Ingestion Pipeline", version="1.0.0")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# In-memory job tracker (same as main system)
+app.state.jobs = {}
+# Global clients (same as main system)
+try:
+    rag = RAGStore(mongo_uri=os.getenv("MONGO_URI"), db_name=os.getenv("MONGO_DB", "studybuddy"))
+    rag.client.admin.command('ping')
+    logger.info("[INGESTION_PIPELINE] MongoDB connection successful")
+    ensure_indexes(rag)
+    logger.info("[INGESTION_PIPELINE] MongoDB indexes ensured")
+except Exception as e:
+    logger.error(f"[INGESTION_PIPELINE] Failed to initialize MongoDB: {e}")
+    rag = None
+embedder = RemoteEmbeddingClient()
+captioner = NvidiaMaverickCaptioner()
+# Initialize services
+initialize_services(rag, embedder, captioner)
+# Include API routes
+app.include_router(router)
+if __name__ == "__main__":
+    import uvicorn
+    port = int(os.getenv("INGESTION_PORT", "7860"))
+    uvicorn.run(app, host="0.0.0.0", port=port)

ingestion_python/helpers/pages.py ADDED Viewed

	@@ -0,0 +1,24 @@

+import os
+from typing import List, Dict, Any
+from fastapi import HTTPException
+from utils.ingestion.parser import parse_pdf_bytes, parse_docx_bytes
+# ────────────────────────────── Helpers ──────────────────────────────
+def _infer_mime(filename: str) -> str:
+    lower = filename.lower()
+    if lower.endswith(".pdf"):
+        return "application/pdf"
+    if lower.endswith(".docx"):
+        return "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
+    return "application/octet-stream"
+def _extract_pages(filename: str, file_bytes: bytes) -> List[Dict[str, Any]]:
+    mime = _infer_mime(filename)
+    if mime == "application/pdf":
+        return parse_pdf_bytes(file_bytes)
+    elif mime == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
+        return parse_docx_bytes(file_bytes)
+    else:
+        raise HTTPException(status_code=400, detail=f"Unsupported file type: {filename}")

ingestion_python/requirements.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+# Ingestion Pipeline Requirements
+# Same as main system but focused on processing
+fastapi==0.114.2
+uvicorn[standard]==0.30.6
+python-multipart==0.0.9
+pymongo==4.8.0
+httpx==0.27.2
+requests==2.32.3
+python-docx==1.1.2
+PyMuPDF==1.24.10
+pillow==10.4.0
+sumy==0.11.0
+numpy==1.26.4
+reportlab==4.0.9
+markdown==3.6
+python-dotenv==1.0.0

ingestion_python/services/ingestion_service.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""
+Ingestion service for processing files and storing them in MongoDB
+"""
+import asyncio
+import uuid
+import time
+import json
+from typing import List, Dict, Any, Optional
+from utils.logger import get_logger
+from utils.rag.rag import RAGStore
+from utils.embedding import RemoteEmbeddingClient
+from services.maverick_captioner import NvidiaMaverickCaptioner, _normalize_caption
+from utils.ingestion.chunker import build_cards_from_pages
+from utils.service.summarizer import cheap_summarize
+from helpers.pages import _extract_pages
+logger = get_logger("INGESTION_SERVICE", __name__)
+class IngestionService:
+    """Service for processing file uploads and storing them in MongoDB"""
+    def __init__(self, rag_store: RAGStore, embedder: RemoteEmbeddingClient, captioner: NvidiaMaverickCaptioner):
+        self.rag = rag_store
+        self.embedder = embedder
+        self.captioner = captioner
+    async def process_files(
+        self,
+        user_id: str,
+        project_id: str,
+        files: List[tuple],  # (filename, raw_bytes)
+        replace_filenames: Optional[List[str]] = None,
+        rename_map: Optional[Dict[str, str]] = None,
+        job_id: Optional[str] = None
+    ) -> str:
+        """
+        Process files and store them in MongoDB
+        Args:
+            user_id: User identifier
+            project_id: Project identifier
+            files: List of (filename, raw_bytes) tuples
+            replace_filenames: Optional list of filenames to replace
+            rename_map: Optional mapping of old names to new names
+            job_id: Optional job ID for tracking
+        Returns:
+            Job ID for tracking progress
+        """
+        if not job_id:
+            job_id = str(uuid.uuid4())
+        replace_set = set(replace_filenames or [])
+        for idx, (fname, raw) in enumerate(files, start=1):
+            try:
+                # Handle file replacement
+                if fname in replace_set:
+                    try:
+                        self.rag.db["chunks"].delete_many({"user_id": user_id, "project_id": project_id, "filename": fname})
+                        self.rag.db["files"].delete_many({"user_id": user_id, "project_id": project_id, "filename": fname})
+                        logger.info(f"[{job_id}] Replaced prior data for {fname}")
+                    except Exception as de:
+                        logger.warning(f"[{job_id}] Replace delete failed for {fname}: {de}")
+                logger.info(f"[{job_id}] ({idx}/{len(files)}) Parsing {fname} ({len(raw)} bytes)")
+                # Extract pages
+                pages = _extract_pages(fname, raw)
+                # Process images with captions
+                num_imgs = sum(len(p.get("images", [])) for p in pages)
+                captions = []
+                if num_imgs > 0:
+                    for p in pages:
+                        caps = []
+                        for im in p.get("images", []):
+                            try:
+                                cap = self.captioner.caption_image(im)
+                                caps.append(cap)
+                            except Exception as e:
+                                logger.warning(f"[{job_id}] Caption error in {fname}: {e}")
+                        captions.append(caps)
+                else:
+                    captions = [[] for _ in pages]
+                # Merge captions into text
+                for p, caps in zip(pages, captions):
+                    if caps:
+                        normalized = [ _normalize_caption(c) for c in caps if c ]
+                        if normalized:
+                            p["text"] = (p.get("text", "") + "\n\n" + "\n".join([f"[Image] {c}" for c in normalized])).strip()
+                # Build cards
+                cards = await build_cards_from_pages(pages, filename=fname, user_id=user_id, project_id=project_id)
+                logger.info(f"[{job_id}] Built {len(cards)} cards for {fname}")
+                # Generate embeddings
+                embeddings = self.embedder.embed([c["content"] for c in cards])
+                for c, vec in zip(cards, embeddings):
+                    c["embedding"] = vec
+                # Store in MongoDB
+                self.rag.store_cards(cards)
+                # Create file summary
+                full_text = "\n\n".join(p.get("text", "") for p in pages)
+                file_summary = await cheap_summarize(full_text, max_sentences=6)
+                self.rag.upsert_file_summary(user_id=user_id, project_id=project_id, filename=fname, summary=file_summary)
+                logger.info(f"[{job_id}] Completed {fname}")
+            except Exception as e:
+                logger.error(f"[{job_id}] Failed processing {fname}: {e}")
+                raise
+        logger.info(f"[{job_id}] Ingestion complete for {len(files)} files")
+        return job_id

ingestion_python/services/maverick_captioner.py ADDED Viewed

	@@ -0,0 +1,141 @@

+import base64
+import io
+import os
+from typing import Optional
+import requests
+from PIL import Image
+from utils.logger import get_logger
+try:
+    from utils.api.rotator import APIKeyRotator  # available in full repo
+except Exception:  # standalone fallback
+    class APIKeyRotator:  # type: ignore
+        def __init__(self, prefix: str = "NVIDIA_API_", max_slots: int = 5):
+            self.keys = []
+            for i in range(1, max_slots + 1):
+                k = os.getenv(f"{prefix}{i}")
+                if k:
+                    self.keys.append(k)
+            if not self.keys:
+                single = os.getenv(prefix.rstrip("_"))
+                if single:
+                    self.keys.append(single)
+            self._idx = 0
+        def get_key(self) -> Optional[str]:
+            if not self.keys:
+                return None
+            k = self.keys[self._idx % len(self.keys)]
+            self._idx += 1
+            return k
+logger = get_logger("MAVERICK_CAPTIONER", __name__)
+def _normalize_caption(text: str) -> str:
+    if not text:
+        return ""
+    t = text.strip()
+    # Remove common conversational/openers and meta phrases
+    banned_prefixes = [
+        "sure,", "sure.", "sure", "here is", "here are", "this image", "the image", "image shows",
+        "the picture", "the photo", "the text describes", "the text describe", "it shows", "it depicts",
+        "caption:", "description:", "output:", "result:", "answer:", "analysis:", "observation:",
+    ]
+    t_lower = t.lower()
+    for p in banned_prefixes:
+        if t_lower.startswith(p):
+            t = t[len(p):].lstrip(" :-\u2014\u2013")
+            t_lower = t.lower()
+    # Strip surrounding quotes and markdown artifacts
+    t = t.strip().strip('"').strip("'").strip()
+    # Collapse whitespace
+    t = " ".join(t.split())
+    return t
+class NvidiaMaverickCaptioner:
+    """Caption images using NVIDIA Integrate API (meta/llama-4-maverick-17b-128e-instruct)."""
+    def __init__(self, rotator: Optional[APIKeyRotator] = None, model: Optional[str] = None):
+        self.rotator = rotator or APIKeyRotator(prefix="NVIDIA_API_", max_slots=5)
+        self.model = model or os.getenv("NVIDIA_MAVERICK_MODEL", "meta/llama-4-maverick-17b-128e-instruct")
+        self.invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions"
+    def _encode_image_jpeg_b64(self, image: Image.Image) -> str:
+        buf = io.BytesIO()
+        # Convert to RGB to ensure JPEG-compatible
+        image.convert("RGB").save(buf, format="JPEG", quality=90)
+        return base64.b64encode(buf.getvalue()).decode("utf-8")
+    def caption_image(self, image: Image.Image) -> str:
+        try:
+            key = self.rotator.get_key()
+            if not key:
+                logger.warning("NVIDIA API key not available; skipping image caption.")
+                return ""
+            img_b64 = self._encode_image_jpeg_b64(image)
+            # Strict, non-conversational system prompt
+            system_prompt = (
+                "You are an expert vision captioner. Produce a precise, information-dense caption of the image. "
+                "Do not include conversational phrases, prefaces, meta commentary, or apologies. "
+                "Avoid starting with phrases like 'The image/picture/photo shows' or 'Here is'. "
+                "Write a single concise paragraph with concrete entities, text in the image, and notable details."
+            )
+            user_prompt = (
+                "Caption this image at the finest level of detail. Include any visible text verbatim. "
+                "Return only the caption text."
+            )
+            # Multimodal content format for NVIDIA Integrate API
+            messages = [
+                {"role": "system", "content": system_prompt},
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": user_prompt},
+                        {
+                            "type": "image_url",
+                            "image_url": {
+                                "url": f"data:image/jpeg;base64,{img_b64}"
+                            }
+                        },
+                    ]
+                },
+            ]
+            payload = {
+                "model": self.model,
+                "messages": messages,
+                "max_tokens": 512,
+                "temperature": 0.2,
+                "top_p": 0.9,
+                "frequency_penalty": 0.0,
+                "presence_penalty": 0.0,
+                "stream": False,
+            }
+            headers = {
+                "Authorization": f"Bearer {key}",
+                "Accept": "application/json",
+                "Content-Type": "application/json",
+            }
+            resp = requests.post(self.invoke_url, headers=headers, json=payload, timeout=60)
+            if resp.status_code >= 400:
+                logger.warning(f"Maverick caption API error {resp.status_code}: {resp.text[:200]}")
+                return ""
+            data = resp.json()
+            text = data.get("choices", [{}])[0].get("message", {}).get("content", "")
+            return _normalize_caption(text)
+        except Exception as e:
+            logger.warning(f"Maverick caption failed: {e}")
+            return ""

ingestion_python/test_upload1.sh ADDED Viewed

	@@ -0,0 +1,241 @@

+#!/bin/bash
+set -euo pipefail
+echo "🚀 Testing Ingestion Pipeline Upload"
+echo "======================================"
+# Configuration
+BACKEND_URL="https://binkhoale1812-studdybuddy-ingestion1.hf.space"
+USER_ID="44e65346-8eaa-4f95-b17a-f6219953e7a8"
+PROJECT_ID="496e2fad-ec7e-4562-b06a-ea2491f2460"
+# Test files
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+FILE1="$SCRIPT_DIR/../exefiles/Lecture5_ML.pdf"
+FILE2="$SCRIPT_DIR/../exefiles/Lecture6_ANN_DL.pdf"
+# Debug toggles
+DEBUG=${DEBUG:-0}
+TRACE=${TRACE:-0}
+echo "📋 Configuration:"
+echo "   Backend URL: $BACKEND_URL"
+echo "   User ID: $USER_ID"
+echo "   Project ID: $PROJECT_ID"
+echo "   Files: $FILE1, $FILE2"
+echo ""
+# Validate files and resolve absolute paths
+if [ ! -f "$FILE1" ]; then
+  echo "❌ Missing file: $FILE1"; exit 26
+fi
+if [ ! -f "$FILE2" ]; then
+  echo "❌ Missing file: $FILE2"; exit 26
+fi
+FILE1_DIR="$(cd "$(dirname "$FILE1")" && pwd)"; FILE1_BASENAME="$(basename "$FILE1")"; FILE1="$FILE1_DIR/$FILE1_BASENAME"
+FILE2_DIR="$(cd "$(dirname "$FILE2")" && pwd)"; FILE2_BASENAME="$(basename "$FILE2")"; FILE2="$FILE2_DIR/$FILE2_BASENAME"
+curl_base() {
+  local method="$1"; shift
+  local url="$1"; shift
+  local extra=("$@")
+  local common=(
+    -L --http1.1 --fail-with-body -sS
+    --connect-timeout 60
+    --retry 5 --retry-delay 4 --retry-connrefused
+  )
+  if [ "$DEBUG" = "1" ]; then
+    common+=( -v )
+  fi
+  if [ "$TRACE" = "1" ]; then
+    common+=( --trace-time --trace-ascii - )
+  fi
+  curl -X "$method" "$url" "${common[@]}" "${extra[@]}"
+}
+json_with_status() {
+  local method="$1"; shift
+  local url="$1"; shift
+  local extra=("$@")
+  curl_base "$method" "$url" "${extra[@]}" \
+    -w "\nHTTP Status: %{http_code}\n"
+}
+# Step 0: Preflight (for browser parity)
+echo "🛰️  Step 0: OPTIONS /upload (preflight parity)"
+echo "---------------------------------------------"
+json_with_status OPTIONS "$BACKEND_URL/upload" -H "Origin: https://example.com" -H "Access-Control-Request-Method: POST" || true
+echo ""; echo ""
+# Step 1: Health Check
+echo "🏥 Step 1: Health Check"
+echo "------------------------"
+json_with_status GET "$BACKEND_URL/health" -H "Accept: application/json" || true
+echo ""; echo ""
+# Step 2: Upload Files
+echo "📁 Step 2: Upload Files (sequential)"
+echo "------------------------------------"
+echo "Uploading $(basename "$FILE1")..."
+UPLOAD_HEADERS=$(mktemp)
+UPLOAD_BODY=$(mktemp)
+set +e
+HTTP_CODE=$(curl -L --http1.1 --fail-with-body -sS \
+  --connect-timeout 60 --retry 3 --retry-delay 4 --retry-connrefused \
+  -H "Expect:" \
+  -X POST "$BACKEND_URL/upload" \
+  -F "user_id=$USER_ID" \
+  -F "project_id=$PROJECT_ID" \
+  -F "files=@$FILE1" \
+  -D "$UPLOAD_HEADERS" -o "$UPLOAD_BODY" \
+  -w "%{http_code}")
+RET=$?
+set -e
+echo "HTTP Status: $HTTP_CODE"
+echo "--- Response Headers ---"; sed -e 's/\r$//' "$UPLOAD_HEADERS" | sed 's/^/  /'
+echo "--- Response Body ---"; sed 's/^/  /' "$UPLOAD_BODY"
+if [ "$RET" -ne 0 ] || [ "$HTTP_CODE" = "000" ]; then
+  echo "❌ Upload failed (curl exit=$RET, http=$HTTP_CODE)"; exit 1
+fi
+# Extract job_id (prefer jq)
+if command -v jq >/dev/null 2>&1; then
+  JOB_ID=$(jq -r '.job_id // empty' < "$UPLOAD_BODY")
+else
+  JOB_ID=$(python3 - <<'PY'
+import sys, json
+try:
+  data=json.load(sys.stdin)
+  print(data.get('job_id',''))
+except Exception:
+  print('')
+PY
+  < "$UPLOAD_BODY")
+fi
+if [ -z "${JOB_ID:-}" ]; then
+  echo "❌ Failed to extract job_id from upload response"; exit 1
+fi
+echo ""
+echo "✅ Upload 1 initiated successfully!"
+echo "   Job ID: $JOB_ID"
+echo ""
+# Step 3: Monitor Upload Progress
+echo "📊 Step 3: Monitor Upload Progress"
+echo "----------------------------------"
+for i in {1..48}; do
+  echo "Checking progress (attempt $i/12)..."
+  json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID" -H "Accept: application/json" | sed 's/^/  /'
+  STATUS_LINE=$(json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID" -H "Accept: application/json" | tail -n +1)
+  if echo "$STATUS_LINE" | grep -q '"status":"completed"'; then
+    echo "✅ Upload completed successfully!"; break
+  elif echo "$STATUS_LINE" | grep -q '"status":"processing"'; then
+    echo "⏳ Still processing... waiting 20 seconds"; sleep 20
+  else
+    echo "❌ Upload failed or unknown status"; break
+  fi
+  echo ""
+done
+echo ""
+# Now upload second file after first completes
+echo "📁 Step 3: Upload second file"
+echo "------------------------------"
+echo "Uploading $(basename "$FILE2")..."
+UPLOAD_HEADERS2=$(mktemp)
+UPLOAD_BODY2=$(mktemp)
+set +e
+HTTP_CODE2=$(curl -L --http1.1 --fail-with-body -sS \
+  --connect-timeout 60 --retry 3 --retry-delay 4 --retry-connrefused \
+  -H "Expect:" \
+  -X POST "$BACKEND_URL/upload" \
+  -F "user_id=$USER_ID" \
+  -F "project_id=$PROJECT_ID" \
+  -F "files=@$FILE2" \
+  -D "$UPLOAD_HEADERS2" -o "$UPLOAD_BODY2" \
+  -w "%{http_code}")
+RET2=$?
+set -e
+echo "HTTP Status: $HTTP_CODE2"
+echo "--- Response Headers ---"; sed -e 's/\r$//' "$UPLOAD_HEADERS2" | sed 's/^/  /'
+echo "--- Response Body ---"; sed 's/^/  /' "$UPLOAD_BODY2"
+if [ "$RET2" -ne 0 ] || [ "$HTTP_CODE2" = "000" ]; then
+  echo "❌ Upload 2 failed (curl exit=$RET2, http=$HTTP_CODE2)"; exit 1
+fi
+# Extract job_id2
+if command -v jq >/dev/null 2>&1; then
+  JOB_ID2=$(jq -r '.job_id // empty' < "$UPLOAD_BODY2")
+else
+  JOB_ID2=$(python3 - <<'PY'
+import sys, json
+try:
+  data=json.load(sys.stdin)
+  print(data.get('job_id',''))
+except Exception:
+  print('')
+PY
+  < "$UPLOAD_BODY2")
+fi
+if [ -z "${JOB_ID2:-}" ]; then
+  echo "❌ Failed to extract job_id from second upload response"; exit 1
+fi
+echo ""
+echo "✅ Upload 2 initiated successfully!"
+echo "   Job ID: $JOB_ID2"
+echo ""
+echo "📊 Step 4: Monitor Upload 2 Progress"
+echo "-------------------------------------"
+for i in {1..48}; do
+  echo "Checking progress (attempt $i/48)..."
+  json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID2" -H "Accept: application/json" | sed 's/^/  /'
+  STATUS_LINE=$(json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID2" -H "Accept: application/json" | tail -n +1)
+  if echo "$STATUS_LINE" | grep -q '"status":"completed"'; then
+    echo "✅ Upload 2 completed successfully!"; break
+  elif echo "$STATUS_LINE" | grep -q '"status":"processing"'; then
+    echo "⏳ Still processing... waiting 20 seconds"; sleep 20
+  else
+    echo "❌ Upload 2 failed or unknown status"; break
+  fi
+  echo ""
+done
+echo ""
+# Step 5: List Uploaded Files
+echo "📋 Step 4: List Uploaded Files"
+echo "-------------------------------"
+json_with_status GET "$BACKEND_URL/files?user_id=$USER_ID&project_id=$PROJECT_ID" -H "Accept: application/json" | sed 's/^/  /'
+echo ""; echo ""
+# Step 5: Get File Chunks (for Lecture7_GA_EC.pdf)
+echo "🔍 Step 5: Get File Chunks for Lecture7_GA_EC.pdf"
+echo "----------------------------------------------"
+json_with_status GET "$BACKEND_URL/files/chunks?user_id=$USER_ID&project_id=$PROJECT_ID&filename=Lecture7_GA_EC.pdf&limit=5" -H "Accept: application/json" | sed 's/^/  /'
+echo ""; echo ""
+# Step 6: Get File Chunks (for Tut7.pdf)
+echo "🔍 Step 6: Get File Chunks for Tut7.pdf"
+echo "------------------------------------------------"
+json_with_status GET "$BACKEND_URL/files/chunks?user_id=$USER_ID&project_id=$PROJECT_ID&filename=Tut7.pdf&limit=5" -H "Accept: application/json" | sed 's/^/  /'
+echo ""
+echo "🎉 Test completed!"
+echo "=================="

ingestion_python/test_upload2.sh ADDED Viewed

	@@ -0,0 +1,238 @@

+#!/bin/bash
+set -euo pipefail
+echo "🚀 Testing Ingestion Pipeline Upload"
+echo "======================================"
+# Configuration
+BACKEND_URL="https://binkhoale1812-studdybuddy-ingestion2.hf.space"
+USER_ID="44e65346-8eaa-4f95-b17a-f6219953e7a8"
+PROJECT_ID="496e2fad-ec7e-4562-b06a-ea2491f2460"
+# Test files (resolve relative to this script's directory)
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+FILE1="$SCRIPT_DIR/../exefiles/Lecture7_GA_EC.pdf"
+FILE2="$SCRIPT_DIR/../exefiles/Tut7.pdf"
+# Debug toggles
+DEBUG=${DEBUG:-0}
+TRACE=${TRACE:-0}
+echo "📋 Configuration:"
+echo "   Backend URL: $BACKEND_URL"
+echo "   User ID: $USER_ID"
+echo "   Project ID: $PROJECT_ID"
+echo "   Files: $FILE1, $FILE2"
+echo ""
+# Validate files and resolve absolute paths
+if [ ! -f "$FILE1" ]; then
+  echo "❌ Missing file: $FILE1"; exit 26
+fi
+if [ ! -f "$FILE2" ]; then
+  echo "❌ Missing file: $FILE2"; exit 26
+fi
+FILE1_DIR="$(cd "$(dirname "$FILE1")" && pwd)"; FILE1_BASENAME="$(basename "$FILE1")"; FILE1="$FILE1_DIR/$FILE1_BASENAME"
+FILE2_DIR="$(cd "$(dirname "$FILE2")" && pwd)"; FILE2_BASENAME="$(basename "$FILE2")"; FILE2="$FILE2_DIR/$FILE2_BASENAME"
+curl_base() {
+  local method="$1"; shift
+  local url="$1"; shift
+  local extra=("$@")
+  local common=(
+    -L --http1.1 --fail-with-body -sS
+    --connect-timeout 60
+    --retry 5 --retry-delay 4 --retry-connrefused
+  )
+  if [ "$DEBUG" = "1" ]; then
+    common+=( -v )
+  fi
+  if [ "$TRACE" = "1" ]; then
+    common+=( --trace-time --trace-ascii - )
+  fi
+  curl -X "$method" "$url" "${common[@]}" "${extra[@]}"
+}
+json_with_status() {
+  local method="$1"; shift
+  local url="$1"; shift
+  local extra=("$@")
+  curl_base "$method" "$url" "${extra[@]}" \
+    -w "\nHTTP Status: %{http_code}\n"
+}
+# Step 0: Preflight (for browser parity)
+echo "🛰️  Step 0: OPTIONS /upload (preflight parity)"
+echo "---------------------------------------------"
+json_with_status OPTIONS "$BACKEND_URL/upload" -H "Origin: https://example.com" -H "Access-Control-Request-Method: POST" || true
+echo ""; echo ""
+# Step 1: Health Check
+echo "🏥 Step 1: Health Check"
+echo "------------------------"
+json_with_status GET "$BACKEND_URL/health" -H "Accept: application/json" || true
+echo ""; echo ""
+# Step 2: Upload Files
+echo "📁 Step 2: Upload Files (sequential)"
+echo "------------------------------------"
+echo "Uploading $(basename "$FILE1")..."
+UPLOAD_HEADERS=$(mktemp)
+UPLOAD_BODY=$(mktemp)
+set +e
+HTTP_CODE=$(curl -L --http1.1 --fail-with-body -sS \
+  --connect-timeout 60 --retry 3 --retry-delay 4 --retry-connrefused \
+  -H "Expect:" \
+  -X POST "$BACKEND_URL/upload" \
+  -F "user_id=$USER_ID" \
+  -F "project_id=$PROJECT_ID" \
+  -F "files=@$FILE1" \
+  -D "$UPLOAD_HEADERS" -o "$UPLOAD_BODY" \
+  -w "%{http_code}")
+RET=$?
+set -e
+echo "HTTP Status: $HTTP_CODE"
+echo "--- Response Headers ---"; sed -e 's/\r$//' "$UPLOAD_HEADERS" | sed 's/^/  /'
+echo "--- Response Body ---"; sed 's/^/  /' "$UPLOAD_BODY"
+if [ "$RET" -ne 0 ] || [ "$HTTP_CODE" = "000" ]; then
+  echo "❌ Upload failed (curl exit=$RET, http=$HTTP_CODE)"; exit 1
+fi
+# Extract job_id (prefer jq)
+if command -v jq >/dev/null 2>&1; then
+  JOB_ID=$(jq -r '.job_id // empty' < "$UPLOAD_BODY")
+else
+  JOB_ID=$(python3 - <<'PY'
+import sys, json
+try:
+  data=json.load(sys.stdin)
+  print(data.get('job_id',''))
+except Exception:
+  print('')
+PY
+  < "$UPLOAD_BODY")
+fi
+if [ -z "${JOB_ID:-}" ]; then
+  echo "❌ Failed to extract job_id from upload response"; exit 1
+fi
+echo ""
+echo "✅ Upload 1 initiated successfully!"
+echo "   Job ID: $JOB_ID"
+echo ""
+# Step 3: Monitor Upload Progress
+echo "📊 Step 3: Monitor Upload Progress"
+echo "----------------------------------"
+for i in {1..48}; do
+  echo "Checking progress (attempt $i/12)..."
+  json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID" -H "Accept: application/json" | sed 's/^/  /'
+  STATUS_LINE=$(json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID" -H "Accept: application/json" | tail -n +1)
+  if echo "$STATUS_LINE" | grep -q '"status":"completed"'; then
+    echo "✅ Upload completed successfully!"; break
+  elif echo "$STATUS_LINE" | grep -q '"status":"processing"'; then
+    echo "⏳ Still processing... waiting 20 seconds"; sleep 20
+  else
+    echo "❌ Upload failed or unknown status"; break
+  fi
+  echo ""
+done
+echo ""
+echo "📁 Step 3: Upload second file"
+echo "------------------------------"
+echo "Uploading $(basename "$FILE2")..."
+UPLOAD_HEADERS2=$(mktemp)
+UPLOAD_BODY2=$(mktemp)
+set +e
+HTTP_CODE2=$(curl -L --http1.1 --fail-with-body -sS \
+  --connect-timeout 60 --retry 3 --retry-delay 4 --retry-connrefused \
+  -H "Expect:" \
+  -X POST "$BACKEND_URL/upload" \
+  -F "user_id=$USER_ID" \
+  -F "project_id=$PROJECT_ID" \
+  -F "files=@$FILE2" \
+  -D "$UPLOAD_HEADERS2" -o "$UPLOAD_BODY2" \
+  -w "%{http_code}")
+RET2=$?
+set -e
+echo "HTTP Status: $HTTP_CODE2"
+echo "--- Response Headers ---"; sed -e 's/\r$//' "$UPLOAD_HEADERS2" | sed 's/^/  /'
+echo "--- Response Body ---"; sed 's/^/  /' "$UPLOAD_BODY2"
+if [ "$RET2" -ne 0 ] || [ "$HTTP_CODE2" = "000" ]; then
+  echo "❌ Upload 2 failed (curl exit=$RET2, http=$HTTP_CODE2)"; exit 1
+fi
+if command -v jq >/dev/null 2>&1; then
+  JOB_ID2=$(jq -r '.job_id // empty' < "$UPLOAD_BODY2")
+else
+  JOB_ID2=$(python3 - <<'PY'
+import sys, json
+try:
+  data=json.load(sys.stdin)
+  print(data.get('job_id',''))
+except Exception:
+  print('')
+PY
+  < "$UPLOAD_BODY2")
+fi
+if [ -z "${JOB_ID2:-}" ]; then
+  echo "❌ Failed to extract job_id from second upload response"; exit 1
+fi
+echo ""
+echo "✅ Upload 2 initiated successfully!"
+echo "   Job ID: $JOB_ID2"
+echo ""
+echo "📊 Step 4: Monitor Upload 2 Progress"
+echo "-------------------------------------"
+for i in {1..48}; do
+  echo "Checking progress (attempt $i/48)..."
+  json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID2" -H "Accept: application/json" | sed 's/^/  /'
+  STATUS_LINE=$(json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID2" -H "Accept: application/json" | tail -n +1)
+  if echo "$STATUS_LINE" | grep -q '"status":"completed"'; then
+    echo "✅ Upload 2 completed successfully!"; break
+  elif echo "$STATUS_LINE" | grep -q '"status":"processing"'; then
+    echo "⏳ Still processing... waiting 20 seconds"; sleep 20
+  else
+    echo "❌ Upload 2 failed or unknown status"; break
+  fi
+  echo ""
+done
+echo ""
+# Step 4: List Uploaded Files
+echo "📋 Step 4: List Uploaded Files"
+echo "-------------------------------"
+json_with_status GET "$BACKEND_URL/files?user_id=$USER_ID&project_id=$PROJECT_ID" -H "Accept: application/json" | sed 's/^/  /'
+echo ""; echo ""
+# Step 5: Get File Chunks (for Lecture7_GA_EC.pdf)
+echo "🔍 Step 5: Get File Chunks for Lecture7_GA_EC.pdf"
+echo "----------------------------------------------"
+json_with_status GET "$BACKEND_URL/files/chunks?user_id=$USER_ID&project_id=$PROJECT_ID&filename=Lecture7_GA_EC.pdf&limit=5" -H "Accept: application/json" | sed 's/^/  /'
+echo ""; echo ""
+# Step 6: Get File Chunks (for Tut7.pdf)
+echo "🔍 Step 6: Get File Chunks for Tut7.pdf"
+echo "------------------------------------------------"
+json_with_status GET "$BACKEND_URL/files/chunks?user_id=$USER_ID&project_id=$PROJECT_ID&filename=Tut7.pdf&limit=5" -H "Accept: application/json" | sed 's/^/  /'
+echo ""
+echo "🎉 Test completed!"
+echo "=================="

ingestion_python/test_upload3.sh ADDED Viewed

	@@ -0,0 +1,227 @@

+#!/bin/bash
+set -euo pipefail
+echo "🚀 Testing Ingestion Pipeline Upload"
+echo "======================================"
+# Configuration
+BACKEND_URL="https://binkhoale1812-studdybuddy-ingestion3.hf.space"
+USER_ID="44e65346-8eaa-4f95-b17a-f6219953e7a8"
+PROJECT_ID="496e2fad-ec7e-4562-b06a-ea2491f2460"
+# Test files
+FILE1="../exefiles/Lecture8_PSO_ACO.pdf"
+FILE2="../exefiles/Tut8.pdf"
+# Debug toggles
+DEBUG=${DEBUG:-0}
+TRACE=${TRACE:-0}
+echo "📋 Configuration:"
+echo "   Backend URL: $BACKEND_URL"
+echo "   User ID: $USER_ID"
+echo "   Project ID: $PROJECT_ID"
+echo "   Files: $FILE1, $FILE2"
+echo ""
+curl_base() {
+  local method="$1"; shift
+  local url="$1"; shift
+  local extra=("$@")
+  local common=(
+    -L --http1.1 --fail-with-body -sS
+    --connect-timeout 60
+    --retry 5 --retry-delay 4 --retry-connrefused
+  )
+  if [ "$DEBUG" = "1" ]; then
+    common+=( -v )
+  fi
+  if [ "$TRACE" = "1" ]; then
+    common+=( --trace-time --trace-ascii - )
+  fi
+  curl -X "$method" "$url" "${common[@]}" "${extra[@]}"
+}
+json_with_status() {
+  local method="$1"; shift
+  local url="$1"; shift
+  local extra=("$@")
+  curl_base "$method" "$url" "${extra[@]}" \
+    -w "\nHTTP Status: %{http_code}\n"
+}
+# Step 0: Preflight (for browser parity)
+echo "🛰️  Step 0: OPTIONS /upload (preflight parity)"
+echo "---------------------------------------------"
+json_with_status OPTIONS "$BACKEND_URL/upload" -H "Origin: https://example.com" -H "Access-Control-Request-Method: POST" || true
+echo ""; echo ""
+# Step 1: Health Check
+echo "🏥 Step 1: Health Check"
+echo "------------------------"
+json_with_status GET "$BACKEND_URL/health" -H "Accept: application/json" || true
+echo ""; echo ""
+# Step 2: Upload Files
+echo "📁 Step 2: Upload Files (sequential)"
+echo "------------------------------------"
+echo "Uploading $(basename "$FILE1")..."
+UPLOAD_HEADERS=$(mktemp)
+UPLOAD_BODY=$(mktemp)
+set +e
+HTTP_CODE=$(curl -L --http1.1 --fail-with-body -sS \
+  --connect-timeout 60 --retry 3 --retry-delay 4 --retry-connrefused \
+  -H "Expect:" \
+  -X POST "$BACKEND_URL/upload" \
+  -F "user_id=$USER_ID" \
+  -F "project_id=$PROJECT_ID" \
+  -F "files=@$FILE1" \
+  -D "$UPLOAD_HEADERS" -o "$UPLOAD_BODY" \
+  -w "%{http_code}")
+RET=$?
+set -e
+echo "HTTP Status: $HTTP_CODE"
+echo "--- Response Headers ---"; sed -e 's/\r$//' "$UPLOAD_HEADERS" | sed 's/^/  /'
+echo "--- Response Body ---"; sed 's/^/  /' "$UPLOAD_BODY"
+if [ "$RET" -ne 0 ] || [ "$HTTP_CODE" = "000" ]; then
+  echo "❌ Upload failed (curl exit=$RET, http=$HTTP_CODE)"; exit 1
+fi
+# Extract job_id (prefer jq)
+if command -v jq >/dev/null 2>&1; then
+  JOB_ID=$(jq -r '.job_id // empty' < "$UPLOAD_BODY")
+else
+  JOB_ID=$(python3 - <<'PY'
+import sys, json
+try:
+  data=json.load(sys.stdin)
+  print(data.get('job_id',''))
+except Exception:
+  print('')
+PY
+  < "$UPLOAD_BODY")
+fi
+if [ -z "${JOB_ID:-}" ]; then
+  echo "❌ Failed to extract job_id from upload response"; exit 1
+fi
+echo ""
+echo "✅ Upload 1 initiated successfully!"
+echo "   Job ID: $JOB_ID"
+echo ""
+# Step 3: Monitor Upload Progress
+echo "📊 Step 3: Monitor Upload Progress"
+echo "----------------------------------"
+for i in {1..48}; do
+  echo "Checking progress (attempt $i/12)..."
+  json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID" -H "Accept: application/json" | sed 's/^/  /'
+  STATUS_LINE=$(json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID" -H "Accept: application/json" | tail -n +1)
+  if echo "$STATUS_LINE" | grep -q '"status":"completed"'; then
+    echo "✅ Upload completed successfully!"; break
+  elif echo "$STATUS_LINE" | grep -q '"status":"processing"'; then
+    echo "⏳ Still processing... waiting 20 seconds"; sleep 20
+  else
+    echo "❌ Upload failed or unknown status"; break
+  fi
+  echo ""
+done
+echo ""
+echo "📁 Step 3: Upload second file"
+echo "------------------------------"
+echo "Uploading $(basename "$FILE2")..."
+UPLOAD_HEADERS2=$(mktemp)
+UPLOAD_BODY2=$(mktemp)
+set +e
+HTTP_CODE2=$(curl -L --http1.1 --fail-with-body -sS \
+  --connect-timeout 60 --retry 3 --retry-delay 4 --retry-connrefused \
+  -H "Expect:" \
+  -X POST "$BACKEND_URL/upload" \
+  -F "user_id=$USER_ID" \
+  -F "project_id=$PROJECT_ID" \
+  -F "files=@$FILE2" \
+  -D "$UPLOAD_HEADERS2" -o "$UPLOAD_BODY2" \
+  -w "%{http_code}")
+RET2=$?
+set -e
+echo "HTTP Status: $HTTP_CODE2"
+echo "--- Response Headers ---"; sed -e 's/\r$//' "$UPLOAD_HEADERS2" | sed 's/^/  /'
+echo "--- Response Body ---"; sed 's/^/  /' "$UPLOAD_BODY2"
+if [ "$RET2" -ne 0 ] || [ "$HTTP_CODE2" = "000" ]; then
+  echo "❌ Upload 2 failed (curl exit=$RET2, http=$HTTP_CODE2)"; exit 1
+fi
+if command -v jq >/dev/null 2>&1; then
+  JOB_ID2=$(jq -r '.job_id // empty' < "$UPLOAD_BODY2")
+else
+  JOB_ID2=$(python3 - <<'PY'
+import sys, json
+try:
+  data=json.load(sys.stdin)
+  print(data.get('job_id',''))
+except Exception:
+  print('')
+PY
+  < "$UPLOAD_BODY2")
+fi
+if [ -z "${JOB_ID2:-}" ]; then
+  echo "❌ Failed to extract job_id from second upload response"; exit 1
+fi
+echo ""
+echo "✅ Upload 2 initiated successfully!"
+echo "   Job ID: $JOB_ID2"
+echo ""
+echo "📊 Step 4: Monitor Upload 2 Progress"
+echo "-------------------------------------"
+for i in {1..48}; do
+  echo "Checking progress (attempt $i/48)..."
+  json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID2" -H "Accept: application/json" | sed 's/^/  /'
+  STATUS_LINE=$(json_with_status GET "$BACKEND_URL/upload/status?job_id=$JOB_ID2" -H "Accept: application/json" | tail -n +1)
+  if echo "$STATUS_LINE" | grep -q '"status":"completed"'; then
+    echo "✅ Upload 2 completed successfully!"; break
+  elif echo "$STATUS_LINE" | grep -q '"status":"processing"'; then
+    echo "⏳ Still processing... waiting 20 seconds"; sleep 20
+  else
+    echo "❌ Upload 2 failed or unknown status"; break
+  fi
+  echo ""
+done
+echo ""
+# Step 4: List Uploaded Files
+echo "📋 Step 4: List Uploaded Files"
+echo "-------------------------------"
+json_with_status GET "$BACKEND_URL/files?user_id=$USER_ID&project_id=$PROJECT_ID" -H "Accept: application/json" | sed 's/^/  /'
+echo ""; echo ""
+# Step 5: Get File Chunks (for Lecture8_PSO_ACO.pdf)
+echo "🔍 Step 5: Get File Chunks for Lecture8_PSO_ACO.pdf"
+echo "----------------------------------------------"
+json_with_status GET "$BACKEND_URL/files/chunks?user_id=$USER_ID&project_id=$PROJECT_ID&filename=Lecture8_PSO_ACO.pdf&limit=5" -H "Accept: application/json" | sed 's/^/  /'
+echo ""; echo ""
+# Step 6: Get File Chunks (for Tut8.pdf)
+echo "🔍 Step 6: Get File Chunks for Tut8.pdf"
+echo "------------------------------------------------"
+json_with_status GET "$BACKEND_URL/files/chunks?user_id=$USER_ID&project_id=$PROJECT_ID&filename=Tut8.pdf&limit=5" -H "Accept: application/json" | sed 's/^/  /'
+echo ""
+echo "🎉 Test completed!"
+echo "=================="

ingestion_python/utils/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+
2	+

ingestion_python/utils/api/rotator.py ADDED Viewed

	@@ -0,0 +1,67 @@

+# ────────────────────────────── utils/rotator.py ──────────────────────────────
+import os
+import itertools
+from ..logger import get_logger
+from typing import Optional
+import httpx
+logger = get_logger("ROTATOR", __name__)
+class APIKeyRotator:
+    """
+    Round-robin API key rotator.
+    - Loads keys from env vars with given prefix (e.g., GEMINI_API_1..6)
+    - get_key() returns current key
+    - rotate() moves to next key
+    - on HTTP 401/429/5xx you should call rotate() and retry (bounded)
+    """
+    def __init__(self, prefix: str, max_slots: int = 6):
+        self.keys = []
+        for i in range(1, max_slots + 1):
+            v = os.getenv(f"{prefix}{i}")
+            if v:
+                self.keys.append(v.strip())
+        if not self.keys:
+            logger.warning(f"No API keys found for prefix {prefix}. Calls will likely fail.")
+            self._cycle = itertools.cycle([""])
+        else:
+            self._cycle = itertools.cycle(self.keys)
+        self.current = next(self._cycle)
+    def get_key(self) -> Optional[str]:
+        return self.current
+    def rotate(self) -> Optional[str]:
+        self.current = next(self._cycle)
+        logger.info("Rotated API key.")
+        return self.current
+async def robust_post_json(url: str, headers: dict, payload: dict, rotator: APIKeyRotator, max_retries: int = 6):
+    """
+    POST JSON with simple retry+rotate on 401/403/429/5xx.
+    Returns json response.
+    """
+    for attempt in range(max_retries):
+        try:
+            async with httpx.AsyncClient(timeout=60) as client:
+                r = await client.post(url, headers=headers, json=payload)
+                logger.info(f"[ROTATOR] HTTP {r.status_code} response from {url}")
+                if r.status_code in (401, 403, 429) or (500 <= r.status_code < 600):
+                    logger.warning(f"HTTP {r.status_code} from provider. Rotating key and retrying ({attempt+1}/{max_retries})")
+                    logger.warning(f"Response body: {r.text}")
+                    rotator.rotate()
+                    continue
+                r.raise_for_status()
+                response_data = r.json()
+                logger.info(f"[ROTATOR] Successfully parsed JSON response with keys: {list(response_data.keys()) if isinstance(response_data, dict) else 'Not a dict'}")
+                return response_data
+        except Exception as e:
+            logger.warning(f"Request error: {e}. Rotating and retrying ({attempt+1}/{max_retries})")
+            logger.warning(f"Request details - URL: {url}, Headers: {headers}")
+            rotator.rotate()
+    raise RuntimeError("Provider request failed after retries.")

ingestion_python/utils/api/router.py ADDED Viewed

	@@ -0,0 +1,359 @@

+# ────────────────────────────── utils/router.py ──────────────────────────────
+import os
+from ..logger import get_logger
+from typing import Dict, Any
+from .rotator import robust_post_json, APIKeyRotator
+logger = get_logger("ROUTER", __name__)
+# Default model names (can be overridden via env)
+GEMINI_SMALL = os.getenv("GEMINI_SMALL", "gemini-2.5-flash-lite")
+GEMINI_MED   = os.getenv("GEMINI_MED",   "gemini-2.5-flash")
+GEMINI_PRO   = os.getenv("GEMINI_PRO",   "gemini-2.5-pro")
+# NVIDIA model hierarchy (can be overridden via env)
+NVIDIA_SMALL = os.getenv("NVIDIA_SMALL", "meta/llama-3.1-8b-instruct")         # Llama model for easy complexity tasks
+NVIDIA_MEDIUM = os.getenv("NVIDIA_MEDIUM", "qwen/qwen3-next-80b-a3b-thinking") # Qwen model for reasoning tasks
+NVIDIA_LARGE = os.getenv("NVIDIA_LARGE", "openai/gpt-oss-120b")                # GPT-OSS model for hard/long context tasks
+def select_model(question: str, context: str) -> Dict[str, Any]:
+    """
+    Enhanced three-tier model selection system:
+    - Easy tasks (immediate execution, simple) -> Llama (NVIDIA small)
+    - Reasoning tasks (analysis, decision-making, JSON parsing) -> Qwen (NVIDIA medium)
+    - Hard/long context tasks (complex synthesis, long-form) -> GPT-OSS (NVIDIA large)
+    - Very complex tasks (research, comprehensive analysis) -> Gemini Pro
+    """
+    qlen = len(question.split())
+    clen = len(context.split())
+    # Very hard task keywords - require Gemini Pro (research, comprehensive analysis)
+    very_hard_keywords = ("prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation", "research", "investigation", "comprehensive study")
+    # Hard/long context keywords - require NVIDIA Large (GPT-OSS)
+    hard_keywords = ("analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "synthesis", "consolidate", "process", "generate", "create", "develop", "build", "construct")
+    # Reasoning task keywords - require Qwen (thinking/reasoning)
+    reasoning_keywords = ("reasoning", "context", "enhance", "select", "decide", "choose", "determine", "assess", "judge", "consider", "think", "reason", "logic", "inference", "deduction", "analysis", "interpretation")
+    # Simple task keywords - immediate execution
+    simple_keywords = ("what", "how", "when", "where", "who", "yes", "no", "count", "list", "find", "search", "lookup")
+    # Determine complexity level
+    is_very_hard = (
+        any(k in question.lower() for k in very_hard_keywords) or
+        qlen > 120 or
+        clen > 4000 or
+        "comprehensive" in question.lower() or
+        "detailed" in question.lower() or
+        "research" in question.lower()
+    )
+    is_hard = (
+        any(k in question.lower() for k in hard_keywords) or
+        qlen > 50 or
+        clen > 1500 or
+        "synthesis" in question.lower() or
+        "generate" in question.lower() or
+        "create" in question.lower()
+    )
+    is_reasoning = (
+        any(k in question.lower() for k in reasoning_keywords) or
+        qlen > 20 or
+        clen > 800 or
+        "enhance" in question.lower() or
+        "context" in question.lower() or
+        "select" in question.lower() or
+        "decide" in question.lower()
+    )
+    is_simple = (
+        any(k in question.lower() for k in simple_keywords) or
+        qlen <= 10 or
+        clen <= 200
+    )
+    if is_very_hard:
+        # Use Gemini Pro for very complex tasks requiring advanced reasoning
+        return {"provider": "gemini", "model": GEMINI_PRO}
+    elif is_hard:
+        # Use NVIDIA Large (GPT-OSS) for hard/long context tasks
+        return {"provider": "nvidia_large", "model": NVIDIA_LARGE}
+    elif is_reasoning:
+        # Use Qwen for reasoning tasks requiring thinking
+        return {"provider": "qwen", "model": NVIDIA_MEDIUM}
+    else:
+        # Use NVIDIA small (Llama) for simple tasks requiring immediate execution
+        return {"provider": "nvidia", "model": NVIDIA_SMALL}
+async def generate_answer_with_model(selection: Dict[str, Any], system_prompt: str, user_prompt: str,
+                                     gemini_rotator: APIKeyRotator, nvidia_rotator: APIKeyRotator,
+                                     user_id: str = None, context: str = "") -> str:
+    provider = selection["provider"]
+    model = selection["model"]
+    if provider == "gemini":
+        # Try Gemini first
+        try:
+            key = gemini_rotator.get_key() or ""
+            url = f"https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key={key}"
+            payload = {
+                "contents": [
+                    {"role": "user", "parts": [{"text": f"{system_prompt}\n\n{user_prompt}"}]}
+                ],
+                "generationConfig": {"temperature": 0.2}
+            }
+            headers = {"Content-Type": "application/json"}
+            data = await robust_post_json(url, headers, payload, gemini_rotator)
+            content = data["candidates"][0]["content"]["parts"][0]["text"]
+            if not content or content.strip() == "":
+                logger.warning(f"Empty content from Gemini model: {data}")
+                raise Exception("Empty content from Gemini")
+            return content
+        except Exception as e:
+            logger.warning(f"Gemini model {model} failed: {e}. Attempting fallback...")
+            # Fallback logic: GEMINI_PRO/MED → NVIDIA_LARGE, GEMINI_SMALL → NVIDIA_SMALL
+            if model in [GEMINI_PRO, GEMINI_MED]:
+                logger.info(f"Falling back from {model} to NVIDIA_LARGE")
+                fallback_selection = {"provider": "nvidia_large", "model": NVIDIA_LARGE}
+                return await generate_answer_with_model(fallback_selection, system_prompt, user_prompt, gemini_rotator, nvidia_rotator, user_id, context)
+            elif model == GEMINI_SMALL:
+                logger.info(f"Falling back from {model} to NVIDIA_SMALL")
+                fallback_selection = {"provider": "nvidia", "model": NVIDIA_SMALL}
+                return await generate_answer_with_model(fallback_selection, system_prompt, user_prompt, gemini_rotator, nvidia_rotator, user_id, context)
+            else:
+                logger.error(f"No fallback defined for Gemini model: {model}")
+                return "I couldn't parse the model response."
+    elif provider == "nvidia":
+        # Try NVIDIA small model first
+        try:
+            key = nvidia_rotator.get_key() or ""
+            url = "https://integrate.api.nvidia.com/v1/chat/completions"
+            payload = {
+                "model": model,
+                "temperature": 0.2,
+                "messages": [
+                    {"role": "system", "content": system_prompt},
+                    {"role": "user", "content": user_prompt},
+                ]
+            }
+            headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
+            logger.info(f"[ROUTER] NVIDIA API call - Model: {model}, Key present: {bool(key)}")
+            logger.info(f"[ROUTER] System prompt length: {len(system_prompt)}, User prompt length: {len(user_prompt)}")
+            data = await robust_post_json(url, headers, payload, nvidia_rotator)
+            logger.info(f"[ROUTER] NVIDIA API response type: {type(data)}, keys: {list(data.keys()) if isinstance(data, dict) else 'Not a dict'}")
+            content = data["choices"][0]["message"]["content"]
+            if not content or content.strip() == "":
+                logger.warning(f"Empty content from NVIDIA model: {data}")
+                raise Exception("Empty content from NVIDIA")
+            return content
+        except Exception as e:
+            logger.warning(f"NVIDIA model {model} failed: {e}. Attempting fallback...")
+            # Fallback: NVIDIA_SMALL → Try a different NVIDIA model or basic response
+            if model == NVIDIA_SMALL:
+                logger.info(f"Falling back from {model} to basic response")
+                return "I'm experiencing technical difficulties with the AI model. Please try again later."
+            else:
+                logger.error(f"No fallback defined for NVIDIA model: {model}")
+                return "I couldn't parse the model response."
+    elif provider == "qwen":
+        # Use Qwen for reasoning tasks with fallback
+        try:
+            return await qwen_chat_completion(system_prompt, user_prompt, nvidia_rotator, user_id, context)
+        except Exception as e:
+            logger.warning(f"Qwen model failed: {e}. Attempting fallback...")
+            # Fallback: Qwen → NVIDIA_SMALL
+            logger.info("Falling back from Qwen to NVIDIA_SMALL")
+            fallback_selection = {"provider": "nvidia", "model": NVIDIA_SMALL}
+            return await generate_answer_with_model(fallback_selection, system_prompt, user_prompt, gemini_rotator, nvidia_rotator, user_id, context)
+    elif provider == "nvidia_large":
+        # Use NVIDIA Large (GPT-OSS) for hard/long context tasks with fallback
+        try:
+            return await nvidia_large_chat_completion(system_prompt, user_prompt, nvidia_rotator, user_id, context)
+        except Exception as e:
+            logger.warning(f"NVIDIA_LARGE model failed: {e}. Attempting fallback...")
+            # Fallback: NVIDIA_LARGE → NVIDIA_SMALL
+            logger.info("Falling back from NVIDIA_LARGE to NVIDIA_SMALL")
+            fallback_selection = {"provider": "nvidia", "model": NVIDIA_SMALL}
+            return await generate_answer_with_model(fallback_selection, system_prompt, user_prompt, gemini_rotator, nvidia_rotator, user_id, context)
+    elif provider == "nvidia_coder":
+        # Use NVIDIA Coder for code generation tasks with fallback
+        try:
+            from helpers.coder import nvidia_coder_completion
+            return await nvidia_coder_completion(system_prompt, user_prompt, nvidia_rotator, user_id, context)
+        except Exception as e:
+            logger.warning(f"NVIDIA_CODER model failed: {e}. Attempting fallback...")
+            # Fallback: NVIDIA_CODER → NVIDIA_SMALL
+            logger.info("Falling back from NVIDIA_CODER to NVIDIA_SMALL")
+            fallback_selection = {"provider": "nvidia", "model": NVIDIA_SMALL}
+            return await generate_answer_with_model(fallback_selection, system_prompt, user_prompt, gemini_rotator, nvidia_rotator, user_id, context)
+    return "Unsupported provider."
+async def qwen_chat_completion(system_prompt: str, user_prompt: str, nvidia_rotator: APIKeyRotator, user_id: str = None, context: str = "") -> str:
+    """
+    Qwen chat completion with thinking mode enabled.
+    Uses the NVIDIA API rotator for key management.
+    """
+    key = nvidia_rotator.get_key() or ""
+    url = "https://integrate.api.nvidia.com/v1/chat/completions"
+    payload = {
+        "model": NVIDIA_MEDIUM,
+        "messages": [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_prompt}
+        ],
+        "temperature": 0.6,
+        "top_p": 0.7,
+        "max_tokens": 8192,
+        "stream": True
+    }
+    headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
+    logger.info(f"[QWEN] API call - Model: {NVIDIA_MEDIUM}, Key present: {bool(key)}")
+    logger.info(f"[QWEN] System prompt length: {len(system_prompt)}, User prompt length: {len(user_prompt)}")
+    try:
+        # For streaming, we need to handle the response differently
+        import httpx
+        async with httpx.AsyncClient(timeout=60) as client:
+            response = await client.post(url, headers=headers, json=payload)
+            if response.status_code in (401, 403, 429) or (500 <= response.status_code < 600):
+                logger.warning(f"HTTP {response.status_code} from Qwen provider. Rotating key and retrying")
+                nvidia_rotator.rotate()
+                # Retry once with new key
+                key = nvidia_rotator.get_key() or ""
+                headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
+                response = await client.post(url, headers=headers, json=payload)
+            response.raise_for_status()
+            # Handle streaming response
+            content = ""
+            async for line in response.aiter_lines():
+                if line.startswith("data: "):
+                    data = line[6:]  # Remove "data: " prefix
+                    if data.strip() == "[DONE]":
+                        break
+                    try:
+                        import json
+                        chunk_data = json.loads(data)
+                        if "choices" in chunk_data and len(chunk_data["choices"]) > 0:
+                            delta = chunk_data["choices"][0].get("delta", {})
+                            # Handle reasoning content (thinking)
+                            reasoning = delta.get("reasoning_content")
+                            if reasoning:
+                                logger.debug(f"[QWEN] Reasoning: {reasoning}")
+                            # Handle regular content
+                            chunk_content = delta.get("content")
+                            if chunk_content:
+                                content += chunk_content
+                    except json.JSONDecodeError:
+                        continue
+            if not content or content.strip() == "":
+                logger.warning(f"Empty content from Qwen model")
+                return "I received an empty response from the model."
+            return content.strip()
+    except Exception as e:
+        logger.warning(f"Qwen API error: {e}")
+        return "I couldn't process the request with Qwen model."
+async def nvidia_large_chat_completion(system_prompt: str, user_prompt: str, nvidia_rotator: APIKeyRotator, user_id: str = None, context: str = "") -> str:
+    """
+    NVIDIA Large (GPT-OSS) chat completion for hard/long context tasks.
+    Uses the NVIDIA API rotator for key management.
+    """
+    key = nvidia_rotator.get_key() or ""
+    url = "https://integrate.api.nvidia.com/v1/chat/completions"
+    payload = {
+        "model": NVIDIA_LARGE,
+        "messages": [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_prompt}
+        ],
+        "temperature": 1.0,
+        "top_p": 1.0,
+        "max_tokens": 4096,
+        "stream": True
+    }
+    headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
+    logger.info(f"[NVIDIA_LARGE] API call - Model: {NVIDIA_LARGE}, Key present: {bool(key)}")
+    logger.info(f"[NVIDIA_LARGE] System prompt length: {len(system_prompt)}, User prompt length: {len(user_prompt)}")
+    try:
+        # For streaming, we need to handle the response differently
+        import httpx
+        async with httpx.AsyncClient(timeout=60) as client:
+            response = await client.post(url, headers=headers, json=payload)
+            if response.status_code in (401, 403, 429) or (500 <= response.status_code < 600):
+                logger.warning(f"HTTP {response.status_code} from NVIDIA Large provider. Rotating key and retrying")
+                nvidia_rotator.rotate()
+                # Retry once with new key
+                key = nvidia_rotator.get_key() or ""
+                headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
+                response = await client.post(url, headers=headers, json=payload)
+            response.raise_for_status()
+            # Handle streaming response
+            content = ""
+            async for line in response.aiter_lines():
+                if line.startswith("data: "):
+                    data = line[6:]  # Remove "data: " prefix
+                    if data.strip() == "[DONE]":
+                        break
+                    try:
+                        import json
+                        chunk_data = json.loads(data)
+                        if "choices" in chunk_data and len(chunk_data["choices"]) > 0:
+                            delta = chunk_data["choices"][0].get("delta", {})
+                            # Handle reasoning content (thinking)
+                            reasoning = delta.get("reasoning_content")
+                            if reasoning:
+                                logger.debug(f"[NVIDIA_LARGE] Reasoning: {reasoning}")
+                            # Handle regular content
+                            chunk_content = delta.get("content")
+                            if chunk_content:
+                                content += chunk_content
+                    except json.JSONDecodeError:
+                        continue
+            if not content or content.strip() == "":
+                logger.warning(f"Empty content from NVIDIA Large model")
+                return "I received an empty response from the model."
+            return content.strip()
+    except Exception as e:
+        logger.warning(f"NVIDIA Large API error: {e}")
+        return "I couldn't process the request with NVIDIA Large model."

ingestion_python/utils/embedding.py ADDED Viewed

	@@ -0,0 +1,44 @@

+import os
+from typing import List
+import requests
+from utils.logger import get_logger
+logger = get_logger("REMOTE_EMBED", __name__)
+class RemoteEmbeddingClient:
+    """Client to call external embedding service /embed endpoint.
+    Expects env EMBED_BASE_URL, e.g. https://<space>.hf.space
+    """
+    def __init__(self, base_url: str | None = None, timeout: int = 60):
+        self.base_url = (base_url or os.getenv("EMBED_BASE_URL", "https://binkhoale1812-embedding.hf.space")).rstrip("/")
+        if not self.base_url:
+            raise RuntimeError("EMBED_BASE_URL is required for RemoteEmbeddingClient")
+        self.timeout = timeout
+    def embed(self, texts: List[str]) -> List[list]:
+        if not texts:
+            return []
+        url = f"{self.base_url}/embed"
+        payload = {"texts": texts}
+        headers = {"Content-Type": "application/json"}
+        try:
+            resp = requests.post(url, json=payload, headers=headers, timeout=self.timeout)
+            resp.raise_for_status()
+            data = resp.json()
+            vectors = data.get("vectors", [])
+            # Basic validation
+            if not isinstance(vectors, list):
+                raise ValueError("Invalid vectors format from remote embedder")
+            return vectors
+        except Exception as e:
+            logger.warning(f"Remote embedding failed: {e}")
+            # Fail closed with zero vectors to avoid crashes
+            return [[0.0] * 384 for _ in texts]

ingestion_python/utils/ingestion/chunker.py ADDED Viewed

	@@ -0,0 +1,130 @@

+# ────────────────────────────── utils/chunker.py ──────────────────────────────
+import re
+from typing import List, Dict, Any
+from utils.service.summarizer import cheap_summarize, clean_chunk_text
+from utils.service.common import split_sentences, slugify
+from ..logger import get_logger
+# Enhanced semantic chunker with overlap and better structure:
+# - Split by headings / numbered sections if present
+# - Ensure each chunk ~ 300-600 words (configurable)
+# - Add overlap between chunks for better context preservation
+# - Generate a short summary + topic name
+# - Better handling of semantic boundaries
+MAX_WORDS = 500
+MIN_WORDS = 150
+OVERLAP_WORDS = 50  # Overlap between chunks for better context
+logger = get_logger("CHUNKER", __name__)
+def _by_headings(text: str):
+    # Enhanced split on markdown-like or outline headings with better patterns
+    patterns = [
+        r"(?m)^(#{1,6}\s.*)\s*$",  # Markdown headers
+        r"(?m)^([0-9]+\.\s+[^\n]+)\s*$",  # Numbered sections
+        r"(?m)^([A-Z][A-Za-z0-9\s\-]{2,}\n[-=]{3,})\s*$",  # Underlined headers
+        r"(?m)^(Chapter\s+\d+.*|Section\s+\d+.*)\s*$",  # Chapter/Section headers
+        r"(?m)^(Abstract|Introduction|Conclusion|References|Bibliography)\s*$",  # Common academic sections
+    ]
+    parts = []
+    last = 0
+    all_matches = []
+    # Find all matches from all patterns
+    for pattern in patterns:
+        for m in re.finditer(pattern, text):
+            all_matches.append((m.start(), m.end(), m.group(1).strip()))
+    # Sort matches by position
+    all_matches.sort(key=lambda x: x[0])
+    # Split text based on matches
+    for start, end, header in all_matches:
+        if start > last:
+            parts.append(text[last:start])
+        parts.append(text[start:end])
+        last = end
+    if last < len(text):
+        parts.append(text[last:])
+    if not parts:
+        parts = [text]
+    return parts
+def _create_overlapping_chunks(text_blocks: List[str]) -> List[str]:
+    """Create overlapping chunks from text blocks for better context preservation"""
+    chunks = []
+    for i, block in enumerate(text_blocks):
+        words = block.split()
+        if not words:
+            continue
+        # If block is small enough, use as-is
+        if len(words) <= MAX_WORDS:
+            chunks.append(block)
+            continue
+        # Split large blocks with overlap
+        start = 0
+        while start < len(words):
+            end = min(start + MAX_WORDS, len(words))
+            chunk_words = words[start:end]
+            # Add overlap from previous chunk if available
+            if start > 0 and len(chunks) > 0:
+                prev_words = chunks[-1].split()
+                overlap_start = max(0, len(prev_words) - OVERLAP_WORDS)
+                overlap_words = prev_words[overlap_start:]
+                chunk_words = overlap_words + chunk_words
+            chunks.append(" ".join(chunk_words))
+            start = end - OVERLAP_WORDS  # Overlap with next chunk
+    return chunks
+async def build_cards_from_pages(pages: List[Dict[str, Any]], filename: str, user_id: str, project_id: str) -> List[Dict[str, Any]]:
+    # Concatenate pages but keep page spans for metadata
+    full = ""
+    page_markers = []
+    for p in pages:
+        start = len(full)
+        full += f"\n\n[[Page {p['page_num']}]]\n{p.get('text','').strip()}\n"
+        page_markers.append((p['page_num'], start, len(full)))
+    # First split by headings
+    coarse = _by_headings(full)
+    # Create overlapping chunks for better context preservation
+    cards = _create_overlapping_chunks(coarse)
+    # Build card dicts
+    out = []
+    for i, raw_content in enumerate(cards, 1):
+        # Clean with LLM to remove headers/footers and IDs
+        cleaned = await clean_chunk_text(raw_content)
+        topic = await cheap_summarize(cleaned, max_sentences=1)
+        if not topic:
+            topic = cleaned[:80] + "..."
+        summary = await cheap_summarize(cleaned, max_sentences=3)
+        # Estimate page span
+        first_page = pages[0]['page_num'] if pages else 1
+        last_page = pages[-1]['page_num'] if pages else 1
+        out.append({
+            "user_id": user_id,
+            "project_id": project_id,
+            "filename": filename,
+            "topic_name": topic[:120],
+            "summary": summary,
+            "content": cleaned,
+            "page_span": [first_page, last_page],
+            "card_id": f"{slugify(filename)}-c{i:04d}"
+        })
+    logger.info(f"Built {len(out)} cards from {len(pages)} pages for {filename}")
+    return out

ingestion_python/utils/ingestion/parser.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import io
+from typing import List, Dict, Any
+import fitz  # PyMuPDF
+from docx import Document
+from PIL import Image
+import numpy as np
+from ..logger import get_logger
+logger = get_logger("PARSER", __name__)
+def parse_pdf_bytes(b: bytes) -> List[Dict[str, Any]]:
+    """
+    Returns list of pages, each {'page_num': i, 'text': str, 'images': [PIL.Image]}
+    """
+    pages = []
+    with fitz.open(stream=b, filetype="pdf") as doc:
+        for i, page in enumerate(doc):
+            text = page.get_text("text")
+            images = []
+            for img in page.get_images(full=True):
+                xref = img[0]
+                try:
+                    pix = fitz.Pixmap(doc, xref)
+                    # Convert CMYK/Alpha safely
+                    if pix.n - pix.alpha >= 4:
+                        pix = fitz.Pixmap(fitz.csRGB, pix)
+                    # Use PNG bytes to avoid 'not enough image data'
+                    png_bytes = pix.tobytes("png")
+                    im = Image.open(io.BytesIO(png_bytes)).convert("RGB")
+                    images.append(im)
+                except Exception as e:
+                    logger.warning(f"Failed to extract image on page {i+1}: {e}")
+                finally:
+                    try:
+                        pix = None
+                    except Exception:
+                        pass
+            pages.append({"page_num": i + 1, "text": text, "images": images})
+    logger.info(f"Parsed PDF with {len(pages)} pages")
+    return pages
+def parse_docx_bytes(b: bytes) -> List[Dict[str, Any]]:
+    f = io.BytesIO(b)
+    doc = Document(f)
+    text = []
+    images = []
+    for rel in doc.part.rels.values():
+        if "image" in rel.reltype:
+            data = rel.target_part.blob
+            try:
+                im = Image.open(io.BytesIO(data)).convert("RGB")
+                images.append(im)
+            except Exception:
+                pass
+    for p in doc.paragraphs:
+        text.append(p.text)
+    pages = [{"page_num": 1, "text": "\n".join(text), "images": images}]
+    logger.info("Parsed DOCX into single concatenated page")
+    return pages

ingestion_python/utils/logger.py ADDED Viewed

	@@ -0,0 +1,71 @@

+import logging
+import sys
+from typing import Optional
+_DEFAULT_FORMAT = "%(asctime)s %(levelname)s %(message)s"
+def _ensure_root_handler() -> None:
+    root_logger = logging.getLogger()
+    if root_logger.handlers:
+        return
+    handler = logging.StreamHandler(stream=sys.stdout)
+    formatter = logging.Formatter(_DEFAULT_FORMAT)
+    handler.setFormatter(formatter)
+    root_logger.addHandler(handler)
+    root_logger.setLevel(logging.INFO)
+class _TaggedAdapter(logging.LoggerAdapter):
+    def process(self, msg, kwargs):
+        tag = self.extra.get("tag", "")
+        if tag and not str(msg).startswith(tag):
+            msg = f"{tag} {msg}"
+        return msg, kwargs
+def get_logger(tag: str, name: Optional[str] = None) -> logging.Logger:
+    _ensure_root_handler()
+    logger_name = name or __name__
+    base = logging.getLogger(logger_name)
+    return _TaggedAdapter(base, {"tag": f"[{tag}]"})
+import logging
+import sys
+from typing import Optional
+_DEFAULT_FORMAT = "%(asctime)s %(levelname)s %(message)s"
+def _ensure_root_handler() -> None:
+    root_logger = logging.getLogger()
+    if root_logger.handlers:
+        return
+    handler = logging.StreamHandler(stream=sys.stdout)
+    formatter = logging.Formatter(_DEFAULT_FORMAT)
+    handler.setFormatter(formatter)
+    root_logger.addHandler(handler)
+    root_logger.setLevel(logging.INFO)
+class _TaggedAdapter(logging.LoggerAdapter):
+    def process(self, msg, kwargs):
+        tag = self.extra.get("tag", "")
+        if tag and not str(msg).startswith(tag):
+            msg = f"{tag} {msg}"
+        return msg, kwargs
+def get_logger(tag: str, name: Optional[str] = None) -> logging.Logger:
+    """
+    Return a logger that injects a [TAG] prefix into records.
+    Example: logger = get_logger("APP") → logs like: [APP] message
+    """
+    _ensure_root_handler()
+    logger_name = name or __name__
+    base = logging.getLogger(logger_name)
+    return _TaggedAdapter(base, {"tag": f"[{tag}]"})

ingestion_python/utils/rag/embeddings.py ADDED Viewed

	@@ -0,0 +1,39 @@

+# ────────────────────────────── utils/embeddings.py ──────────────────────────────
+import os
+from typing import List
+import requests
+from utils.logger import get_logger
+logger = get_logger("EMBED", __name__)
+class EmbeddingClient:
+    """Embedding client that calls external embedding service via HTTP.
+    Expects environment variable EMBEDDER_BASE_URL pointing at an API with:
+      POST /embed {"texts": [..]} -> {"vectors": [[..], ...], "model": "..."}
+    """
+    def __init__(self, base_url: str = None):
+        self.base_url = (base_url or os.getenv("EMBEDDER_BASE_URL", "")).rstrip("/")
+        if not self.base_url:
+            logger.warning("EMBEDDER_BASE_URL not set; embedding calls will fail.")
+    def embed(self, texts: List[str]) -> List[list]:
+        if not texts:
+            return []
+        if not self.base_url:
+            raise RuntimeError("EMBEDDER_BASE_URL not configured")
+        url = f"{self.base_url}/embed"
+        try:
+            resp = requests.post(url, json={"texts": texts}, timeout=60)
+            if resp.status_code >= 400:
+                raise RuntimeError(f"Embedding API error {resp.status_code}: {resp.text[:200]}")
+            data = resp.json()
+            vectors = data.get("vectors") or []
+            return vectors
+        except Exception as e:
+            logger.warning(f"Embedding API failed: {e}")
+            raise

ingestion_python/utils/rag/rag.py ADDED Viewed

	@@ -0,0 +1,278 @@

+# ────────────────────────────── utils/rag.py ──────────────────────────────
+import os
+import math
+from typing import List, Dict, Any, Optional
+from pymongo import MongoClient, ASCENDING, TEXT
+from pymongo.collection import Collection
+from pymongo.errors import PyMongoError
+import numpy as np
+import logging
+from ..logger import get_logger
+VECTOR_DIM = 384  # all-MiniLM-L6-v2
+INDEX_NAME = os.getenv("MONGO_VECTOR_INDEX", "vector_index")
+USE_ATLAS_VECTOR = os.getenv("ATLAS_VECTOR", "0") == "1"
+logger = get_logger("RAG", __name__)
+class RAGStore:
+    def __init__(self, mongo_uri: str, db_name: str = "studybuddy"):
+        self.client = MongoClient(mongo_uri)
+        self.db = self.client[db_name]
+        self.chunks: Collection = self.db["chunks"]
+        self.files: Collection = self.db["files"]
+    # ── Write ────────────────────────────────────────────────────────────────
+    def store_cards(self, cards: List[Dict[str, Any]]):
+        if not cards:
+            return
+        for c in cards:
+            # basic validation
+            emb = c.get("embedding")
+            if not emb or len(emb) != VECTOR_DIM:
+                raise ValueError("Invalid embedding length; expected %d" % VECTOR_DIM)
+        self.chunks.insert_many(cards, ordered=False)
+        logger.info(f"Inserted {len(cards)} cards into MongoDB")
+    def upsert_file_summary(self, user_id: str, project_id: str, filename: str, summary: str):
+        self.files.update_one(
+            {"user_id": user_id, "project_id": project_id, "filename": filename},
+            {"$set": {"summary": summary}},
+            upsert=True
+        )
+        logger.info(f"Upserted summary for {filename} (user {user_id}, project {project_id})")
+    # ── Read ────────────────────────────────────────────────────────────────
+    def list_cards(self, user_id: str, project_id: str, filename: Optional[str], limit: int, skip: int):
+        q = {"user_id": user_id, "project_id": project_id}
+        if filename:
+            q["filename"] = filename
+        cur = self.chunks.find(q, {"embedding": 0}).skip(skip).limit(limit).sort([("_id", ASCENDING)])
+        # Convert MongoDB documents to JSON-serializable format
+        cards = []
+        for card in cur:
+            serializable_card = {}
+            for key, value in card.items():
+                if key == '_id':
+                    serializable_card[key] = str(value)  # Convert ObjectId to string
+                elif hasattr(value, 'isoformat'):  # Handle datetime objects
+                    serializable_card[key] = value.isoformat()
+                else:
+                    serializable_card[key] = value
+            cards.append(serializable_card)
+        return cards
+    def get_file_summary(self, user_id: str, project_id: str, filename: str):
+        doc = self.files.find_one({"user_id": user_id, "project_id": project_id, "filename": filename})
+        if doc:
+            # Convert MongoDB document to JSON-serializable format
+            serializable_doc = {}
+            for key, value in doc.items():
+                if key == '_id':
+                    serializable_doc[key] = str(value)  # Convert ObjectId to string
+                elif hasattr(value, 'isoformat'):  # Handle datetime objects
+                    serializable_doc[key] = value.isoformat()
+                else:
+                    serializable_doc[key] = value
+            return serializable_doc
+        return None
+    def get_file_chunks(self, user_id: str, project_id: str, filename: str, limit: int = 20) -> List[Dict[str, Any]]:
+        """Get chunks for a specific file"""
+        cursor = self.chunks.find({
+            "user_id": user_id,
+            "project_id": project_id,
+            "filename": filename
+        }).limit(limit)
+        chunks = []
+        for doc in cursor:
+            # Convert MongoDB document to JSON-serializable format
+            serializable_doc = {}
+            for key, value in doc.items():
+                if key == '_id':
+                    serializable_doc[key] = str(value)
+                elif hasattr(value, 'isoformat'):
+                    serializable_doc[key] = value.isoformat()
+                else:
+                    serializable_doc[key] = value
+            chunks.append(serializable_doc)
+        return chunks
+    def list_files(self, user_id: str, project_id: str):
+        """List all files for a project with their summaries"""
+        files_cursor = self.files.find(
+            {"user_id": user_id, "project_id": project_id},
+            {"_id": 0, "filename": 1, "summary": 1}
+        ).sort("filename", ASCENDING)
+        # Convert MongoDB documents to JSON-serializable format
+        files = []
+        for file_doc in files_cursor:
+            serializable_file = {}
+            for key, value in file_doc.items():
+                if hasattr(value, 'isoformat'):  # Handle datetime objects
+                    serializable_file[key] = value.isoformat()
+                else:
+                    serializable_file[key] = value
+            files.append(serializable_file)
+        return files
+    def vector_search(self, user_id: str, project_id: str, query_vector: List[float], k: int = 6, filenames: Optional[List[str]] = None, search_type: str = "hybrid"):
+        """
+        Enhanced vector search with multiple strategies:
+        - hybrid: Combines Atlas and local search
+        - flat: Exhaustive search for maximum accuracy
+        - atlas: Uses Atlas Vector Search only
+        - local: Uses local cosine similarity only
+        """
+        if search_type == "flat" or (search_type == "hybrid" and not USE_ATLAS_VECTOR):
+            return self._flat_vector_search(user_id, project_id, query_vector, k, filenames)
+        elif search_type == "atlas" and USE_ATLAS_VECTOR:
+            return self._atlas_vector_search(user_id, project_id, query_vector, k, filenames)
+        elif search_type == "local":
+            return self._local_vector_search(user_id, project_id, query_vector, k, filenames)
+        else:
+            # Default hybrid approach
+            if USE_ATLAS_VECTOR:
+                atlas_results = self._atlas_vector_search(user_id, project_id, query_vector, k, filenames)
+                if atlas_results:
+                    return atlas_results
+            return self._local_vector_search(user_id, project_id, query_vector, k, filenames)
+    def _atlas_vector_search(self, user_id: str, project_id: str, query_vector: List[float], k: int, filenames: Optional[List[str]] = None):
+        """Atlas Vector Search implementation"""
+        match_stage = {"user_id": user_id, "project_id": project_id}
+        if filenames:
+            match_stage["filename"] = {"$in": filenames}
+        pipeline = [
+            {
+                "$search": {
+                    "index": INDEX_NAME,
+                    "knnBeta": {
+                        "vector": query_vector,
+                        "path": "embedding",
+                        "k": k,
+                    }
+                }
+            },
+            {"$match": match_stage},
+            {"$project": {"doc": "$$ROOT", "score": {"$meta": "searchScore"}}},
+            {"$limit": k},
+        ]
+        hits = list(self.chunks.aggregate(pipeline))
+        return self._serialize_hits(hits)
+    def _local_vector_search(self, user_id: str, project_id: str, query_vector: List[float], k: int, filenames: Optional[List[str]] = None):
+        """Local cosine similarity search with improved sampling"""
+        q = {"user_id": user_id, "project_id": project_id}
+        if filenames:
+            q["filename"] = {"$in": filenames}
+        # Increase sample size for better accuracy
+        sample_limit = max(5000, k * 50)
+        sample = list(self.chunks.find(q).sort([("_id", -1)]).limit(sample_limit))
+        if not sample:
+            return []
+        qv = np.array(query_vector, dtype="float32")
+        scores = []
+        for d in sample:
+            v = np.array(d.get("embedding", [0]*VECTOR_DIM), dtype="float32")
+            denom = (np.linalg.norm(qv) * np.linalg.norm(v)) or 1.0
+            sim = float(np.dot(qv, v) / denom)
+            scores.append((sim, d))
+        scores.sort(key=lambda x: x[0], reverse=True)
+        top = scores[:k]
+        logger.info(f"Local vector search: {len(sample)} docs sampled, {len(top)} results")
+        return self._serialize_results(top)
+    def _flat_vector_search(self, user_id: str, project_id: str, query_vector: List[float], k: int, filenames: Optional[List[str]] = None):
+        """Flat exhaustive search for maximum accuracy"""
+        q = {"user_id": user_id, "project_id": project_id}
+        if filenames:
+            q["filename"] = {"$in": filenames}
+        # Get ALL relevant documents for exhaustive search
+        all_docs = list(self.chunks.find(q))
+        if not all_docs:
+            return []
+        qv = np.array(query_vector, dtype="float32")
+        scores = []
+        for doc in all_docs:
+            v = np.array(doc.get("embedding", [0]*VECTOR_DIM), dtype="float32")
+            denom = (np.linalg.norm(qv) * np.linalg.norm(v)) or 1.0
+            sim = float(np.dot(qv, v) / denom)
+            scores.append((sim, doc))
+        scores.sort(key=lambda x: x[0], reverse=True)
+        top = scores[:k]
+        logger.info(f"Flat vector search: {len(all_docs)} docs searched, {len(top)} results")
+        return self._serialize_results(top)
+    def _serialize_hits(self, hits):
+        """Serialize Atlas search hits"""
+        serializable_hits = []
+        for hit in hits:
+            doc = hit["doc"]
+            serializable_doc = self._serialize_doc(doc)
+            serializable_hits.append({
+                "doc": serializable_doc,
+                "score": float(hit.get("score", 0.0))
+            })
+        return serializable_hits
+    def _serialize_results(self, results):
+        """Serialize local search results"""
+        serializable_results = []
+        for score, doc in results:
+            serializable_doc = self._serialize_doc(doc)
+            serializable_results.append({
+                "doc": serializable_doc,
+                "score": float(score)
+            })
+        return serializable_results
+    def _serialize_doc(self, doc):
+        """Convert MongoDB document to JSON-serializable format"""
+        serializable_doc = {}
+        for key, value in doc.items():
+            if key == '_id':
+                serializable_doc[key] = str(value)
+            elif hasattr(value, 'isoformat'):
+                serializable_doc[key] = value.isoformat()
+            else:
+                serializable_doc[key] = value
+        return serializable_doc
+def ensure_indexes(store: RAGStore):
+    # Basic text index for fallback keyword search (optional)
+    try:
+        store.chunks.create_index([("user_id", ASCENDING), ("project_id", ASCENDING), ("filename", ASCENDING)])
+        store.chunks.create_index([("content", TEXT), ("topic_name", TEXT), ("summary", TEXT)], name="text_idx")
+        store.files.create_index([("user_id", ASCENDING), ("project_id", ASCENDING), ("filename", ASCENDING)], unique=True)
+    except PyMongoError as e:
+        logger.warning(f"Index creation warning: {e}")
+    # Note: For Atlas Vector, create an Atlas Search index named INDEX_NAME on field "embedding" with vector options.
+    # Example (in Atlas UI):
+    # {
+    #   "mappings": {
+    #     "dynamic": false,
+    #     "fields": {
+    #       "embedding": {
+    #         "type": "knnVector",
+    #         "dimensions": 384,
+    #         "similarity": "cosine"
+    #       }
+    #     }
+    #   }
+    # }

ingestion_python/utils/service/common.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import re
+import unicodedata
+from utils.logger import get_logger
+logger = get_logger("COMMON", __name__)
+def split_sentences(text: str):
+    return re.split(r"(?<=[\.\!\?])\s+", text.strip())
+def slugify(value: str):
+    value = str(value)
+    value = unicodedata.normalize("NFKD", value).encode("ascii", "ignore").decode("ascii")
+    value = re.sub(r"[^\w\s-]", "", value).strip().lower()
+    return re.sub(r"[-\s]+", "-", value)
+def trim_text(s: str, n: int):
+    s = s or ""
+    if len(s) <= n:
+        return s
+    return s[:n] + "…"

ingestion_python/utils/service/summarizer.py ADDED Viewed

	@@ -0,0 +1,48 @@

+import re
+from typing import List
+from utils.logger import get_logger
+logger = get_logger("SUM", __name__)
+async def clean_chunk_text(text: str) -> str:
+  """Clean and normalize text for processing."""
+  if not text:
+    return ""
+  # Remove extra whitespace and normalize
+  text = " ".join(text.split())
+  # Remove common artifacts
+  text = text.replace("\\n", " ").replace("\\t", " ")
+  return text.strip()
+async def cheap_summarize(text: str, max_sentences: int = 3) -> str:
+  """Simple text-based summarization without external APIs."""
+  if not text or len(text.strip()) < 50:
+    return text.strip()
+  try:
+    # Simple extractive summarization: take first few sentences
+    sentences = re.split(r'[.!?]+', text)
+    sentences = [s.strip() for s in sentences if s.strip()]
+    if len(sentences) <= max_sentences:
+      return text.strip()
+    # Take first max_sentences sentences
+    summary_sentences = sentences[:max_sentences]
+    summary = '. '.join(summary_sentences)
+    # Add period if it doesn't end with punctuation
+    if not summary.endswith(('.', '!', '?')):
+      summary += '.'
+    return summary
+  except Exception as e:
+    logger.warning(f"[SUM] Summarization failed: {e}")
+    # Fallback: return first part of text
+    return text[:200] + "..." if len(text) > 200 else text