CrawlGPT Documentation

Overview

CrawlGPT is a web content crawler with GPT-powered summarization and chat capabilities. It extracts content from URLs, stores it in a vector database, and enables natural language querying of the stored content.

Project Structure

crawlgpt/
├── src/
│   └── crawlgpt/
│       ├── core/                           # Core functionality
│       │   ├── database.py                 # SQL database handling
│       │   ├── LLMBasedCrawler.py         # Main crawler implementation
│       │   ├── DatabaseHandler.py          # Vector database (FAISS)
│       │   └── SummaryGenerator.py         # Text summarization
│       ├── ui/                            # User Interface
│       │   ├── chat_app.py                # Main Streamlit app
│       │   ├── chat_ui.py                 # Development UI
│       │   └── login.py                   # Authentication UI
│       └── utils/                         # Utilities
│           ├── content_validator.py        # URL/content validation
│           ├── data_manager.py            # Import/export handling
│           ├── helper_functions.py         # General helpers
│           ├── monitoring.py              # Metrics collection
│           └── progress.py                # Progress tracking
├── tests/                                # Test suite
│   └── test_core/
│       ├── test_database_handler.py       # Vector DB tests
│       ├── test_integration.py           # Integration tests
│       ├── test_llm_based_crawler.py     # Crawler tests
│       └── test_summary_generator.py     # Summarizer tests
├── .github/                             # CI/CD
│   └── workflows/
│       └── Push_to_hf.yaml              # HuggingFace sync
├── Docs/
│   └── MiniDoc.md                       # Documentation
├── .dockerignore                        # Docker exclusions
├── .gitignore                          # Git exclusions
├── Dockerfile                          # Container config
├── LICENSE                             # MIT License
├── README.md                          # Project documentation
├── README_hf.md                       # HuggingFace README
├── pyproject.toml                     # Project metadata
├── pytest.ini                         # Test configuration
└── setup_env.py                       # Environment setup

Core Components

LLMBasedCrawler (src/crawlgpt/core/LLMBasedCrawler.py)

Main crawler class handling web content extraction and processing
Integrates with Groq API for language model operations
Manages content chunking, summarization and response generation
Includes rate limiting and metrics collection

DatabaseHandler (src/crawlgpt/core/DatabaseHandler.py)

Vector database implementation using FAISS
Stores and retrieves text embeddings for efficient similarity search
Handles data persistence and state management

SummaryGenerator (src/crawlgpt/core/SummaryGenerator.py)

Generates concise summaries of text chunks using Groq API
Configurable model selection and parameters
Handles empty input validation

Database (src/crawl/core/database.py)

SQLAlchemy-based database handling for user management and chat history
Provides secure user authentication with BCrypt password hashing
Manages persistent storage of chat conversations and context
Configuration
- Uses SQLite by default (sqlite:///crawlgpt.db)
- Configurable via DATABASE_URL environment variable
- Automatic schema creation on startup
- Session management with SQLAlchemy sessionmaker
Security Features
- BCrypt password hashing with PassLib
- Unique username enforcement
- Secure session handling
- Role-based message tracking

UI Components

chat_app.py (src/crawlgpt/ui/chat_app.py)

Main Streamlit application interface
URL processing and content extraction
Chat interface with message history
System metrics and debug information
Import/export functionality

chat_ui.py (src/crawlgpt/ui/chat_ui.py)

Development/testing UI with additional debug features
Extended metrics visualization
Raw data inspection capabilities

Utilities

content_validator.py (src/crawlgpt/utils/content_validator.py)

URL and content validation
MIME type checking
Size limit enforcement
Security checks for malicious content

data_manager.py (src/crawlgpt/utils/data_manager.py)

Data import/export operations
File serialization (JSON/pickle)
Timestamped backups
State management

monitoring.py (src/crawlgpt/utils/monitoring.py)

Request metrics collection
Rate limiting implementation
Performance monitoring
Usage statistics

progress.py (src/crawlgpt/utils/progress.py)

Operation progress tracking
Status updates
Step counting
Time tracking

Testing

test_database_handler.py (tests/test_core/test_database_handler.py)

Tests for vector database operations
Integration tests for data storage/retrieval
End-to-end flow validation

test_integration.py (tests/test_core/test_integration.py)

Full system integration tests
URL extraction to response generation flow
State management validation

test_llm_based_crawler.py (tests/test_core/test_llm_based_crawler.py)

Crawler functionality tests
Content extraction validation
Response generation testing

test_summary_generator.py (tests/test_core/test_summary_generator.py)

Summary generation tests
Empty input handling
Model output validation

Configuration

pyproject.toml

Project metadata
Dependencies
Optional dev dependencies
Entry points

pytest.ini

Test configuration
Path settings
Test discovery patterns
Reporting options

setup_env.py

Environment setup script
Virtual environment creation
Dependency installation
Playwright setup

Features

Web Crawling
- Async web content extraction
- Playwright-based rendering
- Content validation
- Rate limiting
Content Processing
- Text chunking
- Vector embeddings
- Summarization
- Similarity search
Chat Interface
- Message history
- Context management
- Model parameter control
- Debug information
Data Management
- State import/export
- Progress tracking
- Metrics collection
- Error handling
Testing
- Unit tests
- Integration tests
- Mock implementations
- Async test support

Dependencies

Core:

streamlit
groq
sentence-transformers
faiss-cpu
crawl4ai
pydantic
aiohttp
beautifulsoup4
playwright

Development:

pytest
pytest-mockito
black
isort
flake8

License

MIT License