Spaces:
Running
Running
CrawlGPT Documentation
Overview
CrawlGPT is a web content crawler with GPT-powered summarization and chat capabilities. It extracts content from URLs, stores it in a vector database, and enables natural language querying of the stored content.
Project Structure
crawlgpt/
βββ src/
β βββ crawlgpt/
β βββ core/ # Core functionality
β β βββ database.py # SQL database handling
β β βββ LLMBasedCrawler.py # Main crawler implementation
β β βββ DatabaseHandler.py # Vector database (FAISS)
β β βββ SummaryGenerator.py # Text summarization
β βββ ui/ # User Interface
β β βββ chat_app.py # Main Streamlit app
β β βββ chat_ui.py # Development UI
β β βββ login.py # Authentication UI
β βββ utils/ # Utilities
β βββ content_validator.py # URL/content validation
β βββ data_manager.py # Import/export handling
β βββ helper_functions.py # General helpers
β βββ monitoring.py # Metrics collection
β βββ progress.py # Progress tracking
βββ tests/ # Test suite
β βββ test_core/
β βββ test_database_handler.py # Vector DB tests
β βββ test_integration.py # Integration tests
β βββ test_llm_based_crawler.py # Crawler tests
β βββ test_summary_generator.py # Summarizer tests
βββ .github/ # CI/CD
β βββ workflows/
β βββ Push_to_hf.yaml # HuggingFace sync
βββ Docs/
β βββ MiniDoc.md # Documentation
βββ .dockerignore # Docker exclusions
βββ .gitignore # Git exclusions
βββ Dockerfile # Container config
βββ LICENSE # MIT License
βββ README.md # Project documentation
βββ README_hf.md # HuggingFace README
βββ pyproject.toml # Project metadata
βββ pytest.ini # Test configuration
βββ setup_env.py # Environment setup
Core Components
LLMBasedCrawler (src/crawlgpt/core/LLMBasedCrawler.py)
- Main crawler class handling web content extraction and processing
- Integrates with Groq API for language model operations
- Manages content chunking, summarization and response generation
- Includes rate limiting and metrics collection
DatabaseHandler (src/crawlgpt/core/DatabaseHandler.py)
- Vector database implementation using FAISS
- Stores and retrieves text embeddings for efficient similarity search
- Handles data persistence and state management
SummaryGenerator (src/crawlgpt/core/SummaryGenerator.py)
- Generates concise summaries of text chunks using Groq API
- Configurable model selection and parameters
- Handles empty input validation
Database (src/crawl/core/database.py)
SQLAlchemy-based database handling for user management and chat history
Provides secure user authentication with BCrypt password hashing
Manages persistent storage of chat conversations and context
Configuration
- Uses SQLite by default (
sqlite:///crawlgpt.db
) - Configurable via DATABASE_URL environment variable
- Automatic schema creation on startup
- Session management with SQLAlchemy sessionmaker
- Uses SQLite by default (
Security Features
- BCrypt password hashing with PassLib
- Unique username enforcement
- Secure session handling
- Role-based message tracking
UI Components
chat_app.py (src/crawlgpt/ui/chat_app.py)
- Main Streamlit application interface
- URL processing and content extraction
- Chat interface with message history
- System metrics and debug information
- Import/export functionality
chat_ui.py (src/crawlgpt/ui/chat_ui.py)
- Development/testing UI with additional debug features
- Extended metrics visualization
- Raw data inspection capabilities
Utilities
content_validator.py (src/crawlgpt/utils/content_validator.py)
- URL and content validation
- MIME type checking
- Size limit enforcement
- Security checks for malicious content
data_manager.py (src/crawlgpt/utils/data_manager.py)
- Data import/export operations
- File serialization (JSON/pickle)
- Timestamped backups
- State management
monitoring.py (src/crawlgpt/utils/monitoring.py)
- Request metrics collection
- Rate limiting implementation
- Performance monitoring
- Usage statistics
progress.py (src/crawlgpt/utils/progress.py)
- Operation progress tracking
- Status updates
- Step counting
- Time tracking
Testing
test_database_handler.py (tests/test_core/test_database_handler.py)
- Tests for vector database operations
- Integration tests for data storage/retrieval
- End-to-end flow validation
test_integration.py (tests/test_core/test_integration.py)
- Full system integration tests
- URL extraction to response generation flow
- State management validation
test_llm_based_crawler.py (tests/test_core/test_llm_based_crawler.py)
- Crawler functionality tests
- Content extraction validation
- Response generation testing
test_summary_generator.py (tests/test_core/test_summary_generator.py)
- Summary generation tests
- Empty input handling
- Model output validation
Configuration
pyproject.toml
- Project metadata
- Dependencies
- Optional dev dependencies
- Entry points
pytest.ini
- Test configuration
- Path settings
- Test discovery patterns
- Reporting options
setup_env.py
- Environment setup script
- Virtual environment creation
- Dependency installation
- Playwright setup
Features
Web Crawling
- Async web content extraction
- Playwright-based rendering
- Content validation
- Rate limiting
Content Processing
- Text chunking
- Vector embeddings
- Summarization
- Similarity search
Chat Interface
- Message history
- Context management
- Model parameter control
- Debug information
Data Management
- State import/export
- Progress tracking
- Metrics collection
- Error handling
Testing
- Unit tests
- Integration tests
- Mock implementations
- Async test support
Dependencies
Core:
- streamlit
- groq
- sentence-transformers
- faiss-cpu
- crawl4ai
- pydantic
- aiohttp
- beautifulsoup4
- playwright
Development:
- pytest
- pytest-mockito
- black
- isort
- flake8
License
MIT License