chatAi / README.md
Soufianesejjari's picture
update requirements and Dockerfile for improved dependency management and application performance
fd63909
metadata
title: ChatAi
emoji: πŸ“š
colorFrom: yellow
colorTo: pink
sdk: docker
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

ChatAi - Intelligent Chatbot API

Project Emoji Project Color SDK

Overview

ChatAi is a Flask-based chatbot API designed to provide intelligent and context-aware responses by extracting and processing content from specified URLs. It leverages web scraping techniques, including specialized handling for JavaScript-heavy sites, to provide comprehensive and accurate information. The API is containerized using Docker for easy deployment and scalability.

Features

  • Intelligent Web Scraping: Extracts content from various types of websites, including those built with modern JavaScript frameworks like React, Vue, and Angular.
  • Content Cleaning and Processing: Cleans and processes extracted content to remove duplicates, excessive whitespace, and irrelevant information.
  • Context-Aware Responses: Generates responses based on the extracted content, providing clear, concise, and relevant information.
  • Token-Based Authentication: Uses tokens to manage user data and ensure secure access to the API.
  • Persistent Storage: Stores user data, including indexed documents, in JSON files for persistence across application restarts.
  • Dockerized Deployment: Containerized using Docker for easy deployment and scalability.
  • API Endpoints:
    • /config: Registers URLs and obtains a token for accessing the chatbot.
    • /chat: Sends a message to the chatbot and receives a response.
    • /refresh_token: Refreshes an existing token to update indexed data.
    • /test: Tests the API to ensure it is functioning correctly.

Architecture

The application is structured into the following modules:

  • app.py: The main Flask application file that defines the API endpoints and orchestrates the chatbot functionality.
  • web_content_fetcher.py: Handles web scraping and content extraction from URLs. It includes specialized techniques for handling JavaScript-heavy sites.
  • token_manager.py: Manages token generation, validation, and refreshing. It also handles loading and saving user data to JSON files.
  • groq_api.py: Interacts with the Groq API to generate embeddings and completions based on the extracted content.

Setup and Deployment

Prerequisites

Steps

  1. Clone the repository:

    git clone [repository_url]
    cd ChatAi
    
  2. Build the Docker image:

    docker build -t chat-ai .
    
  3. Run the Docker container:

    docker run -d -p 7860:7860 chat-ai
    

    This will start the ChatAi API on port 7860.

Environment Variables

The following environment variables can be configured:

  • GROQ_API_KEY: The API key for accessing the Groq API. If not set, a default key is used (not recommended for production).
  • FLASK_APP: The name of the Flask application file (default: app.py).
  • FLASK_RUN_HOST: The host address for the Flask application (default: 0.0.0.0).
  • PORT: The port number for the Flask application (default: 7860).

You can set these environment variables in your Docker environment or in a .env file.

API Endpoints

1. /config

  • Method: POST

  • Description: Registers URLs and obtains a token for accessing the chatbot.

  • Request Body:

    {
      "urls": ["url1", "url2", ...],
      "default_message": "Optional default message",
      "contact_email": "Optional contact email"
    }
    
  • Response:

    {
      "status": "success",
      "message": "Successfully indexed [number] documents",
      "token": "generated_token"
    }
    

2. /chat

  • Method: POST

  • Description: Sends a message to the chatbot and receives a response.

  • Headers:

    • Authorization: Bearer <token>
  • Request Body:

    {
      "message": "user_message"
    }
    
  • Response:

    {
      "status": "success",
      "response": "chatbot_response"
    }
    

3. /refresh_token

  • Method: POST

  • Description: Refreshes an existing token to update indexed data.

  • Request Body:

    {
      "token": "old_token"
    }
    
  • Response:

    {
      "status": "success",
      "message": "Token refreshed successfully",
      "token": "new_token"
    }
    

4. /test

  • Method: GET

  • Description: Tests the API to ensure it is functioning correctly.

  • Response:

    {
      "status": "ok",
      "message": "API is working!",
      "active_tokens": [number_of_active_tokens]
    }
    

Web Content Extraction

The web_content_fetcher.py module is responsible for extracting content from web pages. It uses the following techniques:

  • Standard HTML Parsing: Uses BeautifulSoup to parse HTML content and extract relevant information.
  • JavaScript Detection: Detects if a site is JS-heavy and applies appropriate extraction strategies.
  • Structured Data Extraction: Extracts JSON-LD structured data for SEO.
  • React/SPA State Extraction: Extracts React/SPA initial state data.
  • Crawler Simulation: Uses a bot user-agent to simulate a search engine crawler and extract pre-rendered content.
  • Content Cleaning: Removes duplicate content, excessive whitespace, and irrelevant information.

Token Management

The token_manager.py module handles token generation, validation, and refreshing. It also manages loading and saving user data to JSON files for persistence across application restarts.

  • Token Generation: Generates a new unique token for each user.
  • Token Validation: Validates if a token exists and is not expired.
  • Token Refreshing: Refreshes an existing token, updating its creation timestamp.
  • Persistent Storage: Stores user data in JSON files for persistence across application restarts.

Groq API Integration

The groq_api.py module integrates with the Groq API to generate embeddings and completions based on the extracted content.

  • Embedding Generation: Generates an embedding for the given text using the Groq API.
  • Completion Generation: Generates a completion for the given prompt and context using the Groq API.

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues to suggest improvements or report bugs.

License

[License]