chatAi / README.md
Soufianesejjari's picture
update requirements and Dockerfile for improved dependency management and application performance
fd63909
---
title: ChatAi
emoji: πŸ“š
colorFrom: yellow
colorTo: pink
sdk: docker
pinned: false
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# ChatAi - Intelligent Chatbot API
![Project Emoji](https://img.shields.io/badge/Emoji-%F0%9F%93%9A-brightgreen)
![Project Color](https://img.shields.io/badge/Color-Yellow%20to%20Pink-ff69b4)
![SDK](https://img.shields.io/badge/SDK-Docker-blue)
## Overview
ChatAi is a Flask-based chatbot API designed to provide intelligent and context-aware responses by extracting and processing content from specified URLs. It leverages web scraping techniques, including specialized handling for JavaScript-heavy sites, to provide comprehensive and accurate information. The API is containerized using Docker for easy deployment and scalability.
## Features
- **Intelligent Web Scraping**: Extracts content from various types of websites, including those built with modern JavaScript frameworks like React, Vue, and Angular.
- **Content Cleaning and Processing**: Cleans and processes extracted content to remove duplicates, excessive whitespace, and irrelevant information.
- **Context-Aware Responses**: Generates responses based on the extracted content, providing clear, concise, and relevant information.
- **Token-Based Authentication**: Uses tokens to manage user data and ensure secure access to the API.
- **Persistent Storage**: Stores user data, including indexed documents, in JSON files for persistence across application restarts.
- **Dockerized Deployment**: Containerized using Docker for easy deployment and scalability.
- **API Endpoints**:
- `/config`: Registers URLs and obtains a token for accessing the chatbot.
- `/chat`: Sends a message to the chatbot and receives a response.
- `/refresh_token`: Refreshes an existing token to update indexed data.
- `/test`: Tests the API to ensure it is functioning correctly.
## Architecture
The application is structured into the following modules:
- **`app.py`**: The main Flask application file that defines the API endpoints and orchestrates the chatbot functionality.
- **`web_content_fetcher.py`**: Handles web scraping and content extraction from URLs. It includes specialized techniques for handling JavaScript-heavy sites.
- **`token_manager.py`**: Manages token generation, validation, and refreshing. It also handles loading and saving user data to JSON files.
- **`groq_api.py`**: Interacts with the Groq API to generate embeddings and completions based on the extracted content.
## Setup and Deployment
### Prerequisites
- Docker: [Install Docker](https://docs.docker.com/get-docker/)
### Steps
1. **Clone the repository:**
```bash
git clone [repository_url]
cd ChatAi
```
2. **Build the Docker image:**
```bash
docker build -t chat-ai .
```
3. **Run the Docker container:**
```bash
docker run -d -p 7860:7860 chat-ai
```
This will start the ChatAi API on port 7860.
### Environment Variables
The following environment variables can be configured:
- `GROQ_API_KEY`: The API key for accessing the Groq API. If not set, a default key is used (not recommended for production).
- `FLASK_APP`: The name of the Flask application file (default: `app.py`).
- `FLASK_RUN_HOST`: The host address for the Flask application (default: `0.0.0.0`).
- `PORT`: The port number for the Flask application (default: `7860`).
You can set these environment variables in your Docker environment or in a `.env` file.
## API Endpoints
### 1. `/config`
- **Method**: `POST`
- **Description**: Registers URLs and obtains a token for accessing the chatbot.
- **Request Body**:
```json
{
"urls": ["url1", "url2", ...],
"default_message": "Optional default message",
"contact_email": "Optional contact email"
}
```
- **Response**:
```json
{
"status": "success",
"message": "Successfully indexed [number] documents",
"token": "generated_token"
}
```
### 2. `/chat`
- **Method**: `POST`
- **Description**: Sends a message to the chatbot and receives a response.
- **Headers**:
- `Authorization`: `Bearer <token>`
- **Request Body**:
```json
{
"message": "user_message"
}
```
- **Response**:
```json
{
"status": "success",
"response": "chatbot_response"
}
```
### 3. `/refresh_token`
- **Method**: `POST`
- **Description**: Refreshes an existing token to update indexed data.
- **Request Body**:
```json
{
"token": "old_token"
}
```
- **Response**:
```json
{
"status": "success",
"message": "Token refreshed successfully",
"token": "new_token"
}
```
### 4. `/test`
- **Method**: `GET`
- **Description**: Tests the API to ensure it is functioning correctly.
- **Response**:
```json
{
"status": "ok",
"message": "API is working!",
"active_tokens": [number_of_active_tokens]
}
```
## Web Content Extraction
The `web_content_fetcher.py` module is responsible for extracting content from web pages. It uses the following techniques:
- **Standard HTML Parsing**: Uses BeautifulSoup to parse HTML content and extract relevant information.
- **JavaScript Detection**: Detects if a site is JS-heavy and applies appropriate extraction strategies.
- **Structured Data Extraction**: Extracts JSON-LD structured data for SEO.
- **React/SPA State Extraction**: Extracts React/SPA initial state data.
- **Crawler Simulation**: Uses a bot user-agent to simulate a search engine crawler and extract pre-rendered content.
- **Content Cleaning**: Removes duplicate content, excessive whitespace, and irrelevant information.
## Token Management
The `token_manager.py` module handles token generation, validation, and refreshing. It also manages loading and saving user data to JSON files for persistence across application restarts.
- **Token Generation**: Generates a new unique token for each user.
- **Token Validation**: Validates if a token exists and is not expired.
- **Token Refreshing**: Refreshes an existing token, updating its creation timestamp.
- **Persistent Storage**: Stores user data in JSON files for persistence across application restarts.
## Groq API Integration
The `groq_api.py` module integrates with the Groq API to generate embeddings and completions based on the extracted content.
- **Embedding Generation**: Generates an embedding for the given text using the Groq API.
- **Completion Generation**: Generates a completion for the given prompt and context using the Groq API.
## Contributing
Contributions are welcome! Please feel free to submit pull requests or open issues to suggest improvements or report bugs.
## License
[License]