Spaces:
Sleeping
Sleeping
update requirements and Dockerfile for improved dependency management and application performance
fd63909
title: ChatAi | |
emoji: π | |
colorFrom: yellow | |
colorTo: pink | |
sdk: docker | |
pinned: false | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
# ChatAi - Intelligent Chatbot API | |
 | |
 | |
 | |
## Overview | |
ChatAi is a Flask-based chatbot API designed to provide intelligent and context-aware responses by extracting and processing content from specified URLs. It leverages web scraping techniques, including specialized handling for JavaScript-heavy sites, to provide comprehensive and accurate information. The API is containerized using Docker for easy deployment and scalability. | |
## Features | |
- **Intelligent Web Scraping**: Extracts content from various types of websites, including those built with modern JavaScript frameworks like React, Vue, and Angular. | |
- **Content Cleaning and Processing**: Cleans and processes extracted content to remove duplicates, excessive whitespace, and irrelevant information. | |
- **Context-Aware Responses**: Generates responses based on the extracted content, providing clear, concise, and relevant information. | |
- **Token-Based Authentication**: Uses tokens to manage user data and ensure secure access to the API. | |
- **Persistent Storage**: Stores user data, including indexed documents, in JSON files for persistence across application restarts. | |
- **Dockerized Deployment**: Containerized using Docker for easy deployment and scalability. | |
- **API Endpoints**: | |
- `/config`: Registers URLs and obtains a token for accessing the chatbot. | |
- `/chat`: Sends a message to the chatbot and receives a response. | |
- `/refresh_token`: Refreshes an existing token to update indexed data. | |
- `/test`: Tests the API to ensure it is functioning correctly. | |
## Architecture | |
The application is structured into the following modules: | |
- **`app.py`**: The main Flask application file that defines the API endpoints and orchestrates the chatbot functionality. | |
- **`web_content_fetcher.py`**: Handles web scraping and content extraction from URLs. It includes specialized techniques for handling JavaScript-heavy sites. | |
- **`token_manager.py`**: Manages token generation, validation, and refreshing. It also handles loading and saving user data to JSON files. | |
- **`groq_api.py`**: Interacts with the Groq API to generate embeddings and completions based on the extracted content. | |
## Setup and Deployment | |
### Prerequisites | |
- Docker: [Install Docker](https://docs.docker.com/get-docker/) | |
### Steps | |
1. **Clone the repository:** | |
```bash | |
git clone [repository_url] | |
cd ChatAi | |
``` | |
2. **Build the Docker image:** | |
```bash | |
docker build -t chat-ai . | |
``` | |
3. **Run the Docker container:** | |
```bash | |
docker run -d -p 7860:7860 chat-ai | |
``` | |
This will start the ChatAi API on port 7860. | |
### Environment Variables | |
The following environment variables can be configured: | |
- `GROQ_API_KEY`: The API key for accessing the Groq API. If not set, a default key is used (not recommended for production). | |
- `FLASK_APP`: The name of the Flask application file (default: `app.py`). | |
- `FLASK_RUN_HOST`: The host address for the Flask application (default: `0.0.0.0`). | |
- `PORT`: The port number for the Flask application (default: `7860`). | |
You can set these environment variables in your Docker environment or in a `.env` file. | |
## API Endpoints | |
### 1. `/config` | |
- **Method**: `POST` | |
- **Description**: Registers URLs and obtains a token for accessing the chatbot. | |
- **Request Body**: | |
```json | |
{ | |
"urls": ["url1", "url2", ...], | |
"default_message": "Optional default message", | |
"contact_email": "Optional contact email" | |
} | |
``` | |
- **Response**: | |
```json | |
{ | |
"status": "success", | |
"message": "Successfully indexed [number] documents", | |
"token": "generated_token" | |
} | |
``` | |
### 2. `/chat` | |
- **Method**: `POST` | |
- **Description**: Sends a message to the chatbot and receives a response. | |
- **Headers**: | |
- `Authorization`: `Bearer <token>` | |
- **Request Body**: | |
```json | |
{ | |
"message": "user_message" | |
} | |
``` | |
- **Response**: | |
```json | |
{ | |
"status": "success", | |
"response": "chatbot_response" | |
} | |
``` | |
### 3. `/refresh_token` | |
- **Method**: `POST` | |
- **Description**: Refreshes an existing token to update indexed data. | |
- **Request Body**: | |
```json | |
{ | |
"token": "old_token" | |
} | |
``` | |
- **Response**: | |
```json | |
{ | |
"status": "success", | |
"message": "Token refreshed successfully", | |
"token": "new_token" | |
} | |
``` | |
### 4. `/test` | |
- **Method**: `GET` | |
- **Description**: Tests the API to ensure it is functioning correctly. | |
- **Response**: | |
```json | |
{ | |
"status": "ok", | |
"message": "API is working!", | |
"active_tokens": [number_of_active_tokens] | |
} | |
``` | |
## Web Content Extraction | |
The `web_content_fetcher.py` module is responsible for extracting content from web pages. It uses the following techniques: | |
- **Standard HTML Parsing**: Uses BeautifulSoup to parse HTML content and extract relevant information. | |
- **JavaScript Detection**: Detects if a site is JS-heavy and applies appropriate extraction strategies. | |
- **Structured Data Extraction**: Extracts JSON-LD structured data for SEO. | |
- **React/SPA State Extraction**: Extracts React/SPA initial state data. | |
- **Crawler Simulation**: Uses a bot user-agent to simulate a search engine crawler and extract pre-rendered content. | |
- **Content Cleaning**: Removes duplicate content, excessive whitespace, and irrelevant information. | |
## Token Management | |
The `token_manager.py` module handles token generation, validation, and refreshing. It also manages loading and saving user data to JSON files for persistence across application restarts. | |
- **Token Generation**: Generates a new unique token for each user. | |
- **Token Validation**: Validates if a token exists and is not expired. | |
- **Token Refreshing**: Refreshes an existing token, updating its creation timestamp. | |
- **Persistent Storage**: Stores user data in JSON files for persistence across application restarts. | |
## Groq API Integration | |
The `groq_api.py` module integrates with the Groq API to generate embeddings and completions based on the extracted content. | |
- **Embedding Generation**: Generates an embedding for the given text using the Groq API. | |
- **Completion Generation**: Generates a completion for the given prompt and context using the Groq API. | |
## Contributing | |
Contributions are welcome! Please feel free to submit pull requests or open issues to suggest improvements or report bugs. | |
## License | |
[License] | |