Spaces:

Soufianesejjari
/

chatAi

Sleeping

App Files Files Community

chatAi / README.md

Soufianesejjari

update requirements and Dockerfile for improved dependency management and application performance

fd63909 13 days ago

preview code

raw

history blame contribute delete

6.86 kB

	---
	title: ChatAi
	emoji: 📚
	colorFrom: yellow
	colorTo: pink
	sdk: docker
	pinned: false
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

	# ChatAi - Intelligent Chatbot API

	![Project Emoji](https://img.shields.io/badge/Emoji-%F0%9F%93%9A-brightgreen)
	![Project Color](https://img.shields.io/badge/Color-Yellow%20to%20Pink-ff69b4)
	![SDK](https://img.shields.io/badge/SDK-Docker-blue)

	## Overview

	ChatAi is a Flask-based chatbot API designed to provide intelligent and context-aware responses by extracting and processing content from specified URLs. It leverages web scraping techniques, including specialized handling for JavaScript-heavy sites, to provide comprehensive and accurate information. The API is containerized using Docker for easy deployment and scalability.

	## Features

	- Intelligent Web Scraping: Extracts content from various types of websites, including those built with modern JavaScript frameworks like React, Vue, and Angular.
	- Content Cleaning and Processing: Cleans and processes extracted content to remove duplicates, excessive whitespace, and irrelevant information.
	- Context-Aware Responses: Generates responses based on the extracted content, providing clear, concise, and relevant information.
	- Token-Based Authentication: Uses tokens to manage user data and ensure secure access to the API.
	- Persistent Storage: Stores user data, including indexed documents, in JSON files for persistence across application restarts.
	- Dockerized Deployment: Containerized using Docker for easy deployment and scalability.
	- API Endpoints:
	- `/config`: Registers URLs and obtains a token for accessing the chatbot.
	- `/chat`: Sends a message to the chatbot and receives a response.
	- `/refresh_token`: Refreshes an existing token to update indexed data.
	- `/test`: Tests the API to ensure it is functioning correctly.

	## Architecture

	The application is structured into the following modules:

	- `app.py`: The main Flask application file that defines the API endpoints and orchestrates the chatbot functionality.
	- `web_content_fetcher.py`: Handles web scraping and content extraction from URLs. It includes specialized techniques for handling JavaScript-heavy sites.
	- `token_manager.py`: Manages token generation, validation, and refreshing. It also handles loading and saving user data to JSON files.
	- `groq_api.py`: Interacts with the Groq API to generate embeddings and completions based on the extracted content.

	## Setup and Deployment

	### Prerequisites

	- Docker: [Install Docker](https://docs.docker.com/get-docker/)

	### Steps

	1. Clone the repository:

	```bash
	git clone [repository_url]
	cd ChatAi
	```

	2. Build the Docker image:

	```bash
	docker build -t chat-ai .
	```

	3. Run the Docker container:

	```bash
	docker run -d -p 7860:7860 chat-ai
	```

	This will start the ChatAi API on port 7860.

	### Environment Variables

	The following environment variables can be configured:

	- `GROQ_API_KEY`: The API key for accessing the Groq API. If not set, a default key is used (not recommended for production).
	- `FLASK_APP`: The name of the Flask application file (default: `app.py`).
	- `FLASK_RUN_HOST`: The host address for the Flask application (default: `0.0.0.0`).
	- `PORT`: The port number for the Flask application (default: `7860`).

	You can set these environment variables in your Docker environment or in a `.env` file.

	## API Endpoints

	### 1. `/config`

	- Method: `POST`
	- Description: Registers URLs and obtains a token for accessing the chatbot.
	- Request Body:

	```json
	{
	"urls": ["url1", "url2", ...],
	"default_message": "Optional default message",
	"contact_email": "Optional contact email"
	}
	```

	- Response:

	```json
	{
	"status": "success",
	"message": "Successfully indexed [number] documents",
	"token": "generated_token"
	}
	```

	### 2. `/chat`

	- Method: `POST`
	- Description: Sends a message to the chatbot and receives a response.
	- Headers:
	- `Authorization`: `Bearer <token>`
	- Request Body:

	```json
	{
	"message": "user_message"
	}
	```

	- Response:

	```json
	{
	"status": "success",
	"response": "chatbot_response"
	}
	```

	### 3. `/refresh_token`

	- Method: `POST`
	- Description: Refreshes an existing token to update indexed data.
	- Request Body:

	```json
	{
	"token": "old_token"
	}
	```

	- Response:

	```json
	{
	"status": "success",
	"message": "Token refreshed successfully",
	"token": "new_token"
	}
	```

	### 4. `/test`

	- Method: `GET`
	- Description: Tests the API to ensure it is functioning correctly.
	- Response:

	```json
	{
	"status": "ok",
	"message": "API is working!",
	"active_tokens": [number_of_active_tokens]
	}
	```

	## Web Content Extraction

	The `web_content_fetcher.py` module is responsible for extracting content from web pages. It uses the following techniques:

	- Standard HTML Parsing: Uses BeautifulSoup to parse HTML content and extract relevant information.
	- JavaScript Detection: Detects if a site is JS-heavy and applies appropriate extraction strategies.
	- Structured Data Extraction: Extracts JSON-LD structured data for SEO.
	- React/SPA State Extraction: Extracts React/SPA initial state data.
	- Crawler Simulation: Uses a bot user-agent to simulate a search engine crawler and extract pre-rendered content.
	- Content Cleaning: Removes duplicate content, excessive whitespace, and irrelevant information.

	## Token Management

	The `token_manager.py` module handles token generation, validation, and refreshing. It also manages loading and saving user data to JSON files for persistence across application restarts.

	- Token Generation: Generates a new unique token for each user.
	- Token Validation: Validates if a token exists and is not expired.
	- Token Refreshing: Refreshes an existing token, updating its creation timestamp.
	- Persistent Storage: Stores user data in JSON files for persistence across application restarts.

	## Groq API Integration

	The `groq_api.py` module integrates with the Groq API to generate embeddings and completions based on the extracted content.

	- Embedding Generation: Generates an embedding for the given text using the Groq API.
	- Completion Generation: Generates a completion for the given prompt and context using the Groq API.

	## Contributing

	Contributions are welcome! Please feel free to submit pull requests or open issues to suggest improvements or report bugs.

	## License

	[License]