# General Information

## 1. Project Initilization
- After pulling the project, do the following to initialize the project:
    - Make sure that a Python Version >= 3.11 is installed
    - Run the following command to execute the initialization script: "source setup.sh"
- If you want to insert new PDF documents and update the document base, you first need to install Tesseract which is the OCR engine used in this code:
    - Download Tesseract Installer for Windows: https://github.com/UB-Mannheim/tesseract/wiki
    - For others, see here: https://tesseract-ocr.github.io/tessdoc/Installation.html
- Create a .env file on the root directory level, with the following keys:
    - TESSERACT_PATH={path}
        - Set path to the installation path of tesseract, e.g. "C:\Program Files\Tesseract-OCR\tesseract.exe"
    - CHROMA_PATH=./../chroma

## 2. Using the Chatbot locally
- In the app/helper.py file comment out lines 8 to 10 if you are not on a Linux machine
- To start the chatbot locally, run "cd app" and "chainlit run app.py -w"
- To use the chatbot, you need two API keys which you can create under the following links
    - [OpenAI](https://openai.com/blog/openai-api)
    - [Cohere](https://dashboard.cohere.com/api-keys)

## 3. Using Docker
- Go to helper.py and uncomment the three lines for the import. This is necessary to use Chroma within the container
- Build the docker file: "docker build -t iso_27001_chatbot ."
- Running the docker file: "docker run -p 7860:7860 iso_27001_chatbot"
    - Access at: http://localhost:7860
- Note that the docker file uses the requirements_Docker.txt which do not include Cuda support, as the free version of HF spaces does not come with GPU availability. If you want to include Cuda support, you need to integrate the command seen above for installing torch into the Dockerfile.

# Project Structure

## app
Contains the chatbot web application, created with Chainlit. Also, includes classes for prompts and helper functions.

## chroma
The chroma folder consists of all the indices that were created by using the notebooks inside the index_preparation folder.

## embedding_model
This folder contains the embedding model fine-tuned on an ISO 27001 text corpus under. It is based on [bge-large-en-v.1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) and can be accessed and downloaded on [HuggingFace](https://huggingface.co/Basti8499/bge-large-en-v1.5-ISO-27001).

## index_preparation
Stores all Jupyter notebooks needed to create the vector database which stores the ISO 27001 documents. Before creating the index with the build_index.ipynb, the documents for PDFs, Web and Templates need be created inside the other notebooks.

## input_data

### PDF Files (/PDF)
- Directory structure:
    - PDF/files: After manually cleaning the PDFs (removing pages), the PDFs should be moved manually to this folder.
    - PDF/documents
        - /all_documents: JSON file for all processed PDF documents
        - /new_documents: JSON file for newly processed PDF documents
    - PDF/PDF_images: Empty folder in which the images during OCR are stored and deleted afterwards.

### Web Files (/Web)
- Directory structure:
    - Web/documents:
        - /all_documents: JSON file for all processed web documents
        - /newl_documents: JSON file for newly processed web documents
    - Web/URLs:
        - /cleaned_urls.txt: .txt file for URLs that were already processed and documents exist
        - /uncleaned_urls.txt: .txt file for URLs that have not been processed

### Template Files (/Templates)
- Directory structure:
    - Templates/documents:
        - /all_documents: JSON file for all processed template documents
        - /new_documents:  JSON file for all newly processed template documents
    - Templates/template_files:
        - /new: Not yet processed template files
        - /processed: Already processed template files
-  For templates it is important that the actual files to the template are stored under Templates/template_files/new for processing, as the paths are used in the chatbot.

## sparse_index
Stores the chunked documents that were created in the build_index.ipynb in a .txt file for later sparse retrieval.