## h2oGPT integration with LangChain and Chroma/FAISS/Weaviate for Vector DB Our goal is to make it easy to have private offline document question-answer using LLMs. ## Get Started Follow the [get started steps](../README.md#get-started) in the main README. In this readme, we focus on other optional aspects. To support GPU FAISS database, run: ```bash pip install -r reqs_optional/requirements_optional_faiss.txt ``` or for CPU FAISS database, run: ```bash pip install -r reqs_optional/requirements_optional_faiss_cpu.txt ``` or for Weaviate, run: ```bash pip install -r reqs_optional/requirements_optional_langchain.txt ``` ## Supported Data types Open-source data types are supported, .msg is not supported due to GPL-3 requirement. Other meta types support other types inside them. Special support for some behaviors is provided by the UI itself. ### Supported Native Data types - `.pdf`: Portable Document Format (PDF), - `.txt`: Text file (UTF-8), - `.csv`: CSV, - `.toml`: TOML, - `.py`: Python, - `.rst`: reStructuredText, - `.rtf`: Rich Text Format, - `.md`: Markdown, - `.html`: HTML File, - `.mhtml`: MHTML File, - `.htm`: HTML File, - `.docx`: Word Document (optional), - `.doc`: Word Document (optional), - `.xlsx`: Excel Document (optional), - `.xls`: Excel Document (optional), - `.enex`: EverNote, - `.eml`: Email, - `.epub`: EPub, - `.odt`: Open Document Text, - `.pptx` : PowerPoint Document, - `.ppt` : PowerPoint Document, - `.xml`: XML, - `.apng` : APNG Image (optional), - `.blp` : BLP Image (optional), - `.bmp` : BMP Image (optional), - `.bufr` : BUFR Image (optional), - `.bw` : BW Image (optional), - `.cur` : CUR Image (optional), - `.dcx` : DCX Image (optional), - `.dds` : DDS Image (optional), - `.dib` : DIB Image (optional), - `.emf` : EMF Image (optional), - `.eps` : EPS Image (optional), - `.fit` : FIT Image (optional), - `.fits` : FITS Image (optional), - `.flc` : FLC Image (optional), - `.fli` : FLI Image (optional), - `.fpx` : FPX Image (optional), - `.ftc` : FTC Image (optional), - `.ftu` : FTU Image (optional), - `.gbr` : GBR Image (optional), - `.gif` : GIF Image (optional), - `.grib` : GRIB Image (optional), - `.h5` : H5 Image (optional), - `.hdf` : HDF Image (optional), - `.icb` : ICB Image (optional), - `.icns` : ICNS Image (optional), - `.ico` : ICO Image (optional), - `.iim` : IIM Image (optional), - `.im` : IM Image (optional), - `.j2c` : J2C Image (optional), - `.j2k` : J2K Image (optional), - `.jfif` : JFIF Image (optional), - `.jp2` : JP2 Image (optional), - `.jpc` : JPC Image (optional), - `.jpe` : JPE Image (optional), - `.jpeg` : JPEG Image (optional), - `.jpf` : JPF Image (optional), - `.jpg` : JPG Image (optional), - `.jpx` : JPX Image (optional), - `.mic` : MIC Image (optional), - `.mpeg` : MPEG Image (optional), - `.mpg` : MPG Image (optional), - `.msp` : MSP Image (optional), - `.pbm` : PBM Image (optional), - `.pcd` : PCD Image (optional), - `.pcx` : PCX Image (optional), - `.pgm` : PGM Image (optional), - `.png` : PNG Image (optional), - `.pnm` : PNM Image (optional), - `.ppm` : PPM Image (optional), - `.ps` : PS Image (optional), - `.psd` : PSD Image (optional), - `.pxr` : PXR Image (optional), - `.qoi` : QOI Image (optional), - `.ras` : RAS Image (optional), - `.rgb` : RGB Image (optional), - `.rgba` : RGBA Image (optional), - `.sgi` : SGI Image (optional), - `.tga` : TGA Image (optional), - `.tif` : TIF Image (optional), - `.tiff` : TIFF Image (optional), - `.vda` : VDA Image (optional), - `.vst` : VST Image (optional), - `.webp` : WEBP Image (optional), - `.wmf` : WMF Image (optional), - `.xbm` : XBM Image (optional), - `.xpm` : XPM Image (optional). - `.mp4` : MP4 Audio (optional). - `.mpeg` : MP4-based MPEG Audio (optional). - `.mpg` : MP4-based MPG Audio (optional). - `.mp3` : MP3 Audio (optional). - `.ogg` : OGG Audio (optional). - `.flac` : FLAC Audio (optional). - `.aac` : AAC Audio (optional). - `.au` : AU Audio (optional). ### Supported Meta Data types - `.zip` : Zip File containing any native datatype. - `.urls` : Text file containing new-line separated URLs (to be consumed via download). Note: If you upload files and one of the files is a zip that contains images to be read by BLIP/DocTR or PDFs to be read by DocTR, this will currently fail with: ```text Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method ``` Please upload the zip separately for now. ### Supported Data Types in UI - `Files` : All Native and Meta Data Types as file(s), - `URL` : Any URL (i.e. `http://` or `https://`), - `ArXiv` : Any ArXiv name (e.g. `arXiv:1706.03762`), - `Text` : Paste Text into UI. ### Supported Meta Tasks - `ScrapeWithPlayWRight` : Async Web Scraping using headless Chromium via PlayWright - `ScrapeWithHttp` : Async Web Scraping using aiohttp (slower than PlayWright) * Timing * Typical page like passing `https://github.com/h2oai/h2ogpt` takes about 300 seconds to process at default depth of 1 with about 140 pages. * No good progress indicators from these packages, so just have to wait. * Depth: * Set env `CRAWL_DEPTH=` to control depth for some integer ``, where 0 means only actual page, 1 means that page + all links on that page, etc. `CRAWL_DEPTH=1` by default to avoid excessive crawling. * Set env `ALL_CRAWL_DEPTH=` to force all url loaders to crawl at some depth (will be slower than async ones) * BS4: * Set env `HTML_TRANS=BS4` to use `BS4` to transform instead of `Html2TextTransformer`. Set `BS4_TAGS` env to some string of list to set [tags](https://python.langchain.com/docs/use_cases/web_scraping#quickstart). * e.g. `export BS4_TAGS="['span']"` * Scrape text content tags such as `

`, `

  • `, `
    `, and `` tags from the HTML content: * `

    `: The paragraph tag. It defines a paragraph in HTML and is used to group together related sentences and/or phrases. * `

  • `: The list item tag. It is used within ordered (`
      `) and unordered (`