docs/README_GPU.md · akashkj/H2OGPT at main

GPU

GPU via CUDA is supported via Hugging Face type models and LLaMa.cpp models.

GPU (CUDA)

For help installing cuda toolkit, see CUDA Toolkit

git clone https://github.com/h2oai/h2ogpt.git
cd h2ogpt
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117
python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=True

Then point browser at http://0.0.0.0:7860 (linux) or http://localhost:7860 (windows/mac) or the public live URL printed by the server (disable shared link with --share=False). For 4-bit or 8-bit support, older GPUs may require older bitsandbytes installed as pip uninstall bitsandbytes -y ; pip install bitsandbytes==0.38.1. For production uses, we recommend at least the 12B model, ran as:

python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b --load_8bit=True

and one can use --h2ocolors=False to get soft blue-gray colors instead of H2O.ai colors. Here is a list of environment variables that can control some things in generate.py.

Note if you download the model yourself and point --base_model to that location, you'll need to specify the prompt_type as well by running:

python generate.py --base_model=<user path> --load_8bit=True --prompt_type=human_bot

for some user path <user path> and the prompt_type must match the model or a new version created in prompter.py or added in UI/CLI via prompt_dict.

For quickly using a private document collection for Q/A, place documents (PDFs, text, etc.) into a folder called user_path and run

pip install -r reqs_optional/requirements_optional_langchain.txt
python -m nltk.downloader all  # for supporting unstructured package
python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b  --load_8bit=True --langchain_mode=UserData --user_path=user_path

For more ways to ingest on CLI and control see LangChain Readme. For example, for improved pdf handling via pymupdf (GPL) and support for docx, ppt, OCR, and ArXiV run:

sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr tesseract-ocr libreoffice
pip install -r reqs_optional/requirements_optional_langchain.gpllike.txt

For 4-bit support, the latest dev versions of transformers, accelerate, and peft are required, which can be installed by running:

pip uninstall peft transformers accelerate -y
pip install -r reqs_optional/requirements_optional_4bit.txt

where uninstall is required in case, e.g., peft was installed from GitHub previously. Then when running generate pass --load_4bit=True, which is only supported for certain architectures like GPT-NeoX-20B, GPT-J, LLaMa, etc.

Any other instruct-tuned base models can be used, including non-h2oGPT ones. Larger models require more GPU memory.

GPU with LLaMa

Install langchain, and GPT4All, and python LLaMa dependencies:

pip install -r reqs_optional/requirements_optional_langchain.txt
pip install -r reqs_optional/requirements_optional_gpt4all.txt

then compile llama-cpp-python with CUDA support:

conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit  # maybe optional
pip uninstall -y llama-cpp-python
export LLAMA_CUBLAS=1
export CMAKE_ARGS=-DLLAMA_CUBLAS=on
export FORCE_CMAKE=1
export CUDA_HOME=$HOME/miniconda3/envs/h2ogpt
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.68 --no-cache-dir --verbose

and uncomment # n_gpu_layers=20 in .env_gpt4all, one can try also 40 instead of 20. If one sees /usr/bin/nvcc mentioned in errors, that file needs to be removed as would likely conflict with version installed for conda. Then run:

python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path

when loading you should see something like:

Using Model llama
Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents
load INSTRUCTOR_Transformer
max_seq_length  512
0it [00:00, ?it/s]
0it [00:00, ?it/s]
Loaded 0 sources for potentially adding to UserData
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti
  Device 1: NVIDIA GeForce RTX 2080
llama.cpp: loading model from WizardLM-7B-uncensored.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 1792
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090 Ti) as main device
llama_model_load_internal: mem required  = 4518.85 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 368 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 20 repeating layers to GPU
llama_model_load_internal: offloaded 20/35 layers to GPU
llama_model_load_internal: total VRAM used: 4470 MB
llama_new_context_with_model: kv self size  =  896.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'wizard2', 'prompt_dict': {'promptA': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'promptB': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'PreInstruct': '\n### Instruction:\n', 'PreInput': None, 'PreResponse': '\n### Response:\n', 'terminate_response': ['\n### Response:\n'], 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '\n### Instruction:\n', 'botstr': '\n### Response:\n', 'generates_leading_space': False}}
Running on local URL:  http://0.0.0.0:7860
Running on public URL: https://1ccb24d03273a3d085.gradio.live

and GPU usage when using. Note that once llama-cpp-python is compiled to support CUDA, it no longer works for CPU mode, so one would have to reinstall it without the above options to recovers CPU mode or have a separate h2oGPT env for CPU mode.