Spaces:
Paused
Paused
File size: 6,414 Bytes
dbd2ac6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
# Offline Mode:
## Easy Way:
Run h2oGPT as would in offline mode, ensuring to use LLM and upload docs using same parsers as would want in offline mode. The `~/.cache` folder will be filled, and one can use that in offline mode.
## Moderately Easy Way:
If you can run on same (or better) system that will be like that in offline mode, you can run the below and collect all needed items in the `~/.cache/` and `~/nltk_data` folders, specifically:
* `~/.cache/selenium/`
* `~/.cache/huggingface/`
* `~/.cache/torch/`
* `~/.cache/clip/`
* `~/.cache/doctr/`
* `~/.cache/chroma/`
* `~/.cache/ms-playwright/`
* `~/.cache/selenium/`
* `~/nltk_data/`
```
python generate.py --score_model=None --gradio_size=small --model_lock="[{'base_model': 'h2oai/h2ogpt-4096-llama2-7b-chat'}]" --save_dir=save_fastup_chat --prepare_offline_level=2
# below are already in docker
python -m nltk.downloader all
playwright install --with-deps
```
Some of these locations can be controlled, but others not, so best to make local version of ~/.cache (e.g. move original out of way), run the above, archive it for offline system, restore old ~/.cache, then use offline. If same system, then those steps aren't required, one can just go fully offline.
If you are only concerned with what h2oGPT needs, not any inference servers, you can run with `--prepare_offline_level=1` that will not obtain models associated with inference severs (e.g. vLLM or TGI).
If you have a GGML file, you should download it ahead of time and place it in some path you provide to `--llamacpp_dict` for its `model_path_llama` dict entry.
## Hard Way:
Identify all models needed and download each. The below list is not exhaustive because the models added changes frequently and each uses different approach for downloading.
Note, when running `generate.py` and asking your first question, it will download the model(s), which for the 6.9B model takes about 15 minutes per 3 pytorch bin files if have 10MB/s download.
If all data has been put into `~/.cache` by HF transformers and GGML files downloaded already and one points to them (e.g. with `--model_path_llama=llama-2-7b-chat.ggmlv3.q8_0.bin`), then these following steps (those related to downloading HF models) are not required.
* Download model and tokenizer of choice
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = 'h2oai/h2ogpt-oasst1-512-12b'
model = AutoModelForCausalLM.from_pretrained(model_name)
model.save_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(model_name)
```
If using GGML files, those should be downloaded separately manually, and point to file path, e.g. `--base_model=llama --model_path_llama=llama-2-7b-chat.ggmlv3.q8_0.bin`.
* Download reward model, unless pass `--score_model='None'` to `generate.py`
```python
# and reward model
reward_model = 'OpenAssistant/reward-model-deberta-v3-large-v2'
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(reward_model)
model.save_pretrained(reward_model)
tokenizer = AutoTokenizer.from_pretrained(reward_model)
tokenizer.save_pretrained(reward_model)
```
* For LangChain support, download embedding model:
```python
hf_embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = 'cpu'
from langchain.embeddings import HuggingFaceEmbeddings
embedding = HuggingFaceEmbeddings(model_name=hf_embedding_model, model_kwargs=model_kwargs)
```
* For HF inference server and OpenAI, this downloads the tokenizers used for Hugging Face text generation inference server and gpt-3.5-turbo:
```python
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
```
* Get gpt-2 tokenizer for summarization token counting
```python
from transformers import AutoTokenizer
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(model_name)
```
## Run h2oGPT in offline mode
```bash
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python generate.py --base_model='h2oai/h2ogpt-oasst1-512-12b' --gradio_offline_level=2 --share=False
```
For more info for transformers, see [Offline Mode](https://huggingface.co/docs/transformers/installation#offline-mode).
Some code is always disabled that involves uploads out of user control: Huggingface telemetry, gradio telemetry, chromadb posthog.
The additional option `--gradio_offline_level=2` changes fonts to avoid download of google fonts. This option disables google fonts for downloading, which is less intrusive than uploading, but still required in air-gapped case. The fonts don't look as nice as google fonts, but ensure full offline behavior.
If the front-end can still access internet, but just backend should not, then one can use `--gradio_offline_level=1` for slightly better-looking fonts.
Note that gradio attempts to download [iframeResizer.contentWindow.min.js](https://cdnjs.cloudflare.com/ajax/libs/iframe-resizer/4.3.1/iframeResizer.contentWindow.min.js),
but nothing prevents gradio from working without this. So a simple firewall block is sufficient. For more details, see: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/10324.
### Disable access or port
To ensure nobody can access your gradio server, disable the port via firewall. If that is a hassle, then one can enable authentication by adding to CLI when running `python generate.py`:
```
--auth=[('jon','password')]
```
with no spaces. Run `python generate.py --help` for more details.
### To fully disable Chroma telemetry, which documented options still do not disable, run:
```bash
sp=`python -c 'import site; print(site.getsitepackages()[0])'`
sed -i 's/posthog\.capture/return\n posthog.capture/' $sp/chromadb/telemetry/posthog.py
```
or the equivalent for windows/mac using. Or edit the file manually to just return in the `capture` function.
To avoid h2oGPT monitoring which elements are clicked in UI, set the ENV `H2OGPT_ENABLE_HEAP_ANALYTICS=False` pass `--enable-heap-analytics=False` to `generate.py`. Note that no data or user inputs are included, only raw svelte UI element IDs and nothing from the user inputs or data.
|