File size: 7,226 Bytes
4bdb245
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
## Running the Server

PrivateGPT supports running with different LLMs & setups.

### Local models

Both the LLM and the Embeddings model will run locally.

Make sure you have followed the *Local LLM requirements* section before moving on.

This command will start PrivateGPT using the `settings.yaml` (default profile) together with the `settings-local.yaml`
configuration files. By default, it will enable both the API and the Gradio UI. Run:

```bash
PGPT_PROFILES=local make run
```

or

```bash
PGPT_PROFILES=local poetry run python -m private_gpt
```

When the server is started it will print a log *Application startup complete*.
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API
using Swagger UI.

#### Customizing low level parameters

Currently, not all the parameters of `llama.cpp` and `llama-cpp-python` are available at PrivateGPT's `settings.yaml` file.
In case you need to customize parameters such as the number of layers loaded into the GPU, you might change
these at the `llm_component.py` file under the `private_gpt/components/llm/llm_component.py`.

##### Available LLM config options

The `llm` section of the settings allows for the following configurations:

- `mode`: how to run your llm
- `max_new_tokens`: this lets you configure the number of new tokens the LLM will generate and add to the context window (by default Llama.cpp uses `256`)

Example:

```yaml
llm:
  mode: local
  max_new_tokens: 256
```

If you are getting an out of memory error, you might also try a smaller model or stick to the proposed
recommended models, instead of custom tuning the parameters.

### Using OpenAI

If you cannot run a local model (because you don't have a GPU, for example) or for testing purposes, you may
decide to run PrivateGPT using OpenAI as the LLM and Embeddings model.

In order to do so, create a profile `settings-openai.yaml` with the following contents:

```yaml
llm:
  mode: openai

openai:
  api_base: <openai-api-base-url> # Defaults to https://api.openai.com/v1
  api_key: <your_openai_api_key>  # You could skip this configuration and use the OPENAI_API_KEY env var instead
  model: <openai_model_to_use> # Optional model to use. Default is "gpt-3.5-turbo"
                               # Note: Open AI Models are listed here: https://platform.openai.com/docs/models
```

And run PrivateGPT loading that profile you just created:

`PGPT_PROFILES=openai make run`

or

`PGPT_PROFILES=openai poetry run python -m private_gpt`

When the server is started it will print a log *Application startup complete*.
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
You'll notice the speed and quality of response is higher, given you are using OpenAI's servers for the heavy
computations.

### Using OpenAI compatible API

Many tools, including [LocalAI](https://localai.io/) and [vLLM](https://docs.vllm.ai/en/latest/),
support serving local models with an OpenAI compatible API. Even when overriding the `api_base`,
using the `openai` mode doesn't allow you to use custom models. Instead, you should use the `openailike` mode:

```yaml
llm:
  mode: openailike
```

This mode uses the same settings as the `openai` mode.

As an example, you can follow the [vLLM quickstart guide](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server)
to run an OpenAI compatible server. Then, you can run PrivateGPT using the `settings-vllm.yaml` profile:

`PGPT_PROFILES=vllm make run`

### Using Azure OpenAI

If you cannot run a local model (because you don't have a GPU, for example) or for testing purposes, you may
decide to run PrivateGPT using Azure OpenAI as the LLM and Embeddings model.

In order to do so, create a profile `settings-azopenai.yaml` with the following contents:

```yaml
llm:
  mode: azopenai

embedding:
  mode: azopenai

azopenai:
  api_key: <your_azopenai_api_key>  # You could skip this configuration and use the AZ_OPENAI_API_KEY env var instead
  azure_endpoint: <your_azopenai_endpoint> # You could skip this configuration and use the AZ_OPENAI_ENDPOINT env var instead
  api_version: <api_version> # The API version to use. Default is "2023_05_15"
  embedding_deployment_name: <your_embedding_deployment_name> # You could skip this configuration and use the AZ_OPENAI_EMBEDDING_DEPLOYMENT_NAME env var instead
  embedding_model: <openai_embeddings_to_use> # Optional model to use. Default is "text-embedding-ada-002" 
  llm_deployment_name: <your_model_deployment_name> # You could skip this configuration and use the AZ_OPENAI_LLM_DEPLOYMENT_NAME env var instead
  llm_model: <openai_model_to_use> # Optional model to use. Default is "gpt-35-turbo"
```

And run PrivateGPT loading that profile you just created:

`PGPT_PROFILES=azopenai make run`

or

`PGPT_PROFILES=azopenai poetry run python -m private_gpt`

When the server is started it will print a log *Application startup complete*.
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
You'll notice the speed and quality of response is higher, given you are using Azure OpenAI's servers for the heavy
computations.

### Using AWS Sagemaker

For a fully private & performant setup, you can choose to have both your LLM and Embeddings model deployed using Sagemaker.

Note: how to deploy models on Sagemaker is out of the scope of this documentation.

In order to do so, create a profile `settings-sagemaker.yaml` with the following contents (remember to
update the values of the llm_endpoint_name and embedding_endpoint_name to yours):

```yaml
llm:
  mode: sagemaker

sagemaker:
  llm_endpoint_name: huggingface-pytorch-tgi-inference-2023-09-25-19-53-32-140
  embedding_endpoint_name: huggingface-pytorch-inference-2023-11-03-07-41-36-479
```

And run PrivateGPT loading that profile you just created:

`PGPT_PROFILES=sagemaker make run`

or

`PGPT_PROFILES=sagemaker poetry run python -m private_gpt`

When the server is started it will print a log *Application startup complete*.
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.

### Using Ollama

Another option for a fully private setup is using [Ollama](https://ollama.ai/).

Note: how to deploy Ollama and pull models onto it is out of the scope of this documentation.

In order to do so, create a profile `settings-ollama.yaml` with the following contents:

```yaml
llm:
  mode: ollama

ollama:
  model: <ollama_model_to_use> # Required Model to use.
                               # Note: Ollama Models are listed here: https://ollama.ai/library
                               #       Be sure to pull the model to your Ollama server
  api_base: <ollama-api-base-url> # Defaults to http://localhost:11434
```

And run PrivateGPT loading that profile you just created:

`PGPT_PROFILES=ollama make run`

or

`PGPT_PROFILES=ollama poetry run python -m private_gpt`

When the server is started it will print a log *Application startup complete*.
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.