nnnn / docs /my-website /docs /simple_proxy_old_doc.md
nonhuman's picture
Upload 225 files
4ec8dba

import Image from '@theme/IdealImage'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

πŸ’₯ OpenAI Proxy Server

LiteLLM Server manages:

See LiteLLM Proxy code

Quick Start

View all the supported args for the Proxy CLI here

$ pip install litellm[proxy]
$ litellm --model huggingface/bigcode/starcoder

#INFO: Proxy running on http://0.0.0.0:8000

Test

In a new shell, run, this will make an openai.chat.completions request. Ensure you're using openai v1.0.0+

litellm --test

This will now automatically route any requests for gpt-3.5-turbo to bigcode starcoder, hosted on huggingface inference endpoints.

Using LiteLLM Proxy - Curl Request, OpenAI Package

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "gpt-3.5-turbo",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'
import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:8000"
)

# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
    {
        "role": "user",
        "content": "this is a test request, write a short poem"
    }
])

print(response)

Server Endpoints

  • POST /chat/completions - chat completions endpoint to call 100+ LLMs
  • POST /completions - completions endpoint
  • POST /embeddings - embedding endpoint for Azure, OpenAI, Huggingface endpoints
  • GET /models - available models on server
  • POST /key/generate - generate a key to access the proxy

Supported LLMs

All LiteLLM supported LLMs are supported on the Proxy. Seel all supported llms

$ export AWS_ACCESS_KEY_ID=
$ export AWS_REGION_NAME=
$ export AWS_SECRET_ACCESS_KEY=
$ litellm --model bedrock/anthropic.claude-v2
$ export AZURE_API_KEY=my-api-key
$ export AZURE_API_BASE=my-api-base
$ litellm --model azure/my-deployment-name
$ export OPENAI_API_KEY=my-api-key
$ litellm --model gpt-3.5-turbo
$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model huggingface/<your model name> --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud
$ litellm --model huggingface/<your model name> --api_base http://0.0.0.0:8001
export AWS_ACCESS_KEY_ID=
export AWS_REGION_NAME=
export AWS_SECRET_ACCESS_KEY=
$ litellm --model sagemaker/jumpstart-dft-meta-textgeneration-llama-2-7b
$ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1
Assuming you're running vllm locally
$ litellm --model vllm/facebook/opt-125m
$ export TOGETHERAI_API_KEY=my-api-key
$ litellm --model together_ai/lmsys/vicuna-13b-v1.5-16k
$ export REPLICATE_API_KEY=my-api-key
$ litellm \
  --model replicate/meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3
$ litellm --model petals/meta-llama/Llama-2-70b-chat-hf
$ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison
$ export AI21_API_KEY=my-api-key
$ litellm --model j2-light
$ export COHERE_API_KEY=my-api-key
$ litellm --model command-nightly

Using with OpenAI compatible projects

Set base_url to the LiteLLM Proxy server

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:8000"
)

# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
    {
        "role": "user",
        "content": "this is a test request, write a short poem"
    }
])

print(response)

Start the LiteLLM proxy

litellm --model gpt-3.5-turbo

#INFO: Proxy running on http://0.0.0.0:8000

1. Clone the repo

git clone https://github.com/danny-avila/LibreChat.git

2. Modify Librechat's docker-compose.yml

LiteLLM Proxy is running on port 8000, set 8000 as the proxy below

OPENAI_REVERSE_PROXY=http://host.docker.internal:8000/v1/chat/completions

3. Save fake OpenAI key in Librechat's .env

Copy Librechat's .env.example to .env and overwrite the default OPENAI_API_KEY (by default it requires the user to pass a key).

OPENAI_API_KEY=sk-1234

4. Run LibreChat:

docker compose up

Continue-Dev brings ChatGPT to VSCode. See how to install it here.

In the config.py set this as your default model.

  default=OpenAI(
      api_key="IGNORED",
      model="fake-model-name",
      context_length=2048, # customize if needed for your model
      api_base="http://localhost:8000" # your proxy server url
  ),

Credits @vividfog for this tutorial.

$ pip install aider 

$ aider --openai-api-base http://0.0.0.0:8000 --openai-api-key fake-key
pip install pyautogen
from autogen import AssistantAgent, UserProxyAgent, oai
config_list=[
    {
        "model": "my-fake-model",
        "api_base": "http://localhost:8000",  #litellm compatible endpoint
        "api_type": "open_ai",
        "api_key": "NULL", # just a placeholder
    }
]

response = oai.Completion.create(config_list=config_list, prompt="Hi")
print(response) # works fine

llm_config={
    "config_list": config_list,
}

assistant = AssistantAgent("assistant", llm_config=llm_config)
user_proxy = UserProxyAgent("user_proxy")
user_proxy.initiate_chat(assistant, message="Plot a chart of META and TESLA stock price change YTD.", config_list=config_list)

Credits @victordibia for this tutorial.

A guidance language for controlling large language models. https://github.com/guidance-ai/guidance

NOTE: Guidance sends additional params like stop_sequences which can cause some models to fail if they don't support it.

Fix: Start your proxy using the --drop_params flag

litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048 --drop_params
import guidance

# set api_base to your proxy
# set api_key to anything
gpt4 = guidance.llms.OpenAI("gpt-4", api_base="http://0.0.0.0:8000", api_key="anything")

experts = guidance('''
{{#system~}}
You are a helpful and terse assistant.
{{~/system}}

{{#user~}}
I want a response to the following question:
{{query}}
Name 3 world-class experts (past or present) who would be great at answering this?
Don't answer the question yet.
{{~/user}}

{{#assistant~}}
{{gen 'expert_names' temperature=0 max_tokens=300}}
{{~/assistant}}
''', llm=gpt4)

result = experts(query='How can I be more productive?')
print(result)

Proxy Configs

The Config allows you to set the following params

Param Name Description
model_list List of supported models on the server, with model-specific configs
litellm_settings litellm Module settings, example litellm.drop_params=True, litellm.set_verbose=True, litellm.api_base, litellm.cache
general_settings Server settings, example setting master_key: sk-my_special_key
environment_variables Environment Variables example, REDIS_HOST, REDIS_PORT

Example Config

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-eu
      api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
      api_key: 
      rpm: 6      # Rate limit for this deployment: in requests per minute (rpm)
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-ca
      api_base: https://my-endpoint-canada-berri992.openai.azure.com/
      api_key: 
      rpm: 6
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-large
      api_base: https://openai-france-1234.openai.azure.com/
      api_key: 
      rpm: 1440

litellm_settings:
  drop_params: True
  set_verbose: True

general_settings: 
  master_key: sk-1234 # [OPTIONAL] Only use this if you to require all calls to contain this key (Authorization: Bearer sk-1234)


environment_variables:
  OPENAI_API_KEY: sk-123
  REPLICATE_API_KEY: sk-cohere-is-okay
  REDIS_HOST: redis-16337.c322.us-east-1-2.ec2.cloud.redislabs.com
  REDIS_PORT: "16337"
  REDIS_PASSWORD: 

Config for Multiple Models - GPT-4, Claude-2

Here's how you can use multiple llms with one proxy config.yaml.

Step 1: Setup Config

model_list:
  - model_name: zephyr-alpha # the 1st model is the default on the proxy
    litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
      model: huggingface/HuggingFaceH4/zephyr-7b-alpha
      api_base: http://0.0.0.0:8001
  - model_name: gpt-4
    litellm_params:
      model: gpt-4
      api_key: sk-1233
  - model_name: claude-2
    litellm_params:
      model: claude-2
      api_key: sk-claude    

:::info

The proxy uses the first model in the config as the default model - in this config the default model is zephyr-alpha :::

Step 2: Start Proxy with config

$ litellm --config /path/to/config.yaml

Step 3: Use proxy

Curl Command

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "zephyr-alpha",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'

Load Balancing - Multiple Instances of 1 model

Use this config to load balance between multiple instances of the same model. The proxy will handle routing requests (using LiteLLM's Router). Set rpm in the config if you want maximize throughput

Example config

requests with model=gpt-3.5-turbo will be routed across multiple instances of azure/gpt-3.5-turbo

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-eu
      api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
      api_key: 
      rpm: 6      # Rate limit for this deployment: in requests per minute (rpm)
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-ca
      api_base: https://my-endpoint-canada-berri992.openai.azure.com/
      api_key: 
      rpm: 6
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-large
      api_base: https://openai-france-1234.openai.azure.com/
      api_key: 
      rpm: 1440

Step 2: Start Proxy with config

$ litellm --config /path/to/config.yaml

Step 3: Use proxy

Curl Command

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "gpt-3.5-turbo",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'

Fallbacks + Cooldowns + Retries + Timeouts

If a call fails after num_retries, fall back to another model group.

If the error is a context window exceeded error, fall back to a larger model group (if given).

See Code

Set via config

model_list:
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8001
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8002
  - model_name: zephyr-beta
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8003
  - model_name: gpt-3.5-turbo
    litellm_params:
        model: gpt-3.5-turbo
        api_key: <my-openai-key>
  - model_name: gpt-3.5-turbo-16k
    litellm_params:
        model: gpt-3.5-turbo-16k
        api_key: <my-openai-key>

litellm_settings:
  num_retries: 3 # retry call 3 times on each model_name (e.g. zephyr-beta)
  request_timeout: 10 # raise Timeout error if call takes longer than 10s
  fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # fallback to gpt-3.5-turbo if call fails num_retries 
  context_window_fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo-16k"]}, {"gpt-3.5-turbo": ["gpt-3.5-turbo-16k"]}] # fallback to gpt-3.5-turbo-16k if context window error
  allowed_fails: 3 # cooldown model if it fails > 1 call in a minute. 

Set dynamically

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "zephyr-beta",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
      "fallbacks": [{"zephyr-beta": ["gpt-3.5-turbo"]}],
      "context_window_fallbacks": [{"zephyr-beta": ["gpt-3.5-turbo"]}],
      "num_retries": 2,
      "request_timeout": 10
    }
'

Config for Embedding Models - xorbitsai/inference

Here's how you can use multiple llms with one proxy config.yaml. Here is how LiteLLM calls OpenAI Compatible Embedding models

Config

model_list:
  - model_name: custom_embedding_model
    litellm_params:
      model: openai/custom_embedding  # the `openai/` prefix tells litellm it's openai compatible
      api_base: http://0.0.0.0:8000/
  - model_name: custom_embedding_model
    litellm_params:
      model: openai/custom_embedding  # the `openai/` prefix tells litellm it's openai compatible
      api_base: http://0.0.0.0:8001/

Run the proxy using this config

$ litellm --config /path/to/config.yaml

Managing Auth - Virtual Keys

Grant other's temporary access to your proxy, with keys that expire after a set duration.

Requirements:

You can then generate temporary keys by hitting the /key/generate endpoint.

See code

Step 1: Save postgres db url

model_list:
  - model_name: gpt-4
    litellm_params:
        model: ollama/llama2
  - model_name: gpt-3.5-turbo
    litellm_params:
        model: ollama/llama2

general_settings: 
  master_key: sk-1234 # [OPTIONAL] if set all calls to proxy will require either this key or a valid generated token
  database_url: "postgresql://<user>:<password>@<host>:<port>/<dbname>"

Step 2: Start litellm

litellm --config /path/to/config.yaml

Step 3: Generate temporary keys

curl 'http://0.0.0.0:8000/key/generate' \
--h 'Authorization: Bearer sk-1234' \
--d '{"models": ["gpt-3.5-turbo", "gpt-4", "claude-2"], "duration": "20m"}'
  • models: list or null (optional) - Specify the models a token has access too. If null, then token has access to all models on server.

  • duration: str or null (optional) Specify the length of time the token is valid for. If null, default is set to 1 hour. You can set duration as seconds ("30s"), minutes ("30m"), hours ("30h"), days ("30d").

Expected response:

{
    "key": "sk-kdEXbIqZRwEeEiHwdg7sFA", # Bearer token
    "expires": "2023-11-19T01:38:25.838000+00:00" # datetime object
}

Managing Auth - Upgrade/Downgrade Models

If a user is expected to use a given model (i.e. gpt3-5), and you want to:

  • try to upgrade the request (i.e. GPT4)
  • or downgrade it (i.e. Mistral)
  • OR rotate the API KEY (i.e. open AI)
  • OR access the same model through different end points (i.e. openAI vs openrouter vs Azure)

Here's how you can do that:

Step 1: Create a model group in config.yaml (save model name, api keys, etc.)

model_list:
  - model_name: my-free-tier
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8001
  - model_name: my-free-tier
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8002
  - model_name: my-free-tier
    litellm_params:
        model: huggingface/HuggingFaceH4/zephyr-7b-beta
        api_base: http://0.0.0.0:8003
    - model_name: my-paid-tier
    litellm_params:
        model: gpt-4
        api_key: my-api-key

Step 2: Generate a user key - enabling them access to specific models, custom model aliases, etc.

curl -X POST "https://0.0.0.0:8000/key/generate" \
-H "Authorization: Bearer sk-1234" \
-H "Content-Type: application/json" \
-d '{
    "models": ["my-free-tier"], 
    "aliases": {"gpt-3.5-turbo": "my-free-tier"}, 
    "duration": "30min"
}'
  • How to upgrade / downgrade request? Change the alias mapping
  • How are routing between diff keys/api bases done? litellm handles this by shuffling between different models in the model list with the same model_name. See Code

Managing Auth - Tracking Spend

You can get spend for a key by using the /key/info endpoint.

curl 'http://0.0.0.0:8000/key/info?key=<user-key>' \
     -X GET \
     -H 'Authorization: Bearer <your-master-key>'

This is automatically updated (in USD) when calls are made to /completions, /chat/completions, /embeddings using litellm's completion_cost() function. See Code.

Sample response

{
    "key": "sk-tXL0wt5-lOOVK9sfY2UacA",
    "info": {
        "token": "sk-tXL0wt5-lOOVK9sfY2UacA",
        "spend": 0.0001065,
        "expires": "2023-11-24T23:19:11.131000Z",
        "models": [
            "gpt-3.5-turbo",
            "gpt-4",
            "claude-2"
        ],
        "aliases": {
            "mistral-7b": "gpt-3.5-turbo"
        },
        "config": {}
    }
}

Save Model-specific params (API Base, API Keys, Temperature, Headers etc.)

You can use the config to save model-specific information like api_base, api_key, temperature, max_tokens, etc.

Step 1: Create a config.yaml file

model_list:
  - model_name: gpt-4-team1
    litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
      model: azure/chatgpt-v-2
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
      api_version: "2023-05-15"
      azure_ad_token: eyJ0eXAiOiJ
  - model_name: gpt-4-team2
    litellm_params:
      model: azure/gpt-4
      api_key: sk-123
      api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
  - model_name: mistral-7b
    litellm_params:
      model: ollama/mistral
      api_base: your_ollama_api_base
      headers: {
        "HTTP-Referer": "litellm.ai",  
        "X-Title": "LiteLLM Server"
      }

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Load API Keys from Vault

If you have secrets saved in Azure Vault, etc. and don't want to expose them in the config.yaml, here's how to load model-specific keys from the environment.

os.environ["AZURE_NORTH_AMERICA_API_KEY"] = "your-azure-api-key"
model_list:
  - model_name: gpt-4-team1
    litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
      model: azure/chatgpt-v-2
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
      api_version: "2023-05-15"
      api_key: os.environ/AZURE_NORTH_AMERICA_API_KEY

See Code

s/o to @David Manouchehri for helping with this.

Config for setting Model Aliases

Set a model alias for your deployments.

In the config.yaml the model_name parameter is the user-facing name to use for your deployment.

In the config below requests with model=gpt-4 will route to ollama/llama2

model_list:
  - model_name: text-davinci-003
    litellm_params:
        model: ollama/zephyr
  - model_name: gpt-4
    litellm_params:
        model: ollama/llama2
  - model_name: gpt-3.5-turbo
    litellm_params:
        model: ollama/llama2

Caching Responses

Caching can be enabled by adding the cache key in the config.yaml

Step 1: Add cache to the config.yaml

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo

litellm_settings:
  set_verbose: True
  cache:          # init cache
    type: redis   # tell litellm to use redis caching

Step 2: Add Redis Credentials to .env

LiteLLM requires the following REDIS credentials in your env to enable caching

REDIS_HOST = ""       # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
REDIS_PORT = ""       # REDIS_PORT='18841'
REDIS_PASSWORD = ""   # REDIS_PASSWORD='liteLlmIsAmazing'

Step 3: Run proxy with config

$ litellm --config /path/to/config.yaml

Using Caching

Send the same request twice:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "gpt-3.5-turbo",
     "messages": [{"role": "user", "content": "write a poem about litellm!"}],
     "temperature": 0.7
   }'

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "gpt-3.5-turbo",
     "messages": [{"role": "user", "content": "write a poem about litellm!"}],
     "temperature": 0.7
   }'

Control caching per completion request

Caching can be switched on/off per /chat/completions request

  • Caching on for completion - pass caching=True:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
       "model": "gpt-3.5-turbo",
       "messages": [{"role": "user", "content": "write a poem about litellm!"}],
       "temperature": 0.7,
       "caching": true
     }'
    
  • Caching off for completion - pass caching=False:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
       "model": "gpt-3.5-turbo",
       "messages": [{"role": "user", "content": "write a poem about litellm!"}],
       "temperature": 0.7,
       "caching": false
     }'
    

Set Custom Prompt Templates

LiteLLM by default checks if a model has a prompt template and applies it (e.g. if a huggingface model has a saved chat template in it's tokenizer_config.json). However, you can also set a custom prompt template on your proxy in the config.yaml:

Step 1: Save your prompt template in a config.yaml

# Model-specific parameters
model_list:
  - model_name: mistral-7b # model alias
    litellm_params: # actual params for litellm.completion()
      model: "huggingface/mistralai/Mistral-7B-Instruct-v0.1" 
      api_base: "<your-api-base>"
      api_key: "<your-api-key>" # [OPTIONAL] for hf inference endpoints
      initial_prompt_value: "\n"
      roles: {"system":{"pre_message":"<|im_start|>system\n", "post_message":"<|im_end|>"}, "assistant":{"pre_message":"<|im_start|>assistant\n","post_message":"<|im_end|>"}, "user":{"pre_message":"<|im_start|>user\n","post_message":"<|im_end|>"}}
      final_prompt_value: "\n"
      bos_token: "<s>"
      eos_token: "</s>"
      max_tokens: 4096

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Debugging Proxy

Run the proxy with --debug to easily view debug logs

litellm --model gpt-3.5-turbo --debug

When making requests you should see the POST request sent by LiteLLM to the LLM on the Terminal output

POST Request Sent from LiteLLM:
curl -X POST \
https://api.openai.com/v1/chat/completions \
-H 'content-type: application/json' -H 'Authorization: Bearer sk-qnWGUIW9****************************************' \
-d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "this is a test request, write a short poem"}]}'

Health Check LLMs on Proxy

Use this to health check all LLMs defined in your config.yaml

Request

curl --location 'http://0.0.0.0:8000/health'

You can also run litellm -health it makes a get request to http://0.0.0.0:8000/health for you

litellm --health

Response

{
    "healthy_endpoints": [
        {
            "model": "azure/gpt-35-turbo",
            "api_base": "https://my-endpoint-canada-berri992.openai.azure.com/"
        },
        {
            "model": "azure/gpt-35-turbo",
            "api_base": "https://my-endpoint-europe-berri-992.openai.azure.com/"
        }
    ],
    "unhealthy_endpoints": [
        {
            "model": "azure/gpt-35-turbo",
            "api_base": "https://openai-france-1234.openai.azure.com/"
        }
    ]
}

Logging Proxy Input/Output - OpenTelemetry

Step 1 Start OpenTelemetry Collecter Docker Container

This container sends logs to your selected destination

Install OpenTelemetry Collecter Docker Image

docker pull otel/opentelemetry-collector:0.90.0
docker run -p 127.0.0.1:4317:4317 -p 127.0.0.1:55679:55679 otel/opentelemetry-collector:0.90.0

Set Destination paths on OpenTelemetry Collecter

Here's the OpenTelemetry yaml config to use with Elastic Search

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  
processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  logging:
    loglevel: debug
  otlphttp/elastic:
    endpoint: "<your elastic endpoint>"
    headers: 
      Authorization: "Bearer <elastic api key>"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [logging, otlphttp/elastic]
    traces:
      receivers: [otlp]
      exporters: [logging, otlphttp/elastic]
    logs: 
      receivers: [otlp]
      exporters: [logging,otlphttp/elastic]

Start the OpenTelemetry container with config

Run the following command to start your docker container. We pass otel_config.yaml from the previous step

docker run -p 4317:4317 \
    -v $(pwd)/otel_config.yaml:/etc/otel-collector-config.yaml \
    otel/opentelemetry-collector:latest \
    --config=/etc/otel-collector-config.yaml

Step 2 Configure LiteLLM proxy to log on OpenTelemetry

Pip install opentelemetry

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -U

Set (OpenTelemetry) otel=True on the proxy config.yaml

Example config.yaml

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-eu
      api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
      api_key: 
      rpm: 6      # Rate limit for this deployment: in requests per minute (rpm)

general_settings: 
  otel: True      # set OpenTelemetry=True, on litellm Proxy

Set OTEL collector endpoint

LiteLLM will read the OTEL_ENDPOINT environment variable to send data to your OTEL collector

os.environ['OTEL_ENDPOINT'] # defauls to 127.0.0.1:4317 if not provided

Start LiteLLM Proxy

litellm -config config.yaml

Run a test request to Proxy

curl --location 'http://0.0.0.0:8000/chat/completions' \
    --header 'Authorization: Bearer sk-1244' \
    --data ' {
    "model": "gpt-3.5-turbo",
    "messages": [
        {
        "role": "user",
        "content": "request from LiteLLM testing"
        }
    ]
    }'

Test & View Logs on OpenTelemetry Collecter

On successfull logging you should be able to see this log on your OpenTelemetry Collecter Docker Container

Events:
SpanEvent #0
     -> Name: LiteLLM: Request Input
     -> Timestamp: 2023-12-02 05:05:53.71063 +0000 UTC
     -> DroppedAttributesCount: 0
     -> Attributes::
          -> type: Str(http)
          -> asgi: Str({'version': '3.0', 'spec_version': '2.3'})
          -> http_version: Str(1.1)
          -> server: Str(('127.0.0.1', 8000))
          -> client: Str(('127.0.0.1', 62796))
          -> scheme: Str(http)
          -> method: Str(POST)
          -> root_path: Str()
          -> path: Str(/chat/completions)
          -> raw_path: Str(b'/chat/completions')
          -> query_string: Str(b'')
          -> headers: Str([(b'host', b'0.0.0.0:8000'), (b'user-agent', b'curl/7.88.1'), (b'accept', b'*/*'), (b'authorization', b'Bearer sk-1244'), (b'content-length', b'147'), (b'content-type', b'application/x-www-form-urlencoded')])
          -> state: Str({})
          -> app: Str(<fastapi.applications.FastAPI object at 0x1253dd960>)
          -> fastapi_astack: Str(<contextlib.AsyncExitStack object at 0x127c8b7c0>)
          -> router: Str(<fastapi.routing.APIRouter object at 0x1253dda50>)
          -> endpoint: Str(<function chat_completion at 0x1254383a0>)
          -> path_params: Str({})
          -> route: Str(APIRoute(path='/chat/completions', name='chat_completion', methods=['POST']))
SpanEvent #1
     -> Name: LiteLLM: Request Headers
     -> Timestamp: 2023-12-02 05:05:53.710652 +0000 UTC
     -> DroppedAttributesCount: 0
     -> Attributes::
          -> host: Str(0.0.0.0:8000)
          -> user-agent: Str(curl/7.88.1)
          -> accept: Str(*/*)
          -> authorization: Str(Bearer sk-1244)
          -> content-length: Str(147)
          -> content-type: Str(application/x-www-form-urlencoded)
SpanEvent #2

View Log on Elastic Search

Here's the log view on Elastic Search. You can see the request input, output and headers

<Image img={require('../img/elastic_otel.png')} />

Logging Proxy Input/Output - Langfuse

We will use the --config to set litellm.success_callback = ["langfuse"] this will log all successfull LLM calls to langfuse

Step 1 Install langfuse

pip install langfuse

Step 2: Create a config.yaml file and set litellm_settings: success_callback

model_list:
 - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo
litellm_settings:
  success_callback: ["langfuse"]

Step 3: Start the proxy, make a test request

Start proxy

litellm --config config.yaml --debug

Test Request

litellm --test

Expected output on Langfuse

<Image img={require('../img/langfuse_small.png')} />

Deploying LiteLLM Proxy

Deploy on Render https://render.com/

LiteLLM Proxy Performance

Throughput - 30% Increase

LiteLLM proxy + Load Balancer gives 30% increase in throughput compared to Raw OpenAI API <Image img={require('../img/throughput.png')} />

Latency Added - 0.00325 seconds

LiteLLM proxy adds 0.00325 seconds latency as compared to using the Raw OpenAI API <Image img={require('../img/latency.png')} />

Proxy CLI Arguments

--host

  • Default: '0.0.0.0'
  • The host for the server to listen on.
  • Usage:
    litellm --host 127.0.0.1
    

--port

  • Default: 8000
  • The port to bind the server to.
  • Usage:
    litellm --port 8080
    

--num_workers

  • Default: 1
  • The number of uvicorn workers to spin up.
  • Usage:
    litellm --num_workers 4
    

--api_base

  • Default: None
  • The API base for the model litellm should call.
  • Usage:
    litellm --model huggingface/tinyllama --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud
    

--api_version

  • Default: None
  • For Azure services, specify the API version.
  • Usage:
    litellm --model azure/gpt-deployment --api_version 2023-08-01 --api_base https://<your api base>"
    

--model or -m

  • Default: None
  • The model name to pass to Litellm.
  • Usage:
    litellm --model gpt-3.5-turbo
    

--test

  • Type: bool (Flag)
  • Proxy chat completions URL to make a test request.
  • Usage:
    litellm --test
    

--health

  • Type: bool (Flag)
  • Runs a health check on all models in config.yaml
  • Usage:
    litellm --health
    

--alias

  • Default: None
  • An alias for the model, for user-friendly reference.
  • Usage:
    litellm --alias my-gpt-model
    

--debug

  • Default: False
  • Type: bool (Flag)
  • Enable debugging mode for the input.
  • Usage:
    litellm --debug
    

--temperature

  • Default: None
  • Type: float
  • Set the temperature for the model.
  • Usage:
    litellm --temperature 0.7
    

--max_tokens

  • Default: None
  • Type: int
  • Set the maximum number of tokens for the model output.
  • Usage:
    litellm --max_tokens 50
    

--request_timeout

  • Default: 600
  • Type: int
  • Set the timeout in seconds for completion calls.
  • Usage:
    litellm --request_timeout 300
    

--drop_params

  • Type: bool (Flag)
  • Drop any unmapped params.
  • Usage:
    litellm --drop_params
    

--add_function_to_prompt

  • Type: bool (Flag)
  • If a function passed but unsupported, pass it as a part of the prompt.
  • Usage:
    litellm --add_function_to_prompt
    

--config

  • Configure Litellm by providing a configuration file path.
  • Usage:
    litellm --config path/to/config.yaml
    

--telemetry

  • Default: True
  • Type: bool
  • Help track usage of this feature.
  • Usage:
    litellm --telemetry False