| # OpenAI-Compatible RESTful APIs | |
| FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. | |
| The FastChat server is compatible with both [openai-python](https://github.com/openai/openai-python) library and cURL commands. | |
| The following OpenAI APIs are supported: | |
| - Chat Completions. (Reference: https://platform.openai.com/docs/api-reference/chat) | |
| - Completions. (Reference: https://platform.openai.com/docs/api-reference/completions) | |
| - Embeddings. (Reference: https://platform.openai.com/docs/api-reference/embeddings) | |
| The REST API can be seamlessly operated from Google Colab, as demonstrated in the [FastChat_API_GoogleColab.ipynb](https://github.com/lm-sys/FastChat/blob/main/playground/FastChat_API_GoogleColab.ipynb) notebook, available in our repository. This notebook provides a practical example of how to utilize the API effectively within the Google Colab environment. | |
| ## RESTful API Server | |
| First, launch the controller | |
| ```bash | |
| python3 -m fastchat.serve.controller | |
| ``` | |
| Then, launch the model worker(s) | |
| ```bash | |
| python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5 | |
| ``` | |
| Finally, launch the RESTful API server | |
| ```bash | |
| python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 | |
| ``` | |
| Now, let us test the API server. | |
| ### OpenAI Official SDK | |
| The goal of `openai_api_server.py` is to implement a fully OpenAI-compatible API server, so the models can be used directly with [openai-python](https://github.com/openai/openai-python) library. | |
| First, install OpenAI python package >= 1.0: | |
| ```bash | |
| pip install --upgrade openai | |
| ``` | |
| Then, interact with the Vicuna model: | |
| ```python | |
| import openai | |
| openai.api_key = "EMPTY" | |
| openai.base_url = "http://localhost:8000/v1/" | |
| model = "vicuna-7b-v1.5" | |
| prompt = "Once upon a time" | |
| # create a completion | |
| completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64) | |
| # print the completion | |
| print(prompt + completion.choices[0].text) | |
| # create a chat completion | |
| completion = openai.chat.completions.create( | |
| model=model, | |
| messages=[{"role": "user", "content": "Hello! What is your name?"}] | |
| ) | |
| # print the completion | |
| print(completion.choices[0].message.content) | |
| ``` | |
| Streaming is also supported. See [test_openai_api.py](../tests/test_openai_api.py). If your api server is behind a proxy you'll need to turn off buffering, you can do so in Nginx by setting `proxy_buffering off;` in the location block for the proxy. | |
| ### cURL | |
| cURL is another good tool for observing the output of the api. | |
| List Models: | |
| ```bash | |
| curl http://localhost:8000/v1/models | |
| ``` | |
| Chat Completions: | |
| ```bash | |
| curl http://localhost:8000/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "vicuna-7b-v1.5", | |
| "messages": [{"role": "user", "content": "Hello! What is your name?"}] | |
| }' | |
| ``` | |
| Text Completions: | |
| ```bash | |
| curl http://localhost:8000/v1/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "vicuna-7b-v1.5", | |
| "prompt": "Once upon a time", | |
| "max_tokens": 41, | |
| "temperature": 0.5 | |
| }' | |
| ``` | |
| Embeddings: | |
| ```bash | |
| curl http://localhost:8000/v1/embeddings \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "vicuna-7b-v1.5", | |
| "input": "Hello world!" | |
| }' | |
| ``` | |
| ### Running multiple | |
| If you want to run multiple models on the same machine and in the same process, | |
| you can replace the `model_worker` step above with a multi model variant: | |
| ```bash | |
| python3 -m fastchat.serve.multi_model_worker \ | |
| --model-path lmsys/vicuna-7b-v1.5 \ | |
| --model-names vicuna-7b-v1.5 \ | |
| --model-path lmsys/longchat-7b-16k \ | |
| --model-names longchat-7b-16k | |
| ``` | |
| This loads both models into the same accelerator and in the same process. This | |
| works best when using a Peft model that triggers the `PeftModelAdapter`. | |
| TODO: Base model weight optimization will be fixed once [this | |
| Peft](https://github.com/huggingface/peft/issues/430) issue is resolved. | |
| ## LangChain Support | |
| This OpenAI-compatible API server supports LangChain. See [LangChain Integration](langchain_integration.md) for details. | |
| ## Adjusting Environment Variables | |
| ### Timeout | |
| By default, a timeout error will occur if a model worker does not response within 100 seconds. If your model/hardware is slower, you can change this timeout through an environment variable: | |
| ```bash | |
| export FASTCHAT_WORKER_API_TIMEOUT=<larger timeout in seconds> | |
| ``` | |
| ### Batch size | |
| If you meet the following OOM error while creating embeddings. You can use a smaller batch size by setting | |
| ```bash | |
| export FASTCHAT_WORKER_API_EMBEDDING_BATCH_SIZE=1 | |
| ``` | |
| ## Todos | |
| Some features to be implemented: | |
| - [ ] Support more parameters like `logprobs`, `logit_bias`, `user`, `presence_penalty` and `frequency_penalty` | |
| - [ ] Model details (permissions, owner and create time) | |
| - [ ] Edits API | |
| - [ ] Rate Limitation Settings | |