# FastChat API using Google Colab

[ggcr](https://github.com/ggcr)

In [None]:
%cd /content/

# clone FastChat
!git clone https://github.com/lm-sys/FastChat.git

# install dependencies
%cd FastChat
!python3 -m pip install -e ".[model_worker,webui]" --quiet

See [openai_api.md](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) from FastChat docs.

Because in Google Colab we are limited in resources and executing things in the background is not stable, we will run each API process in a thread and communicate them via explicit addresses:

In [11]:
import subprocess
import threading

%cd /content/

# Using 127.0.0.1 because localhost does not work properly in Colab

def run_controller():
 subprocess.run(["python3", "-m", "fastchat.serve.controller", "--host", "127.0.0.1"])

def run_model_worker():
 subprocess.run(["python3", "-m", "fastchat.serve.model_worker", "--host", "127.0.0.1", "--controller-address", "http://127.0.0.1:21001", "--model-path", "lmsys/vicuna-7b-v1.5", "--load-8bit"])

def run_api_server():
 subprocess.run(["python3", "-m", "fastchat.serve.openai_api_server", "--host", "127.0.0.1", "--controller-address", "http://127.0.0.1:21001", "--port", "8000"])


/content


In [3]:
# Start controller thread
# see `controller.log` on the local storage provided by Colab
controller_thread = threading.Thread(target=run_controller)
controller_thread.start()

In [4]:
# Start model worker thread

# see `controller.log` on the local storage provided by Colab
# important to wait until the checkpoint shards are fully downloaded
model_worker_thread = threading.Thread(target=run_model_worker)
model_worker_thread.start()


In [12]:
# Start API server thread
api_server_thread = threading.Thread(target=run_api_server)
api_server_thread.start()

We now have the API running at http://127.0.0.1:8000/v1/ locally from Google Colab.

We can run the examples from FastChat with curl.

Try chat completion with

In [14]:
!curl http://127.0.0.1:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{ \
 "model": "vicuna-7b-v1.5", \
 "messages": [{"role": "user", "content": "Hello, can you tell me a joke for me?"}], \
 "temperature": 0.5 \
 }'

{"id":"chatcmpl-3RViU5mrsEBNu8oSxexAEb","object":"chat.completion","created":1705781842,"model":"vicuna-7b-v1.5","choices":[{"index":0,"message":{"role":"assistant","content":"Sure thing! Here's one for you:\n\nWhy did the tomato turn red?\n\nBecause it saw the salad dressing!"},"finish_reason":"stop"}],"usage":{"prompt_tokens":50,"total_tokens":82,"completion_tokens":32}}

Try embeddings with

In [18]:
!curl http://127.0.0.1:8000/v1/embeddings \
 -H "Content-Type: application/json" \
 -d '{ \
 "model": "vicuna-7b-v1.5", \
 "input": "Hello, can you tell me a joke for me?" \
 }'

{"object":"list","data":[{"object":"embedding","embedding":[0.0229715034365654,-0.020740192383527756,0.01663232035934925,0.013713006861507893,-0.01602417416870594,-0.006382038351148367,0.011642662808299065,-0.021167458966374397,0.004879815969616175,-0.005442662630230188,0.0034834356047213078,-0.010336925275623798,-0.009551243856549263,0.0005828586872667074,-0.0089940270408988,-0.0018360239919275045,-0.021827373653650284,0.007349758874624968,-0.0011765437666326761,-0.01432803925126791,0.012239773757755756,-0.018455859273672104,0.016475312411785126,-0.006144467741250992,-0.013893244788050652,-0.00961716752499342,0.00827623251825571,0.0013034207513555884,0.006355977617204189,0.007773293182253838,0.0029199880082160234,-0.014487813226878643,-0.01615595631301403,0.007242684718221426,-0.004686516709625721,-0.0034376305993646383,-0.0046915397979319096,0.0007899928605183959,-0.003679676679894328,-0.022176748141646385,-0.005467468872666359,-0.02236158587038517,0.02086811512708664,0.0029669292271

Try text completion with

In [20]:
!curl http://127.0.0.1:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{ \
 "model": "vicuna-7b-v1.5", \
 "prompt": "Once upon a time", \
 "max_tokens": 20, \
 "temperature": 0.5 \
 }'

{"id":"cmpl-kB3gg4KtgcGdif9V4eNbh6","object":"text_completion","created":1705782008,"model":"vicuna-7b-v1.5","choices":[{"index":0,"text":", there was a little girl named Alice. Alice lived in a small village nestled in a valley","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":5,"total_tokens":24,"completion_tokens":19}}

Try create_embeddings to analyze the prompts!

In [21]:
import json
import numpy as np
import requests
from scipy.spatial.distance import cosine


def get_embedding_from_api(word, model='vicuna-7b-v1.5'):
 url = 'http://127.0.0.1:8000/v1/embeddings'
 headers = {'Content-Type': 'application/json'}
 data = json.dumps({
 'model': model,
 'input': word
 })

 response = requests.post(url, headers=headers, data=data)
 if response.status_code == 200:
 embedding = np.array(response.json()['data'][0]['embedding'])
 return embedding
 else:
 print(f"Error: {response.status_code} - {response.text}")
 return None


def cosine_similarity(vec1, vec2):
 return 1 - cosine(vec1, vec2)


def print_cosine_similarity(embeddings, texts):
 for i in range(len(texts)):
 for j in range(i + 1, len(texts)):
 sim = cosine_similarity(embeddings[texts[i]], embeddings[texts[j]])
 print(f"Cosine similarity between '{texts[i]}' and '{texts[j]}': {sim:.2f}")


texts = [
 'The quick brown fox',
 'The quick brown dog',
 'The fast brown fox',
 'A completely different sentence'
]

embeddings = {}
for text in texts:
 embeddings[text] = get_embedding_from_api(text)

print_cosine_similarity(embeddings, texts)

Cosine similarity between 'The quick brown fox' and 'The quick brown dog': 0.90
Cosine similarity between 'The quick brown fox' and 'The fast brown fox': 0.86
Cosine similarity between 'The quick brown fox' and 'A completely different sentence': 0.58
Cosine similarity between 'The quick brown dog' and 'The fast brown fox': 0.84
Cosine similarity between 'The quick brown dog' and 'A completely different sentence': 0.66
Cosine similarity between 'The fast brown fox' and 'A completely different sentence': 0.62
