# Introduction
Please check out my [blog post](https://datavistics.github.io/posts/jais-inference-endpoints/) for more details!

# Setup

## Requirements

In [1]:
%pip install -q "huggingface-hub>=0.20" ipywidgets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Imports

In [None]:
from huggingface_hub import login, whoami, create_inference_endpoint
from getpass import getpass

## Config
Choose your `ENDPOINT_NAME` if you like.

In [3]:
ENDPOINT_NAME = "jais13b-demo"

In [None]:
login()

Some users might have payment registered in an organization. This allows you to connect to an organization (that you are a member of) with a payment method.

Leave it blank if you want to use your username.

In [5]:
who = whoami()
organization = getpass(prompt="What is your Hugging Face 馃 username or organization? (with an added payment method)")

namespace = organization or who['name']

What is your Hugging Face 馃 username or organization? (with an added payment method) 路路路路路路路路


# Inference Endpoints
## Create Inference Endpoint
We are going to use the [API](https://huggingface.co/docs/inference-endpoints/api_reference) to create an [Inference Endpoint](https://huggingface.co/inference-endpoints). This should provide a few main benefits:
- It's convenient (No clicking)
- It's repeatable (We have the code to run it easily)
- It's cheaper (No time spent waiting for it to load, and automatically shut it down)

Here is a convenient table of instance details you can use when selecting a GPU. Once you have chosen a GPU in Inference Endpoints, you can use the corresponding `instanceType` and `instanceSize`.

| hw_desc | instanceType | instanceSize | vRAM |
|---------------------|----------------|--------------|-------|
| 1x Nvidia Tesla T4 | g4dn.xlarge | small | 16GB |
| 4x Nvidia Tesla T4 | g4dn.12xlarge | large | 64GB |
| 1x Nvidia A10G | g5.2xlarge | medium | 24GB |
| 4x Nvidia A10G | g5.12xlarge | xxlarge | 96GB |
| 1x Nvidia A100 | p4de | xlarge | 80GB |
| 2x Nvidia A100 | p4de | 2xlarge | 160GB |

Note: To use a node (multiple GPUs) you will need to use a sharded version of jais. I'm not sure if there is currently a version like this on the hub. 

In [6]:
hw_dict = dict(
 accelerator="gpu",
 vendor="aws",
 region="us-east-1",
 type="protected",
 instance_type="p4de",
 instance_size="xlarge",
)

In [7]:
tgi_env = {
 "MAX_BATCH_PREFILL_TOKENS": "2048",
 "MAX_INPUT_LENGTH": "2000",
 'TRUST_REMOTE_CODE':'true',
 "QUANTIZE": 'bitsandbytes', 
 "MODEL_ID": "/repository"
}

A couple notes on my choices here:
- I used `derek-thomas/jais-13b-chat-hf` because that repo has SafeTensors merged which will lead to faster loading of the TGI container
- I'm using the latest TGI container as of the time of writing (1.3.4)
- `min_replica=0` allows [zero scaling](https://huggingface.co/docs/inference-endpoints/autoscaling#scaling-to-0) which is really useful for your wallet though think through if this makes sense for your use-case as there will be loading times
- `max_replica` allows you to handle high throughput. Make sure you read through the [docs](https://huggingface.co/docs/inference-endpoints/autoscaling#scaling-criteria) to understand how this scales

In [8]:
endpoint = create_inference_endpoint(
 ENDPOINT_NAME,
 repository="derek-thomas/jais-13b-chat-hf", 
 framework="pytorch",
 task="text-generation",
 **hw_dict,
 min_replica=0,
 max_replica=1,
 namespace=namespace,
 custom_image={
 "health_route": "/health",
 "env": tgi_env,
 "url": "ghcr.io/huggingface/text-generation-inference:1.3.4",
 },
)

## Wait until its running

In [None]:
%%time
endpoint.wait()

In [10]:
endpoint.client.text_generation("""
### Instruction: What is the sentiment of the input?
### Examples
I wish the screen was bigger - Negative
I hate the battery - Negative
I love the default appliations - Positive
### Input
I am happy with this purchase - 
### Response
""",
 do_sample=True,
 repetition_penalty=1.2,
 top_p=0.9,
 temperature=0.3)

'POSITIVE'

## Pause Inference Endpoint
Now that we have finished, lets pause the endpoint so we don't incur any extra charges, this will also allow us to analyze the cost.

In [11]:
endpoint = endpoint.pause()

print(f"Endpoint Status: {endpoint.status}")

Endpoint Status: paused


## Analyze Usage
1. Go to your `dashboard_url` printed below
1. Check the dashboard
1. Analyze the Usage & Cost tab

In [None]:
dashboard_url = f'https://ui.endpoints.huggingface.co/{namespace}/endpoints/{ENDPOINT_NAME}/analytics'
print(dashboard_url)

## Delete Endpoint

In [13]:
endpoint = endpoint.delete()

if not endpoint:
 print('Endpoint deleted successfully')
else:
 print('Delete Endpoint in manually') 

Endpoint deleted successfully
