## Triton Inference Server

To get optimal performance for inference for h2oGPT models, we will be using the [FastTransformer Backend for Triton](https://github.com/triton-inference-server/fastertransformer_backend/).

Make sure to [Set Up GPU Docker](README_DOCKER.md#setup-docker-for-gpus) first.

### Build Docker image for Triton with FasterTransformer backend:

```bash
git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
git clone https://github.com/NVIDIA/FasterTransformer.git
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=22.12
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}   \
    -t ${TRITON_DOCKER_IMAGE} \
    -f docker/Dockerfile \
    .
```

### Create model definition files

We convert the h2oGPT model from [HF to FT format](https://github.com/NVIDIA/FasterTransformer/pull/569):

####  Fetch model from Hugging Face
```bash
export MODEL=h2ogpt-oig-oasst1-512-6_9b
if [ ! -d ${MODEL} ]; then
    git lfs clone https://huggingface.co/h2oai/${MODEL}
fi
```
If `git lfs` fails, make sure to install it first. For Ubuntu:
```bash
sudo apt-get install git-lfs
```

####  Convert to FasterTransformer format

```bash
export WORKSPACE=$(pwd)
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
# Go into Docker
docker run -it --rm --runtime=nvidia --shm-size=1g \
       --ulimit memlock=-1 -v ${WORKSPACE}:${WORKSPACE} \
       -e CUDA_VISIBLE_DEVICES=0 \
       -e MODEL=${MODEL} \
       -e WORKSPACE=${WORKSPACE} \
       -w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash
export PYTHONPATH=${WORKSPACE}/FasterTransformer/:$PYTHONPATH
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py \
        -i_g 1 \
        -m_n gptneox \
        -i ${WORKSPACE}/${MODEL} \
        -o ${WORKSPACE}/FT-${MODEL}
```

####  Test the FasterTransformer model

FIXME
```bash
echo "Hi, who are you?" > gptneox_input
echo "And you are?" >> gptneox_input
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptneox/gptneox_example.py \
         --ckpt_path ${WORKSPACE}/FT-${MODEL}/1-gpu \
         --tokenizer_path ${WORKSPACE}/${MODEL} \
         --sample_input_file gptneox_input
```

#### Update Triton configuration files

Fix a typo in the example:
```bash
sed -i -e 's@postprocessing@preprocessing@' all_models/gptneox/preprocessing/config.pbtxt
```

Update the path to the PyTorch model, and set to use 1 GPU:
```bash
sed -i -e "s@/workspace/ft/models/ft/gptneox/@${WORKSPACE}/FT-${MODEL}/1-gpu@" all_models/gptneox/fastertransformer/config.pbtxt
sed -i -e 's@string_value: "2"@string_value: "1"@' all_models/gptneox/fastertransformer/config.pbtxt
```

#### Launch Triton

```bash
CUDA_VISIBLE_DEVICES=0 mpirun -n 1 \
        --allow-run-as-root /opt/tritonserver/bin/tritonserver  \
        --model-repository=${WORKSPACE}/all_models/gptneox/ &
```

Now, you should see something like this:
```bash
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| ensemble          | 1       | READY  |
| fastertransformer | 1       | READY  |
| postprocessing    | 1       | READY  |
| preprocessing     | 1       | READY  |
+-------------------+---------+--------+
```
which means the pipeline is ready to make predictions!

### Run client test

Let's test the endpoint:
```bash
python3 ${WORKSPACE}/tools/gpt/identity_test.py
```

And now the end-to-end test:

We first have to fix a bug in the inputs for postprocessing:
```bash
sed -i -e 's@prepare_tensor("RESPONSE_INPUT_LENGTHS", output2, FLAGS.protocol)@prepare_tensor("sequence_length", output1, FLAGS.protocol)@' ${WORKSPACE}/tools/gpt/end_to_end_test.py
```

```bash
python3 ${WORKSPACE}/tools/gpt/end_to_end_test.py
```