File size: 3,197 Bytes
08f2411
0699e13
924f16b
 
a24b598
 
 
 
 
 
 
 
 
924f16b
 
 
 
 
2532bba
924f16b
 
 
 
 
 
 
 
2532bba
924f16b
2532bba
 
 
 
 
 
 
 
 
924f16b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: llama2
---

## Description

This model as intended to be used as an accelerator for llama 13B (chat). 


Undlerlying implementation of Paged Attention KV-Cached and speculator can be found in https://github.com/foundation-model-stack/fms-extras
Production implementation using `fms-extras` implementation can be found in https://github.com/tdoublep/text-generation-inference/tree/speculative-decoding


To try this out running in a production-like environment, please use the pre-built docker image:

```bash
docker pull docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7
docker run -d --rm --gpus all \
    --name my-tgis-server \
    -p 8033:8033 \
    -v /path/to/all/models:/models \
    -e MODEL_NAME=/models/model_weights/llama/13B-F \
    -e SPECULATOR_PATH=/models/speculator_weights/llama/13B-F \
    -e FLASH_ATTENTION=true \
    -e PAGED_ATTENTION=true \
    -e DTYPE_STR=float16 \
    docker-eu-public.artifactory.swg-devops.com/res-zrl-snap-docker-local/tgis-os:spec.7

# check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
docker logs my-tgis-server -f

# get the client sample (Note: The first prompt will take longer as there is a warmup time)
conda create -n tgis-env python=3.11
conda activate tgis-env
git clone --branch speculative-decoding --single-branch https://github.com/tdoublep/text-generation-inference.git
cd text-generation-inference/integration_tests
make gen-client
pip install . --no-cache-dir
python sample_client.py
```

To try this out with the fms-native compiled model, please execute the following:

#### batch_size=1 (compile + cudagraphs)

```bash
git clone https://github.com/foundation-model-stack/fms-extras
(cd fms-extras && pip install -e .)
pip install transformers==4.35.0 sentencepiece numpy
python fms-extras/scripts/paged_speculative_inference.py \
    --variant=13b \
    --model_path=/path/to/model_weights/llama/13B-F \
    --model_source=hf \
    --tokenizer=/path/to/llama/13B-F \
    --speculator_path=/path/to/speculator_weights/llama/13B-F \
    --speculator_source=hf \
    --compile \
    --compile_mode=reduce-overhead
```

#### batch_size=1 (compile)

```bash
git clone https://github.com/foundation-model-stack/fms-extras
(cd fms-extras && pip install -e .)
pip install transformers==4.35.0 sentencepiece numpy
python fms-extras/scripts/paged_speculative_inference.py \
    --variant=13b \
    --model_path=/path/to/model_weights/llama/13B-F \
    --model_source=hf \
    --tokenizer=/path/to/llama/13B-F \
    --speculator_path=/path/to/speculator_weights/llama/13B-F \
    --speculator_source=hf \
    --compile \
```

#### batch_size=4 (compile)

```bash
git clone https://github.com/foundation-model-stack/fms-extras
(cd fms-extras && pip install -e .)
pip install transformers==4.35.0 sentencepiece numpy
python fms-extras/scripts/paged_speculative_inference.py \
    --variant=13b \
    --model_path=/path/to/model_weights/llama/13B-F \
    --model_source=hf \
    --tokenizer=/path/to/llama/13B-F \
    --speculator_path=/path/to/speculator_weights/llama/13B-F \
    --speculator_source=hf \
    --batch_input \
    --compile \
```