JRosenkranz commited on
Commit
1575adf
·
1 Parent(s): b1c3a10

moving model weights to ibm-granite org

Browse files
README.md CHANGED
@@ -1,3 +1,166 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama2
3
+ ---
4
+
5
+ ## Installation from source
6
+
7
+ ```bash
8
+ git clone https://github.com/foundation-model-stack/fms-extras
9
+ cd fms-extras
10
+ pip install -e .
11
+ ```
12
+
13
+
14
+ ## Description
15
+
16
+ This model is intended to be used as an accelerator for [granite 7B (instruct lab)](https://huggingface.co/instructlab/granite-7b-lab) and takes inspiration
17
+ from the Medusa speculative decoding architecture. This accelerator modifies the MLP into a multi-stage MLP, where each stage predicts
18
+ a single token in the draft based on both a state vector and sampled token
19
+ from the prior stage (the base model can be considered stage 0).
20
+ The state vector from the base model provides contextual information to the accelerator,
21
+ while conditioning on prior sampled tokens allows it to produce higher-quality draft n-grams.
22
+
23
+ Note: The underlying MLP speculator is a generic architecture that can be trained with any generative model to accelerate inference.
24
+ Training is light-weight and can be completed in only a few days depending on base model size and speed.
25
+
26
+ ## Repository Links
27
+
28
+ 1. [Paged Attention KV-Cache / Speculator](https://github.com/foundation-model-stack/fms-extras)
29
+ 2. [Production Server with speculative decoding](https://github.com/IBM/text-generation-inference.git)
30
+ 3. [Speculator training](https://github.com/foundation-model-stack/fms-fsdp/pull/35)
31
+
32
+ ## Samples
33
+
34
+ _Note: For all samples, your environment must have access to cuda_
35
+
36
+ ### Production Server Sample
37
+
38
+ *To try this out running in a production-like environment, please use the pre-built docker image:*
39
+
40
+ #### Setup
41
+
42
+ ```bash
43
+ HF_HUB_CACHE=/hf_hub_cache
44
+ chmod a+w $HF_HUB_CACHE
45
+ HF_HUB_TOKEN="your huggingface hub token"
46
+ TGIS_IMAGE=quay.io/wxpe/text-gen-server:main.ee927a4
47
+
48
+ docker pull $TGIS_IMAGE
49
+
50
+ # optionally download granite-7b-lab if the weights do not already exist
51
+ docker run --rm \
52
+ -v $HF_HUB_CACHE:/models \
53
+ -e HF_HUB_CACHE=/models \
54
+ -e TRANSFORMERS_CACHE=/models \
55
+ $TGIS_IMAGE \
56
+ text-generation-server download-weights \
57
+ instructlab/granite-7b-lab \
58
+ --token $HF_HUB_TOKEN
59
+
60
+ # optionally download the speculator model if the weights do not already exist
61
+ docker run --rm \
62
+ -v $HF_HUB_CACHE:/models \
63
+ -e HF_HUB_CACHE=/models \
64
+ -e TRANSFORMERS_CACHE=/models \
65
+ $TGIS_IMAGE \
66
+ text-generation-server download-weights \
67
+ ibm/granite-7b-lab-accelerator \
68
+ --token $HF_HUB_TOKEN
69
+
70
+ # note: if the weights were downloaded separately (not with the above commands), please place them in the HF_HUB_CACHE directory and refer to them with /models/<model_name>
71
+ docker run -d --rm --gpus all \
72
+ --name my-tgis-server \
73
+ -p 8033:8033 \
74
+ -v $HF_HUB_CACHE:/models \
75
+ -e HF_HUB_CACHE=/models \
76
+ -e TRANSFORMERS_CACHE=/models \
77
+ -e MODEL_NAME=instructlab/granite-7b-lab \
78
+ -e SPECULATOR_NAME=ibm/granite-7b-lab-accelerator \
79
+ -e FLASH_ATTENTION=true \
80
+ -e PAGED_ATTENTION=true \
81
+ -e DTYPE=float16 \
82
+ $TGIS_IMAGE
83
+
84
+ # check logs and wait for "gRPC server started on port 8033" and "HTTP server started on port 3000"
85
+ docker logs my-tgis-server -f
86
+
87
+ # get the client sample (Note: The first prompt will take longer as there is a warmup time)
88
+ conda create -n tgis-client-env python=3.11
89
+ conda activate tgis-client-env
90
+ git clone --branch main --single-branch https://github.com/IBM/text-generation-inference.git
91
+ cd text-generation-inference/integration_tests
92
+ make gen-client
93
+ pip install . --no-cache-dir
94
+ ```
95
+
96
+ #### Run Sample
97
+
98
+ ```bash
99
+ python sample_client.py
100
+ ```
101
+
102
+ _Note: first prompt may be slower as there is a slight warmup time_
103
+
104
+ ### Minimal Sample
105
+
106
+ *To try this out with the fms-native compiled model, please execute the following:*
107
+
108
+ #### Install
109
+
110
+ ```bash
111
+ git clone --branch ibm_7b_instruct_lab_variant --single-branch https://github.com/JRosenkranz/fms-extras.git
112
+ (cd fms-extras && pip install -e .)
113
+ pip install transformers==4.35.0 sentencepiece numpy
114
+ ```
115
+
116
+ #### Run Sample
117
+
118
+ ##### batch_size=1 (compile + cudagraphs)
119
+
120
+ ```bash
121
+ MODEL_PATH=/path/to/instructlab/granite-7b-lab
122
+ python fms-extras/scripts/paged_speculative_inference.py \
123
+ --variant=7b.ibm_instruct_lab \
124
+ --model_path=$MODEL_PATH \
125
+ --model_source=hf \
126
+ --tokenizer=$MODEL_PATH \
127
+ --speculator_path=ibm/granite-7b-lab-accelerator \
128
+ --speculator_source=hf \
129
+ --speculator_variant=1_4b \
130
+ --top_k_tokens_per_head=4,3,2,2,2 \
131
+ --compile \
132
+ --compile_mode=reduce-overhead
133
+ ```
134
+
135
+ ##### batch_size=1 (compile)
136
+
137
+ ```bash
138
+ MODEL_PATH=/path/to/instructlab/granite-7b-lab
139
+ python fms-extras/scripts/paged_speculative_inference.py \
140
+ --variant=7b.ibm_instruct_lab \
141
+ --model_path=$MODEL_PATH \
142
+ --model_source=hf \
143
+ --tokenizer=$MODEL_PATH \
144
+ --speculator_path=ibm/granite-7b-lab-accelerator \
145
+ --speculator_source=hf \
146
+ --speculator_variant=1_4b \
147
+ --top_k_tokens_per_head=4,3,2,2,2 \
148
+ --compile \
149
+ ```
150
+
151
+ ##### batch_size=4 (compile)
152
+
153
+ ```bash
154
+ MODEL_PATH=/path/to/instructlab/granite-7b-lab
155
+ python fms-extras/scripts/paged_speculative_inference.py \
156
+ --variant=7b.ibm_instruct_lab \
157
+ --model_path=$MODEL_PATH \
158
+ --model_source=hf \
159
+ --tokenizer=$MODEL_PATH \
160
+ --speculator_path=ibm/granite-7b-lab-accelerator \
161
+ --speculator_source=hf \
162
+ --speculator_variant=1_4b \
163
+ --top_k_tokens_per_head=4,3,2,2,2 \
164
+ --batch_input \
165
+ --compile \
166
+ ```
added_tokens.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "<|assistant|>": 32003,
3
+ "<|endoftext|>": 32000,
4
+ "<|pad|>": 32001,
5
+ "<|system|>": 32004,
6
+ "<|user|>": 32002
7
+ }
config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MLPSpeculatorPreTrainedModel"
4
+ ],
5
+ "emb_dim": 4096,
6
+ "inner_dim": 4096,
7
+ "model_type": "mlp_speculator",
8
+ "n_candidates": 5,
9
+ "n_predict": 5,
10
+ "top_k_tokens_per_head": [
11
+ 4,
12
+ 3,
13
+ 2,
14
+ 2,
15
+ 2
16
+ ],
17
+ "torch_dtype": "float16",
18
+ "transformers_version": "4.38.2",
19
+ "vocab_size": 32008
20
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23f0815eb53a3cee5448c4a46edf38e23f09f42d939420eabbc7ac27384c93ca
3
+ size 2789951944
special_tokens_map.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|system|>",
4
+ "<|user|>",
5
+ "<|assistant|>"
6
+ ],
7
+ "bos_token": {
8
+ "content": "<s>",
9
+ "lstrip": false,
10
+ "normalized": false,
11
+ "rstrip": false,
12
+ "single_word": false
13
+ },
14
+ "eos_token": {
15
+ "content": "<|endoftext|>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "pad_token": {
22
+ "content": "<|pad|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "unk_token": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ }
35
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
3
+ size 499723
tokenizer_config.json ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "32000": {
30
+ "content": "<|endoftext|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "32001": {
38
+ "content": "<|pad|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "32002": {
46
+ "content": "<|user|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "32003": {
54
+ "content": "<|assistant|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "32004": {
62
+ "content": "<|system|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ }
69
+ },
70
+ "additional_special_tokens": [
71
+ "<|system|>",
72
+ "<|user|>",
73
+ "<|assistant|>"
74
+ ],
75
+ "bos_token": "<s>",
76
+ "chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>'+ '\n' + message['content'] + '\n'}}{% elif message['role'] == 'user' %}{{'<|user|>' + '\n' + message['content'] + '\n'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>' + '\n' + message['content'] + '<|endoftext|>' + ('' if loop.last else '\n')}}{% endif %}{% endfor %}",
77
+ "clean_up_tokenization_spaces": false,
78
+ "eos_token": "<|endoftext|>",
79
+ "fast_tokenizer": true,
80
+ "model_max_length": 1000000000000000019884624838656,
81
+ "pad_token": "<|pad|>",
82
+ "sp_model_kwargs": {},
83
+ "tokenizer_class": "LlamaTokenizer",
84
+ "unk_token": "<unk>",
85
+ "use_default_system_prompt": false
86
+ }