JRosenkranz commited on
Commit
8ac38c5
1 Parent(s): 972329a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md CHANGED
@@ -92,3 +92,90 @@ cd text-generation-inference/integration_tests
92
  make gen-client
93
  pip install . --no-cache-dir
94
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  make gen-client
93
  pip install . --no-cache-dir
94
  ```
95
+
96
+ ### Minimal Sample
97
+
98
+ *To try this out with the fms-native compiled model, please execute the following:*
99
+
100
+ #### Install
101
+
102
+ ```bash
103
+ git clone https://github.com/foundation-model-stack/fms-extras
104
+ (cd fms-extras && pip install -e .)
105
+ pip install transformers==4.35.0 sentencepiece numpy
106
+ ```
107
+
108
+ #### Run Sample
109
+
110
+ ```bash
111
+ python sample_client.py
112
+ ```
113
+
114
+ _Note: first prompt may be slower as there is a slight warmup time_
115
+
116
+ ### Minimal Sample
117
+
118
+ #### Install
119
+
120
+ ```bash
121
+ git clone --branch main --single-branch llama_3_variants https://github.com/JRosenkranz/fms-extras
122
+ (cd fms-extras && pip install -e .)
123
+ pip install transformers==4.35.0 sentencepiece numpy
124
+ ```
125
+
126
+ #### Run Sample
127
+
128
+ ##### batch_size=1 (compile + cudagraphs)
129
+
130
+ ```bash
131
+ MODEL_PATH=/path/to/llama3/hf/Meta-Llama-3-8B-Instruct
132
+ python fms-extras/scripts/paged_speculative_inference.py \
133
+ --architecture=llama3 \
134
+ --variant=8b \
135
+ --model_path=$MODEL_PATH \
136
+ --model_source=hf \
137
+ --tokenizer=$MODEL_PATH \
138
+ --speculator_path=ibm-fms/codellama-13b-accelerator \
139
+ --speculator_source=hf \
140
+ --speculator_variant=3_2b \
141
+ --top_k_tokens_per_head=4,3,2,2 \
142
+ --compile \
143
+ --compile_mode=reduce-overhead
144
+ ```
145
+
146
+ ##### batch_size=1 (compile)
147
+
148
+ ```bash
149
+ MODEL_PATH=/path/to/llama3/hf/Meta-Llama-3-8B-Instruct
150
+ python fms-extras/scripts/paged_speculative_inference.py \
151
+ --architecture=llama3 \
152
+ --variant=8b \
153
+ --model_path=$MODEL_PATH \
154
+ --model_source=hf \
155
+ --tokenizer=$MODEL_PATH \
156
+ --speculator_path=ibm-fms/codellama-13b-accelerator \
157
+ --speculator_source=hf \
158
+ --speculator_variant=3_2b \
159
+ --top_k_tokens_per_head=4,3,2,2 \
160
+ --compile
161
+ ```
162
+
163
+ ##### batch_size=4 (compile)
164
+
165
+ ```bash
166
+ MODEL_PATH=/path/to/llama3/hf/Meta-Llama-3-8B-Instruct
167
+ python fms-extras/scripts/paged_speculative_inference.py \
168
+ --architecture=llama3 \
169
+ --variant=8b \
170
+ --model_path=$MODEL_PATH \
171
+ --model_source=hf \
172
+ --tokenizer=$MODEL_PATH \
173
+ --speculator_path=ibm-fms/codellama-13b-accelerator \
174
+ --speculator_source=hf \
175
+ --speculator_variant=3_2b \
176
+ --top_k_tokens_per_head=4,3,2,2 \
177
+ --batch_input \
178
+ --compile
179
+ ```
180
+
181
+ Sample code can be found [here](https://github.com/foundation-model-stack/fms-extras/pull/24)