|
DEPLOY_TEXT = f""" |
|
|
|
# 🚀 Deployment Tips |
|
|
|
A collection of powerful models is valuable, but ultimately, you need to be able to use them effectively. |
|
This tab is dedicated to providing guidance and code snippets for performing inference with leaderboard models on Intel platforms. |
|
|
|
Below is a table of open-source software options for inference, along with the supported Intel hardware platforms. |
|
A 🚀 indicates that inference with the associated software package is supported on the hardware. We hope this information |
|
helps you choose the best option for your specific use case. Happy building! |
|
|
|
<div style="display: flex; justify-content: center;"> |
|
<table border="1"> |
|
<tr> |
|
<th>Inference Software</th> |
|
<th>Gaudi</th> |
|
<th>Xeon</th> |
|
<th>GPU Max</th> |
|
<th>Arc GPU</th> |
|
<th>Core Ultra</th> |
|
</tr> |
|
<tr> |
|
<td>Optimum Habana</td> |
|
<td>🚀</td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
</tr> |
|
<tr> |
|
<td>Intel Extension for PyTorch</td> |
|
<td></td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
<td></td> |
|
</tr> |
|
<tr> |
|
<td>Intel Extension for Transformers</td> |
|
<td></td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
<td></td> |
|
</tr> |
|
<tr> |
|
<td>OpenVINO</td> |
|
<td></td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
</tr> |
|
<tr> |
|
<td>BigDL</td> |
|
<td></td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
</tr> |
|
<tr> |
|
<td>NPU Acceleration Library</td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td>🚀</td> |
|
</tr> |
|
</tr> |
|
<tr> |
|
<td>PyTorch</td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
<td></td> |
|
<td></td> |
|
<td>🚀</td> |
|
</tr> |
|
</tr> |
|
<tr> |
|
<td>Tensorflow</td> |
|
<td>🚀</td> |
|
<td>🚀</td> |
|
<td></td> |
|
<td></td> |
|
<td>🚀</td> |
|
</tr> |
|
</table> |
|
</div> |
|
|
|
<hr> |
|
|
|
# Intel® Max Series GPU |
|
The Intel® Data Center GPU Max Series is Intel's highest performing, highest density, general-purpose discrete GPU, which packs over 100 billion transistors into one package and contains up to 128 Xe Cores--Intel's foundational GPU compute building block. You can learn more about this GPU [here](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html). |
|
|
|
### INT4 Inference (GPU) with Intel Extension for Transformers and Intel Extension for Python |
|
Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. |
|
|
|
👍 [Intel Extension for Transformers GitHub](https://github.com/intel/intel-extension-for-transformers) |
|
|
|
Intel® Extension for PyTorch* extends PyTorch* with up-to-date features optimizations for an extra performance boost on Intel hardware. Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device. |
|
|
|
👍 [Intel Extension for PyTorch GitHub](https://github.com/intel/intel-extension-for-pytorch) |
|
|
|
```python |
|
import intel_extension_for_pytorch as ipex |
|
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM |
|
from transformers import AutoTokenizer |
|
|
|
device_map = "xpu" |
|
model_name ="Qwen/Qwen-7B" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
prompt = "When winter becomes spring, the flowers..." |
|
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map) |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, |
|
device_map=device_map, load_in_4bit=True) |
|
|
|
model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, woq=True, device=device_map) |
|
|
|
output = model.generate(inputs) |
|
``` |
|
<hr> |
|
|
|
# Intel® Xeon® CPUs |
|
The Intel® Xeon® CPUs have the most built-in accelerators of any CPU on the market, including Advanced Matrix Extensions (AMX) to accelerate matrix multiplication in deep learning training and inference. Learn more about the Xeon CPUs [here](https://www.intel.com/content/www/us/en/products/details/processors/xeon.html). |
|
|
|
### Optimum Intel and Intel Extension for PyTorch (no quantization) |
|
🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures. |
|
|
|
👍 [Optimum Intel GitHub](https://github.com/huggingface/optimum-intel) |
|
|
|
Requires installing/updating optimum `pip install --upgrade-strategy eager optimum[ipex]` |
|
|
|
```python |
|
from optimum.intel import IPEXModelForCausalLM |
|
from transformers import AutoTokenizer, pipeline |
|
|
|
model = IPEXModelForCausalLM.from_pretrained(model_id) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) |
|
results = pipe("A fisherman at sea...") |
|
``` |
|
|
|
### Intel® Extension for PyTorch - Mixed Precision (fp32 and bf16) |
|
|
|
```python |
|
import torch |
|
import intel_extension_for_pytorch as ipex |
|
import transformers |
|
|
|
model= transformers.AutoModelForCausalLM(model_name_or_path).eval() |
|
|
|
dtype = torch.float # or torch.bfloat16 |
|
model = ipex.llm.optimize(model, dtype=dtype) |
|
|
|
# generation inference loop |
|
with torch.inference_mode(): |
|
model.generate() |
|
``` |
|
|
|
### Intel® Extension for Transformers - INT4 Inference (CPU) |
|
```python |
|
from transformers import AutoTokenizer |
|
from intel_extension_for_transformers.transformers import AutoModelForCausalLM |
|
model_name = "Intel/neural-chat-7b-v3-1" |
|
prompt = "When winter becomes spring, the flowers..." |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
inputs = tokenizer(prompt, return_tensors="pt").input_ids |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) |
|
outputs = model.generate(inputs) |
|
|
|
|
|
``` |
|
|
|
<hr> |
|
|
|
# Intel® Core Ultra (NPUs and iGPUs) |
|
Intel® Core™ Ultra Processors are optimized for premium thin and powerful laptops, featuring 3D performance hybrid architecture, advanced AI capabilities, and available with built-in Intel® Arc™ GPU. Learn more about Intel Core Ultra [here](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html). For now, there is support for smaller models like [TinyLama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0). |
|
|
|
### Intel® NPU Acceleration Library |
|
The Intel® NPU Acceleration Library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware. |
|
|
|
👍 [Intel NPU Acceleration Library GitHub](https://github.com/intel/intel-npu-acceleration-library) |
|
|
|
```python |
|
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM |
|
import intel_npu_acceleration_library |
|
import torch |
|
|
|
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval() |
|
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True) |
|
tokenizer.pad_token_id = tokenizer.eos_token_id |
|
streamer = TextStreamer(tokenizer, skip_special_tokens=True) |
|
|
|
print("Compile model for the NPU") |
|
model = intel_npu_acceleration_library.compile(model, dtype=torch.int8) |
|
|
|
query = input("Ask something: ") |
|
prefix = tokenizer(query, return_tensors="pt")["input_ids"] |
|
|
|
generation_kwargs = dict( |
|
input_ids=prefix, |
|
streamer=streamer, |
|
do_sample=True, |
|
top_k=50, |
|
top_p=0.9, |
|
max_new_tokens=512, |
|
) |
|
|
|
print("Run inference") |
|
_ = model.generate(**generation_kwargs) |
|
``` |
|
|
|
### OpenVINO Tooling with Optimum Intel |
|
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. |
|
|
|
👍 [OpenVINO GitHub](https://github.com/openvinotoolkit/openvino) |
|
|
|
```python |
|
from optimum.intel import OVModelForCausalLM |
|
from transformers import AutoTokenizer, pipeline |
|
|
|
model_id = "helenai/gpt2-ov" |
|
model = OVModelForCausalLM.from_pretrained(model_id) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) |
|
|
|
pipe("In the spring, beautiful flowers bloom...") |
|
|
|
``` |
|
|
|
<hr> |
|
|
|
# Intel® Gaudi Accelerators |
|
The Intel Gaudi 2 accelerator is Intel's most capable deep learning chip. You can learn about Gaudi 2 [here](https://habana.ai/products/gaudi2/). |
|
|
|
Intel Gaudi Software supports PyTorch and DeepSpeed for accelerating LLM training and inference. |
|
The Intel Gaudi Software graph compiler will optimize the execution of the operations accumulated in the graph |
|
(e.g. operator fusion, data layout management, parallelization, pipelining and memory management, |
|
and graph-level optimizations). |
|
|
|
Optimum Habana provides covenient functionality for various tasks. Below is a command line snippet to run inference on Gaudi with meta-llama/Llama-2-7b-hf. |
|
|
|
👍[Optimum Habana GitHub](https://github.com/huggingface/optimum-habana) |
|
|
|
The "run_generation.py" script below can be found [here on GitHub](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation) |
|
|
|
```bash |
|
python run_generation.py \ |
|
--model_name_or_path meta-llama/Llama-2-7b-hf \ |
|
--use_hpu_graphs \ |
|
--use_kv_cache \ |
|
--max_new_tokens 100 \ |
|
--do_sample \ |
|
--batch_size 2 \ |
|
--prompt "Hello world" "How are you?" |
|
|
|
``` |
|
<hr> |
|
|
|
# Intel Arc GPUs |
|
You can learn more about Arc GPUs [here](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html). |
|
|
|
Code snippets coming soon! |
|
|
|
""" |