--- license: apache-2.0 base_model: tiiuae/falcon-7b language: - en tags: - falcon-7b - falcon - onnxruntime - onnx - llm --- # falcon-7b for ONNX Runtime ## Introduction This repository hosts the optimized version of **falcon-7b** to accelerate inference with ONNX Runtime CUDA execution provider. See the [usage instructions](#usage-example) for how to inference this model with the ONNX files hosted in this repository. ## Model Description - **Developed by:** TIIUAE - **Model type:** Pretrained generative text model - **License:** Apache 2.0 License - **Model Description:** This is a conversion of the [falcon-7b](https://huggingface.co/tiiuae/falcon-7b) for [ONNX Runtime](https://github.com/microsoft/onnxruntime) inference with CUDA execution provider. ## Performance Comparison #### Latency for token generation Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU: | Prompt Length | Batch Size | PyTorch 2.1 torch.compile | ONNX Runtime CUDA | |-------------|------------|----------------|-------------------| | 32 | 1 | 53.64ms | 15.68ms | | 256 | 1 | 59.55ms | 26.05ms | | 1024 | 1 | 89.82ms | 99.05ms | | 2048 | 1 | 208.0ms | 227.0ms | | 32 | 4 | 70.8ms | 19.62ms | | 256 | 4 | 78.6ms | 81.29ms | | 1024 | 4 | 373.7ms | 369.6ms | | 2048 | 4 | N/A | 879.2ms | ## Usage Example 1. Clone onnxruntime repository. ```shell git clone https://github.com/microsoft/onnxruntime cd onnxruntime ``` 2. Install required dependencies ```shell python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt ``` 5. Inference using custom model API, or use Hugging Face's ORTModelForCausalLM ```python from optimum.onnxruntime import ORTModelForCausalLM from onnxruntime import InferenceSession from transformers import AutoConfig, AutoTokenizer sess = InferenceSession("falcon-7b.onnx", providers = ["CUDAExecutionProvider"]) config = AutoConfig.from_pretrained("tiiuae/falcon-7b") model = ORTFalconForCausalLM(sess, config, use_cache = True, use_io_binding = True) tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b") inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ```