Multi-backend support (non-CUDA backends)
As part of a recent refactoring effort, we will soon offer official multi-backend support. Currently, this feature is available in a preview alpha release, allowing us to gather early feedback from users to improve the functionality and identify any bugs.
At present, the Intel CPU and AMD ROCm backends are considered fully functional. The Intel XPU backend has limited functionality and is less mature.
Please refer to the installation instructions for details on installing the backend you intend to test (and hopefully provide feedback on).
Apple Silicon support is planned for Q4 2024. We are actively seeking contributors to help implement this, develop a concrete plan, and create a detailed list of requirements. Due to limited resources, we rely on community contributions for this implementation effort. To discuss further, please spell out your thoughts and discuss in this GitHub discussion and tag @Titus-von-Koeller
and @matthewdouglas
. Thank you!
Alpha Release
As we are currently in the alpha testing phase, bugs are expected, and performance might not meet expectations. However, this is exactly what we want to discover from your perspective as the end user!
Please share and discuss your feedback with us here:
- Github Discussion: Multi-backend refactor: Alpha release ( AMD ROCm ONLY )
- Github Discussion: Multi-backend refactor: Alpha release ( Intel ONLY )
Thank you for your support!
Benchmarks
Intel
The following performance data is collected from Intel 4th Gen Xeon (SPR) platform. The tables show speed-up and memory compared with different data types of Llama-2-7b-chat-hf.
Inference (CPU)
Data Type | BF16 | INT8 | NF4 | FP4 |
---|---|---|---|---|
Speed-Up (vs BF16) | 1.0x | 0.6x | 2.3x | 0.03x |
Memory (GB) | 13.1 | 7.6 | 5.0 | 4.6 |
Fine-Tuning (CPU)
Data Type | AMP BF16 | INT8 | NF4 | FP4 |
---|---|---|---|---|
Speed-Up (vs AMP BF16) | 1.0x | 0.38x | 0.07x | 0.07x |
Memory (GB) | 40 | 9 | 6.6 | 6.6 |