Multi-backend support (non-CUDA backends)

As part of a recent refactoring effort, we will soon offer official multi-backend support. Currently, this feature is available in a preview alpha release, allowing us to gather early feedback from users to improve the functionality and identify any bugs.

At present, the Intel CPU and AMD ROCm backends are considered fully functional. The Intel XPU backend has limited functionality and is less mature.

Please refer to the installation instructions for details on installing the backend you intend to test (and hopefully provide feedback on).

Apple Silicon support is planned for Q4 2024. We are actively seeking contributors to help implement this, develop a concrete plan, and create a detailed list of requirements. Due to limited resources, we rely on community contributions for this implementation effort. To discuss further, please spell out your thoughts and discuss in this GitHub discussion and tag @Titus-von-Koeller and @matthewdouglas. Thank you!

Alpha Release

As we are currently in the alpha testing phase, bugs are expected, and performance might not meet expectations. However, this is exactly what we want to discover from your perspective as the end user!

Please share and discuss your feedback with us here:

Thank you for your support!

Benchmarks

Intel

The following performance data is collected from Intel 4th Gen Xeon (SPR) platform. The tables show speed-up and memory compared with different data types of Llama-2-7b-chat-hf.

Inference (CPU)

Data Type	BF16	INT8	NF4	FP4
Speed-Up (vs BF16)	1.0x	0.6x	2.3x	0.03x
Memory (GB)	13.1	7.6	5.0	4.6

Fine-Tuning (CPU)

Data Type	AMP BF16	INT8	NF4	FP4
Speed-Up (vs AMP BF16)	1.0x	0.38x	0.07x	0.07x
Memory (GB)	40	9	6.6	6.6