Spaces:

optimum-nvidia
/

README

Running

App Files Files Community

mfuntowicz HF staff commited on Mar 4, 2024

Commit

15b232a

verified ·

1 Parent(s): 50fd940

Update Organisation Card

Browse files

Files changed (1) hide show

README.md +24 -5

README.md CHANGED Viewed

@@ -1,10 +1,29 @@
 ---
-title: README
-emoji: 📉
-colorFrom: red
-colorTo: gray
 sdk: static
 pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 ---
+title: Optimum-Nvidia - TensorRT-LLM optimized inference engines
+emoji: 🚀
+colorFrom: green
+colorTo: yellow
 sdk: static
 pinned: false
 ---
+[Optimum-Nvidia](https://github.com/huggingface/optimum-nvidia) allows you to easily leverages Nvidia's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) Inference tool
+through a seemlessly integration following huggingface/transformers API.
+This organisation holds prebuilt TensorRT-LLM compatible engines for various fondational models one can use, fork and deploy to get started as fast as possible and benefits from
+out of the box peak performances on Nvidia hardware.
+Prebuilt engines will attempt (as much as possible) to be build with the best options available and will push updated models following additions to TensorRT-LLM repository.
+This can include (not limited to):
+- Leveraging `float8` quantization on supported hardware (H100/L4/L40/RTX 40xx)
+- Enabling `float8` or `int8` KV cache
+- Enabling in-flight batching for dynamic batching when used in combinaison with Nvidia Triton Inference Server
+- Enabling xQA attention kernels
+Current engines are targetting the following Nvidia TensorCore GPUs and can be found using specific branch matching the targetted GPU in the repo:
+- [4090 (sm_89)](https://huggingface.co/collections/optimum-nvidia/rtx-4090-optimized-tensorrt-llm-models-65e5ebc1240c11001a3e666b)
+Feel free to open-up discussions and ask for models to support through the community tab
+- The Optimum-Nvidia team at 🤗