michaelfeil
/

ct2fast-flan-alpaca-base

Inference Endpoints

Model card Files Files and versions Community

michaelfeil commited on May 1, 2023

Commit

ef4f5f0

•

1 Parent(s): 479383c

Create README.md

Files changed (1) hide show

README.md +35 -0

README.md ADDED Viewed

	@@ -0,0 +1,35 @@

+# Fast-Inference with Ctranslate2
+Speedup inference by 2x-8x using int8 inference in C++
+quantized version of [declare-lab/flan-alpaca-base](https://huggingface.co/declare-lab/flan-alpaca-base)
+```bash
+pip install hf_hub_ctranslate2>=1.0.0 ctranslate2>=3.13.0
+```
+Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2)
+- `compute_type=int8_float16` for `device="cuda"`
+- `compute_type=int8`  for `device="cuda"`
+```python
+from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
+model_name = "michaelfeil/ct2fast-flan-alpaca-base"
+model = GeneratorCT2fromHfHub(
+        # load in int8 on CUDA
+        model_name_or_path=model_name,
+        device="cuda",
+        compute_type="int8_float16"
+)
+outputs = model.generate(
+    text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"],
+    min_decoding_length=24,
+    max_decoding_length=32,
+    max_input_length=512,
+    beam_size=5
+)
+print(outputs)
+```
+# Licence and other remarks:
+This is just a quantized version of