Fast-Inference with Ctranslate2
Speedup inference by 2x-8x using int8 inference in C++
quantized version of google/flan-ul2
pip install hf_hub_ctranslate2>=2.0.6 ctranslate2>=3.13.0
Checkpoint compatible to ctranslate2 and hf-hub-ctranslate2
compute_type=int8_float16
fordevice="cuda"
compute_type=int8
fordevice="cpu"
from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
model_name = "michaelfeil/ct2fast-flan-ul2"
model = TranslatorCT2fromHfHub(
# load in int8 on CUDA
model_name_or_path=model_name,
device="cuda",
compute_type="int8_float16"
)
outputs = model.generate(
text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"],
min_decoding_length=24,
max_decoding_length=32,
max_input_length=512,
beam_size=5
)
print(outputs)
Licence and other remarks:
This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
- Downloads last month
- 7