--- license: mit tags: - ctranslate2 - dolly-v2 --- # Fast-Inference with Ctranslate2 Speedup inference by 2x-8x using int8 inference in C++ quantized version of [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) ```bash pip install hf_hub_ctranslate2>=1.0.0 ctranslate2>=3.13.0 ``` Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2) - `compute_type=int8_float16` for `device="cuda"` - `compute_type=int8` for `device="cpu"` ```python from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub model_name = "michaelfeil/ct2fast-dolly-v2-12b" model = GeneratorCT2fromHfHub( # load in int8 on CUDA model_name_or_path=model_name, device="cuda", compute_type="int8_float16" ) outputs = model.generate( text=["How do you call a fast Flan-ingo?", "User: How are you doing?"], ) print(outputs) ``` # Licence and other remarks: This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo. # Usage of Dolly-v2: According to the Intruction Pipeline of databricks/dolly-v2-12b ```python # from https://huggingface.co/databricks/dolly-v2-12b def encode_prompt(instruction): INSTRUCTION_KEY = "### Instruction:" RESPONSE_KEY = "### Response:" END_KEY = "### End" INTRO_BLURB = ( "Below is an instruction that describes a task. Write a response that appropriately completes the request." ) # This is the prompt that is used for generating responses using an already trained model. It ends with the response # key, where the job of the model is to provide the completion that follows it (i.e. the response itself). PROMPT_FOR_GENERATION_FORMAT = """{intro} {instruction_key} {instruction} {response_key} """.format( intro=INTRO_BLURB, instruction_key=INSTRUCTION_KEY, instruction="{instruction}", response_key=RESPONSE_KEY, ) return PROMPT_FOR_GENERATION_FORMAT.format(instruction=instruction) ```