michaelfeil commited on
Commit
8f217ed
·
1 Parent(s): 8c0d72c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -0
README.md ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # Fast-Inference with Ctranslate2
5
+ Speedup inference by 2x-8x using int8 inference in C++
6
+
7
+ quantized version of [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
8
+ ```bash
9
+ pip install hf_hub_ctranslate2>=1.0.0 ctranslate2>=3.13.0
10
+ ```
11
+
12
+
13
+ Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2)
14
+ - `compute_type=int8_float16` for `device="cuda"`
15
+ - `compute_type=int8` for `device="cuda"`
16
+
17
+ ```python
18
+ from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
19
+
20
+ model_name = "michaelfeil/ct2fast-dolly-v2-12b"
21
+ model = GeneratorCT2fromHfHub(
22
+ # load in int8 on CUDA
23
+ model_name_or_path=model_name,
24
+ device="cuda",
25
+ compute_type="int8_float16"
26
+ )
27
+ outputs = model.generate(
28
+ text=["How do you call a fast Flan-ingo?", "User: How are you doing?"],
29
+ )
30
+ print(outputs)
31
+ ```
32
+
33
+ # Licence and other remarks:
34
+ This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
35
+
36
+ # Usage of Dolly-v2:
37
+ According to the Intruction Pipeline of databricks/dolly-v2-12b
38
+ ```python
39
+ # from https://huggingface.co/databricks/dolly-v2-12b
40
+ def encode_prompt(instruction):
41
+ INSTRUCTION_KEY = "### Instruction:"
42
+ RESPONSE_KEY = "### Response:"
43
+ END_KEY = "### End"
44
+ INTRO_BLURB = (
45
+ "Below is an instruction that describes a task. Write a response that appropriately completes the request."
46
+ )
47
+
48
+ # This is the prompt that is used for generating responses using an already trained model. It ends with the response
49
+ # key, where the job of the model is to provide the completion that follows it (i.e. the response itself).
50
+ PROMPT_FOR_GENERATION_FORMAT = """{intro}
51
+ {instruction_key}
52
+ {instruction}
53
+ {response_key}
54
+ """.format(
55
+ intro=INTRO_BLURB,
56
+ instruction_key=INSTRUCTION_KEY,
57
+ instruction="{instruction}",
58
+ response_key=RESPONSE_KEY,
59
+ )
60
+ return PROMPT_FOR_GENERATION_FORMAT.format(instruction=instruction)
61
+ ```