michaelfeil commited on
Commit
ef4f5f0
1 Parent(s): 479383c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fast-Inference with Ctranslate2
2
+ Speedup inference by 2x-8x using int8 inference in C++
3
+
4
+ quantized version of [declare-lab/flan-alpaca-base](https://huggingface.co/declare-lab/flan-alpaca-base)
5
+ ```bash
6
+ pip install hf_hub_ctranslate2>=1.0.0 ctranslate2>=3.13.0
7
+ ```
8
+
9
+
10
+ Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2)
11
+ - `compute_type=int8_float16` for `device="cuda"`
12
+ - `compute_type=int8` for `device="cuda"`
13
+
14
+ ```python
15
+ from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
16
+
17
+ model_name = "michaelfeil/ct2fast-flan-alpaca-base"
18
+ model = GeneratorCT2fromHfHub(
19
+ # load in int8 on CUDA
20
+ model_name_or_path=model_name,
21
+ device="cuda",
22
+ compute_type="int8_float16"
23
+ )
24
+ outputs = model.generate(
25
+ text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"],
26
+ min_decoding_length=24,
27
+ max_decoding_length=32,
28
+ max_input_length=512,
29
+ beam_size=5
30
+ )
31
+ print(outputs)
32
+ ```
33
+
34
+ # Licence and other remarks:
35
+ This is just a quantized version of